AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.
C++ Python Shell C
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
data/wikipedia
debian
homepage
man
releases
scripts
src
.gitignore
CHANGELOG
COPYING
Makefile
README
VERSION

README

INTRODUCTION

Autocorpus is a set of utilities that enable automatic extraction of
language corpora and language models from publicly available
datasets. For example, it provides the full set of tools to translate
the entire English Wikipedia from a 30+GB XML file to a clean n-gram
language model, all in a matter of a few hours.



BUILDING

Before building autocorpus, make sure you have the required
dependencies. These are:

    - Python 2.7.1+
    - g++ 4.6.1
    - libpcre3-dev
    - libboost-dev 1.46
    - libboost-thread-dev 1.46

Older versions *might* work, but have not been tested.

Once you've verified that you have the prerequisites, build autocorpus
by calling make:

    $ make

The binaries will be placed in the 'bin' directory.


INSTALLING

To install Autocorpus, build it first using the instructions in the
previous section, then type "make install". Note that you need to be
root for the installation to succeed, which on most desktop Linux
distributions means you need to run "sudo make install".
 


USING AUTOCORPUS

Assuming you have properly installed the documentation from the 'man'
directory, you can get a quick overview of how to use Autocorpus by
typing:

    $ man 7 autocorpus

This manpage can also be viewed at 
http://mpacula.com/autocorpus/1.0/man/autocorpus.7.html

Man pages are also available for individual tools, both locally and online 
at http://mpacula.com/autocorpus/1.0/man



PROJECT WEBSITE

The project's website is http://mpacula.com/autocorpus. Use it to
download new releases and submit bug reports. 



AUTHOR & LICENSING

Autocorpus was written by Maciej Pacula (maciej.pacula@gmail.com) 
and is distributed as free software under the terms of the AGPL v3
license. See the file COPYING for details.

If you would like to incorporate one or more Autocorpus tools in
proprietary product, please contact the author and inquire about a
commercial license.

Wikipedia-based corpora are distributed under the "Creative Commons
Attribution - ShareAlike 3.0 Unported License". The full text of this
license can be found at: 

http://en.wikipedia.org/wiki/Wikipedia:CC-BY-SA