Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
WikiDAT installation guide
This page provides step-by-step information to install WikiDAT in your system. Please, feel free to include any comments or changes to improve this documentation.
WikiDAT has been designed to run even on modest hardware. Dump files are processed on the fly, working with data streams as the files are decompressed. Thus, it should be enough to make sure that you have enough disk space to store extracted information locally.
CPU and memory requirements
There are no absolute minimum requirements regarding CPU or memory. WikiDAT itself consumes less than 200MB of memory, even with large Wikipedia languages. However, your local database may require more memory (and additional configuration) to import extracted information quickly.
The following hardware requirements are recommended to ensure a fluent ETL process:
- Intel Core i7 CPU (or comparable).
- At least 8GB RAM (DDR3, 1600 MHz); mostly for your local DB.
- At least one Solid-State Disk (SSD) drive, to speed up data loading.
Disk Storage requirements
Disk space requirements depend on the number and size of the projects you intend to analyze. For instance, to analyze the complete history dump (all revisions for each page) of the English Wikipedia, it is recommended to have at least 200 GB of free disk space (storing the compressed dump files as well as the DB tables). The following tables provides a (non-exhaustive) list of recommended disk space to analyze some Wikipedia languages:
|Language||code||Recommended storage space|
|English Wikipedia||enwiki||~ 200 GB|
|German Wikipedia||dewiki||~ 60 GB|
|French Wikipedia||frwiki||~ 50 GB|
|Spanish Wikipedia||eswiki||~ 24 GB|
|Polish Wikipedia||plwiki||~ 18 GB|
|Italian Wikipedia||itwiki||~ 24 GB|
|Japanese Wikipedia||jawiki||~ 21 GB|
|Dutch Wikipedia||nlwiki||~ 18 GB|
|Portuguese Wikipedia||ptwiki||~ 18 GB|
|Russian Wikipedia||ruwiki||~ 32 GB|
IMPORTANT: WikiDAT has only been tested with Python 2.7. Python 3 is not supported yet, although plans to address this issue have already been outlined.
The following software dependencies must be met before installing WikiDAT on your system:
- MySQL (>= v5.5.x) or MariaDB (>= 5.5.x || >= v10.0.x).
- The Python programming language (>= v2.7.x and < 3).
- The following Python packages (PyPI installation recommended with
- MySQLdb (>= v1.2.3)
- lxml (>= v3.3.1-0)
- requests (>= v2.2.1)
- beautifulsoup4 (>= v4.2.1)
- configparser (>= v3.3.0r2)
- pyzmq (>= v14.3.0)
- ujson (>= v1.30)
- The 0MQ (ZeroMQ) message queuing.
- The R programming language.
- The following R packages (available from CRAN):
- RMySQL: Connect to MySQL databases from R.
- Hmisc: Frank Harrell's miscelaneous functions.
- car: Companion library for "R Companion to Applied Regression", 2nd ed.
- ineq: Calcualte inequality metrics and graphics.
- ggplot2: A wonderful library to create appealing graphics in R.
- eha: Library for event history and survival analysis.
- zoo: Library to handle time series data.
Since WikiDAT follows a modular design, you can just run the data extraction process and undertake the data analysis phase with any tool of your choice (e.g. NumPy/SciPy, Pandas or scikit-learn in Python).
GNU/Linux (Ubuntu 13.10, 14.04) (Debian 7.0)
The following steps install all software dependencies required for both data extraction and data analysis with WikiDAT.
$ sudo apt-get install python-pip
Install or update all Python dependencies listed above:
$ sudo pip install -U MySQL-python $ sudo pip install -U lxml $ sudo pip install -U requests $ sudo pip install -U beautifulsoup4 $ sudo pip install -U configparser $ sudo pip install -U pyzmq $ sudo pip install -U ujson
Install 0MQ (ZeroMQ):
$ sudo apt-get install libzmq3
Install R, then all required R packages:
> install.packages(c('RMySQL', 'Hmisc', 'car', 'ineq', 'ggplot2', 'eha', 'zoo'), dep=T)
Clone the latest stable version of WikiDAT on your local machine:
$ git clone https://github.com/glimmerphoenix/WikiDAT.git
If you are not working with
virtualenvin Python, make sure that your environment variable
PYTHONPATHpoints to the cloned WikiDAT directory:
$ export PYTHONPATH=$PYTHONPATH:path/to/WikiDAT
You can add this line to the end of your
.bashrcfile to make these changes permanent.
Then, change to the
WikiDAT/wikidatdirectory. You should modify the default
config.inifile to indicate a valid user and password to connect to your local database. Finally, to run the program execute the
main.pyfile. This will run the whole process for the case of
WikiDAT/wikidat$ python main.py
Please, refer to the Quick start guide for more information about how to quickly customize the execution of WikiDAT for your own wishes.
To be created
To be created