Wikipedia Data Analysis Toolkit
Switch branches/tags
Nothing to show
Pull request Compare This branch is 236 commits behind glimmerphoenix:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Wikipedia Data Analysis Toolkit

Authors: Felipe Ortega, Aaron Halfaker.
License: GPLv3 (

The aim of WikiDAT is to create an extensible toolkit for Wikipedia Data Analysis, based on MySQL, Python and R.

Each module implements a different type of analysis, storing the output in subdirectories results, figs or traces, created in the module's directory. Module source code includes Python and R code to implement both the data preparation/cleaning and data analysis steps, including inline comments. An important goal is to illustrate different case examples of interesting analyses with Wikipedia data, following a didactic approach.

The long-term goal is to include more case examples progressively, in order to cover many of the usual examples of quantitative analyses that can be undertaken with Wikipedia data. In the future, this may also include the use of tools for distributed computing to support analysis of really huge data sets in high-resolution studies.

Required dependencies The following software dependencies are required to run all examples currently included in WikiDAT:

  • MySQL server and client (v5.5 or later).
  • Python programming language (v2.7 or later, but not the v3 branch) and MySQLdb (v1.2.3)
  • R programming language and environment (v 2.15.0 or later).
  • Additional R libraries with extra data and functionalities (This list will be updated as new functionalities are included in this toolkit):
    • RMySQL: Connect to MySQL dbs from R.
    • Hmisc: Frank Harrell's miscelaneous functions (essential).
    • car: Companion library for "R Companion to Applied Regression", 2nd ed.
    • DAAG: Companion library for "Data Analysis and Graphics using R."
    • ineq: Calcualte inequality metrics and graphics.
    • ggplot2: A wonderful library to create appealing graphics in R.
    • eha: Library for event history and survival analysis.
    • zoo: Excellent library to handle timeseries data.