Skip to content

Configuration files

Felipe Ortega edited this page Mar 5, 2015 · 5 revisions

You can always refer to the latest version of the example configuration file to obtain a list of accepted parameters in WikiDAT.

How to configure ETL processes

To skip any of the several ETL processes implemented in WikiDAT, simply comment out the corresponding section in the configuration file. For example, if you are just interested in extracting information from administrative events (logging table in MediaWiki), just remove or comment all other [ETL:*] sections in the configuration file template. See below for a complete description of the different ETL processes currently implemented.

Understanding the configuration file

Configuration parameters are organized in several sections:

General

Parameters affecting general features.

  • lang (String): Target Wikipedia project to be processed, as specified in http://dumps.wikimedia.org. For example, enwiki, dewiki or eswiki are valid project names.
  • date (YYYYMMDD): Date of the dump to be processed (must be a valid date, listed in the mirror site and already created). Alternatively, the value latest will try to download the latest available dump.
  • mirror (URL): Mirror site from which dump files will be downloaded.
  • donwload_files (Boolean): If true, database dump files will be downloaded. Otherwise, the program will try to process dump files already retrieved and stored in a local directory (see next option).
  • dumps_dir (Path): Absolute or relative path to the local directory in which the dump files for this language have already been stored. If the previous option is True this value will be skipped.

Database

Parameters affecting database-related features.

  • host (Hostname): Name of the host in which the local database is running.
  • port (Port num.): Port to connect to the local database.
  • db_engine (Engine name): Name of the database engine for the tables that will store extracted information. Recommended values are MyISAM for MySQL database and ARIA for MariaDB.
  • db_user (User name): Valid user name to connect to database. The user must have privileges for database and table creation.
  • db_passw (Password): Password for this user to connect to the local database.

ETL:RevHistory

Parameters affecting the Extract-Transform-Load process for full revision history dumps.

  • etl_lines (Positive Integer): Number of ETL processing lines to be created. See next section to understand the structure of the ETL process and how it is parallelized.
  • page_fan (Positive Integer): Number of worker processes to handle extracted page elements.
  • rev_fan (Positive Integer): Number of worker processes to handle extracted revision elements.
  • page_cache_size (Positive Integer): Number of rows that will be stored with page information in a temporal file before it is uploaded to the local database.
  • rev_cache_size (Positive Integer): Number of rows that will be stored with revision (and revision hash) information in temporal files before they are uploaded to the local database.
  • base_ports (List of port numbers): Port numbers for the data communication sockets created with ZeroMQ. At least one port number must be provided for each ETL line
  • control_ports (List of port numbers): Port numbers for command sockets created with ZermMQ. At least one port number must be provided for each ETL line (avoid overlapping with base ports specified above).
  • detect_FA (Boolean): If True, revisions corresponding to featured articles (containing the FA template in that language) will be detected.
  • detect_FLIST (Boolean): If True, revisions corresponding to featured lists (containing the FLIST template in that language) will be detected.
  • detect_GA (Boolean): If True, revisions corresponding to good articles (containing the GA template in that language) will be detected.

ETL:RevMeta

Parameters affecting the Extract-Transform-Load process for metadata revision history dumps.

NOT IMPLEMENTED YET

ETL:PagesLogging

Parameters affecting the Extract-Transform-Load process for dumps containing records of administrative events (logging table in MediaWiki). Since pages-logging dump files are not split in different chunks for any language, we cannot set up more than a single ETL process in this case. Hence, the only parallelization level available is using more workers to process logitem elements (data units stored in these kind of files):

  • log_fan (Positive Integer): Number of worker processes to handle extracted logitem elements.
  • log_cache_size (Positive Integer): Number of rows that will be stored with logitem information in a temporal file before it is uploaded to the local database.

Table of content | (Prev) Default execution | (Next) Understanding ETL