Implements a multi-threaded crawler for data collection, collation and processing.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
design
how-to-deploy
src/main
.gitignore
LICENSE
README.md
nb-configuration.xml
nbactions.xml
pom.xml

README.md

PhenoDCC: File Source Crawler

This project implements the PhenoDCC crawler. The crawler is a Java application that is run periodically to retrieve phenotype data from research centres around the world. It is responsible for visiting FTP/sFTP servers that are hosted by participating research centres, and retrieving zip files that contain XML documents in the standardised file format specific by the IMPC. These XML files contain phenotype data, which the crawler downloads, extracts, validates and processes so that the data is made available for quality control.

Usage

$ java -jar phenodcc-crawler-<version>-jar-with-dependencies.jar -h

PhenoDCC: File Source Crawler
Copyright (c) 2013 Medical Research Council Harwell
(http://www.mousephenotype.org)

usage:
    java -jar program.jar [-a <arg>] [-c <arg>]
        [-d <arg>] [-h] [-m <arg>] [-o <arg>]
        [-p <arg>] [-r <arg>] [-s <arg>]
        [-t <arg>] [-v <arg>] [-x <arg>]

    -a <arg>   Number of parallel downloaders to use.
    -c <arg>   The path to the properties file that specifies
               the Crawler configuration.
    -d <arg>   The path where the downloaded zipped data files will
               be stored. If unspecified, the current directory where
               the program is being executed is used.
    -h         Show help message on how to use the system.
    -m <arg>   Maximum number of download retries.
    -o <arg>   The path to the properties file that specifies
               the context builder configuration.
    -p <arg>   Sets the delay (in hours) for periodic runs. If zero, the
               program returns immediately after processing has finished.
    -r <arg>   If you wish the crawler to send a report, use this
               switch and provide a valid email Id
    -s <arg>   The path to the properties file that specifies
               the XML serialiser configuration.
    -t <arg>   Maximum size of the thread pool.
    -v <arg>   The path to the properties file that specifies
               the XML validator configuration.
    -x <arg>   The path to the properties file that specifies
               the XML validation resources configuration.

Dependency

The following are compile-time dependencies.

  1. Library for XML document validation. This is implemented in org.mousephenotype.dcc.exportlibrary.xsdvalidation.

  2. Cawler entities library for accessing the tracker database. This is implemented in phenodcc-entities-crawler project.

In addition to the above dependencies, the crawler also depends on various runtime dependencies. The details are available in how-to-deploy/README.md. Since the PhenoDCC system is quite complex, we have not attempted to cover the entire system here.

Implementation notes

The following is an overview of what happens when a crawler is run:

  1. Get a list of all the active centres.

  2. For each of the centres, send out multiple FTP/sFTP crawlers for each file source.

  3. These crawlers crawl the add, edit and delete directories inside the centre specific directory of the file source.

  4. All of the zip files are checked against the PhenoDCC tracker database, and new files are added to the tracker database.

  5. Once the crawling has finished, we have a map of what files need downloading.

  6. The download and validation is carried out in the following manner:

    • Start fixed number of parallel download threads, each downloading a specific unprocessed file.

    • To do the download, get the file source that hosts the file using the download map generated by the FTP/sFTP crawlers.

    • After a successful download, the crawler extracts the XML documents, and start multiple threads from an unbounded cached thread-pool, each thread validating one XML document.

    • All of the download, extraction and validation are recorded in the tracking database using <phase, status> pairs. The <phase, status> pair is promoted to the parent entities (XML file > zip download > zip action).

  7. Once all of the XML documents have been validated, we proceed to the next stage. This stage is composed of two parts: data insertion to the Pre-QC database and data integrity checks.

    Since implicit data dependencies might exist between multiple XML documents, this stage cannot be parallelised using multiple threads. The data insertion followed by data integrity checks must be carried out sequentially, one-after-the-other for each of the XML documents. The temporal ordering of the XML documents are significant, and the XML files are ordered based on the details specified in the XML file names.

For further details, see the design documentation in /design.