PhenoDCC: File Source Crawler
This project implements the PhenoDCC crawler. The crawler is a Java application that is run periodically to retrieve phenotype data from research centres around the world. It is responsible for visiting FTP/sFTP servers that are hosted by participating research centres, and retrieving zip files that contain XML documents in the standardised file format specific by the IMPC. These XML files contain phenotype data, which the crawler downloads, extracts, validates and processes so that the data is made available for quality control.
$ java -jar phenodcc-crawler-<version>-jar-with-dependencies.jar -h PhenoDCC: File Source Crawler Copyright (c) 2013 Medical Research Council Harwell (http://www.mousephenotype.org) usage: java -jar program.jar [-a <arg>] [-c <arg>] [-d <arg>] [-h] [-m <arg>] [-o <arg>] [-p <arg>] [-r <arg>] [-s <arg>] [-t <arg>] [-v <arg>] [-x <arg>] -a <arg> Number of parallel downloaders to use. -c <arg> The path to the properties file that specifies the Crawler configuration. -d <arg> The path where the downloaded zipped data files will be stored. If unspecified, the current directory where the program is being executed is used. -h Show help message on how to use the system. -m <arg> Maximum number of download retries. -o <arg> The path to the properties file that specifies the context builder configuration. -p <arg> Sets the delay (in hours) for periodic runs. If zero, the program returns immediately after processing has finished. -r <arg> If you wish the crawler to send a report, use this switch and provide a valid email Id -s <arg> The path to the properties file that specifies the XML serialiser configuration. -t <arg> Maximum size of the thread pool. -v <arg> The path to the properties file that specifies the XML validator configuration. -x <arg> The path to the properties file that specifies the XML validation resources configuration.
The following are compile-time dependencies.
Library for XML document validation. This is implemented in
Cawler entities library for accessing the tracker database. This is implemented in
In addition to the above dependencies, the crawler also depends on
various runtime dependencies. The details are available in
how-to-deploy/README.md. Since the PhenoDCC system is quite complex,
we have not attempted to cover the entire system here.
The following is an overview of what happens when a crawler is run:
Get a list of all the active centres.
For each of the centres, send out multiple FTP/sFTP crawlers for each file source.
These crawlers crawl the
deletedirectories inside the centre specific directory of the file source.
All of the zip files are checked against the PhenoDCC tracker database, and new files are added to the tracker database.
Once the crawling has finished, we have a map of what files need downloading.
The download and validation is carried out in the following manner:
Start fixed number of parallel download threads, each downloading a specific unprocessed file.
To do the download, get the file source that hosts the file using the download map generated by the FTP/sFTP crawlers.
After a successful download, the crawler extracts the XML documents, and start multiple threads from an unbounded cached thread-pool, each thread validating one XML document.
All of the download, extraction and validation are recorded in the tracking database using
<phase, status>pairs. The
<phase, status>pair is promoted to the parent entities (XML file > zip download > zip action).
Once all of the XML documents have been validated, we proceed to the next stage. This stage is composed of two parts: data insertion to the Pre-QC database and data integrity checks.
Since implicit data dependencies might exist between multiple XML documents, this stage cannot be parallelised using multiple threads. The data insertion followed by data integrity checks must be carried out sequentially, one-after-the-other for each of the XML documents. The temporal ordering of the XML documents are significant, and the XML files are ordered based on the details specified in the XML file names.
For further details, see the design documentation in