Skip to content
Lucie Hutchins edited this page Dec 26, 2019 · 13 revisions

Biocore Data Downloads Repos

A repository to create automations that download external bioinformatics Datasets.

For each Bioinformatics database source, this package can be setup to run an automation that checks if a new version of the database is available. As soon as it detects a new release , the automation will download it locally.

What It Does

For each database source,the automation creates a root directory that is the name of the database in the path defined by the standards - the path is set in the main Configuration (EXTERNAL_DATA_BASE). The organization of files under these root directories will depend on the way a given data source publishes its data.

Release-Centric Data Sources

Under data source root directory, you will find:

  • A file (current_release_NUMBER) that stores the latest release of the data source
  • A directory for each version downloaded
  • A symbolic "current" that points to the latest version

Non Release-Centric Data Sources

Under data source root directory, the files will be stored by datasets or as specified in variables DATASETS, or/and TAXA in the data source configuration file

Getting Started

Dependencies

System

This was tested on Linux and Mac OS environments

Software

The main dependency of this package is wget utilities - but if you want to untar downloaded datasets then make sure tar, unzip, and gunzip utilities are installed as well.

  • wget
  • tar
  • unzip
  • gunzip
To verify that these software are installed, run the following commands:
To check wget install, run:   which wget
To check tar install, run:    which tar
To check unzip install, run:  which unzip
To check gunzip install, run: which gunzip

Repos Organization

Data Sources

Each source is a sub-directory that contains:

  1. a configuration file
  2. a readme file
  3. a download script (only for some Sources)

The name of each source is all in lowercase and matches the name of the download root directory of the source. Different versions of each source - where applicable - are downloaded under the same root directory and the name of the root directory is the same as the source's name.

Main Scripts and Config

In addition to data source directories, the following are found under the package's root.

Global config and setup

Used to get/set the version to download - Applicable only to release-centric sources

Download Automation Trigger

Used by the Trigger to download datasets

Trigger Helpers -