Skip to content

indraniel/srasearch

Repository files navigation

SRASearch

A NCBI Sequence Read Archive (SRA) upload management search utility.

The SRA produces a regularly updated set of batch telemetry files for laboratories that submit data to its repository. The telemetry files can help track the information a submitter has sent from a given center with what the SRA has received and processed.

This utility processes these telemetry files and presents that data through a "Google"-esque search interface.

BUILDING

The instructions assume that you have access to a Go compiler.

Setup a Go workspace

Initialize a proper Go development workspace:

mkdir /path/to/project
cd /path/to/project
export GOPATH=$PWD
export GOROOT=$(go env GOROOT)

Setup the git repository:

mkdir -p $GOPATH/src/github.com/indraniel/
cd $GOPATH/src/github.com/indraniel/
git clone git@github.com:indraniel/srasearch.git
cd srasearch/

Initialize the dependencies

make prepare

Build the app

make

You should now see a srasearch executable inside the $GOPATH/src/github.com/indraniel/srasearch directory. You can move that file to wherever you please.

USAGE

For the examples below we are using the SRA batch telemetry files available to the McDonnell Genome Institute; which has a NCBI center abbreviation name WUGSC.

The SRA provides monthly full and daily incremental versions of the batch telemetry files. Using these we can create an "SRA Dump" file, an intermediary file which is a collection of JSON documents.

srasearch will build upon prior "SRA Dumps" as new incremental telemetry files are obtained.

Initializing a SRA Dump

This command uses the full metadata and data transfer telemetry files produced at the beginning of the month. For example, here is how the command would run on May 1, 2015:

srasearch init-dump -m data/SRA/NCBI_SRA_Metadata_Full_WUGSC_20150501.tar.gz -u data/SRA/NCBI_SRA_Files_Full_WUGSC_20150501.gz -o 2015-05-01.dump.dat.gz

Here we initialized a SRA Dump file named 2015-05-01.dump.dat.gz.

Incrementing an existing SRA Dump

This command uses the incremental metadata and data transfer telemetry files and a prior existing SRA Dump file. For example, this is how the command would run on May 2, 2015:

srasearch increment-dump -i 2015-05-01.dump.dat.gz -m /path/to/NCBI_SRA_Metadata_WUGSC_20150502.tar.gz -u data/SRA/NCBI_SRA_Files_WUGSC_20150502.gz -o 2015-05-02.dump.dat.gz

Here we initialized a SRA Dump file named 2015-05-02.dump.dat.gz.

On May 3, 2015, the command to create a new updated "SRA Dump" file would look like so:

srasearch increment-dump -i 2015-05-02.dump.dat.gz -m /path/to/NCBI_SRA_Metadata_WUGSC_20150503.tar.gz -u data/SRA/NCBI_SRA_Files_WUGSC_20150503.gz -o 2015-05-03.dump.dat.gz

We can proceed onwards simliarly through the rest of the month. Once the next month arrives we can initialize a brand new SRA Dump again.

Creating a search index from a given SRA Dump

Given an SRA Dump file, a primary search index database can be constructed. This sub-command creates the primary search index database:

srasearch make-index -i 2015-05-03.dump.dat.gz -o /path/to/db/srasearch0503.idx

Creating a recent uploads file (optional)

This sub-command creates an abbreviated "recent uploads" TSV file:

srasearch-noweb make-uploads --ncbi-uploads="/path/to/SRA/NCBI_SRA_Files_WUGSC_20150503.gz" --output-dir="/path/to/db/srasearch0503.idx" --threshold=4000 

In this example, we've placed the last recent 4000 uploads (from May 3, 2015) as a file inside the primary search index database directory called /path/to/db/srasearch0503.idx/recent-4000-sra-uploads-20150523.tsv.

This file is generally placed within a search index directory. It provides the data for the "Recent Uploads" link in the web app.

Start up the web app

srasearch web --host="0.0.0.0" --port=9999 --index-path="/path/to/db/srasearch0503.idx"

NOTES

srasearch is using the bleve text indexing library for the underlying search engine. BoltDB is being used for bleve's underlying key/value store.

All the dependecies to this app are stored within this repository and are managed by godep.

LICENSE

ISC

About

A NCBI Sequence Read Archive (SRA) upload management search utility.

Resources

License

Stars

Watchers

Forks

Packages

No packages published