Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of the NCBI Sequence Read Archive (SRA) into Nextflow #89

Closed
evanfloden opened this Issue Nov 18, 2015 · 16 comments

Comments

Projects
None yet
6 participants
@evanfloden
Copy link
Contributor

evanfloden commented Nov 18, 2015

From Nature:

A condition of publication in a Nature journal is that authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications.

Truly reproducible computational analysis relies on the having the exact same software, computational environment and data available. Nextflow along with Docker fulfills the software and environment requirements yet the data is kept distinct.

It now common practice that DNA and RNA sequencing data must be made available in the NCBI Sequence Read Archive (SRA). To date the SRA contains over 4.3 quadrillion bases (Nov '15).

Sequence Read Archive (SRA) makes biological sequence data available to the research community to enhance reproducibility and allow for new discoveries by comparing data sets. The SRA stores raw sequencing data and alignment information from high-throughput sequencing platforms, including Roche 454 GS System®, Illumina Genome Analyzer®, Applied Biosystems SOLiD System®, Helicos Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.
http://www.ncbi.nlm.nih.gov/sra

Integration of the SRA into Nextflow could allow users to reference, pull and analyse their or others read data in programmatic manner.

As an example, it would be great if a user of pipeline could specify an SRA Study ID, for example the ID SRP003186 and have the mapping performed exactly as is specified in the study.

The SRA study SRP003186 is titled "Identification of fusion genes in breast cancer by paired-end RNA-sequencing" and consists of 7 RNA-seq samples in fastq format. We can see that each individual sample contains the following information about the library used:

Library:
Name: FIMM-OK-normbr-1
Instrument: Illumina Genome Analyzer II
Strategy: WGS
Source: TRANSCRIPTOMIC
Selection: cDNA
Layout: PAIRED

Taking this and other information such as the reference genome, each samples fastq files could be mapped using the information above as parameters and the downstream analysis performed.

Tools for accessing the SRA are available here: https://github.com/ncbi/ngs

Please leave comments if you have any ideas

@pditommaso

This comment has been minimized.

Copy link
Member

pditommaso commented Nov 18, 2015

Thanks for that Evan. I'm looking at the examples in the NGS binding toolkit but it looks there's no api to access a study as you are suggesting.

It seems that the API entry point is an ReadCollection object, that you will get specifying an accession number.

Could that class be useful for what you are proposing?

cc @emi80

@kwrodarmer

This comment has been minimized.

Copy link

kwrodarmer commented Nov 19, 2015

I mentioned in the NGS thread that the NGS API was developed with the purpose of being applicable to other archives as well as the SRA. As such, we took pains to avoid putting in requirements that weren't universally applicable.

At this time, we are putting together the NCBI-specific extensions to the NGS API which will give access to the unique features of SRA, and will include resolving study accessions, accessing coverages graphs built into SRA objects, etc.

@pditommaso

This comment has been minimized.

Copy link
Member

pditommaso commented Nov 20, 2015

Thanks for commenting on this. We will start our integration with the current NGS API that allows reads to the downloaded from the SRA.

If at some point will be possible to retrieve other features such as studies, coverage graphs, etc. that would be great.

@durbrow

This comment has been minimized.

Copy link

durbrow commented Nov 21, 2015

If your interested in resolving SRP's (and SRX's and possibly other things) to SRA runs, here is some python code to do it. It uses NCBI's eutils web service to perform the lookup. Look at the SRARunList function, in particular.
https://github.com/durbrow/efetch.py/blob/master/efetch.py

@evanfloden

This comment has been minimized.

Copy link
Contributor Author

evanfloden commented Oct 3, 2018

Snakemake has the concept of 'remotes' of which the NCBI is one.

A similar approach would be very useful here.

@pditommaso pditommaso added the nfhack18 label Oct 3, 2018

@pditommaso

This comment has been minimized.

Copy link
Member

pditommaso commented Oct 3, 2018

This is a good topic for the NF hack next month.

@pditommaso

This comment has been minimized.

Copy link
Member

pditommaso commented Oct 5, 2018

@wikiselev

This comment has been minimized.

Copy link
Contributor

wikiselev commented Oct 8, 2018

We would really like to be involved in this! We have just started exploring different options and I found this, really cool!

@pditommaso

This comment has been minimized.

Copy link
Member

pditommaso commented Oct 8, 2018

Do you mean htsget or SRA archive in general?

@wikiselev

This comment has been minimized.

Copy link
Contributor

wikiselev commented Oct 8, 2018

SRA at the moment. But htsget is also on my radar.

@wikiselev

This comment has been minimized.

Copy link
Contributor

wikiselev commented Oct 11, 2018

We just had a discussion with @micans and our conclusion was that for both SRA and iRods there are quite a lot of multiple scenarios and it will be hard to put it all into one core function. It would be probably better to have separate processes and provide them as templates or starting points.

@pditommaso

This comment has been minimized.

Copy link
Member

pditommaso commented Oct 11, 2018

Yes, please open a separate issue for iRods with a brief description of an use case (at practical level).

@lukasjelonek

This comment has been minimized.

Copy link

lukasjelonek commented Nov 26, 2018

During the nfhack18 we prototyped the access for the retrieval of data from databases. At the end we came up with the following interface:

// usage
Database.from(db: "sra", format: "fastq", accessions: "acc1,acc2,acc3")

// or with a file of accessions
Database.from(db: "sra", format: "fastq", accession_file: "some/path")

And implemented access via ENA (they had the easier to use web API, compared to the NCBI). The idea is to create a channel that contains the URLs for the specified accessions and let nextflow download them with it's default mechanism.

You can check out the sample implementation at https://github.com/lukasjelonek/nextflow-download-public-data/blob/master/lib/Database.groovy

@wikiselev

This comment has been minimized.

Copy link
Contributor

wikiselev commented Nov 26, 2018

Very nice, thanks so much for your effort Lukas!

@pditommaso

This comment has been minimized.

Copy link
Member

pditommaso commented Nov 26, 2018

This looks cool! I'll give a try at my earliest convenience.

@pditommaso

This comment has been minimized.

Copy link
Member

pditommaso commented Mar 12, 2019

Closing this in favour of #1070

@pditommaso pditommaso closed this Mar 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.