Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for dbGAP #138

Closed
bhodkins opened this issue Mar 7, 2023 · 4 comments
Closed

Support for dbGAP #138

bhodkins opened this issue Mar 7, 2023 · 4 comments
Assignees
Labels
enhancement Improvement for existing functionality
Milestone

Comments

@bhodkins
Copy link

bhodkins commented Mar 7, 2023

Description of feature

Based on the docs, it looks like it is not currently possible to get files from dbGAP. Is this something that could be added? This can be done with 'prefetch' and 'fasterq-dump' by specifying either a JWT file or NGC file (the user would need to provide these).
https://www.ncbi.nlm.nih.gov/sra/docs/sra-dbGAP-cloud-download/
https://www.ncbi.nlm.nih.gov/sra/docs/sra-dbgap-download/
Thanks!

@bhodkins bhodkins added the enhancement Improvement for existing functionality label Mar 7, 2023
@robsyme
Copy link
Contributor

robsyme commented Mar 9, 2023

The danger with including the JWT file as an input to the workflow is that the (often sensitive) key is stored in the run directory. I've had a quick play around and it may be better to store the JWT file contents as a Nextflow secret that can be injected into the container.

I'll try and run some tests using the prj_phs710EA_test.ngc test key to pull the protected SRR1219902 run. I'll update this issue with the results.

@robsyme
Copy link
Contributor

robsyme commented Mar 23, 2023

We also need to update the metadata function to add the "gap" database, basically cloning the method here changing "sra" to "gap"

@classmethod
def _id_to_srx(cls, identifier):
"""Resolve the identifier to SRA experiments."""
params = {"id": identifier, "db": "sra", "rettype": "runinfo", "retmode": "text"}
response = fetch_url(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{urlencode(params)}")
cls._content_check(response, identifier)
return [row["Experiment"] for row in open_table(response, delimiter=",")]

@drpatelh drpatelh added this to the 1.10 milestone Apr 25, 2023
@ejseqera ejseqera assigned ejseqera and unassigned robsyme May 5, 2023
@ejseqera
Copy link
Contributor

ejseqera commented May 9, 2023

I've implemented support for the JWT in PR #152 after testing with SRR1219902, however there are some caveats with using a JWT to download dbGAP data in this pipeline:

The input(s) provided to fetchngs resolve the IDs back to the experiment-level accessions IDs and subsequently fetch the metadata from the ENA API. This means, for a given SRA ID and its experiment-level accession ID, if there are other 'runs' also tied to the same experiment, they will be included in the runinfo_ftp.tsv to download FASTQs for.

In order to use a JWT then, you must generate the JWT for ALL runs tied to a given experiment accession ID. Otherwise, the pipeline will attempt to run sra-tools prefetch with the other run IDs under that experiment and subsequently fail because the JWT does not authenticate for other runs.

For example,

  • I generate a JWT from the SRA Run Selector for SRR1219902

  • I provide SRR1219902 as my only input ID to the pipeline

  • sra_ids_to_runinfo.py pulls metadata for SRR1219902, resolving this ID to it's experiment accession which is SRX512039 and the resultant metadata looks as follows:

id	run_accession	experiment_accession
SRX512039_SRR1219865	SRR1219865	SRX512039
SRX512039_SRR1219902	SRR1219902	SRX512039
  • Under SRX512039 experiment also exists run ID SRR1219865
  • prefetch tries to pull files for SRR1219902 but also SRR1219865 for which my JWT does not provide authentication for
  • Pipeline fails with 403 error

How do we mitigate this?

  • Generate JWT on the Run Selector by selecting for all runs under a given experiment, or in this case, for SRX512039
  • Now I can authenticate for all run data for that experiment using prefetch and fasterq-dump

It's not entirely ideal because users may end up pulling data for runs they don't entirely need/want. I am also not entirely sure to what extent dbGAP study data is set up in this way. This is the only real available test for dbGAP with JWT we are able to do.

@ejseqera
Copy link
Contributor

ejseqera commented May 16, 2023

This has been implemented in #152 and will be available in v1.10.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants