Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to fetch metadata for ERR* ids associated with ArrayExpress #85

Closed
drpatelh opened this issue May 5, 2022 · 5 comments
Closed
Assignees
Labels
enhancement Improvement for existing functionality
Milestone

Comments

@drpatelh
Copy link
Member

drpatelh commented May 5, 2022

Description of feature

The current implementation in the pipeline assumes that all ERR* ids are available via the ENA API. We fetch the metadata using the URL below:
https://www.ebi.ac.uk/ena/portal/api/filereport?accession=ERR9539214&result=read_run&fields=run_accession%2Cexperiment_accession

However, this URL doesn't work for ERR9539214 and returns the error below from this code in the pipeline:

Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR9539214)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR9539214)` terminated with an error exit status (1)

Command executed:

  echo ERR9539214 > id.txt
  sra_ids_to_runinfo.py \
      id.txt \
      ERR9539214.runinfo.tsv \
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [ERROR] There is no content for id ERR9539214. Maybe you lack the right permissions?

On closer inspection, this id is associated with an ArrayExpress experiment for E-MTAB-11611 and if you click the link to E-MTAB-11611.sdrf.txt in that page we get all of the metadata for ERR9539214 as well as other ids associated with that submission. This includes FTP links for direct download which work when you try to download locally so the data isn't restricted to download even though it is patient data e.g.

wget -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR953/004/ERR9539214/ERR9539214_1.fastq.gz

ArrayExpress has it's own API as documented here and you can access all of the metadata for samples associated with an ArrayExpress id in XML format with the URL below:
https://www.ebi.ac.uk/arrayexpress/xml/v3/experiments/E-MTAB-11611/samples

Still abit puzzled as to why we can't get any metadata for ERR9539214 from the ENA API because that it's a native id for that database.

We should look into this further and add support for ArrayExpress ids too.

@drpatelh drpatelh added the enhancement Improvement for existing functionality label May 5, 2022
@Midnighter
Copy link
Contributor

Is it possible to somehow detect that an ID comes from ArrayExpress? Or do you think the solution will it be try one thing first and then the other?

@drpatelh
Copy link
Member Author

drpatelh commented May 6, 2022

Is it possible to somehow detect that an ID comes from ArrayExpress?

I haven't looked properly but it would be neat if we could do this for any ERR* id beforehand. Means another API call. Or we catch any 204 errors associated with ERR* ids and then try ArrayExpress.

@Midnighter
Copy link
Contributor

The above error message

[ERROR] There is no content for id ERR9539214. Maybe you lack the right permissions?

is already the result of special casing 204 so that wouldn't be too hard to try another API call there. Of course, it'd be nice to somehow do this for a group of IDs since it's probably likely that a user will request multiple IDs from the same ArrayExpress experiment.

@Midnighter
Copy link
Contributor

I just tried this again and now

curl -X GET "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=ERR9539214&fields=run_accession%2Cexperiment_accession&format=tsv&result=read_run"

succeeds. 🤷🏼‍♂️

@drpatelh drpatelh added this to the 1.10 milestone Apr 25, 2023
@ejseqera
Copy link
Contributor

ejseqera commented May 5, 2023

Confirmed that this has been resolved with the v2.0 update of the API and we're able to pull the metadata successfully with sra_ids_to_runinfo.py.

(1) Testing the endpoint:

(base) ➜  ~ curl -X GET "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=ERR9539214&fields=run_accession%2Cexperiment_accession&format=tsv&result=read_run"
run_accession	experiment_accession
ERR9539214	ERX9080024

(2) Running the pipeline test with ERR9539214 on dev with merged changes from API update:
Screenshot 2023-05-05 at 9 28 27 AM

@ejseqera ejseqera closed this as completed May 5, 2023
drpatelh added a commit to drpatelh/nf-core-fetchngs that referenced this issue May 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants