some samples fail with --force_sratools_download due to changes in prefetch results #98

dmalzl · 2022-06-24T15:25:38Z

Description of the bug

It is now over a month that I handle my data with fetchngs and I am pretty satisfied with the results. However, I recently encountered some difficulties when trying to force data download via sratools. Previously everything worked fine (in this context previously refers to the month May) but I had to reprocess and thus redownload some of the samples which resulted in pipeline fails due to error when fetching the data with prefetch. I vaguely remember reading somewhere that the SRA has made changes to its data storage policies or similar around beginning of June and the error I get as well as the timing (i.e. rerunning the same pipe command with as in May in June) is quite a hint towards a connection to this change. Looking at the .command.log file of the respective jobs reveals the core of the issue where prefetch will not download the typical *.sra file but something called *.sralite which is not recognized by the subsequent vdb-validate command as prefetch just puts it in the temp directory and not in the ./temp_dir/SRAsomething directory as expected by vdb-validate. This in turn causes the pipeline to fail. I haven't looked into it further as to if vdb-validate also excepts the *.sralite file and the problem being resolved by just checking if prefetch generates the expected folder or the *.sralite file and handling the cases accordingly. However, downloading the failing samples via the ENA FTP is still possible so a temporary fix is downloading everything I can with sratools and fetching the rest from the FTP.

Command used and terminal output

nextflow run nf-core/fetchngs ... --force_sratools_download
2022-06-24T14:44:39 prefetch.2.11.0 int: self NULL while reading file within network system module - cannot Make Compute Environment Token

2022-06-24T14:44:40 prefetch.2.11.0: 1) Downloading 'ERR1141695.sralite'...
2022-06-24T14:44:40 prefetch.2.11.0:  Downloading via HTTPS...
|-------------------------------------------------- 100%
2022-06-24T14:45:04 prefetch.2.11.0:  HTTPS download succeed
2022-06-24T14:45:05 prefetch.2.11.0:  'ERR1141695.sralite' is valid
2022-06-24T14:45:05 prefetch.2.11.0: 1) 'ERR1141695.sralite' was downloaded successfully
2022-06-24T14:45:06 vdb-validate.2.11.0 info: 'ERR1141695' could not be found

Relevant files

No response

System information

No response

The text was updated successfully, but these errors were encountered:

jsalignon · 2022-06-28T15:29:49Z

Hi. I confirm that the fetchngs is not useable as it is now for SRA ids. Neither version 1.6 or 1.4.
In my case all my inputs are in SRA format so I can't use the pipeline unless I convert them I guess.
Any idea of the timescale when this bug will be fixed?

dmalzl · 2022-06-28T15:53:16Z

so for me most of them still work just some are not available. did you try use teh --force_sratools_download option

jsalignon · 2022-06-28T16:00:10Z

Hi. Yes it doesn't work. I double checked and in fact all my IDs are Geo Sample IDs (GSM).

dmalzl · 2022-06-28T18:12:37Z

Although I think fetchngs is designed to also link these back SRA accessions I would suggest using the SRA run selector or the Entrez API to link GSM to SRA accessions maybe that solves the problem

jsalignon · 2022-06-29T13:48:12Z

Thanks for the suggestion. I managed to convert the IDs using the Entrez API as you suggested and run the command again with or without the --force_sratools_download option, but I get the same error message as before.

dmalzl · 2022-06-29T13:51:38Z

Could you post the exact error you get and for which process this occurs?

jsalignon · 2022-06-29T13:52:53Z

Yes, here is the error:

Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR5000684)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR5000684)` terminated with an error exit status (1)

Command executed:

  echo SRR5000684 > id.txt
  sra_ids_to_runinfo.py \
      id.txt \
      SRR5000684.runinfo.tsv \


  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [ERROR] The server couldn't fulfill the request.
  [ERROR] Status: 400 Bad Request. Both list of IDs and query_key are empty

dmalzl · 2022-06-29T14:01:56Z

Okay this is the same thing I get connected to this issue. What I did to get around this is locally changing the bin/sra_ids_to_runinfo.py file so that the ENA FTP is preferred to SRA which can simply be done by changing line 229 from return cls._id_to_srx(identifier) to return cls._id_to_erx(identifier).
However, be aware that this only works for SRR identifiers and not for GSM.

jsalignon · 2022-06-29T14:16:56Z

It works now. Thanks a lot!

drpatelh · 2022-07-01T07:24:52Z

Hi @dmalzl ! Thanks for looking into a fix. Yes, the NCBI changed their APIs yet again with a breaking change. Given this pipelines and other tools make assumptions about the API calls unfortunately the only thing we can do is to patch fix on the fly...

Have you by any chance found a backwards compatible fix? If so, we can do a patch release straight away.

We have plans to use and contribute to ffq to fetch the metadata in the future but that won't be immune to these sorts of changes either. It will however standardise the way we fetch the metadata.

dmalzl · 2022-07-01T07:34:47Z

Hi @drpatelh,

Unfortunately, the only things I could come up with is using the ENA FTP by default and falling back on the FTP when prefetch failed. So what I did was first ignoring all fails of prefetch due to it only downloading the *.sralite file and then downloading the ones that failed via the ENA FTP which was a very quick fix for me but I already thought about trying to feed the *.sralite file to fasterq-dump to see what happens. If it shows the expected behaviour one might be able to just check for either of the outcomes of prefetch and act accordingly but I haven't tried yet.

Such API changes are so annoying and I know this is out of your hands. Just wanted to point it out so that other users know what's going on.

Thanks for looking into it though

Jokendo-collab · 2022-08-14T12:39:01Z

return cls._id_to_erx(identifier)

This also worked for me.

drpatelh · 2022-11-04T10:36:56Z

This issue should mostly be solved I think after the API was fixed. Feel free to re-open if the problem persists.

dmalzl added the bug Something isn't working label Jun 24, 2022

dmalzl mentioned this issue Jun 24, 2022

SRA_IDS_TO_RUNINFO fails due to bad request #99

Closed

drpatelh mentioned this issue Jul 1, 2022

Test fails when running with singularity #101

Closed

drpatelh added this to the 1.8 milestone Nov 4, 2022

drpatelh closed this as completed Nov 4, 2022

ollieeknight mentioned this issue Apr 11, 2023

sra_ids_to_runinfo.py: command not found #142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some samples fail with --force_sratools_download due to changes in prefetch results #98

some samples fail with --force_sratools_download due to changes in prefetch results #98

dmalzl commented Jun 24, 2022

jsalignon commented Jun 28, 2022 •

edited

dmalzl commented Jun 28, 2022

jsalignon commented Jun 28, 2022 •

edited

dmalzl commented Jun 28, 2022

jsalignon commented Jun 29, 2022 •

edited

dmalzl commented Jun 29, 2022

jsalignon commented Jun 29, 2022

dmalzl commented Jun 29, 2022

jsalignon commented Jun 29, 2022

drpatelh commented Jul 1, 2022

dmalzl commented Jul 1, 2022

Jokendo-collab commented Aug 14, 2022 •

edited

drpatelh commented Nov 4, 2022

some samples fail with --force_sratools_download due to changes in prefetch results #98

some samples fail with --force_sratools_download due to changes in prefetch results #98

Comments

dmalzl commented Jun 24, 2022

Description of the bug

Command used and terminal output

Relevant files

System information

jsalignon commented Jun 28, 2022 • edited

dmalzl commented Jun 28, 2022

jsalignon commented Jun 28, 2022 • edited

dmalzl commented Jun 28, 2022

jsalignon commented Jun 29, 2022 • edited

dmalzl commented Jun 29, 2022

jsalignon commented Jun 29, 2022

dmalzl commented Jun 29, 2022

jsalignon commented Jun 29, 2022

drpatelh commented Jul 1, 2022

dmalzl commented Jul 1, 2022

Jokendo-collab commented Aug 14, 2022 • edited

drpatelh commented Nov 4, 2022

jsalignon commented Jun 28, 2022 •

edited

jsalignon commented Jun 28, 2022 •

edited

jsalignon commented Jun 29, 2022 •

edited

Jokendo-collab commented Aug 14, 2022 •

edited