.faa.gz files not being downloaded for bacteria #136

BhushanDhamale · 2020-10-07T11:21:29Z

Hello.
For the past week, I have been attempting to download protein fasta files for all bacteria using the following command:
ncbi-genome-download -F 'protein-fasta' -p 5 -r 3 -v 'bacteria'
This creates the directory structure as ./refseq/bacteria/GCF* containing only the MD5SUMS file in each directory.
Strangely enough, the same command run for other groups (archaea, fungi, plants, etc.) runs just fine and downloads the desired .faa.gz files.
What am I missing here?

The text was updated successfully, but these errors were encountered:

kblin · 2020-10-07T12:21:40Z

The main thing that comes to mind is that right now there's 127 plant assemblies, 333 fungal assemblies, 1,157 archaeal assemblies, and 200,357 bacterial assemblies in RefSeq. So there are massively more bacterial assemblies.
The way ncbi-genome-download is built currently, we do keep some info on the downloads in memory while running. Because ncbi-genome-download wasn't really designed as a tool to just download all the things, I wasn't super careful with runtime memory usage, so chances are that you're just running out of memory while downloading all bacteria.

Unaimend · 2023-05-31T18:31:01Z

I have the same problem on my side
Executing this command
ncbi-genome-download bacteria --section refseq -l complete

results in Directories that just contain the MD5 file

Unaimend · 2023-05-31T19:22:19Z

ncbi-genome-download --formats fasta,assembly-stats --assembly-levels complete bacteria resullts in the same problem

kblin · 2023-06-01T04:26:28Z

Hm, I don't think you should be running out of memory on a restricted download set like this. So much for that theory. Could you run one of your download commands with the added --debug parameter and paste the last 10 lines or so of that run in here?

Unaimend · 2023-06-01T19:23:33Z

log.log

Executed command
ncbi-genome-download --formats fasta,assembly-stats --assembly-levels complete bacteria --debug &> log.log

@kblin Is this error reproducible on your side? If not I can try to dig into the python code myself, it looks pretty clean.

I can really post the last lines because it takes quite a while ... i.e. I have no idea how long my request would take.

Unaimend · 2023-06-01T19:27:39Z

ncbi-genome-download --genera "Vibrio fortis" --formats fasta,assembly-stats --assembly-levels complete bacteria --debug &> log.log actually downloads "everything"

kblin · 2023-06-02T04:42:49Z

The full set of complete bacteria is a big download. I've just started a download with 12 parallel server connections, and extrapolating from the speed I'm getting the MD5SUMS files in it'll take around 20 minutes to just get those, before I can even get started on downloading the sequence files.

I just noticed that I didn't release the progress bar changes that tell you about the MD5SUMS download progress yet, I'll see what I can do about that.

Unaimend · 2023-06-02T08:07:09Z

Ahh ok, so I will first download all MD5SUMS? I did not know that. Is this documented somewhere? That might be the problem then.
Is it acceptable to just do that many requests? I wrote my own shitty version of a refseq downloader before I found your program (I just used wget) and ran into the problem that after a while my wget calls just stopped downloading stuff. I thought this is some kind of soft blocking due to the fact that I made so many requests

kblin added the bug label Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.faa.gz files not being downloaded for bacteria #136

.faa.gz files not being downloaded for bacteria #136

BhushanDhamale commented Oct 7, 2020

kblin commented Oct 7, 2020

Unaimend commented May 31, 2023

Unaimend commented May 31, 2023

kblin commented Jun 1, 2023

Unaimend commented Jun 1, 2023 •

edited

Unaimend commented Jun 1, 2023

kblin commented Jun 2, 2023

Unaimend commented Jun 2, 2023

.faa.gz files not being downloaded for bacteria #136

.faa.gz files not being downloaded for bacteria #136

Comments

BhushanDhamale commented Oct 7, 2020

kblin commented Oct 7, 2020

Unaimend commented May 31, 2023

Unaimend commented May 31, 2023

kblin commented Jun 1, 2023

Unaimend commented Jun 1, 2023 • edited

Unaimend commented Jun 1, 2023

kblin commented Jun 2, 2023

Unaimend commented Jun 2, 2023

Unaimend commented Jun 1, 2023 •

edited