Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.faa.gz files not being downloaded for bacteria #136

Open
BhushanDhamale opened this issue Oct 7, 2020 · 8 comments
Open

.faa.gz files not being downloaded for bacteria #136

BhushanDhamale opened this issue Oct 7, 2020 · 8 comments
Labels

Comments

@BhushanDhamale
Copy link

Hello.
For the past week, I have been attempting to download protein fasta files for all bacteria using the following command:
ncbi-genome-download -F 'protein-fasta' -p 5 -r 3 -v 'bacteria'
This creates the directory structure as ./refseq/bacteria/GCF* containing only the MD5SUMS file in each directory.
Strangely enough, the same command run for other groups (archaea, fungi, plants, etc.) runs just fine and downloads the desired .faa.gz files.
What am I missing here?

@kblin
Copy link
Owner

kblin commented Oct 7, 2020

The main thing that comes to mind is that right now there's 127 plant assemblies, 333 fungal assemblies, 1,157 archaeal assemblies, and 200,357 bacterial assemblies in RefSeq. So there are massively more bacterial assemblies.
The way ncbi-genome-download is built currently, we do keep some info on the downloads in memory while running. Because ncbi-genome-download wasn't really designed as a tool to just download all the things, I wasn't super careful with runtime memory usage, so chances are that you're just running out of memory while downloading all bacteria.

@kblin kblin added the bug label Oct 7, 2020
@Unaimend
Copy link

I have the same problem on my side
Executing this command
ncbi-genome-download bacteria --section refseq -l complete

results in Directories that just contain the MD5 file

image

@Unaimend
Copy link

ncbi-genome-download --formats fasta,assembly-stats --assembly-levels complete bacteria resullts in the same problem

@kblin
Copy link
Owner

kblin commented Jun 1, 2023

Hm, I don't think you should be running out of memory on a restricted download set like this. So much for that theory. Could you run one of your download commands with the added --debug parameter and paste the last 10 lines or so of that run in here?

@Unaimend
Copy link

Unaimend commented Jun 1, 2023

log.log

Executed command
ncbi-genome-download --formats fasta,assembly-stats --assembly-levels complete bacteria --debug &> log.log

@kblin Is this error reproducible on your side? If not I can try to dig into the python code myself, it looks pretty clean.

I can really post the last lines because it takes quite a while ... i.e. I have no idea how long my request would take.

@Unaimend
Copy link

Unaimend commented Jun 1, 2023

ncbi-genome-download --genera "Vibrio fortis" --formats fasta,assembly-stats --assembly-levels complete bacteria --debug &> log.log actually downloads "everything"
image

@kblin
Copy link
Owner

kblin commented Jun 2, 2023

The full set of complete bacteria is a big download. I've just started a download with 12 parallel server connections, and extrapolating from the speed I'm getting the MD5SUMS files in it'll take around 20 minutes to just get those, before I can even get started on downloading the sequence files.

I just noticed that I didn't release the progress bar changes that tell you about the MD5SUMS download progress yet, I'll see what I can do about that.

@Unaimend
Copy link

Unaimend commented Jun 2, 2023

Ahh ok, so I will first download all MD5SUMS? I did not know that. Is this documented somewhere? That might be the problem then.
Is it acceptable to just do that many requests? I wrote my own shitty version of a refseq downloader before I found your program (I just used wget) and ran into the problem that after a while my wget calls just stopped downloading stuff. I thought this is some kind of soft blocking due to the fact that I made so many requests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants