New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The number of GTDB complete genomes mismatched with that in official website due to some records are deleted in NCBI #94
Comments
Additionally, old: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/882/255/GCF_002882255.1_FW507-14D01 |
Unfortunately that is the case and some records are removed from NCBI forever. Thanks for the investigation, I will add this information to the README and link this issue. |
@donovan.parks' reply
|
Thanks for the investigation. Did you ever manage to find sequences for any of those "suppressed" entries? The metadata still exists in the ftp and NCBI website but you never get to the sequence. Even if you go to the WGS entry, it's not there (I tried this and this). There are many different reasons for suppression and maybe in some of them is possible to still retrieve data. In the case of changing URL as you mentioned above, it eventually gets updated on the main assembly_summary_refseq.txt. Note that genome_updater already scrappes the "suppressed" or older entries from the assembly_summary_refseq_historical.txt, but it only holds metadata. If the sequence is not in the ftp, it will skip it. |
I just ignored these ungettable records. 😀
A few days ago, I downloaded the whole 2 million prokaryotic genomes, there were only 3 genomes failing to download. |
GTDB complete genomes
Oh, 402,538 < 402,709 genomes! 402,709 is from https://gtdb.ecogenomic.org/.
Check it.
So, 171 genomes are missing. Here's the full list: missing.txt.
Manually searched them (with and without version
.1
) on NCBI, and no records were found. So they are removed.Who are they.
The text was updated successfully, but these errors were encountered: