Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The number of GTDB complete genomes mismatched with that in official website due to some records are deleted in NCBI #94

Open
shenwei356 opened this issue Feb 1, 2024 · 5 comments

Comments

@shenwei356
Copy link
Contributor

shenwei356 commented Feb 1, 2024

GTDB complete genomes

time genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_complete" -M "gtdb" -t 12 -m -L curl -i

cd GTDB_complete/2024-01-30_19-34-40/
wc -l assembly_summary.txt 
402538 assembly_summary.txt

Oh, 402,538 < 402,709 genomes! 402,709 is from https://gtdb.ecogenomic.org/.

Check it.

# download metadata
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_metadata_r214.tsv.gz
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_metadata_r214.tsv.gz

# concatenate metadata
(zcat ar53_metadata_r214.tsv.gz; zcat bac120_metadata_r214.tsv.gz | sed 1d) > metadata.tsv

# check missing
cd GTDB_complete/2024-01-30_19-34-40/
csvtk replace -t -p '^...' ../../metadata.tsv |  csvtk grep -t -v -P <(cut -f 1 assembly_summary.txt)  > missing.tsv

csvtk dim -t missing.tsv
file         num_cols  num_rows
missing.tsv       110       171

So, 171 genomes are missing. Here's the full list: missing.txt.

$ csvtk cut -t -f accession missing.tsv | head -n 5
accession
GCA_024650005.1
GCF_023371115.1
GCF_024450885.1
GCF_024654755.1

Manually searched them (with and without version .1) on NCBI, and no records were found. So they are removed.

Who are they.

$ csvtk freq -t -f ncbi_organism_name missing.tsv -nr | csvtk pretty -t
ncbi_organism_name                                     frequency
----------------------------------------------------   ---------
Escherichia coli                                       78
Acinetobacter baumannii                                38
Klebsiella pneumoniae                                  13
Pseudomonas aeruginosa                                 10
Staphylococcus xylosus                                 5
Acinetobacter nosocomialis                             2
Proteus mirabilis                                      2
Candidatus Bathyarchaeota archaeon                     1
Chromohalobacter sp. TMW 2.2303                        1
Enterobacter cloacae                                   1
Enterobacter hormaechei                                1
Enterobacter roggenkampii                              1
Enterobacter sp. ODB01                                 1
Fusobacterium sp. Marseille-Q7035                      1
Klebsiella oxytoca                                     1
Klebsiella quasipneumoniae                             1
Klebsiella sp. VKM B-1436                              1
Limnospira indica PCC 8005                             1
Methylobacillus methanolivorans                        1
Oscillospiraceae bacterium BX18                        1
Pseudoalteromonas rhizosphaerae                        1
Pseudomonas graminis                                   1
Pseudomonas qingdaonensis                              1
Salmonella enterica                                    1
Salmonella enterica subsp. enterica serovar Kedougou   1
Salmonella enterica subsp. enterica serovar Stanley    1
Streptomyces sp. GBA 94-10                             1
Stutzerimonas frequens                                 1
Tunicatimonas sp. TK19036                              1
Xylella fastidiosa subsp. multiplex                    1
@shenwei356
Copy link
Contributor Author

@pirovc
Copy link
Owner

pirovc commented Feb 5, 2024

Unfortunately that is the case and some records are removed from NCBI forever. Thanks for the investigation, I will add this information to the README and link this issue.

@shenwei356
Copy link
Contributor Author

@donovan.parks' reply

NCBI generally (never?) deletes data, but data records can become suppressed. For example,
GCA_024650005.1 has been suppressed, but you can still find information about this record and how to download the data at: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_024650005.1/

@pirovc
Copy link
Owner

pirovc commented Feb 20, 2024

Thanks for the investigation. Did you ever manage to find sequences for any of those "suppressed" entries?

The metadata still exists in the ftp and NCBI website but you never get to the sequence. Even if you go to the WGS entry, it's not there (I tried this and this). There are many different reasons for suppression and maybe in some of them is possible to still retrieve data.

In the case of changing URL as you mentioned above, it eventually gets updated on the main assembly_summary_refseq.txt.

Note that genome_updater already scrappes the "suppressed" or older entries from the assembly_summary_refseq_historical.txt, but it only holds metadata. If the sequence is not in the ftp, it will skip it.

@shenwei356
Copy link
Contributor Author

I just ignored these ungettable records. 😀

genome_updater is already good enough for downloading genbank+refseq assemblies, I once tried to generate URLs from the assembly_summary file but it turned out more effort was needed.

A few days ago, I downloaded the whole 2 million prokaryotic genomes, there were only 3 genomes failing to download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants