# Get information about the available genomes given a list of taxons

----------
- Author: Guillem Ylla, Ph.D.,
 - Post-Doctoral Fellow at Harvard University
- Date: March 2021

- Ran again for Cassandra MUSHIT - February 2022
-----------

Given a list of taxons (i.e. orders, superorder, species,..) retrieve information about the available genome assemblies in each taxon from NCBI (GeneBank and RefSeq) such as the number of available genome assemblies per taxon, list of the species with available genome per taxon.

In [2]:
import sys
import zipfile
import pandas as pd
from pprint import pprint
from datetime import datetime
from collections import defaultdict, Counter
import ncbi.datasets #pip install ncbi-datasets-pylib

- This script uses the ncbi.datasets python library. (More info https://www.ncbi.nlm.nih.gov/datasets/)

In [3]:
## Set up api
api_instance = ncbi.datasets.GenomeApi(ncbi.datasets.ApiClient())

**Indicate file to read: here**
- The file must include a header "taxon" and a 1 taxon per row

In [4]:
#Indicate file to read here (replace taxons_to_check.txt for your file)
filetoread="Taxons_to_check.txt"

Check first 5 lines of the file:

In [5]:
taxonlist=pd.read_csv(filetoread)
taxonlist[0:5]

Unnamed: 0,Taxons
0,Odonata
1,Ephemeroptera
2,Plecoptera
3,Dermaptera
4,Orthoptera


## Number of genome assemblies available per taxon
- Warning: species might have more than 1 assembly,
- If you want the number of unique species with genome assembly per taxon see below


In [16]:
for lindex, taxon in taxonlist.iterrows():
    tax_name = taxon['Taxons']
    # query NCBI
    genome_summary = api_instance.assembly_descriptors_by_taxon(taxon=tax_name, page_size=1000)
    print(f"- {tax_name}: assemblies; {genome_summary.total_count}")

- Odonata: assemblies; 7
- Ephemeroptera: assemblies; 3
- Plecoptera: assemblies; 9
- Dermaptera: assemblies; 1
- Orthoptera: assemblies; 13
- Zoraptera: assemblies; None
- Mantodea: assemblies; None
- Blattodea: assemblies; 10
- Mantophasmatodea: assemblies; None
- Grylloblattodea: assemblies; None
- Phasmatodea: assemblies; 16
- Embiodea: assemblies; None
- Psocoptera: assemblies; None
- Phthiraptera: assemblies; 6
- Thysanoptera: assemblies; 6
- Hemiptera: assemblies; 114
- Holometabola: assemblies; 2565


## List the species of the available genomes

In [38]:
for lindex, taxon in taxonlist.iterrows():
    tax_name =taxon['Taxons']
    columns = ['spp', 'acc', 'level',"numChrScaff","sub.date","org.rank","taxID","assembly.length"]
    lst = []
    # query NCBI
    genome_summary = api_instance.assembly_descriptors_by_taxon(taxon=tax_name, page_size=1000)
    print(f"{tax_name}; {genome_summary.total_count} assemblies")

    if genome_summary.total_count is None:
        print("No genomes")
    else:
        for assembly in map(lambda d: d.assembly, genome_summary.assemblies):
            lst.append([
                assembly.org.sci_name,
                assembly.assembly_accession,
                assembly.assembly_level,
                len(assembly.chromosomes),
                assembly.submission_date,
                assembly.org.rank,
                assembly.org.tax_id,
                assembly.seq_length])
        ## they implemented a page token in case that output exceeds page_size (1000)
        while genome_summary.next_page_token != None:
            print("Page: ", genome_summary.next_page_token)
            genome_summary = api_instance.assembly_descriptors_by_taxon(taxon=tax_name, page_size=1000, page_token=genome_summary.next_page_token)
            for assembly in map(lambda d: d.assembly, genome_summary.assemblies):
                lst.append([
                    assembly.org.sci_name,
                    assembly.assembly_accession,
                    assembly.assembly_level,
                    len(assembly.chromosomes),
                    assembly.submission_date,
                    assembly.org.rank,
                    assembly.org.tax_id,
                    assembly.seq_length])

        genomesdf = pd.DataFrame(lst, columns=columns)
        display(genomesdf)
        genomesdf.to_csv("Species_genomes/"+tax_name+'_genomes.csv')


Odonata; 7 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Calopteryx splendens,GCA_002093875.1,Scaffold,1,2017-04-14,SPECIES,52612,1628352544
1,Ischnura elegans,GCF_921293095.1,Chromosome,15,2021-12-05,SPECIES,197161,1722738530
2,Ischnura elegans,GCA_921293095.1,Chromosome,16,2021-12-05,SPECIES,197161,1722738530
3,Ischnura elegans,GCA_921293115.1,Scaffold,1,2021-12-05,SPECIES,197161,1602242354
4,Ladona fulva,GCA_000376725.2,Scaffold,1,2017-10-31,SPECIES,123851,1157615891
5,Pantala flavescens,GCA_020796165.1,Chromosome,13,2021-11-08,SPECIES,185825,663273809
6,Rhinocypha anisoptera,GCA_011762765.1,Scaffold,1,2020-03-27,SPECIES,2487776,1871797026


Ephemeroptera; 3 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Baetis rhodani,GCA_001676355.1,Contig,1,2016-06-24,SPECIES,189839,174117300
1,Cloeon dipterum,GCA_902829235.1,Scaffold,1,2020-04-15,SPECIES,197152,180286980
2,Ephemera danica,GCA_000507165.2,Scaffold,1,2017-12-12,SPECIES,1049336,474347418


Plecoptera; 9 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Amphinemura sulcicollis,GCA_001676325.1,Contig,1,2016-06-24,SPECIES,143724,271924966
1,Brachyptera putata,GCA_907164805.1,Chromosome,14,2021-05-17,SPECIES,2065413,436497185
2,Brachyptera putata,GCA_907164815.1,Scaffold,1,2021-05-15,SPECIES,2065413,79437563
3,Isoperla grammatica,GCA_001676475.1,Contig,1,2016-06-24,SPECIES,552050,509522935
4,Lednia tumana,GCA_003287335.1,Scaffold,1,2018-07-03,SPECIES,466727,304502267
5,Nemoura dubitans,GCA_921293005.1,Chromosome,8,2021-12-05,SPECIES,2014036,321001405
6,Nemoura dubitans,GCA_921293025.1,Contig,1,2021-12-05,SPECIES,2014036,295227936
7,Nemurella pictetii,GCA_921293315.1,Chromosome,14,2021-12-05,SPECIES,143722,257036096
8,Nemurella pictetii,GCA_921293065.1,Contig,1,2021-12-05,SPECIES,143722,219787140


Dermaptera; 1 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Anisolabis maritima,GCA_010014785.1,Scaffold,1,2020-01-30,SPECIES,62749,641675028


Orthoptera; 13 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Acheta domesticus,GCA_014858955.1,Contig,1,2020-10-07,SPECIES,6997,929173017
1,Apteronemobius asahinai,GCA_019974035.1,Scaffold,1,2021-08-23,SPECIES,746126,1676217857
2,Gryllus bimaculatus,GCA_017312745.1,Scaffold,1,2021-02-12,SPECIES,6999,1658007496
3,Laupala kohalensis,GCA_002313205.1,Scaffold,1,2017-09-19,SPECIES,109027,1595214429
4,Locusta migratoria,GCA_000516895.1,Contig,1,2013-12-18,SPECIES,7004,5759798599
5,Schistocerca americana,GCA_021461395.1,Chromosome,13,2022-01-13,SPECIES,7009,8990354198
6,Schistocerca piceifrons,GCA_021461385.1,Chromosome,13,2022-01-13,SPECIES,274613,8742443784
7,Teleogryllus occipitalis,GCA_011170035.1,Scaffold,2,2020-02-21,SPECIES,470939,1933831998
8,Vandiemenella viatica,GCA_019457785.1,Scaffold,1,2021-08-07,SPECIES,431949,3905819974
9,Vandiemenella viatica,GCA_019457805.1,Scaffold,1,2021-08-07,SPECIES,431949,3822312222


Zoraptera; None assemblies
No genomes
Mantodea; None assemblies
No genomes
Blattodea; 10 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Blattella germanica,GCA_003018175.1,Scaffold,1,2018-03-26,SPECIES,6973,2037297555
1,Blattella germanica,GCA_000762945.2,Scaffold,1,2018-01-19,SPECIES,6973,1796331457
2,Coptotermes formosanus,GCA_013340265.1,Scaffold,1,2020-06-01,SPECIES,36987,875711502
3,Cryptotermes secundus,GCF_002891405.2,Scaffold,1,2018-01-17,SPECIES,105785,1018932804
4,Cryptotermes secundus,GCA_002891405.2,Scaffold,1,2018-01-17,SPECIES,105785,1018932804
5,Periplaneta americana,GCA_002939525.1,Scaffold,1,2018-02-23,SPECIES,6978,3374840130
6,Reticulitermes speratus,GCA_021186555.1,Scaffold,1,2021-12-14,SPECIES,60591,880618617
7,Reticulitermes speratus,GCA_021605165.1,Scaffold,1,2021-12-22,SPECIES,60591,881338091
8,Zootermopsis nevadensis,GCF_000696155.1,Scaffold,2,2014-07-22,SPECIES,136037,485009472
9,Zootermopsis nevadensis,GCA_000696155.1,Scaffold,1,2014-06-05,SPECIES,136037,485009472


Mantophasmatodea; None assemblies
No genomes
Grylloblattodea; None assemblies
No genomes
Phasmatodea; 16 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Clitarchus hookeri,GCA_002778355.1,Scaffold,2,2017-11-16,SPECIES,325643,3802613888
1,Dryococelus australis,GCA_002236955.1,Contig,1,2017-07-28,SPECIES,614101,3416447916
2,Medauroidea extradentata,GCA_003012365.1,Scaffold,1,2018-03-20,SPECIES,614211,2593359413
3,Timema bartmani,GCA_902151455.1,Scaffold,1,2019-07-03,SPECIES,61472,1108645000
4,Timema californicum,GCA_902141385.1,Scaffold,1,2019-06-19,SPECIES,61474,1220418541
5,Timema cristinae,GCA_002926335.1,Scaffold,1,2018-02-13,SPECIES,61476,931401240
6,Timema cristinae,GCA_002009905.3,Chromosome,14,2017-04-13,SPECIES,61476,1042628579
7,Timema cristinae,GCA_002928295.1,Chromosome,14,2018-02-15,SPECIES,61476,955490247
8,Timema cristinae,GCA_902151435.1,Scaffold,1,2019-07-03,SPECIES,61476,1177599484
9,Timema douglasi,GCA_901482245.1,Scaffold,1,2019-06-02,SPECIES,61478,1123070144


Embiodea; None assemblies
No genomes
Psocoptera; None assemblies
No genomes
Phthiraptera; 6 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Columbicola columbae,GCA_016920875.1,Scaffold,1,2021-02-23,SPECIES,128991,207887661
1,Columbicola hoogstraali,GCA_913698275.1,Contig,1,2021-09-13,SPECIES,1719211,1357829
2,Degeeriella rufa,GCA_913698455.1,Contig,1,2021-09-13,SPECIES,236531,1364611
3,Pediculus humanus corporis,GCF_000006295.1,Scaffold,1,2007-04-23,SUBSPECIES,121224,110781312
4,Pediculus humanus corporis,GCA_000006295.1,Scaffold,1,2007-04-23,SUBSPECIES,121224,110781312
5,Proechinophthirus fluctus,GCA_913698485.1,Contig,1,2021-09-13,SPECIES,908931,1251895


Thysanoptera; 6 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Aptinothrips rufus,GCA_902196195.1,Contig,1,2019-08-01,SPECIES,1291243,339926289
1,Frankliniella occidentalis,GCF_000697945.2,Scaffold,2,2017-12-15,SPECIES,133901,274974745
2,Frankliniella occidentalis,GCA_000697945.4,Scaffold,1,2019-11-12,SPECIES,133901,415771118
3,Frankliniella occidentalis,GCA_000697945.3,Scaffold,1,2017-12-15,SPECIES,133901,274974745
4,Thrips palmi,GCF_012932325.1,Scaffold,2,2020-04-29,SPECIES,161013,237830083
5,Thrips palmi,GCA_012932325.1,Scaffold,1,2020-04-29,SPECIES,161013,237830083


Hemiptera; 114 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Acyrthosiphon pisum,GCF_005508785.1,Chromosome,6,2019-05-20,SPECIES,7029,541120603
1,Acyrthosiphon pisum,GCF_000142985.2,Scaffold,2,2010-06-24,SPECIES,7029,541675471
2,Acyrthosiphon pisum,GCA_000142985.2,Scaffold,1,2010-06-24,SPECIES,7029,541675471
3,Acyrthosiphon pisum,GCA_005508785.1,Chromosome,5,2019-05-20,SPECIES,7029,541120603
4,Aelia acuminata,GCA_911387785.1,Chromosome,10,2021-07-21,SPECIES,1511221,1170030458
...,...,...,...,...,...,...,...,...
109,Trialeurodes vaporariorum,GCA_011764245.1,Scaffold,1,2020-03-27,SPECIES,88556,787484423
110,Trialeurodes vaporariorum,GCA_009741425.1,Scaffold,1,2019-12-11,SPECIES,88556,842014990
111,Triatoma infestans,GCA_011037195.1,Scaffold,1,2020-02-27,SPECIES,30076,949486157
112,Trionymus perrisii,GCA_900050545.1,Scaffold,1,2016-03-27,SPECIES,1295211,237582517


Holometabola; 2565 assemblies
Page:  eNrjsnQpyi_OL8jIzElUKC6tKs3OzOTMyy-JL0pNi8_Py6nkzCwGsYtTCzndnd3iDQwMTMyNDA1M9QwBgnITyA
Page:  eNrjsvYvysssycgvKEktSlQoKMpMzC0t5swsji9KTYvPz8up5MzLLwFxilMLOd2dHeMNDA2NDc3NTU31DAGz7RSn


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Abisara bifasciata,GCA_018237985.1,Contig,1,2021-05-03,SPECIES,2614024,332741916
1,Abrostola tripartita,GCA_905340225.1,Chromosome,33,2021-03-25,SPECIES,938171,381042155
2,Abrostola tripartita,GCA_905340255.1,Contig,1,2021-03-25,SPECIES,938171,377898841
3,Abscondita terminalis,GCA_013368085.1,Scaffold,1,2020-06-23,SPECIES,2069292,499652588
4,Achlyodes mithridates tamenund,GCA_018251735.1,Contig,1,2021-05-03,SUBSPECIES,2201486,360619446
...,...,...,...,...,...,...,...,...
2560,Zonia zonia panamensis,GCA_018246015.1,Contig,1,2021-05-03,SUBSPECIES,2201726,115122648
2561,Zonia zonia panamensis,GCA_018248875.1,Contig,1,2021-05-03,SUBSPECIES,2201726,172979046
2562,Zopyrion sandace,GCA_018248435.1,Contig,1,2021-05-03,SPECIES,509524,243637293
2563,Zygaena filipendulae,GCA_907165275.2,Chromosome,33,2021-11-18,SPECIES,287375,365930723


## Number of genome assemblies available per Species

In [35]:
for lindex, taxon in taxonlist.iterrows():
    tax_name = taxon['Taxons']
    columns = ['spp', 'acc', 'level',"numChrScaff","sub.date","org.rank","taxID","assembly.length"]
    lst = []
    # query NCBI
    genome_summary = api_instance.assembly_descriptors_by_taxon(taxon=tax_name, page_size=1000)
    
    if genome_summary.total_count is None:
        print(f"{tax_name}; species with genome: 0")
    else:
        for assembly in map(lambda d: d.assembly, genome_summary.assemblies):
            lst.append([
                assembly.org.sci_name,
                assembly.assembly_accession,
                assembly.assembly_level,
                len(assembly.chromosomes),
                assembly.submission_date,
                assembly.org.rank,
                assembly.org.tax_id,
                assembly.seq_length])
        
        
        ## they implemented a page token in case that output exceeds page_size (1000)
        while genome_summary.next_page_token != None:
            print("Page: ", genome_summary.next_page_token)
            genome_summary = api_instance.assembly_descriptors_by_taxon(taxon=tax_name, page_size=1000, page_token=genome_summary.next_page_token)
            
            
            for assembly in map(lambda d: d.assembly, genome_summary.assemblies):
                lst.append([
                    assembly.org.sci_name,
                    assembly.assembly_accession,
                    assembly.assembly_level,
                    len(assembly.chromosomes),
                    assembly.submission_date,
                    assembly.org.rank,
                    assembly.org.tax_id,
                    assembly.seq_length])            

        
        genomesdf = pd.DataFrame(lst, columns=columns)
        print(f"{tax_name}; species with genome: ", len(genomesdf.spp.unique()))
       # print("     -Spp list: ",",".join(genomesdf.spp.unique()))     


Odonata; species with genome:  5
Ephemeroptera; species with genome:  3
Plecoptera; species with genome:  6
Dermaptera; species with genome:  1
Orthoptera; species with genome:  10
Zoraptera; species with genome: 0
Mantodea; species with genome: 0
Blattodea; species with genome:  6
Mantophasmatodea; species with genome: 0
Grylloblattodea; species with genome: 0
Phasmatodea; species with genome:  13
Embiodea; species with genome: 0
Psocoptera; species with genome: 0
Phthiraptera; species with genome:  5
Thysanoptera; species with genome:  3
Hemiptera; species with genome:  63
Page:  eNrjsnQpyi_OL8jIzElUKC6tKs3OzOTMyy-JL0pNi8_Py6nkzCwGsYtTCzndnd3iDQwMTMyNDA1M9QwBgnITyA
Page:  eNrjsvYvysssycgvKEktSlQoKMpMzC0t5swsji9KTYvPz8up5MzLLwFxilMLOd2dHeMNDA2NDc3NTU31DAGz7RSn
Holometabola; species with genome:  1516
