# Get information about the available genomes given a list of taxons

----------
- Author: Guillem Ylla, Ph.D.,
 - Post-Doctoral Fellow at Harvard University
- Date: March 2021
-----------

Notebook based on my other repository [Number_genomes_available_in_NCBI](https://github.com/guillemylla/Number_genomes_available_in_NCBI/blob/main/Get_Available_Genomes_by_Taxon.ipynb)

In [5]:
import sys
import zipfile
import pandas as pd
from pprint import pprint
from datetime import datetime
from collections import defaultdict, Counter
import ncbi.datasets

- This script uses the ncbi.datasets python library. (More info https://www.ncbi.nlm.nih.gov/datasets/)

In [10]:
## Set up api
api_instance = ncbi.datasets.GenomeApi(ncbi.datasets.ApiClient())

**Indicate file to read: here**
- The file [Taxons_to_check](Taxons_to_check.txt) includes all the hemimetabolous orders plus the superorder Endopterygota (holometabolous)
- For mor edetails about the hemimetabolous orders, check the file [Hemimetabolous_orders.txt](Hemimetabolous_orders.txt) that I compiled from  the book ["Insect Metamorphosis"](https://www.elsevier.com/books/insect-metamorphosis/belles/978-0-12-813020-9) by Xavier Belles.

In [9]:
#Indicate file to read here (replace taxons_to_check.txt for your file)
filetoread="Taxons_to_check.txt"

In [14]:
taxonlist=pd.read_csv(filetoread)

## Number of species with genome available within each taxon

In [15]:
for lindex, taxon in taxonlist.iterrows():
    tax_name = taxon['Taxons']
    columns = ['spp', 'acc', 'level',"numChrScaff","sub.date","org.rank","taxID","assembly.length"]
    lst = []
    # query NCBI
    genome_summary = api_instance.assembly_descriptors_by_taxon(taxon=tax_name, limit='all')

    if genome_summary.total_count is None:
        print(f"{tax_name}; sppeecies with genome: 0")
    else:
        for assembly in map(lambda d: d.assembly, genome_summary.assemblies):
            lst.append([
                assembly.org.sci_name,
                assembly.assembly_accession,
                assembly.assembly_level,
                len(assembly.chromosomes),
                assembly.submission_date,
                assembly.org.rank,
                assembly.org.tax_id,
                assembly.seq_length])
        genomesdf = pd.DataFrame(lst, columns=columns)
        print(f"{tax_name}; species with genome: ", len(genomesdf.spp.unique()))
       # print("     -Spp list: ",",".join(genomesdf.spp.unique()))     


Odonata; species with genome:  3
Ephemeroptera; species with genome:  3
Plecoptera; species with genome:  3
Dermaptera; species with genome:  1
Orthoptera; species with genome:  6
Zoraptera; sppeecies with genome: 0
Mantodea; sppeecies with genome: 0
Blattodea; species with genome:  5
Mantophasmatodea; sppeecies with genome: 0
Grylloblattodea; sppeecies with genome: 0
Phasmatodea; species with genome:  13
Embiodea; sppeecies with genome: 0
Psocoptera; sppeecies with genome: 0
Phthiraptera; species with genome:  2
Thysanoptera; species with genome:  3
Hemiptera; species with genome:  49
Endopterygota; species with genome:  601


## Detailed information about the available genomes within each taxon

- Here 1 species might have multiple assemblies available

In [16]:
for lindex, taxon in taxonlist.iterrows():
    tax_name =taxon['Taxons']
    columns = ['spp', 'acc', 'level',"numChrScaff","sub.date","org.rank","taxID","assembly.length"]
    lst = []
    # query NCBI
    genome_summary = api_instance.assembly_descriptors_by_taxon(taxon=tax_name, limit='all')
    print(f"{tax_name}; {genome_summary.total_count} assemblies")

    if genome_summary.total_count is None:
        print("No genomes")
    else:
        for assembly in map(lambda d: d.assembly, genome_summary.assemblies):
            lst.append([
                assembly.org.sci_name,
                assembly.assembly_accession,
                assembly.assembly_level,
                len(assembly.chromosomes),
                assembly.submission_date,
                assembly.org.rank,
                assembly.org.tax_id,
                assembly.seq_length])
        genomesdf = pd.DataFrame(lst, columns=columns)
        display(genomesdf)

Odonata; 3 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Calopteryx splendens,GCA_002093875.1,Scaffold,1,2017-04-14,SPECIES,52612,1628352544
1,Rhinocypha anisoptera,GCA_011762765.1,Scaffold,1,2020-03-27,SPECIES,2487776,1871797026
2,Ladona fulva,GCA_000376725.2,Scaffold,1,2017-10-31,SPECIES,123851,1157615891


Ephemeroptera; 3 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Ephemera danica,GCA_000507165.2,Scaffold,1,2017-12-12,SPECIES,1049336,474347418
1,Baetis rhodani,GCA_001676355.1,Contig,1,2016-06-24,SPECIES,189839,174117300
2,Cloeon dipterum,GCA_902829235.1,Scaffold,1,2020-04-15,SPECIES,197152,180286980


Plecoptera; 3 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Amphinemura sulcicollis,GCA_001676325.1,Contig,1,2016-06-24,SPECIES,143724,271924966
1,Lednia tumana,GCA_003287335.1,Scaffold,1,2018-07-03,SPECIES,466727,304502267
2,Isoperla grammatica,GCA_001676475.1,Contig,1,2016-06-24,SPECIES,552050,509522935


Dermaptera; 1 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Anisolabis maritima,GCA_010014785.1,Scaffold,1,2020-01-30,SPECIES,62749,641675028


Orthoptera; 6 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Acheta domesticus,GCA_014858955.1,Contig,1,2020-10-07,SPECIES,6997,929173017
1,Gryllus bimaculatus,GCA_017312745.1,Scaffold,1,2021-02-12,SPECIES,6999,1658007496
2,Teleogryllus occipitalis,GCA_011170035.1,Scaffold,2,2020-02-21,SPECIES,470939,1933831998
3,Laupala kohalensis,GCA_002313205.1,Scaffold,1,2017-09-19,SPECIES,109027,1595214429
4,Locusta migratoria,GCA_000516895.1,Contig,1,2013-12-18,SPECIES,7004,5759798599
5,Xenocatantops brachycerus,GCA_900249655.1,Contig,1,2018-01-18,SPECIES,227619,41642132


Zoraptera; None assemblies
No genomes
Mantodea; None assemblies
No genomes
Blattodea; 8 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Blattella germanica,GCA_003018175.1,Scaffold,1,2018-03-26,SPECIES,6973,2037297555
1,Blattella germanica,GCA_000762945.2,Scaffold,1,2018-01-19,SPECIES,6973,1796331457
2,Periplaneta americana,GCA_002939525.1,Scaffold,1,2018-02-23,SPECIES,6978,3374840130
3,Zootermopsis nevadensis,GCF_000696155.1,Scaffold,2,2014-07-22,SPECIES,136037,485009472
4,Zootermopsis nevadensis,GCA_000696155.1,Scaffold,1,2014-06-05,SPECIES,136037,485009472
5,Cryptotermes secundus,GCF_002891405.2,Scaffold,1,2018-01-17,SPECIES,105785,1018932804
6,Cryptotermes secundus,GCA_002891405.2,Scaffold,1,2018-01-17,SPECIES,105785,1018932804
7,Coptotermes formosanus,GCA_013340265.1,Scaffold,1,2020-06-01,SPECIES,36987,875711502


Mantophasmatodea; None assemblies
No genomes
Grylloblattodea; None assemblies
No genomes
Phasmatodea; 16 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Timema californicum,GCA_902141385.1,Scaffold,1,2019-06-19,SPECIES,61474,1220418541
1,Timema bartmani,GCA_902151455.1,Scaffold,1,2019-07-03,SPECIES,61472,1108645000
2,Timema cristinae,GCA_002926335.1,Scaffold,1,2018-02-13,SPECIES,61476,931401240
3,Timema cristinae,GCA_902151435.1,Scaffold,1,2019-07-03,SPECIES,61476,1177599484
4,Timema cristinae,GCA_002928295.1,Chromosome,14,2018-02-15,SPECIES,61476,955490247
5,Timema cristinae,GCA_002009905.3,Chromosome,14,2017-04-13,SPECIES,61476,1042628579
6,Timema douglasi,GCA_901482245.1,Scaffold,1,2019-06-02,SPECIES,61478,1123070144
7,Timema genevievae,GCA_902155825.1,Scaffold,1,2019-07-08,SPECIES,629358,1049658498
8,Timema monikensis,GCA_902151445.1,Scaffold,1,2019-07-03,SPECIES,170555,1098921974
9,Timema podura,GCA_902151475.1,Scaffold,1,2019-07-03,SPECIES,61482,1105092765


Embiodea; None assemblies
No genomes
Psocoptera; None assemblies
No genomes
Phthiraptera; 3 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Columbicola columbae,GCA_016920875.1,Scaffold,1,2021-02-23,SPECIES,128991,207887661
1,Pediculus humanus corporis,GCF_000006295.1,Scaffold,1,2007-04-23,SUBSPECIES,121224,110781312
2,Pediculus humanus corporis,GCA_000006295.1,Scaffold,1,2007-04-23,SUBSPECIES,121224,110781312


Thysanoptera; 5 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Aptinothrips rufus,GCA_902196195.1,Contig,1,2019-08-01,SPECIES,1291243,339926289
1,Frankliniella occidentalis,GCF_000697945.2,Scaffold,2,2017-12-15,SPECIES,133901,274974745
2,Frankliniella occidentalis,GCA_000697945.4,Scaffold,1,2019-11-12,SPECIES,133901,415771118
3,Thrips palmi,GCF_012932325.1,Scaffold,2,2020-04-29,SPECIES,161013,237830083
4,Thrips palmi,GCA_012932325.1,Scaffold,1,2020-04-29,SPECIES,161013,237830083


Hemiptera; 82 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Philaenus spumarius,GCA_002233535.1,Scaffold,1,2017-07-25,SPECIES,36667,1248087067
1,Magicicada septendecim,GCA_011326945.1,Scaffold,1,2020-03-16,SPECIES,44679,1579033894
2,Magicicada septendecula,GCA_011763675.1,Scaffold,1,2020-03-27,SPECIES,44680,1585977997
3,Homalodisca vitripennis,GCA_000696855.2,Scaffold,1,2018-06-15,SPECIES,197043,1445215006
4,Laodelphax striatellus,GCA_017141395.1,Chromosome,16,2021-03-09,SPECIES,195883,510210580
...,...,...,...,...,...,...,...,...
77,Bemisia tabaci,GCA_902825415.1,Scaffold,1,2021-02-12,SPECIES,7038,657779510
78,Bemisia tabaci,GCA_004919745.1,Scaffold,1,2019-04-29,SPECIES,7038,538478037
79,Bemisia tabaci,GCA_003994315.1,Scaffold,1,2018-12-11,SPECIES,7038,657500294
80,Trialeurodes vaporariorum,GCA_011764245.1,Scaffold,1,2020-03-27,SPECIES,88556,787484423


Endopterygota; 1078 assemblies


Unnamed: 0,spp,acc,level,numChrScaff,sub.date,org.rank,taxID,assembly.length
0,Hydropsyche tenuis,GCA_009617725.1,Contig,1,2019-11-07,SPECIES,577407,229663394
1,Stenopsyche tienmushanensis,GCA_008973525.1,Contig,1,2019-10-11,SPECIES,1560151,451494475
2,Plectrocnemia conspersa,GCA_009617715.1,Contig,1,2019-11-07,SPECIES,161770,396695105
3,Sericostoma sp. HW-2014,GCA_003003475.1,Scaffold,1,2018-03-15,SPECIES,1977532,1015727762
4,Glossosoma conforme,GCA_003347265.1,Scaffold,1,2018-07-31,SPECIES,1875375,604293666
...,...,...,...,...,...,...,...,...
1073,Campoletis sonorensis,GCA_013761285.1,Scaffold,1,2020-07-27,SPECIES,7416,256042896
1074,Diadegma semiclausum,GCA_011421695.1,Contig,1,2020-03-18,SPECIES,208481,168496476
1075,Diadromus collaris,GCA_009394715.1,Contig,1,2019-10-30,SPECIES,7421,188922525
1076,Neoneuromus ignobilis,GCA_014529405.1,Scaffold,2,2020-09-11,SPECIES,1569773,767796098
