<a href="https://colab.research.google.com/github/mftorres/APP/blob/main/American_palm_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Leaving this here in case you need info about Colaboratory in Google

[link text](https://)<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<h1>What is Colaboratory?</h1>

Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with 
- Zero configuration required
- Free access to GPUs
- Easy sharing

Whether you're a **student**, a **data scientist** or an **AI researcher**, Colab can make your work easier. Watch [Introduction to Colab](https://www.youtube.com/watch?v=inN8seMm7UI) to learn more, or just get started below!

----

----

# **American Palm Phylogeny data exploration/management**

Document authored by **Maria Fernanda Torres Jimenez**  
Date: 2021 May 05

____


The aim of this notebook is to show you how to retrieve data from the American Palm Phylogeny metadata, how to query, and how to update records.

Maintaining metadata associated to files requires you to do version control, properly dated, and to keep backups at all times. You should be prepared to accidentally change/delete metadata and you should always be able to go back to a stable version (but that depends on you and your data practices).


## 0. Importing packages

In [16]:
import pandas as pd
import re
import matplotlib as mpl
from matplotlib import pyplot as plt
import requests as rs
import datetime
import numpy as np

!pip install fuzzywuzzy # install package to do partial text match
from fuzzywuzzy import fuzz
from fuzzywuzzy import process



The database is stored in a google spreadsheet. It is good to have it in the cloud, google has version control and everyone can have access to the document without it staying with a single person.

Creating a function to download the data:

In [17]:
def get_data():
  return pd.read_csv('./palm_metadata_%s.txt'%(str(datetime.datetime.now()).split(' ')[0]), sep = '\t')

# try reading the file if exists, keep in mind the date
try:
  df = get_data()
except:
  url = 'https://raw.githubusercontent.com/mftorres/APP/main/data/Metadata_palm_gap_2021-05-05.txt' # shared link careful who you share this with
  print('Downloading')
  res = rs.get(url = url)
  with open('./palm_metadata_%s.txt'%(str(datetime.datetime.now()).split(' ')[0]), 'wb') as file: # Writing document in binary mode so python doesn't make changes
    file.write(res.content)
  df = get_data()

df # check the database

Unnamed: 0,dataset,provider,pi_code,originalmetadata,filename,sense,filecode,samplecode,library_index,botanic_garden,voucher,taxgenus,taxspecies,ifmorphotype,ifpopulation,sent_to_Cano,newfilename,raw_reads,reads_trimed_paired_flag,reads_trimed_paired,reads_trimed_single,percentage_lost_tosingles,pcr_filtered,collection_year,Continent,country,long,lat,flag,notes
0,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151124_000000000-AJE6N_P3252_1001_1.fastq.gz,R1,AJE6N_P3252_1001,AJE6N_1001,CGATGT,G,Cano_A._etal__ACS338,Acrocomia,Acrocomia_aculeata,,,sent,Acr_acu_AJE6N1001_AC_R1.fastq.gz,,passed,,,,,,americas,Panama,,,,
1,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151124_000000000-AJE6N_P3252_1001_2.fastq.gz,R2,AJE6N_P3252_1001,AJE6N_1001,CGATGT,G,Cano_A._etal__ACS338,Acrocomia,Acrocomia_aculeata,,,sent,Acr_acu_AJE6N1001_AC_R2.fastq.gz,,passed,,,,,,americas,Panama,,,,
2,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151130_000000000-AK5EU_P3252_1126_1.fastq.gz,R1,AK5EU_P3252_1126,AK5EU_1126,CACCGG,FTBG,_FTBG_20040120A,Acrocomia,Acrocomia_crispa,,,sent,Acr_cri_AK5EU1126_AC_R1.fastq.gz,,failed,,,,,,americas,,,,,
3,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151130_000000000-AK5EU_P3252_1126_2.fastq.gz,R2,AK5EU_P3252_1126,AK5EU_1126,CACCGG,FTBG,_FTBG_20040120A,Acrocomia,Acrocomia_crispa,,,sent,Acr_cri_AK5EU1126_AC_R2.fastq.gz,,failed,,,,,,americas,,,,,
4,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151130_000000000-AKG91_P3252_1182_1.fastq.gz,R1,AKG91_P3252_1182,AKG91_1182,CTAGCT,JBP,Lorenzi_&_Soares_6762,Acrocomia,Acrocomia_emensis,,,sent,Acr_eme_AKG911182_AC_R1.fastq.gz,,passed,,,,,,americas,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3044,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_vulgata_Standley67344_R2_001.fastq.gz,R2,Standley67344,,,,,Chamaedorea,Chamaedorea_vulgata,,,,,,,,,,,,,,,,,
3045,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_zamorae_Azofeifa484_R1_001.fastq.gz,R1,Azofeifa484,,,,,Chamaedorea,Chamaedorea_zamorae,,,,,,,,,,,,,,,,,
3046,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_zamorae_Azofeifa484_R2_001.fastq.gz,R2,Azofeifa484,,,,,Chamaedorea,Chamaedorea_zamorae,,,,,,,,,,,,,,,,,
3047,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_zamorae_Azofeifa484_S36_L001_R1_00...,R1,Azofeifa484_S36_L001,,,,,Chamaedorea,Chamaedorea_zamorae,,,,,,,,,,,,,,,,,


The first thing to check is the columns on the data and what they mean

In [18]:
list(df.columns)

['dataset',
 'provider',
 'pi_code',
 'originalmetadata',
 'filename',
 'sense',
 'filecode',
 'samplecode',
 'library_index',
 'botanic_garden',
 'voucher',
 'taxgenus',
 'taxspecies',
 'ifmorphotype',
 'ifpopulation',
 'sent_to_Cano',
 'newfilename',
 'raw_reads',
 'reads_trimed_paired_flag',
 'reads_trimed_paired',
 'reads_trimed_single',
 'percentage_lost_tosingles',
 'pcr_filtered',
 'collection_year',
 'Continent',
 'country',
 'long',
 'lat',
 'flag',
 'notes']

**dataset** bundle in which files are provided. e.g. CANO: sequences extracted by Angela Cano  
**provider** who provided the files or is responsible for sending the files  
**pi_code** two-letter code to shorten the *provider*  
**originalmetadata** name of original metadata provided by provider. That file should never be modified and only annotated  
**filename** original file name, usually as delivered by the sequencing facility  
**sense** whether file contains forward, reverse, interleaved, single reads    
**filecode** code provided by sequencing facility, often links to sequencing experiment, lane and plate information  
**samplecode** sample ID in the plate or as delivered to the sequencing facility
**library_index** sequencing index if provided or if in header of fastq. When double, each index is separated with a '+'  
**botanic_garden** if sample was obtained from a botanic garden rather than collected  
**voucher** voucher of the tissue or collecting code  
**taxgenus** taxonomic genus of the sample  
**taxspecies** taxonomic species with genus, separated by an underscore '_'  
**ifmorphotype** if the sample has a morphotype, variety or subspecies assigned  
**ifpopulation** if the sample belongs to a population-wide sampling effort, which population it was collected from  
**sent_to_Cano** files sent to Angela Cano who's analysing the data and leading the project  
**newfilename** new file name assigned to the sample. This is to annotate provider, dataset, species and voucher in the name. See Appendix for the code used to name files  
**raw_reads** number of raw reads before processing the data  -- filled as analysis progresses  
**reads_trimed_paired_flag** if trimmed -- filled as analysis progresses  
**reads_trimed_paired** number of reads after trimming -- filled as analysis progresses  
**reads_trimed_single** number of unpaired reads after trimming -- filled as analysis progresses  
**percentage_lost_tosingles** percentage of unpaired reads from the total of reads -- filled as analysis progresses  
**pcr_filtered** if PCR duplicate reads were filtered out using samtools or picard-- filled as analysis progresses  
**collection_year** year that sample was collected  
**Continent** continent. The project involves species from America but the metadata keeps track of other data from other continents  
**country** which country was the sample collected at if known from collection or voucher  
**long** longitude coordinates if available, should be standardised to decimals  
**lat**  latitude coordinates if available, should be standardised to decimals  
**flag** single word descriptors separated by spaces (never colons, semicolons or tabs) that describe the flaws of the file. e.g. misingtaxspecies  
**notes** elaborated notes if needed but avoid using commas, semicolons or tabs.  

The second interesting thing to do is filtering unique species for which we have data. The files here represent species that might or might not be good, but at this point of the analysis (which will be on a sequence by sequence basis) we can only know what we have sequenced.

Another thing is to filter out species that are not in America.

In [19]:
df['Continent'].unique()

array(['americas', 'out_americas', nan], dtype=object)

Some species don't have a continent specified. Let's see which ones:

In [20]:
# filtering pandas dataframe
df[df['Continent'].isna()]['taxspecies'].unique()

array(['Ceroxylon_alpinum', 'Ceroxylon_ceriferum',
       'Ceroxylon_echinulatum', 'Ceroxylon_parvifrons',
       'Ceroxylon_parvum', 'Ceroxylon_quindiuense',
       'Ceroxylon_ventricosum', 'Ceroxylon_vogelianum', 'Chamaedorea',
       'Cocos', nan, 'Phytelephas_aecuatorialis',
       'Phytelephas_macrocarpa', 'Phytelephas', 'Phytelephas_tumacana',
       'Ravenea_sambiranensis', 'Chamaedorea_nationsiana',
       'Chamaedorea_arenbergiana', 'Chamaedorea_binderi',
       'Chamaedorea_brachyclada', 'Chamaedorea_seifrizii',
       'Chamaedorea_fractiflexa', 'Chamaedorea_graminifolia',
       'Chamaedorea_hodelii', 'Chamaedorea_ibarrae',
       'Chamaedorea_incrustata', 'Chamaedorea_keelerorum',
       'Chamaedorea_lehmannii', 'Chamaedorea_liebmannii',
       'Chamaedorea_pachenoana', 'Chamaedorea_parvisecta',
       'Chamaedorea_pauciflora', 'Chamaedorea_piscifolia',
       'Chamaedorea_pumila', 'Chamaedorea_costaricana',
       'Chamaedorea_rigida', 'Chamaedorea_schideana',
       'Cham

In [21]:
# using a diferent syntax that can help with readability
df[df.Continent.isna()]['taxspecies'].unique()

array(['Ceroxylon_alpinum', 'Ceroxylon_ceriferum',
       'Ceroxylon_echinulatum', 'Ceroxylon_parvifrons',
       'Ceroxylon_parvum', 'Ceroxylon_quindiuense',
       'Ceroxylon_ventricosum', 'Ceroxylon_vogelianum', 'Chamaedorea',
       'Cocos', nan, 'Phytelephas_aecuatorialis',
       'Phytelephas_macrocarpa', 'Phytelephas', 'Phytelephas_tumacana',
       'Ravenea_sambiranensis', 'Chamaedorea_nationsiana',
       'Chamaedorea_arenbergiana', 'Chamaedorea_binderi',
       'Chamaedorea_brachyclada', 'Chamaedorea_seifrizii',
       'Chamaedorea_fractiflexa', 'Chamaedorea_graminifolia',
       'Chamaedorea_hodelii', 'Chamaedorea_ibarrae',
       'Chamaedorea_incrustata', 'Chamaedorea_keelerorum',
       'Chamaedorea_lehmannii', 'Chamaedorea_liebmannii',
       'Chamaedorea_pachenoana', 'Chamaedorea_parvisecta',
       'Chamaedorea_pauciflora', 'Chamaedorea_piscifolia',
       'Chamaedorea_pumila', 'Chamaedorea_costaricana',
       'Chamaedorea_rigida', 'Chamaedorea_schideana',
       'Cham

In [22]:
df[df.Continent.isna()]['taxgenus'].unique()

array(['Ceroxylon', 'Chamaedorea', 'Cocos', nan, 'Phytelephas', 'Ravenea'],
      dtype=object)

Most species are from the genera *Ceroxylon*, *Chamaedorea*, *Cocos*, *Phytelephas*, and *Ravenea*. I can already remember the dataset they come from and I can tell they don't have a continent assigned because I haven't updated the dataframe (as is not my main priority)

we can totally fix that today though. Let's iterate through rows in the dataset to assign the continent information. We also know all but *Cocos* (which is cosmopolita) are from America.

In [23]:
# itertuples iterates through tuples (pairs) of index,row and different columns can be acceced using the column label. Similar to the iterrows() method but faster.
# itertuples() only allows for the "dot" syntax for filtering pandas columns.

for row in df[df['Continent'].isna()].itertuples():
  # print(row.Continent,'empty') # test
  df.loc[row.Index,'Continent'] = 'americas' # loc assigns a value in a cell with coordinates [row, column] labels
df['Continent'].unique() # no nan should appear

array(['americas', 'out_americas'], dtype=object)

Nice. Now we can filter all species in America and count/list them

In [24]:
df[df['Continent'] == 'americas']['taxspecies'].unique()

array(['Acrocomia_aculeata', 'Acrocomia_crispa', 'Acrocomia_emensis',
       'Acrocomia_glaucescens', 'Acrocomia_intumescens',
       'Acrocomia_totai', 'Aiphanes_acaulis', 'Aiphanes_buenaventurae',
       'Aiphanes_concinna', 'Aiphanes_erinacea', 'Aiphanes_gelatinosa',
       'Aiphanes_hirsuta', 'Aiphanes_horrida', 'Aiphanes_killipii',
       'Aiphanes_leiostachys', 'Aiphanes_lindeniana', 'Aiphanes_linearis',
       'Aiphanes_macroloba', 'Aiphanes_minima', 'Aiphanes_parvifolia',
       'Aiphanes_pilaris', 'Aiphanes_simplex', 'Aiphanes_tricuspidata',
       'Aiphanes_ulei', 'Allagoptera_arenaria', 'Allagoptera_brevicalyx',
       'Allagoptera_caudescens', 'Allagoptera_leucocalyx',
       'Allagoptera_pectinata', 'Ammandra_decasperma', 'Aphandra_natalia',
       'Astrocaryum_acaule', 'Astrocaryum_aculeatum',
       'Astrocaryum_alatum', 'Astrocaryum_campestre',
       'Astrocaryum_carnosum', 'Astrocaryum_chambira',
       'Astrocaryum_chonta', 'Astrocaryum_ciliatum',
       'Astrocaryum

In [25]:
print('Number of american palm species sequenced:')
len(df[df['Continent'] == 'americas']['taxspecies'].unique())

Number of american palm species sequenced:


531

We need to know how many species are American species. I initially used this list passed on to me (but it is something to check out more formaly). The Kew checklist, for example, doesn't include some Sabal species for which we have sequences. There has to be a consensus about which list to compare against.

In [26]:
palms_americas=['Schippia_concolor','Pseudophoenix_ekmanii','Pseudophoenix_lediniana','Pseudophoenix_sargentii','Pseudophoenix_vinifera','Acoelorraphe_wrightii','Pholidostachys_amazonensis','Pholidostachys_dactyloides','Pholidostachys_kalbreyeri','Pholidostachys_occidentalis','Pholidostachys_panamensis','Pholidostachys_pulchra','Pholidostachys_sanluisensis','Pholidostachys_synanthera','Lepidocaryum_tenue','Hyospathe_elegans','Hyospathe_frontinensis','Hyospathe_macrorhachis','Hyospathe_peruviana','Hyospathe_pittieri','Hyospathe_wendlandiana','Chelyocarpus_chuco','Chelyocarpus_dianeurus','Chelyocarpus_repens','Chelyocarpus_ulei','Acrocomia_aculeata','Acrocomia_crispa','Acrocomia_emensis','Acrocomia_glaucescens','Acrocomia_hassleri','Acrocomia_intumescens','Acrocomia_media','Acrocomia_totai','Aiphanes_acanthophylla','Aiphanes_acaulis','Aiphanes_bicornis','Aiphanes_buenaventurae','Aiphanes_chiribogensis','Aiphanes_concinna','Aiphanes_deltoidea','Aiphanes_duquei','Aiphanes_eggersii','Aiphanes_erinacea','Aiphanes_gelatinosa','Aiphanes_graminifolia','Aiphanes_grandis','Aiphanes_hirsuta','Aiphanes_horrida','Aiphanes_killipii','Aiphanes_leiostachys','Aiphanes_lindeniana','Aiphanes_linearis','Aiphanes_macroloba','Aiphanes_minima','Aiphanes_multiplex','Aiphanes_parvifolia','Aiphanes_pilaris','Aiphanes_simplex','Aiphanes_spicata','Aiphanes_stergiosii','Aiphanes_tricuspidata','Aiphanes_ulei','Aiphanes_verrucosa','Aiphanes_weberbaueri','Allagoptera_arenaria','Allagoptera_brevicalyx','Allagoptera_campestris','Allagoptera_caudescens','Allagoptera_leucocalyx','Ammandra_decasperma','Aphandra_natalia','Asterogyne_guianensis','Asterogyne_martiana','Asterogyne_ramosa','Asterogyne_spicata','Asterogyne_yaracuyense','Astrocaryum_acaule','Astrocaryum_aculeatissimum','Astrocaryum_aculeatum','Astrocaryum_alatum','Astrocaryum_campestre','Astrocaryum_carnosum','Astrocaryum_chambira','Astrocaryum_chonta','Astrocaryum_ciliatum','Astrocaryum_confertum','Astrocaryum_cuatrecasanum','Astrocaryum_echinatum','Astrocaryum_faranae','Astrocaryum_farinosum','Astrocaryum_ferrugineum','Astrocaryum_giganteum','Astrocaryum_gratum','Astrocaryum_gynacanthum','Astrocaryum_huaimi','Astrocaryum_huicungo','Astrocaryum_jauari','Astrocaryum_javarense','Astrocaryum_macrocalyx','Astrocaryum_malybo','Astrocaryum_mexicanum','Astrocaryum_minus','Astrocaryum_murumuru','Astrocaryum_paramaca','Astrocaryum_perangustatum','Astrocaryum_rodriguesii','Astrocaryum_sciophilum','Astrocaryum_scopatum','Astrocaryum_sociale','Astrocaryum_standleyanum','Astrocaryum_triandrum','Astrocaryum_tucuma','Astrocaryum_ulei','Astrocaryum_urostachys','Astrocaryum_vulgare','Attalea_allenii','Attalea_amygdalina','Attalea_amylacea','Attalea_anisitsiana','Attalea_apoda','Attalea_attaleoides','Attalea_barreirensis','Attalea_bassleriana','Attalea_blepharopus','Attalea_brasiliensis','Attalea_brejinhoensis','Attalea_burretiana','Attalea_butyracea','Attalea_camopiensis','Attalea_cephalotus','Attalea_cohune','Attalea_colenda','Attalea_compta','Attalea_crassispatha','Attalea_cuatrecasana','Attalea_dahlgreniana','Attalea_degranvillei','Attalea_dubia','Attalea_eichleri','Attalea_exigua','Attalea_fairchildensis','Attalea_funifera','Attalea_geraensis','Attalea_guacuyule','Attalea_guianensis','Attalea_hoehnei','Attalea_huebneri','Attalea_humilis','Attalea_iguadummat','Attalea_insignis','Attalea_kewensis','Attalea_lauromuelleriana','Attalea_leandroana','Attalea_luetzelburgii','Attalea_macrolepis','Attalea_magdalenica','Attalea_maracaibensis','Attalea_maripa','Attalea_maripensis','Attalea_microcarpa','Attalea_moorei','Attalea_nucifera','Attalea_oleifera','Attalea_osmantha','Attalea_peruviana','Attalea_phalerata','Attalea_pindobassu','Attalea_plowmanii','Attalea_princeps','Attalea_racemosa','Attalea_rhynchocarpa','Attalea_rostrata','Attalea_salazarii','Attalea_salvadorensis','Attalea_seabrensis','Attalea_septuagenata','Attalea_speciosa','Attalea_spectabilis','Attalea_tessmannii','Attalea_vitrivir','Attalea_weberbaueri','Attalea_wesselsboeri','Bactris_acanthocarpa','Bactris_acanthocarpoides','Bactris_ana-juliae','Bactris_aubletiana','Bactris_bahiensis','Bactris_balanophora','Bactris_barronis','Bactris_bidentula','Bactris_bifida','Bactris_brongniartii','Bactris_campestris','Bactris_caryotifolia','Bactris_caudata','Bactris_charnleyae','Bactris_chaveziae','Bactris_coloniata','Bactris_coloradonis','Bactris_concinna','Bactris_constanciae','Bactris_corossilla','Bactris_cubensis','Bactris_cuspidata','Bactris_dianeura','Bactris_elegans','Bactris_faucium','Bactris_ferruginea','Bactris_fissifrons','Bactris_gasipaes','Bactris_gastoniana','Bactris_glandulosa','Bactris_glassmanii','Bactris_glaucescens','Bactris_gracilior','Bactris_grayumii','Bactris_guineensis','Bactris_halmoorei','Bactris_hatschbachii','Bactris_herrerana','Bactris_hirta','Bactris_hondurensis','Bactris_horridispatha','Bactris_jamaicana','Bactris_killipii','Bactris_kunorum','Bactris_longiseta','Bactris_macroacantha','Bactris_major','Bactris_maraja','Bactris_martiana','Bactris_mexicana','Bactris_militaris','Bactris_nancibaensis','Bactris_obovata','Bactris_oligocarpa','Bactris_oligoclada','Bactris_panamensis','Bactris_pickelii','Bactris_pilosa','Bactris_pliniana','Bactris_plumeriana','Bactris_polystachya','Bactris_ptariana','Bactris_rhaphidacantha','Bactris_riparia','Bactris_rostrata','Bactris_schultesii','Bactris_setiflora','Bactris_setosa','Bactris_setulosa','Bactris_simplicifrons','Bactris_soeiroana','Bactris_sphaerocarpa','Bactris_syagroides','Bactris_tefensis','Bactris_timbuiensis','Bactris_tomentosa','Bactris_turbinocarpa','Bactris_vulgaris','Brahea_aculeata','Brahea_armata','Brahea_brandegeei','Brahea_calcarea','Brahea_decumbens','Brahea_dulcis','Brahea_edulis','Brahea_moorei','Brahea_pimo','Brahea_salvadorensis','Brahea_sarukhanii','Butia_archeri','Butia_campicola','Butia_capitata','Butia_catarinensis','Butia_eriospatha','Butia_exilata','Butia_exospadix','Butia_lallemantii','Butia_leptospatha','Butia_marmorii','Butia_matogrossensis','Butia_microspadix','Butia_missionera','Butia_noblickii','Butia_odorata','Butia_paraguayensis','Butia_purpurascens','Butia_quaraimana','Butia_stolonifera','Butia_yatay','Calyptrogyne_allenii','Calyptrogyne_anomala','Calyptrogyne_baudensis','Calyptrogyne_coloradensis','Calyptrogyne_condensata','Calyptrogyne_costatifrons','Calyptrogyne_deneversii','Calyptrogyne_fortunensis','Calyptrogyne_ghiesbreghtiana','Calyptrogyne_herrerae','Calyptrogyne_kunorum','Calyptrogyne_osensis','Calyptrogyne_panamensis','Calyptrogyne_pubescens','Calyptrogyne_sanblasensis','Calyptrogyne_trichostachys','Calyptrogyne_tutensis','Calyptronoma_occidentalis','Calyptronoma_plumeriana','Calyptronoma_rivalis','Ceroxylon_alpinum','Ceroxylon_amazonicum','Ceroxylon_ceriferum','Ceroxylon_echinulatum','Ceroxylon_parvifrons','Ceroxylon_parvum','Ceroxylon_peruvianum','Ceroxylon_pityrophyllum','Ceroxylon_quindiuense','Ceroxylon_sasaimae','Ceroxylon_ventricosum','Ceroxylon_vogelianum','Chamaedorea_adscendens','Chamaedorea_allenii','Chamaedorea_alternans','Chamaedorea_amabilis','Chamaedorea_anemophila','Chamaedorea_angustisecta','Chamaedorea_arenbergiana','Chamaedorea_atrovirens','Chamaedorea_benziei','Chamaedorea_binderi','Chamaedorea_brachyclada','Chamaedorea_brachypoda','Chamaedorea_carchensis','Chamaedorea_castillo-montii','Chamaedorea_cataractarum','Chamaedorea_christinae','Chamaedorea_correae','Chamaedorea_costaricana','Chamaedorea_crucensis','Chamaedorea_dammeriana','Chamaedorea_deckeriana','Chamaedorea_deneversiana','Chamaedorea_elatior','Chamaedorea_elegans','Chamaedorea_ernesti-augusti','Chamaedorea_falcifera','Chamaedorea_foveata','Chamaedorea_fractiflexa','Chamaedorea_fragrans','Chamaedorea_frondosa','Chamaedorea_geonomiformis','Chamaedorea_glaucifolia','Chamaedorea_graminifolia','Chamaedorea_guntheriana','Chamaedorea_hodelii','Chamaedorea_hooperiana','Chamaedorea_ibarrae','Chamaedorea_incrustata','Chamaedorea_keelerorum','Chamaedorea_klotzschiana','Chamaedorea_latisecta','Chamaedorea_lehmannii','Chamaedorea_liebmannii','Chamaedorea_linearis','Chamaedorea_lucidifrons','Chamaedorea_macrospadix','Chamaedorea_matae','Chamaedorea_metallica','Chamaedorea_microphylla','Chamaedorea_microspadix','Chamaedorea_moliniana','Chamaedorea_murriensis','Chamaedorea_nationsiana','Chamaedorea_neurochlamys','Chamaedorea_nubium','Chamaedorea_oblongata','Chamaedorea_oreophila','Chamaedorea_pachecoana','Chamaedorea_palmeriana','Chamaedorea_parvifolia','Chamaedorea_parvisecta','Chamaedorea_pauciflora','Chamaedorea_pedunculata','Chamaedorea_pinnatifrons','Chamaedorea_piscifolia','Chamaedorea_pittieri','Chamaedorea_plumosa','Chamaedorea_pochutlensis','Chamaedorea_ponderosa','Chamaedorea_pumila','Chamaedorea_pygmaea','Chamaedorea_queroana','Chamaedorea_radicalis','Chamaedorea_recurvata','Chamaedorea_rhizomatosa','Chamaedorea_ricardoi','Chamaedorea_rigida','Chamaedorea_robertii','Chamaedorea_rojasiana','Chamaedorea_rosibeliae','Chamaedorea_rossteniorum','Chamaedorea_sartorii','Chamaedorea_scheryi','Chamaedorea_schiedeana','Chamaedorea_schippii','Chamaedorea_seifrizii','Chamaedorea_serpens','Chamaedorea_simplex','Chamaedorea_skutchii','Chamaedorea_smithii','Chamaedorea_stenocarpa','Chamaedorea_stolonifera','Chamaedorea_stricta','Chamaedorea_subjectifolia','Chamaedorea_tenerrima','Chamaedorea_tepejilote','Chamaedorea_tuerckheimii','Chamaedorea_undulatifolia','Chamaedorea_verapazensis','Chamaedorea_verecunda','Chamaedorea_volcanensis','Chamaedorea_vulgata','Chamaedorea_warscewiczii','Chamaedorea_whitelockiana','Chamaedorea_woodsoniana','Chamaedorea_zamorae','Coccothrinax_acunana','Coccothrinax_alexandri','Coccothrinax_argentata','Coccothrinax_argentea','Coccothrinax_baracoensis','Coccothrinax_barbadensis','Coccothrinax_bermudezii','Coccothrinax_borhidiana','Coccothrinax_boschiana','Coccothrinax_camagueyana','Coccothrinax_clarensis','Coccothrinax_concolor','Coccothrinax_crinita','Coccothrinax_cupularis','Coccothrinax_ekmanii','Coccothrinax_elegans','Coccothrinax_fagildei','Coccothrinax_fragrans','Coccothrinax_garciana','Coccothrinax_gracilis','Coccothrinax_guantanamensis','Coccothrinax_gundlachii','Coccothrinax_hioramii','Coccothrinax_inaguensis','Coccothrinax_jamaicensis','Coccothrinax_leonis','Coccothrinax_litoralis','Coccothrinax_macroglossa','Coccothrinax_microphylla','Coccothrinax_miraguama','Coccothrinax_moaensis','Coccothrinax_montana','Coccothrinax_munizii','Coccothrinax_muricata','Coccothrinax_nipensis','Coccothrinax_orientalis','Coccothrinax_pauciramosa','Coccothrinax_proctorii','Coccothrinax_pseudorigida','Coccothrinax_pumila','Coccothrinax_readii','Coccothrinax_rigida','Coccothrinax_salvatoris','Coccothrinax_saxicola','Coccothrinax_scoparia','Coccothrinax_spissa','Coccothrinax_torrida','Coccothrinax_trinitensis','Coccothrinax_victorini','Coccothrinax_yunquensis','Coccothrinax_yuraguana','Colpothrinax_aphanopetala','Colpothrinax_cookii','Colpothrinax_wrightii','Copernicia_alba','Copernicia_baileyana','Copernicia_berteroana','Copernicia_brittonorum','Copernicia_cowellii','Copernicia_curbeloi','Copernicia_curtissii','Copernicia_ekmanii','Copernicia_fallaensis','Copernicia_gigas','Copernicia_glabrescens','Copernicia_hospita','Copernicia_humicola','Copernicia_longiglossa','Copernicia_macroglossa','Copernicia_molineti','Copernicia_oxycalyx','Copernicia_prunifera','Copernicia_rigida','Copernicia_roigii','Copernicia_tectorum','Copernicia_yarey','Cryosophila_bartlettii','Cryosophila_cookii','Cryosophila_grayumii','Cryosophila_guagara','Cryosophila_kalbreyeri','Cryosophila_macrocarpa','Cryosophila_nana','Cryosophila_stauracantha','Cryosophila_warscewiczii','Cryosophila_williamsii','Desmoncus_chinantlensis','Desmoncus_cirrhifer','Desmoncus_costaricensis','Desmoncus_giganteus','Desmoncus_horridus','Desmoncus_interjectus','Desmoncus_kunarius','Desmoncus_latisectus','Desmoncus_leptoclonos','Desmoncus_loretanus','Desmoncus_madrensis','Desmoncus_mitis','Desmoncus_moorei','Desmoncus_myriacanthos','Desmoncus_obovoideus','Desmoncus_orthacanthos','Desmoncus_osensis','Desmoncus_parvulus','Desmoncus_polyacanthos','Desmoncus_prunifer','Desmoncus_pumilus','Desmoncus_setosus','Desmoncus_stans','Desmoncus_vacivus','Dictyocaryum_fuscum','Dictyocaryum_lamarckianum','Dictyocaryum_ptarianum','Elaeis_oleifera','Euterpe_broadwayi','Euterpe_catinga','Euterpe_edulis','Euterpe_longibracteata','Euterpe_luminosa','Euterpe_oleracea','Euterpe_precatoria','Gaussia_attenuata','Gaussia_gomez-pompae','Gaussia_maya','Gaussia_princeps','Gaussia_spirituana','Geonoma_aspidiifolia','Geonoma_baculifera','Geonoma_bernalii','Geonoma_braunii','Geonoma_brenesii','Geonoma_brongniartii','Geonoma_calyptrogynoidea','Geonoma_camana','Geonoma_chlamydostachys','Geonoma_chococola','Geonoma_concinna','Geonoma_concinnoidea','Geonoma_congesta','Geonoma_cuneata','Geonoma_deneversii','Geonoma_deversa','Geonoma_dindoensis','Geonoma_divisa','Geonoma_elegans','Geonoma_epetiolata','Geonoma_euspatha','Geonoma_ferruginea','Geonoma_fosteri','Geonoma_frontinensis','Geonoma_galeanoae','Geonoma_gentryi','Geonoma_hollinensis','Geonoma_hugonis','Geonoma_interrupta','Geonoma_lanata','Geonoma_laxiflora','Geonoma_lehmannii','Geonoma_leptospadix','Geonoma_longipedunculata','Geonoma_longivaginata','Geonoma_macrostachys','Geonoma_maxima','Geonoma_monospatha','Geonoma_mooreana','Geonoma_multisecta','Geonoma_occidentalis','Geonoma_oldemanii','Geonoma_oligoclona','Geonoma_operculata','Geonoma_orbignyana','Geonoma_paradoxa','Geonoma_pauciflora','Geonoma_peruviana','Geonoma_pinnatifrons','Geonoma_poeppigiana','Geonoma_pohliana','Geonoma_poiteauana','Geonoma_sanmartinensis','Geonoma_santanderensis','Geonoma_schizocarpa','Geonoma_schottiana','Geonoma_scoparia','Geonoma_simplicifrons','Geonoma_spinescens','Geonoma_stricta','Geonoma_talamancana','Geonoma_tenuissima','Geonoma_triandra','Geonoma_triglochin','Geonoma_trigona','Geonoma_umbraculiformis','Geonoma_undata','Geonoma_venosa','Guihaia_argyrata','Guihaia_grossifibrosa','Guihaia_lancifolia','Hemithrinax_compacta','Hemithrinax_ekmaniana','Hemithrinax_rivularis','Iriartea_deltoidea','Iriartella_setigera','Iriartella_stenocarpa','Itaya_amicorum','Juania_australis','Jubaea_chilensis','Leopoldinia_piassaba','Leopoldinia_pulchra','Leucothrinax_morrisii','Lytocaryum_hoehnei','Lytocaryum_insigne','Lytocaryum_itapebiense','Lytocaryum_weddellianum','Manicaria_martiana','Manicaria_saccifera','Mauritia_carana','Mauritia_flexuosa','Mauritiella_aculeata','Mauritiella_armata','Mauritiella_macroclada','Mauritiella_pumila','Neonicholsonia_watsonii','Oenocarpus_bacaba','Oenocarpus_balickii','Oenocarpus_bataua','Oenocarpus_circumtextus','Oenocarpus_distichus','Oenocarpus_makeru','Oenocarpus_mapora','Oenocarpus_minor','Oenocarpus_simplex','Phytelephas_macrocarpa','Phytelephas_schottii','Phytelephas_seemannii','Phytelephas_tenuicaulis','Phytelephas_tumacana','Prestoea_acuminata','Prestoea_carderi','Prestoea_decurrens','Prestoea_ensiformis','Prestoea_longepetiolata','Prestoea_pubens','Prestoea_pubigera','Prestoea_schultzeana','Prestoea_simplicifolia','Prestoea_tenuiramosa','Raphia_taedigera','Reinhardtia_elegans','Reinhardtia_gracilis','Reinhardtia_koschnyana','Reinhardtia_latisecta','Reinhardtia_paiewonskiana','Reinhardtia_simplex','Rhapidophyllum_hystrix','Roystonea_altissima','Roystonea_borinquena','Roystonea_dunlapiana','Roystonea_lenis','Roystonea_maisiana','Roystonea_oleracea','Roystonea_princeps','Roystonea_regia','Roystonea_stellata','Roystonea_violacea','Sabal_bermudana','Sabal_causiarum','Sabal_domingensis','Sabal_etonia','Sabal_gretherae','Sabal_maritima','Sabal_mauritiiformis','Sabal_mexicana','Sabal_minor','Sabal_palmetto','Sabal_pumos','Sabal_rosei','Sabal_uresana','Sabal_yapa','Serenoa_repens','Socratea_exorrhiza','Socratea_hecatonandra','Socratea_montana','Socratea_rostrata','Socratea_salazarii','Syagrus_allagopteroides','Syagrus_amara','Syagrus_angustifolia','Syagrus_botryophora','Syagrus_caerulescens','Syagrus_campestris','Syagrus_campylospatha','Syagrus_cardenasii','Syagrus_cataphracta','Syagrus_cearensis','Syagrus_cerqueirana','Syagrus_cocoides','Syagrus_comosa','Syagrus_coronata','Syagrus_deflexa','Syagrus_duartei','Syagrus_emasensis','Syagrus_evansiana','Syagrus_flexuosa','Syagrus_glaucescens','Syagrus_glazioviana','Syagrus_gouveiana','Syagrus_graminifolia','Syagrus_harleyi','Syagrus_hoehnei','Syagrus_inajai','Syagrus_insignis','Syagrus_itacambirana','Syagrus_itapebiensis','Syagrus_kellyana','Syagrus_lilliputiana','Syagrus_loefgrenii','Syagrus_longipedunculata','Syagrus_lorenzoniorum','Syagrus_macrocarpa','Syagrus_mendanhensis','Syagrus_menzeliana','Syagrus_microphylla','Syagrus_minor','Syagrus_oleracea','Syagrus_orinocensis','Syagrus_petraea','Syagrus_picrophylla','Syagrus_pimentae','Syagrus_pleioclada','Syagrus_pleiocladoides','Syagrus_pompeoi','Syagrus_procumbens','Syagrus_pseudococos','Syagrus_romanzoffiana','Syagrus_rupicola','Syagrus_ruschiana','Syagrus_sancona','Syagrus_santosii','Syagrus_schizophylla','Syagrus_smithii','Syagrus_stenopetala','Syagrus_stratincola','Syagrus_vagans','Syagrus_vermicularis','Syagrus_werdermannii','Syagrus_yungasensis','Synechanthus_fibrosus','Synechanthus_warscewiczianus','Thrinax_excelsa','Thrinax_parviflora','Thrinax_radiata','Trithrinax_acanthocoma','Trithrinax_brasiliensis','Trithrinax_campestris','Trithrinax_schizophylla','Washingtonia_filifera','Washingtonia_robusta','Welfia_regia','Wendlandiella_gracilis','Wettinia_aequalis','Wettinia_aequatorialis','Wettinia_anomala','Wettinia_augusta','Wettinia_castanea','Wettinia_disticha','Wettinia_drudei','Wettinia_fascicularis','Wettinia_hirsuta','Wettinia_kalbreyeri','Wettinia_lanata','Wettinia_longipetala','Wettinia_maynensis','Wettinia_microcarpa','Wettinia_minima','Wettinia_oxycarpa','Wettinia_panamensis','Wettinia_praemorsa','Wettinia_quinaria','Wettinia_radiata','Wettinia_verruculosa','Zombia_antillarum']

print('Number of american palms:')
len(palms_americas)

Number of american palms:


810

So, what have we sequenced?

In [27]:
# first, we create sets of items to compare, using list comprehention. Sets should be defined within curly brackets
sppseqed = {spp for spp in df[df['Continent'] == 'americas']['taxspecies'].unique()} #list comprehension are loops of a single line
apspp = {spp for spp in palms_americas} # american palm species
# we can apply set theory to find the differences (not sequenced or sequenced but not in the list of american palms)
# items in setA that are not shared with setB and viceversa
# symmetric difference 
print(apspp - sppseqed) # niceee, nothing is missing
sppmissing = len(apspp - sppseqed)
print('\nAlleged missing species: %s'%(sppmissing))
print('Percentage of species missing: %s'%((sppmissing * 100)/len(palms_americas)))
print('Percentage of species sequenced: %s'%(100-(sppmissing * 100)/len(palms_americas)))

{'Chamaedorea_undulatifolia', 'Desmoncus_osensis', 'Attalea_macrolepis', 'Bactris_balanophora', 'Brahea_edulis', 'Copernicia_humicola', 'Chamaedorea_smithii', 'Bactris_grayumii', 'Calyptrogyne_kunorum', 'Oenocarpus_simplex', 'Colpothrinax_cookii', 'Chamaedorea_latisecta', 'Desmoncus_myriacanthos', 'Reinhardtia_paiewonskiana', 'Bactris_nancibaensis', 'Brahea_calcarea', 'Oenocarpus_circumtextus', 'Attalea_wesselsboeri', 'Copernicia_molineti', 'Coccothrinax_scoparia', 'Geonoma_chococola', 'Roystonea_violacea', 'Geonoma_trigona', 'Oenocarpus_makeru', 'Coccothrinax_bermudezii', 'Socratea_rostrata', 'Calyptrogyne_trichostachys', 'Aiphanes_bicornis', 'Dictyocaryum_fuscum', 'Acoelorraphe_wrightii', 'Geonoma_longipedunculata', 'Wettinia_longipetala', 'Attalea_fairchildensis', 'Dictyocaryum_ptarianum', 'Attalea_lauromuelleriana', 'Wettinia_oxycarpa', 'Bactris_cubensis', 'Bactris_sphaerocarpa', 'Wettinia_anomala', 'Geonoma_scoparia', 'Chamaedorea_queroana', 'Socratea_hecatonandra', 'Astrocaryum_p

That looks like a very low percentage. But there are a bunch of sequences we have not logged in the metadata. We can have a look at lists of species other people sequenced and count them in.

The datasets that are missing:  
- Chamaedoreas from Elliot Gardner
- Species extracted by Paola Lima
- Acrocomia from Brazil researchers
- Geonoma from Loiseau et al. 2019

We can check lists of files and metadata for those datasets

First, Chamaedoreas. They are conveniently named already. All I'll do is to create a list of files in a text file and work from there.

If running this on a local disk, you can use jupyter notebook's magic commands for bash


```
%%bash

ls *gz > files_chamaedorea.txt
```


In [28]:
def get_data():
  return pd.read_csv('./files_chamaedorea.txt', sep = '\t', names = ['file']) # the file doesn't have a header

# try reading the file if exists, keep in mind the date
try:
  eg = get_data()
except:
  url = 'https://raw.githubusercontent.com/mftorres/APP/main/data/files_chamaedorea.txt' # shared link careful who you share this with
  print('Downloading')
  res = rs.get(url = url)
  with open('./files_chamaedorea.txt', 'wb') as file: # Writing document in binary mode so python doesn't make changes
    file.write(res.content)
  eg = get_data()

eg # check the database

Downloading


Unnamed: 0,file
0,Chamaedorea_aff_nationsiana_Steyermark41863_S2...
1,Chamaedorea_aff_nationsiana_Steyermark41863_S2...
2,Chamaedorea_arenbergiana_Hawkins228_R1_001.fas...
3,Chamaedorea_arenbergiana_Hawkins228_R2_001.fas...
4,Chamaedorea_binderi_Grayum3351_R1_001.fastq.gz*
...,...
75,Chamaedorea_vulgata_Standley67344_R2_001.fastq...
76,Chamaedorea_zamorae_Azofeifa484_R1_001.fastq.gz*
77,Chamaedorea_zamorae_Azofeifa484_R2_001.fastq.gz*
78,Chamaedorea_zamorae_Azofeifa484_S36_L001_R1_00...


That file has 80 files of compressed paired-end reads, forward and reverse.
We need to add the file names into the database. We keep every file in the database to be able to mix single and paired-end reads.

The column for files looks like

In [29]:
df.columns
df['filename']

0          1_151124_000000000-AJE6N_P3252_1001_1.fastq.gz
1          1_151124_000000000-AJE6N_P3252_1001_2.fastq.gz
2          1_151130_000000000-AK5EU_P3252_1126_1.fastq.gz
3          1_151130_000000000-AK5EU_P3252_1126_2.fastq.gz
4          1_151130_000000000-AKG91_P3252_1182_1.fastq.gz
                              ...                        
3044    Chamaedorea_vulgata_Standley67344_R2_001.fastq.gz
3045      Chamaedorea_zamorae_Azofeifa484_R1_001.fastq.gz
3046      Chamaedorea_zamorae_Azofeifa484_R2_001.fastq.gz
3047    Chamaedorea_zamorae_Azofeifa484_S36_L001_R1_00...
3048    Chamaedorea_zamorae_Azofeifa484_S36_L001_R2_00...
Name: filename, Length: 3049, dtype: object

well, looks like I did update that data on the metadata. Let's see if the other fields are filled. Looks pretty much complete. We can go ahead with other datasets then.

In [30]:
df[df['filename'].str.contains('Chamaedorea')] # filters the database by looking a string match

Unnamed: 0,dataset,provider,pi_code,originalmetadata,filename,sense,filecode,samplecode,library_index,botanic_garden,voucher,taxgenus,taxspecies,ifmorphotype,ifpopulation,sent_to_Cano,newfilename,raw_reads,reads_trimed_paired_flag,reads_trimed_paired,reads_trimed_single,percentage_lost_tosingles,pcr_filtered,collection_year,Continent,country,long,lat,flag,notes
2969,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_aff_nationsiana_Steyermark41863_S2...,R1,Steyermark41863_S26_L001,,,,,Chamaedorea,Chamaedorea_nationsiana,,,,,,,,,,,,americas,,,,,
2970,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_aff_nationsiana_Steyermark41863_S2...,R2,Steyermark41863_S26_L001,,,,,Chamaedorea,Chamaedorea_nationsiana,,,,,,,,,,,,americas,,,,,
2971,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_arenbergiana_Hawkins228_R1_001.fas...,R1,Hawkins228,,,,,Chamaedorea,Chamaedorea_arenbergiana,,,,,,,,,,,,americas,,,,,
2972,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_arenbergiana_Hawkins228_R2_001.fas...,R2,Hawkins228,,,,,Chamaedorea,Chamaedorea_arenbergiana,,,,,,,,,,,,americas,,,,,
2973,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_binderi_Grayum3351_R1_001.fastq.gz,R1,Grayum3351,,,,,Chamaedorea,Chamaedorea_binderi,,,,,,,,,,,,americas,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3044,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_vulgata_Standley67344_R2_001.fastq.gz,R2,Standley67344,,,,,Chamaedorea,Chamaedorea_vulgata,,,,,,,,,,,,americas,,,,,
3045,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_zamorae_Azofeifa484_R1_001.fastq.gz,R1,Azofeifa484,,,,,Chamaedorea,Chamaedorea_zamorae,,,,,,,,,,,,americas,,,,,
3046,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_zamorae_Azofeifa484_R2_001.fastq.gz,R2,Azofeifa484,,,,,Chamaedorea,Chamaedorea_zamorae,,,,,,,,,,,,americas,,,,,
3047,EliottGardner,Eliott Gardner,EGR,none,Chamaedorea_zamorae_Azofeifa484_S36_L001_R1_00...,R1,Azofeifa484_S36_L001,,,,,Chamaedorea,Chamaedorea_zamorae,,,,,,,,,,,,americas,,,,,


Let's check the data sequenced by Paola. There's a list of species 

In [31]:
def get_data():
  return pd.read_csv('./metadata_machinereadable_Paola_2020.txt', sep = '\t') # the file doesn't have a header

# try reading the file if exists, keep in mind the date
try:
  seqp = get_data()
except:
  url = 'https://raw.githubusercontent.com/mftorres/APP/main/data/metadata_machinereadable_Paola_2020.txt' # shared link careful who you share this with
  print('Downloading')
  res = rs.get(url = url)
  with open('./metadata_machinereadable_Paola_2020.txt', 'wb') as file: # Writing document in binary mode so python doesn't make changes
    file.write(res.content)
  seqp = get_data()

seqp # check the database

Downloading


Unnamed: 0,plate,Rapid_Genomics_Identification,Plate,Column Sort,Silica number,Species,Collector and Number,Final Concentration (ng/ul),Nanodrop (260/280)
0,1,GOT_130404_P001_WA01,A1,1,3174,Acoelorrhaphe wrightii,H. Balslev 8156,50.0,1.76
1,1,GOT_130404_P001_WA02,A2,9,1156,Aiphanes bicornis,J. West s.n.,50.0,1.82
2,1,GOT_130404_P001_WA03,A3,17,261,Aiphanes chiribogensis,F. Borchsenius 652,50.0,1.80
3,1,GOT_130404_P001_WA04,A4,25,3563,Aiphanes deltoidea,L.A. Nuñez s.n.,50.0,1.84
4,1,GOT_130404_P001_WA05,A5,33,397,Aiphanes duquei,R. Bernal 2852,50.0,1.78
...,...,...,...,...,...,...,...,...,...
118,2,GOT_130404_P002_WB11,B11,82,431,Wettinia praemorsa,R. Bernal 2883,50.0,1.82
119,2,GOT_130404_P002_WB12,B12,90,434,Wettinia quinaria,R. Bernal 2491,50.0,1.73
120,2,GOT_130404_P002_WC01,C1,3,436,Wettinia radiata,R. Bernal 2190,50.0,1.75
121,2,GOT_130404_P002_WC02,C2,11,3514,Wettinia verruculosa,R. Bernal 2500,50.0,1.90


In [32]:
# now the rapid genomics data
def get_data():
  return pd.read_csv('./GOT_130404_SampleSheet.csv', sep = '\t') # the file doesn't have a header

# try reading the file if exists, keep in mind the date
try:
  seqprapid = get_data()
except:
  url = 'https://raw.githubusercontent.com/mftorres/APP/main/data/GOT_130404_SampleSheet.csv' # shared link careful who you share this with
  print('Downloading')
  res = rs.get(url = url)
  with open('./GOT_130404_SampleSheet.csv', 'wb') as file: # Writing document in binary mode so python doesn't make changes
    file.write(res.content)
  seqprapid = get_data()

seqprapid # check the database

Downloading


Unnamed: 0,RG_Sample_Code,Customer_Code,i5_Barcode_Seq,i7_Barcode_Seq,Sequence_Name,Sequencing_Cycle
0,GOT_130404_P001_WA01,,TATGCAGT,GTGTTCTA,RAPiD-Genomics_F147_GOT_130404_P001_WA01_i5-51...,2x150
1,GOT_130404_P001_WA02,,TATGCAGT,AGTCACTA,RAPiD-Genomics_F147_GOT_130404_P001_WA02_i5-51...,2x150
2,GOT_130404_P001_WA03,,TATGCAGT,ATTGGCTC,RAPiD-Genomics_F147_GOT_130404_P001_WA03_i5-51...,2x150
3,GOT_130404_P001_WA04,,TATGCAGT,CAGATCTG,RAPiD-Genomics_F147_GOT_130404_P001_WA04_i5-51...,2x150
4,GOT_130404_P001_WA05,,TATGCAGT,CCGTGAGA,RAPiD-Genomics_F147_GOT_130404_P001_WA05_i5-51...,2x150
...,...,...,...,...,...,...
118,GOT_130404_P002_WB11,,TACTCCTT,ACCACTGT,RAPiD-Genomics_F147_GOT_130404_P002_WB11_i5-51...,2x150
119,GOT_130404_P002_WB12,,TACTCCTT,ACAGATTC,RAPiD-Genomics_F147_GOT_130404_P002_WB12_i5-51...,2x150
120,GOT_130404_P002_WC01,,TACTCCTT,TGGAACAA,RAPiD-Genomics_F147_GOT_130404_P002_WC01_i5-51...,2x150
121,GOT_130404_P002_WC02,,TACTCCTT,TGGTGGTA,RAPiD-Genomics_F147_GOT_130404_P002_WC02_i5-51...,2x150


In [33]:
df.columns
df['dataset'].unique() # looks like these sequences aren't there. We can fill up the dataset then.

array(['Cano', 'Lausanne_Cerox', 'American Palms', 'Astrocaryum',
       'Mauritia', 'Zizka', 'Raphia', 'Geonoma', 'EliottGardner'],
      dtype=object)

We can create new columns in seqp to contain the information needed in df and then concatenate both.

File names can be extractede from the 'Sequence_name' column in seqpraid. We can add the species and voucher from seqp to seqpraid, then duplicate rows to create the R2 read entries, rename columns and concatenate with df

In [34]:
# create a column with file information, forward and reverse
# will create empty lists that I will join in a dataframe later

# first, a dictionary onject linking sample codes with species and vouchers
# a dictionary has keys (unique) and values, the latest can be strings, lists, other dictionaries, etc.

# in this case keys are the rapid genomics identifications
# values are tuples of species (replacing space with underscore) and voucher
id_spp_dict = dict(zip(seqp['Rapid_Genomics_Identification'],zip(seqp['Species'].str.replace(' ','_'),seqp['Collector and Number']))) # think of zip as a jacket zip: makes tuples with values in the same indexing

# looks like this
id_spp_dict

{'GOT_130404_P001_WA01': ('Acoelorrhaphe_wrightii', 'H. Balslev 8156'),
 'GOT_130404_P001_WA02': ('Aiphanes_bicornis', 'J. West s.n.'),
 'GOT_130404_P001_WA03': ('Aiphanes_chiribogensis', 'F. Borchsenius 652'),
 'GOT_130404_P001_WA04': ('Aiphanes_deltoidea', 'L.A. Nuñez s.n.'),
 'GOT_130404_P001_WA05': ('Aiphanes_duquei', 'R. Bernal 2852'),
 'GOT_130404_P001_WA06': ('Aiphanes_eggersii', 'J.C.Pintaud 1476'),
 'GOT_130404_P001_WA07': ('Aiphanes_graminifolia', 'R.Bernal 4825'),
 'GOT_130404_P001_WA08': ('Aiphanes_grandis', 'F. Borchsenius 648'),
 'GOT_130404_P001_WA09': ('Aiphanes_verrucosa', 'F. Borchsenius 641'),
 'GOT_130404_P001_WA10': ('Aiphanes_weberbaueri', 'H.Balslev 7769'),
 'GOT_130404_P001_WA11': ('Asterogyne_guianensis', 'J. Roncal 406'),
 'GOT_130404_P001_WA12': ('Asterogyne_ramosa', 'Private collection'),
 'GOT_130404_P001_WB01': ('Astrocaryum_chambira', 'T. Emilio 864'),
 'GOT_130404_P001_WB02': ('Astrocaryum_faranae', 'H. Balslev 7897'),
 'GOT_130404_P001_WB03': ('Astrocar

In [35]:
# now, a dictionary for sample code
id_code_dict = dict(zip(seqp['Rapid_Genomics_Identification'],seqp['Plate']+'_'+seqp['Collector and Number'].str.replace('[^a-zA-Z \d]','').str.replace(' ','')))
id_code_dict

{'GOT_130404_P001_WA01': 'A1_HBalslev8156',
 'GOT_130404_P001_WA02': 'A2_JWestsn',
 'GOT_130404_P001_WA03': 'A3_FBorchsenius652',
 'GOT_130404_P001_WA04': 'A4_LANuezsn',
 'GOT_130404_P001_WA05': 'A5_RBernal2852',
 'GOT_130404_P001_WA06': 'A6_JCPintaud1476',
 'GOT_130404_P001_WA07': 'A7_RBernal4825',
 'GOT_130404_P001_WA08': 'A8_FBorchsenius648',
 'GOT_130404_P001_WA09': 'A9_FBorchsenius641',
 'GOT_130404_P001_WA10': 'A10_HBalslev7769',
 'GOT_130404_P001_WA11': 'A11_JRoncal406',
 'GOT_130404_P001_WA12': 'A12_Privatecollection',
 'GOT_130404_P001_WB01': 'B1_TEmilio864',
 'GOT_130404_P001_WB02': 'B2_HBalslev7897',
 'GOT_130404_P001_WB03': 'B3_TPLCouvreur289',
 'GOT_130404_P001_WB04': 'B4_HBalslev7347',
 'GOT_130404_P001_WB05': 'B5_RBernal4373',
 'GOT_130404_P001_WB06': 'B6_DeGranvilleJJ14511',
 'GOT_130404_P001_WB07': 'B7_JCPintaud901',
 'GOT_130404_P001_WB08': 'B8_BMaguire54126',
 'GOT_130404_P001_WB09': 'B9_FBorchseniusnovoucher',
 'GOT_130404_P001_WB10': 'B10_MJBalic1626',
 'GOT_130404

In [36]:
# you can access the values for a particular key like this
# dictionaries are a cool object that R doesn't have..jejeje
# GOT_13etc is the key
print(id_spp_dict['GOT_130404_P001_WA01'])

# to access only the species:
print('\n',id_spp_dict['GOT_130404_P001_WA01'][0]) # python, unlike R, is zero-index. The first item on anything has index 0

# to access only the voucher:
print('\n',id_spp_dict['GOT_130404_P001_WA01'][1])

('Acoelorrhaphe_wrightii', 'H. Balslev 8156')

 Acoelorrhaphe_wrightii

 H. Balslev 8156


In [43]:
# creating a column with species and voucher in the rapidgenomics dataframe
seqprapid.columns # print them to make sure you aren't replacing any column

seqprapid['taxspecies'] = ''
seqprapid['voucher'] = ''
seqprapid['taxgenus'] = ''
for row in seqprapid.itertuples():
  seqprapid.loc[row.Index,'taxspecies'] = id_spp_dict[row.RG_Sample_Code][0]
  seqprapid.loc[row.Index,'taxgenus'] = id_spp_dict[row.RG_Sample_Code][0].split('_')[0]
  seqprapid.loc[row.Index,'voucher'] = id_spp_dict[row.RG_Sample_Code][1]
seqprapid['dataset'] = 'APP'
seqprapid['provider'] = 'Paola Lima'
seqprapid['pi_code'] = 'PL'
seqprapid['Continent'] = 'americas'
seqprapid['sense'] = 'R1'

seqprapid['samplecode'] = seqprapid['RG_Sample_Code'].map(id_code_dict)

seqprapid['originalmetadata'] = 'GOT_130404_SampleSheet metadata_machinereadable_Paola_2020'
seqprapid['library_index'] = seqprapid['i5_Barcode_Seq']+'+'+seqprapid['i7_Barcode_Seq']

seqprapid.rename(columns = {'Sequence_Name':'filename'}, inplace = True) # uses a dictionary with keys (old name) and value (new name)
seqprapid

Unnamed: 0,RG_Sample_Code,Customer_Code,i5_Barcode_Seq,i7_Barcode_Seq,filename,Sequencing_Cycle,taxspecies,voucher,taxgenus,dataset,provider,pi_code,Continent,sense,originalmetadata,samplecode,library_index
0,GOT_130404_P001_WA01,,TATGCAGT,GTGTTCTA,RAPiD-Genomics_F147_GOT_130404_P001_WA01_i5-51...,2x150,Acoelorrhaphe_wrightii,H. Balslev 8156,Acoelorrhaphe,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,A1_HBalslev8156,TATGCAGT+GTGTTCTA
1,GOT_130404_P001_WA02,,TATGCAGT,AGTCACTA,RAPiD-Genomics_F147_GOT_130404_P001_WA02_i5-51...,2x150,Aiphanes_bicornis,J. West s.n.,Aiphanes,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,A2_JWestsn,TATGCAGT+AGTCACTA
2,GOT_130404_P001_WA03,,TATGCAGT,ATTGGCTC,RAPiD-Genomics_F147_GOT_130404_P001_WA03_i5-51...,2x150,Aiphanes_chiribogensis,F. Borchsenius 652,Aiphanes,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,A3_FBorchsenius652,TATGCAGT+ATTGGCTC
3,GOT_130404_P001_WA04,,TATGCAGT,CAGATCTG,RAPiD-Genomics_F147_GOT_130404_P001_WA04_i5-51...,2x150,Aiphanes_deltoidea,L.A. Nuñez s.n.,Aiphanes,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,A4_LANuezsn,TATGCAGT+CAGATCTG
4,GOT_130404_P001_WA05,,TATGCAGT,CCGTGAGA,RAPiD-Genomics_F147_GOT_130404_P001_WA05_i5-51...,2x150,Aiphanes_duquei,R. Bernal 2852,Aiphanes,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,A5_RBernal2852,TATGCAGT+CCGTGAGA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
118,GOT_130404_P002_WB11,,TACTCCTT,ACCACTGT,RAPiD-Genomics_F147_GOT_130404_P002_WB11_i5-51...,2x150,Wettinia_praemorsa,R. Bernal 2883,Wettinia,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,B11_RBernal2883,TACTCCTT+ACCACTGT
119,GOT_130404_P002_WB12,,TACTCCTT,ACAGATTC,RAPiD-Genomics_F147_GOT_130404_P002_WB12_i5-51...,2x150,Wettinia_quinaria,R. Bernal 2491,Wettinia,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,B12_RBernal2491,TACTCCTT+ACAGATTC
120,GOT_130404_P002_WC01,,TACTCCTT,TGGAACAA,RAPiD-Genomics_F147_GOT_130404_P002_WC01_i5-51...,2x150,Wettinia_radiata,R. Bernal 2190,Wettinia,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,C1_RBernal2190,TACTCCTT+TGGAACAA
121,GOT_130404_P002_WC02,,TACTCCTT,TGGTGGTA,RAPiD-Genomics_F147_GOT_130404_P002_WC02_i5-51...,2x150,Wettinia_verruculosa,R. Bernal 2500,Wettinia,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,C2_RBernal2500,TACTCCTT+TGGTGGTA


In [51]:
# duplicate the dataframe to make up for the file with reverse files
seqprapidR2 = seqprapid.copy(deep = True)
seqprapidR2['filename'] = seqprapidR2['filename'].str.replace('_R1_','_R2_')
seqprapidR2['sense'] = seqprapidR2['sense'].str.replace('R1','R2')

list(seqprapidR2['filename'])

['RAPiD-Genomics_F147_GOT_130404_P001_WA01_i5-514_i7-59_S42_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA02_i5-514_i7-27_S43_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA03_i5-514_i7-82_S44_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA04_i5-514_i7-7_S45_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA05_i5-514_i7-38_S46_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA06_i5-514_i7-74_S47_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA07_i5-514_i7-77_S48_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA08_i5-514_i7-16_S49_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA09_i5-514_i7-36_S50_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA10_i5-514_i7-54_S51_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA11_i5-514_i7-25_S52_L001_R2_001.fastq.gz',
 'RAPiD-Genomics_F147_GOT_130404_P001_WA12_i5-514_i7-23_S53_L001_R2_001.fastq.gz',
 'RAP

In [52]:
# concatenate dataframes for forward and reverse files and sort by filename
seqprapidall = pd.concat([seqprapid, seqprapidR2]).sort_values(by = 'filename').reset_index()
seqprapidall

Unnamed: 0,index,RG_Sample_Code,Customer_Code,i5_Barcode_Seq,i7_Barcode_Seq,filename,Sequencing_Cycle,taxspecies,voucher,taxgenus,dataset,provider,pi_code,Continent,sense,originalmetadata,samplecode,library_index
0,0,GOT_130404_P001_WA01,,TATGCAGT,GTGTTCTA,RAPiD-Genomics_F147_GOT_130404_P001_WA01_i5-51...,2x150,Acoelorrhaphe_wrightii,H. Balslev 8156,Acoelorrhaphe,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,A1_HBalslev8156,TATGCAGT+GTGTTCTA
1,0,GOT_130404_P001_WA01,,TATGCAGT,GTGTTCTA,RAPiD-Genomics_F147_GOT_130404_P001_WA01_i5-51...,2x150,Acoelorrhaphe_wrightii,H. Balslev 8156,Acoelorrhaphe,APP,Paola Lima,PL,americas,R2,GOT_130404_SampleSheet metadata_machinereadabl...,A1_HBalslev8156,TATGCAGT+GTGTTCTA
2,1,GOT_130404_P001_WA02,,TATGCAGT,AGTCACTA,RAPiD-Genomics_F147_GOT_130404_P001_WA02_i5-51...,2x150,Aiphanes_bicornis,J. West s.n.,Aiphanes,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,A2_JWestsn,TATGCAGT+AGTCACTA
3,1,GOT_130404_P001_WA02,,TATGCAGT,AGTCACTA,RAPiD-Genomics_F147_GOT_130404_P001_WA02_i5-51...,2x150,Aiphanes_bicornis,J. West s.n.,Aiphanes,APP,Paola Lima,PL,americas,R2,GOT_130404_SampleSheet metadata_machinereadabl...,A2_JWestsn,TATGCAGT+AGTCACTA
4,2,GOT_130404_P001_WA03,,TATGCAGT,ATTGGCTC,RAPiD-Genomics_F147_GOT_130404_P001_WA03_i5-51...,2x150,Aiphanes_chiribogensis,F. Borchsenius 652,Aiphanes,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,A3_FBorchsenius652,TATGCAGT+ATTGGCTC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241,120,GOT_130404_P002_WC01,,TACTCCTT,TGGAACAA,RAPiD-Genomics_F147_GOT_130404_P002_WC01_i5-51...,2x150,Wettinia_radiata,R. Bernal 2190,Wettinia,APP,Paola Lima,PL,americas,R2,GOT_130404_SampleSheet metadata_machinereadabl...,C1_RBernal2190,TACTCCTT+TGGAACAA
242,121,GOT_130404_P002_WC02,,TACTCCTT,TGGTGGTA,RAPiD-Genomics_F147_GOT_130404_P002_WC02_i5-51...,2x150,Wettinia_verruculosa,R. Bernal 2500,Wettinia,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,C2_RBernal2500,TACTCCTT+TGGTGGTA
243,121,GOT_130404_P002_WC02,,TACTCCTT,TGGTGGTA,RAPiD-Genomics_F147_GOT_130404_P002_WC02_i5-51...,2x150,Wettinia_verruculosa,R. Bernal 2500,Wettinia,APP,Paola Lima,PL,americas,R2,GOT_130404_SampleSheet metadata_machinereadabl...,C2_RBernal2500,TACTCCTT+TGGTGGTA
244,122,GOT_130404_P002_WC03,,TACTCCTT,CGAACTTA,RAPiD-Genomics_F147_GOT_130404_P002_WC03_i5-51...,2x150,Mauritiella_armata,J.C.Pintaud 1488,Mauritiella,APP,Paola Lima,PL,americas,R1,GOT_130404_SampleSheet metadata_machinereadabl...,C3_JCPintaud1488,TACTCCTT+CGAACTTA


Now, we can concatenate the dataframes

In [53]:
df = pd.concat([df, seqprapidall[['filename','taxspecies', 'voucher','taxgenus',
       'dataset', 'provider', 'pi_code', 'Continent', 'sense',
       'originalmetadata', 'samplecode', 'library_index' ]]])
df

Unnamed: 0,dataset,provider,pi_code,originalmetadata,filename,sense,filecode,samplecode,library_index,botanic_garden,voucher,taxgenus,taxspecies,ifmorphotype,ifpopulation,sent_to_Cano,newfilename,raw_reads,reads_trimed_paired_flag,reads_trimed_paired,reads_trimed_single,percentage_lost_tosingles,pcr_filtered,collection_year,Continent,country,long,lat,flag,notes
0,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151124_000000000-AJE6N_P3252_1001_1.fastq.gz,R1,AJE6N_P3252_1001,AJE6N_1001,CGATGT,G,Cano_A._etal__ACS338,Acrocomia,Acrocomia_aculeata,,,sent,Acr_acu_AJE6N1001_AC_R1.fastq.gz,,passed,,,,,,americas,Panama,,,,
1,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151124_000000000-AJE6N_P3252_1001_2.fastq.gz,R2,AJE6N_P3252_1001,AJE6N_1001,CGATGT,G,Cano_A._etal__ACS338,Acrocomia,Acrocomia_aculeata,,,sent,Acr_acu_AJE6N1001_AC_R2.fastq.gz,,passed,,,,,,americas,Panama,,,,
2,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151130_000000000-AK5EU_P3252_1126_1.fastq.gz,R1,AK5EU_P3252_1126,AK5EU_1126,CACCGG,FTBG,_FTBG_20040120A,Acrocomia,Acrocomia_crispa,,,sent,Acr_cri_AK5EU1126_AC_R1.fastq.gz,,failed,,,,,,americas,,,,,
3,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151130_000000000-AK5EU_P3252_1126_2.fastq.gz,R2,AK5EU_P3252_1126,AK5EU_1126,CACCGG,FTBG,_FTBG_20040120A,Acrocomia,Acrocomia_crispa,,,sent,Acr_cri_AK5EU1126_AC_R2.fastq.gz,,failed,,,,,,americas,,,,,
4,Cano,Angela Cano,AC,Sent Appendix_SamplingPhylogeny_CentralAmerica...,1_151130_000000000-AKG91_P3252_1182_1.fastq.gz,R1,AKG91_P3252_1182,AKG91_1182,CTAGCT,JBP,Lorenzi_&_Soares_6762,Acrocomia,Acrocomia_emensis,,,sent,Acr_eme_AKG911182_AC_R1.fastq.gz,,passed,,,,,,americas,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241,APP,Paola Lima,PL,GOT_130404_SampleSheet metadata_machinereadabl...,RAPiD-Genomics_F147_GOT_130404_P002_WC01_i5-51...,R2,,C1_RBernal2190,TACTCCTT+TGGAACAA,,R. Bernal 2190,Wettinia,Wettinia_radiata,,,,,,,,,,,,americas,,,,,
242,APP,Paola Lima,PL,GOT_130404_SampleSheet metadata_machinereadabl...,RAPiD-Genomics_F147_GOT_130404_P002_WC02_i5-51...,R1,,C2_RBernal2500,TACTCCTT+TGGTGGTA,,R. Bernal 2500,Wettinia,Wettinia_verruculosa,,,,,,,,,,,,americas,,,,,
243,APP,Paola Lima,PL,GOT_130404_SampleSheet metadata_machinereadabl...,RAPiD-Genomics_F147_GOT_130404_P002_WC02_i5-51...,R2,,C2_RBernal2500,TACTCCTT+TGGTGGTA,,R. Bernal 2500,Wettinia,Wettinia_verruculosa,,,,,,,,,,,,americas,,,,,
244,APP,Paola Lima,PL,GOT_130404_SampleSheet metadata_machinereadabl...,RAPiD-Genomics_F147_GOT_130404_P002_WC03_i5-51...,R1,,C3_JCPintaud1488,TACTCCTT+CGAACTTA,,J.C.Pintaud 1488,Mauritiella,Mauritiella_armata,,,,,,,,,,,,americas,,,,,


Let's count species again

In [54]:
# first, we create sets of items to compare, using list comprehention. Sets should be defined within curly brackets
sppseqed = {spp for spp in df[df['Continent'] == 'americas']['taxspecies'].unique()} #list comprehension are loops of a single line
apspp = {spp for spp in palms_americas} # american palm species
# we can apply set theory to find the differences (not sequenced or sequenced but not in the list of american palms)
# items in setA that are not shared with setB and viceversa
# symmetric difference 
print(apspp - sppseqed) # niceee, nothing is missing
sppmissing = len(apspp - sppseqed)
print('\nAlleged missing species: %s'%(sppmissing))
print('Percentage of species missing: %s'%((sppmissing * 100)/len(palms_americas)))
print('Percentage of species sequenced: %s'%(100-(sppmissing * 100)/len(palms_americas)))

{'Attalea_exigua', 'Chamaedorea_undulatifolia', 'Euterpe_longibracteata', 'Lytocaryum_insigne', 'Desmoncus_osensis', 'Attalea_macrolepis', 'Geonoma_brenesii', 'Brahea_edulis', 'Copernicia_humicola', 'Chamaedorea_smithii', 'Bactris_grayumii', 'Attalea_salazarii', 'Aiphanes_multiplex', 'Euterpe_luminosa', 'Bactris_ptariana', 'Calyptrogyne_kunorum', 'Pholidostachys_amazonensis', 'Guihaia_grossifibrosa', 'Colpothrinax_cookii', 'Chamaedorea_latisecta', 'Chamaedorea_christinae', 'Coccothrinax_microphylla', 'Bactris_herrerana', 'Coccothrinax_acunana', 'Desmoncus_kunarius', 'Desmoncus_horridus', 'Reinhardtia_paiewonskiana', 'Bactris_nancibaensis', 'Lytocaryum_hoehnei', 'Brahea_calcarea', 'Geonoma_simplicifrons', 'Coccothrinax_muricata', 'Oenocarpus_circumtextus', 'Hyospathe_wendlandiana', 'Astrocaryum_aculeatissimum', 'Aiphanes_acanthophylla', 'Prestoea_carderi', 'Hemithrinax_rivularis', 'Attalea_wesselsboeri', 'Copernicia_molineti', 'Geonoma_poiteauana', 'Attalea_dahlgreniana', 'Coccothrinax_

Looks better, huh?
We can also double check that we are not missing species because of mislabeled continents. We already have the list of american species and that column is redundant... but better to have it

In [55]:
# first, we create sets of items to compare, using list comprehention. Sets should be defined within curly brackets
sppseqed = {spp for spp in df['taxspecies'].unique()} #list comprehension are loops of a single line
apspp = {spp for spp in palms_americas} # american palm species
# we can apply set theory to find the differences (not sequenced or sequenced but not in the list of american palms)
# items in setA that are not shared with setB and viceversa
# symmetric difference 
print(apspp - sppseqed) # niceee, nothing is missing
sppmissing = len(apspp - sppseqed)
print('\nAlleged missing species: %s'%(sppmissing))
print('Percentage of species missing: %s'%((sppmissing * 100)/len(palms_americas)))
print('Percentage of species sequenced: %s'%(100-(sppmissing * 100)/len(palms_americas)))

{'Attalea_exigua', 'Chamaedorea_undulatifolia', 'Euterpe_longibracteata', 'Lytocaryum_insigne', 'Desmoncus_osensis', 'Attalea_macrolepis', 'Geonoma_brenesii', 'Brahea_edulis', 'Copernicia_humicola', 'Chamaedorea_smithii', 'Bactris_grayumii', 'Attalea_salazarii', 'Aiphanes_multiplex', 'Euterpe_luminosa', 'Bactris_ptariana', 'Calyptrogyne_kunorum', 'Pholidostachys_amazonensis', 'Guihaia_grossifibrosa', 'Colpothrinax_cookii', 'Chamaedorea_latisecta', 'Chamaedorea_christinae', 'Coccothrinax_microphylla', 'Bactris_herrerana', 'Coccothrinax_acunana', 'Desmoncus_kunarius', 'Desmoncus_horridus', 'Reinhardtia_paiewonskiana', 'Bactris_nancibaensis', 'Lytocaryum_hoehnei', 'Brahea_calcarea', 'Geonoma_simplicifrons', 'Coccothrinax_muricata', 'Oenocarpus_circumtextus', 'Hyospathe_wendlandiana', 'Astrocaryum_aculeatissimum', 'Aiphanes_acanthophylla', 'Prestoea_carderi', 'Hemithrinax_rivularis', 'Attalea_wesselsboeri', 'Copernicia_molineti', 'Geonoma_poiteauana', 'Attalea_dahlgreniana', 'Coccothrinax_

well.. same thing. But it is always GOOD to double check things.

Next thing to do are the Acrocromias. First... how many species do we have until now?

You can import your own data into Colab notebooks from your Google Drive account, including from spreadsheets, as well as from Github and many other sources. To learn more about importing data, and how Colab can be used for data science, see the links below under [Working with Data](#working-with-data).