# Darwin Core Conversion of eDNA Sequence Data Using the DNA Derived Data Extension

**Author:** Diana LaScala-Gruenewald

**Last Updated:** 09-Dec-2021

**Requirements:**
- Python 3
- Python 3 packages:
    - datetime
    - random
    - os
- External packages:
    - Bio.Entrez
    - numpy
    - pandas
    - pytz
- Custom modules:
    - WoRMS

**Resources:**
- [OBIS Webinar on Genetic Data]()
- Andersson AF, Bissett A, Finstad AG, Fossøy F, Grosjean M, Hope M, Jeppesen TS, Kõljalg U, Lundin D, Nilsson RN, Prager M, Svenningsen C & Schigel D (2021) Publishing DNA-derived data through biodiversity data platforms. v1.0 Copenhagen: GBIF Secretariat. https://doi.org/10.35035/doc-vf1a-nr22.
- [TDWG Darwin Core Occurrence Core](https://dwc.tdwg.org/terms/#occurrence)
- [GBIF DNA Derived Data Extension](https://tools.gbif.org/dwca-validator/extension.do?id=http://rs.gbif.org/terms/1.0/DNADerivedData)

In [1]:
## Imports

from datetime import datetime
import os
import random

import numpy as np
import pandas as pd
import pytz # for handling time zones

import WoRMS # custom functions for querying WoRMS API

## Load data

Note that in a Jupyter Notebook, the current directory is always where the .ipynb file is being run.

In [2]:
## Plate data

filename = os.getcwd().replace('src', os.path.join('raw', 'asv_table.csv'))  
plate = pd.read_csv(filename)
print(plate.shape)
plate.head()

(280440, 11)


Unnamed: 0,ASV,FilterID,Sequence_ID,Reads,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_1,05114c01_12_edna_1_S,14825,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_2,05114c01_12_edna_2_S,16094,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
2,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,05114c01_12_edna_3,05114c01_12_edna_3_S,22459,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
3,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_1,11216c01_12_edna_1_S,19312,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned
4,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,11216c01_12_edna_2,11216c01_12_edna_2_S,16491,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,unassigned


Plate data contains the ASV sequence, the number of reads (number of times that ASV was observed in the sample), and the taxonomy associated with that ASV.

| Column name| Column definition                                                                                                                                                                           |
|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 ASV        | The sequence of the Amplicon Sequence Variant observed                                                                                                                                       |
| FilterID   | A unique identifier for the filter the sample was obtained from, composed of: <br>- cruise number <br>- CTD cast number<br>- CTD bottle number <br>- filter indicator <br>- replicate number |
| Sequence_ID| The FilterID plus a letter indicating which plate the sample was on when sequenced                                                                                                           |
| Reads      | The number of reads for the ASV                                                                                                                                                             |
| Kingdom    | The Kingdom of the taxonomic identity assigned to the ASV, if known                                                                                                                          |
| Phylum     | The Phylum of the taxonomic identity assigned to the ASV, if known                                                                                                                           |
| Class      | The Class of the taxonomic identity assigned to the ASV, if known                                                                                                                            |
| Order      | The Order of the taxonomic identity assigned to the ASV, if known                                                                                                                            |
| Family     | The Family of the taxonomic identity assigned to the ASV, if known                                                                                                                           |
| Genus      | The Genus of the taxonomic identity assigned to the ASV, if known                                                                                                                            |
| Species    | The Species of the taxonomic identity assigned to the ASV, if known                                                                                                                          |

Additionally, taxonomic columns may include the following designations:
- **unknown** = GenBank couldn't give a scientifically-agreed-upon name for a given taxonomic rank. I.e., either the name doesn't exist, or there isn't enough scientific consensus to give a name.
- **no_hit** = BLAST did not find any hits for the ASV.
- **unassigned** = The ASV got BLAST hits, but the post-processing program Megan6 didn't assign the ASV to any taxonomic group.
- **g_** or **s_** = Megan6 assigned the ASV to a genus or species, but not with high enough confidence to include it. 

In [3]:
## Plate metadata

filename = os.getcwd().replace('src', os.path.join('raw', 'metadata_table.csv'))  
meta = pd.read_csv(filename)
print(meta.shape)
meta.head()

(60, 67)


Unnamed: 0,sample_name,library,tag_sequence,primer_sequence_forward,primer_sequence_reverse,R1,R2,PlateID,sample_type,target_gene,...,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,seq_meth,sequencing_facility,seqID,identificationRemarks,identificationReferences,FilterID,associatedSequences
0,14213c01_12_eDNA_1,S1,ACGAGACTGATT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_1_S1_L001_R1_001.fastq.gz,14213c01_12_edna_1_S1_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,14213c01_12_edna_1_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna,NCBI BioProject accession number PRJNA433203
1,14213c01_12_eDNA_2,S2,GAATACCAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_2_S2_L001_R1_001.fastq.gz,14213c01_12_edna_2_S2_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,14213c01_12_edna_2_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna,NCBI BioProject accession number PRJNA433203
2,14213c01_12_eDNA_3,S3,CGAGGGAAAGTC,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,14213c01_12_edna_3_S3_L001_R1_001.fastq.gz,14213c01_12_edna_3_S3_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,14213c01_12_edna_3_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14213c01_12_edna,NCBI BioProject accession number PRJNA433203
3,22013c01_12_eDNA_1,S4,GAACACTTTGGA,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,22013c01_12_edna_1_S4_L001_R1_001.fastq.gz,22013c01_12_edna_1_S4_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,22013c01_12_edna_1_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22013c01_12_edna,NCBI BioProject accession number PRJNA433203
4,22013c01_12_eDNA_2,S5,ACTCACAGGAAT,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,22013c01_12_edna_2_S5_L001_R1_001.fastq.gz,22013c01_12_edna_2_S5_L001_R2_001.fastq.gz,S,environmental,18S,...,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",NGS Illumina Miseq,Stanford,22013c01_12_edna_2_S,Genbank nr Release 221 September 20 2017,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22013c01_12_edna,NCBI BioProject accession number PRJNA433203


Metadata contains information on sample collection, DNA extraction, DNA amplification, and DNA sequencing. 

Definitions of relevant columns:

| Column name              | Column definition                                                                       |
|--------------------------|-----------------------------------------------------------------------------------------|
| primer_sequence_forward  | The sequence of the forward primer used during PCR                                      |
| primer_sequence_reverse  | The sequence of the reverse primer used during PCR                                      |
| target_gene              | The gene being targeted for amplification during PCR                                    |
| eventDate                | The date (and time, if available) the water sample was collected                        |
| decimalLatitude          | The latitude in decimal degrees where the water sample was collected (WGS84)            |
| decimalLongitude         | The longitude in decimal degrees where the water sample was collected (WGS84)           |
| env_broad_scale          | The most broad descriptor of the environment from which the water sample was collected  |
| env_local_scale          | A more specific descriptor of the environment from which the water sample was collected |
| env_medium               | A descriptor of the medium from which the DNA was collected                             |
| minimumDepthInMeters     | The minimum depth at which the water sample was collected                               |
| maximumDepthInMeters     | The maximum depth at which the water sample was collected                               |
| samp_vol_we_dna_ext      | The volume of the water sample that was processed during DNA extraction                 |
| nucl_acid_ext            | Reference to the DNA extraction protocol                                                |
| nucl_acid_amp            | Reference to the DNA amplification protocol                                             |
| sop                      | Links or references to standard operating protocols used to obtain the data             |
| pcr_primer_name_forward  | Name of the forward primer used during PCR                                              |
| pcr_primer_name_reverse  | Name of the reverse primer used during PCR                                              |
| pcr_primer_reference     | Reference for PCR primers                                                               |
| seq_meth                 | The sequencing method used                                                              |
| identificationRemarks    | Information on the taxonomic identification process                                     |
| identificationReferences | References to procedures and/or code used during the taxonomic identification process   |
| associatedSequences      | The identifier of the published raw DNA sequences from the water sample, if available   |

## Convert

For this data set, an `event` is a filtered water sample that was sequenced and an `occurrence` is an ASV observed within a water sample. Since there are no event-level measurements (i.e., measurements that are associated with the water sample but not the ASV), a separate event file is not required. We will assemble an occurrence file complying with Darwin Core and a DNA derived data (ddd) file complying with the DNA derived data extension.

In [4]:
## eventID - the Sequence_ID column in the plate dataframe uniquely identifies a water sample

occ = pd.DataFrame({'eventID':plate['Sequence_ID']})
print(occ.shape)
occ.head()

(280440, 1)


Unnamed: 0,eventID
0,05114c01_12_edna_1_S
1,05114c01_12_edna_2_S
2,05114c01_12_edna_3_S
3,11216c01_12_edna_1_S
4,11216c01_12_edna_2_S


In [5]:
## Merge with plate_meta to obtain columns that can be added directly from metadata

metadata_cols = [
    'seqID',
    'eventDate', 
    'decimalLatitude', 
    'decimalLongitude',
    'env_broad_scale',
    'env_local_scale',
    'env_medium',
    'target_gene',
    'primer_sequence_forward',
    'primer_sequence_reverse',
    'pcr_primer_name_forward',
    'pcr_primer_name_reverse',
    'pcr_primer_reference',
    'sop',
    'seq_meth',
    'samp_vol_we_dna_ext',
    'nucl_acid_ext', 
    'nucl_acid_amp',
]

dwc_cols = metadata_cols.copy()
dwc_cols[0] = 'eventID'

occ = occ.merge(meta[metadata_cols], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ.columns = dwc_cols
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp
0,05114c01_12_edna_1_S,2/20/14 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
1,05114c01_12_edna_2_S,2/20/14 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
2,05114c01_12_edna_3_S,2/20/14 15:33,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
3,11216c01_12_edna_1_S,4/21/16 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
4,11216c01_12_edna_2_S,4/21/16 14:39,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6


In [6]:
## Format eventDate

pst = pytz.timezone('America/Los_Angeles')
eventDate = [pst.localize(datetime.strptime(dt, '%m/%d/%y %H:%M')).isoformat() for dt in occ['eventDate']]
occ['eventDate'] = eventDate

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,primer_sequence_forward,primer_sequence_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw|dx.do...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6


In [7]:
## Clean sop

occ['sop'] = occ['sop'].str.replace('|', ' | ', regex=False)
occ['sop'].iloc[0]

'dx.doi.org/10.17504/protocols.io.xjufknw | dx.doi.org/10.17504/protocols.io.n2vdge6 | https://github.com/MBARI-BOG/BOG-Banzai-Dada2-Pipeline'

In [8]:
## Change column names as needed

occ = occ.rename(columns = {'primer_sequence_forward':'pcr_primer_forward',
                            'primer_sequence_reverse':'pcr_primer_reverse'})
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,pcr_primer_name_forward,pcr_primer_name_reverse,pcr_primer_reference,sop,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,1391f,EukBr,"Amaral-Zettler LA, McCliment EA, Ducklow HW, H...",dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6


In [9]:
## Add extension terms that weren't in metadata file (obtained by asking data provider)

occ['target_subfragment'] = 'V9'
occ['lib_layout'] = 'paired'
occ['otu_class_appr'] = 'dada2;version;ASV'
occ['otu_seq_comp_appr'] = 'blast;version;80% identity | MEGAN6;version; bitscore:100:2%'
occ['otu_db'] = 'Genbank nr;221'

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,sop,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp,target_subfragment,lib_layout,otu_class_appr,otu_seq_comp_appr,otu_db
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,dx.doi.org/10.17504/protocols.io.xjufknw | dx....,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221


In [10]:
## Create an occurrenceID that will uniquely identify each ASV observed within a water sample

occ['occurrenceID'] = plate.groupby('Sequence_ID')['ASV'].cumcount()+1
occ['occurrenceID'] = occ['eventID'] + '_occ' + occ['occurrenceID'].astype(str)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,seq_meth,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp,target_subfragment,lib_layout,otu_class_appr,otu_seq_comp_appr,otu_db,occurrenceID
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,05114c01_12_edna_1_S_occ1
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,05114c01_12_edna_2_S_occ1
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,NGS Illumina Miseq,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,05114c01_12_edna_3_S_occ1
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,11216c01_12_edna_1_S_occ1
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,NGS Illumina Miseq,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,11216c01_12_edna_2_S_occ1


In [11]:
## Add DNA_sequence

occ['DNA_sequence'] = plate['ASV']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,samp_vol_we_dna_ext,nucl_acid_ext,nucl_acid_amp,target_subfragment,lib_layout,otu_class_appr,otu_seq_comp_appr,otu_db,occurrenceID,DNA_sequence
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,05114c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,05114c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,100ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,05114c01_12_edna_3_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,11216c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,1000ml,dx.doi.org/10.17504/protocols.io.xjufknw,dx.doi.org/10.17504/protocols.io.n2vdge6,V9,paired,dada2;version;ASV,blast;version;80% identity | MEGAN6;version; b...,Genbank nr;221,11216c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...


In [12]:
## Add scientificName, taxonomic info

occ['scientificName'] = plate['Species']
occ['kingdom'] = plate['Kingdom']
occ['phylum'] = plate['Phylum']
occ['class'] = plate['Class']
occ['order'] = plate['Order']
occ['family'] = plate['Family']
occ['genus'] = plate['Genus']

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,otu_db,occurrenceID,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_3_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,11216c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,11216c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,unassigned,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


For the purpose of submitting data to OBIS, all the variations on missing data (e.g. "unknown," "no_hit," etc.) do not add information. We can replace these with NaN, which is easy to work with in pandas.

In [13]:
## Replace 'unknown', 'unassigned', etc. in scientificName and taxonomy columns with NaN

cols = ['scientificName', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus']
occ[cols] = occ[cols].replace({'unassigned':np.nan,
                              's_':np.nan,
                              'g_':np.nan,
                              'unknown':np.nan,
                              'no_hit':np.nan})
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,otu_db,occurrenceID,DNA_sequence,scientificName,kingdom,phylum,class,order,family,genus
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,05114c01_12_edna_3_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,11216c01_12_edna_1_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Genbank nr;221,11216c01_12_edna_2_S_occ1,GCTACTACCGATTGAACATTTTAGTGAGGTCCTCGGACTGTGAGCC...,,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus


In [14]:
## Get unique species names

names = occ['scientificName'].unique()
names = names[~pd.isnull(names)]  # remove NaN
print(len(names))

323


OBIS uses the World Register of Marine Species (or [WoRMS](http://www.marinespecies.org/)) as it's taxonomic backbone, so scientific names have to be WoRMS-approved in order to show up as valid occurrences. But there are a number of entries in the `scientificName` column, like "uncultured marine eukaryote," "eukaryote clone OLI11007," and "Acantharian sp. 6201," that are **not proper Linnaean species names**. Since these essentially indicate that a more precise name is unknown, it seemed reasonable to replace these with NaN as well. 

**NOTE: I used a simple rule to filter out non-Linnaean names, but it's important to check and see if any true species names are being removed.**

To visually inspect names that are being filtered out, use:
```python
names = occ['scientificName'].unique()
names = names[~pd.isnull(names)]  # remove NaN
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        print(name)
```

In [15]:
## Replace non-Linnaean species names with NaN

# Get non-Linnaean names
non_latin_names = []
for name in names:
    words_in_name = name.split(' ')
    if len(words_in_name) > 2:
        non_latin_names.append(name)
non_latin_names_dict = {i:np.nan for i in non_latin_names}

# Add any names that didn't get caught in the simple filter
non_latin_names_dict['phototrophic eukaryote'] = np.nan
non_latin_names_dict['Candida <clade Candida/Lodderomyces clade>'] = np.nan

# Replace
occ['scientificName'].replace(non_latin_names_dict, inplace=True)

In addition, many records **only give "Eukaryota" as the scientific name** (i.e. Eukaryota is in the kingdom field, and there is no more taxonomic information). These should be replaced with [Biota](http://marinespecies.org/aphia.php?p=taxdetails&id=1), which is WoRMS's most general taxonomic designation.

In [16]:
## Replace entries where kingdom = 'Eukaryota' with the WoRMS-approved 'Biota'

occ.loc[occ['kingdom'] == 'Eukaryota', 'kingdom'] = 'Biota'

The data providers for this dataset used the [NCBI taxonomy database](https://www.ncbi.nlm.nih.gov/taxonomy) as their reference database when assigning taxonomies to ASVs. **It's important to note** that this taxonomy database is not a taxonomic authority, and the taxonomic ranks for any given scientific name on WoRMS may not directly compare. There are ongoing discussions about this problem (see [this](https://github.com/iobis/Project-team-Genetic-Data/issues/5) GitHub issue). At the moment, I don't see a way to definitively ensure that a given scientific name actually has the same taxonomic ranks on both platforms without going case-by-case.

In addition, there are still names in the data that will not match on WoRMS at all, despite appearing to be Linnaean names. This is because the name may not have been fully and officially adopted by the scientific community. I therefore need a system for searching through the higher taxonomic ranks given, finding the lowest one that will match on WoRMS, and putting that name in the `scientificName` column. The following few code blocks do this - they're clunky, but they were sufficient for this data set.

In [17]:
## Define functions for finding the lowest available taxonomic rank that will match on WoRMS

def fill_lowest_taxon(df, cols):
    """ Takes the occurrence pandas data frame and fills missing values in scientificName 
    with values from the first non-missing taxonomic rank column. The names of the taxonomic
    rank columns are listed in cols. """
    
    cols.reverse()
    
    for col in cols[:-1]:
        df['scientificName'] = df['scientificName'].combine_first(df[col])
    
    cols.reverse()
    
    return(df)

def find_not_matched(df, name_dict):
    """ Takes the occurrence pandas data frame and name_dict matching scientificName values 
    with names on WoRMS and returns a list of names that did not match on WoRMS. """
    
    not_matched = []
    
    for name in df['scientificName'].unique():
        if name not in name_dict.keys():
            not_matched.append(name)
    
    try:
        not_matched.remove(np.nan)
    except ValueError:
        pass
            
    return(not_matched)

def replace_not_matched(df, not_matched, cols):
    """ Takes the occurrence pandas data frame and a list of scientificName values that 
    did not match on WoRMS and replaces those values with NaN in the columns specified by cols. """
    
    df[cols] = df[cols].replace(not_matched, np.nan)
    
    return(df)  

In [18]:
## Iterate to match lowest possible taxonomic rank on WoRMS (takes ~8 minutes when starting with ~750 names)

# Note that cols (list of taxonomic column names) was defined in a previous code block 

# Initialize dictionaries
name_name_dict = {}
name_id_dict = {}
name_taxid_dict = {}
name_class_dict = {}

# Initialize not_matched
not_matched = [1]

# Iterate
while len(not_matched) > 0:
    
    # Step 1 - fill
    occ = fill_lowest_taxon(occ, cols)

    # Step 2 - get names to match
    to_match = find_not_matched(occ, name_name_dict)

    # Step 3 - match
    print('Matching {num} names on WoRMS.'.format(num = len(to_match)))
    name_id, name_name, name_taxid, name_class = WoRMS.run_get_worms_from_scientific_name(to_match, verbose_flag=False)
    name_id_dict = {**name_id_dict, **name_id}
    name_name_dict = {**name_name_dict, **name_name}
    name_taxid_dict = {**name_taxid_dict, **name_taxid}
    name_class_dict = {**name_class_dict, **name_class}
    print('Length of name_name_dict: {length}'.format(length = len(name_name_dict)))

    # Step 4 - get names that didn't match
    not_matched = find_not_matched(occ, name_name_dict)
    print('Number of names not matched: {num}'.format(num = len(not_matched)))

    # Step 5 - replace these values with NaN
    occ = replace_not_matched(occ, not_matched, cols)

Matching 756 names on WoRMS.
Length of name_name_dict: 696
Number of names not matched: 60
Matching 35 names on WoRMS.
Length of name_name_dict: 716
Number of names not matched: 15
Matching 11 names on WoRMS.
Length of name_name_dict: 721
Number of names not matched: 6
Matching 3 names on WoRMS.
Length of name_name_dict: 722
Number of names not matched: 2
Matching 1 names on WoRMS.
Length of name_name_dict: 722
Number of names not matched: 1
Matching 0 names on WoRMS.
Length of name_name_dict: 722
Number of names not matched: 0


There are, I'm sure, a vast number of ways to improve on this. A couple that have crossed my mind are:
- Use of .reverse() in fill_lowest_taxon()
- Add in a progress bar
- Consider better and/or additional stopping criteria. Importantly, what if not all names can be matched?
- Could consider using pyworms instead of my custom WoRMS functions

There are quite a few records where no taxonomic information was obtained at all (i.e., after this whole process, `scientificName` is still NaN). I set `scientificName` to 'Biota' for these records.

In [20]:
## Change scientificName to Biota in cases where all taxonomic information is missing

print(occ[occ['scientificName'].isna() == True].shape)
occ.loc[occ['scientificName'].isna() == True, 'scientificName'] = 'Biota'
occ[occ['scientificName'].isna() == True].shape

(33360, 32)


(0, 32)

Finally, during the above process, **I altered the taxonomy columns in order to obtain the best possible `scientificName` column**. I chose to re-populate these columns with the taxonomy from the original data set, rather than altering some of the names to match taxonomy retrieved from WoRMS. In this case, it seemed best to adhere as closely as possible to the original data.

In [21]:
## Fix taxonomy columns

# Replace with original data
occ[cols[1:]] = plate[['Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus']].copy()

# Replace missing data indicators in original data with empty strings ('')
occ[cols[1:]] = occ[cols[1:]].replace({
    'unassigned':'',
    's_':'',
    'g_':'',
    'unknown':'',
    'no_hit':''})

In [22]:
## Add scientific name-related columns

occ['scientificNameID'] = occ['scientificName']
occ['scientificNameID'].replace(name_id_dict, inplace=True)

occ['taxonID'] = occ['scientificName']
occ['taxonID'].replace(name_taxid_dict, inplace=True)

occ['scientificName'].replace(name_name_dict, inplace=True)

occ['nameAccordingTo'] = 'WoRMS'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,scientificName,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS


I also wanted to persist the original name from the NCBI taxonomy database into the Darwin Core-converted data set. To do this, I queried the database based on the name in the original data to obtain its taxonomic ID number.

In [23]:
## Get set up to query NCBI taxonomy 

from Bio import Entrez

# ----- Insert your email here -----
Entrez.email = 'dianalg@mbari.org'
# ----------------------------------

# Get list of all databases available through this tool
record = Entrez.read(Entrez.einfo())
all_dbs = record['DbList']
all_dbs

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

In [24]:
## Get NCBI taxIDs for each name in dataset ---- TAKES ~ 2 MINUTES FOR 300 RECORDS

name_ncbiid_dict = {}

for name in names:
    handle = Entrez.esearch(db='taxonomy', retmax=10, term=name)
    record = Entrez.read(handle)
    name_ncbiid_dict[name] = record['IdList'][0]
    handle.close()

**Note** that this code will throw an IndexError (IndexError: list index out of range) if a term is not found.

In [25]:
## Add NCBI taxonomy IDs under taxonConceptID

# Map indicators that say no taxonomy was assigned to empty strings
name_ncbiid_dict['unassigned'], name_ncbiid_dict['s_'], name_ncbiid_dict['no_hit'], name_ncbiid_dict['unknown'], name_ncbiid_dict['g_'] = '', '', '', '', ''

# Create column
occ['taxonConceptID']  = plate['Species'].copy()
occ['taxonConceptID'].replace(name_ncbiid_dict, inplace=True)

# Add remainder of text and clean
occ['taxonConceptID'] = 'NCBI:txid' + occ['taxonConceptID']
occ['taxonConceptID'].replace('NCBI:txid', '', inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,kingdom,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Eukaryota,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,


In [27]:
## identificationRemarks

# Get identificationRemarks
occ = occ.merge(meta[['seqID', 'identificationRemarks']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)

# Add name that matched in GenBank - i.e. the species name from the original data
occ['identificationRemarks'] = plate['Species'].copy() + ', ' + occ['identificationRemarks']
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,phylum,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Arthropoda,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2..."


In [28]:
## basisOfRecord

occ['basisOfRecord'] = 'MaterialSample'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,class,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Hexanauplia,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample


In [29]:
## Add identificationReferences 

occ = occ.merge(meta[['seqID', 'identificationReferences']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ['identificationReferences'] = occ['identificationReferences'].str.replace('| ', ' | ', regex=False)

occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,order,family,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Calanoida,Paracalanidae,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...


In [30]:
## organismQuantity (number of reads)

occ['organismQuantity'] = plate['Reads']
occ['organismQuantityType'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,genus,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,Paracalanus,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads


In the context of eDNA data, `sampleSizeValue` should be the total number of reads for a given sample.

In [31]:
## sampleSizeValue

count_by_seq = plate.groupby('Sequence_ID', as_index=False)['Reads'].sum()
occ = occ.merge(count_by_seq, how='left', left_on='eventID', right_on='Sequence_ID')
occ.drop(columns='Sequence_ID', inplace=True)
occ.rename(columns={'Reads':'sampleSizeValue'}, inplace=True)
print(occ.shape)
occ.head()

(280440, 42)


Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,scientificNameID,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,urn:lsid:marinespecies.org:taxname:104196,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419


In [32]:
## sampleSizeUnit

occ['sampleSizeUnit'] = 'DNA sequence reads'
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,taxonID,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue,sampleSizeUnit
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600,DNA sequence reads
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702,DNA sequence reads
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275,DNA sequence reads
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220,DNA sequence reads
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,104196,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419,DNA sequence reads


In [34]:
## associatedSequences

occ = occ.merge(meta[['seqID', 'associatedSequences']], how='left', left_on='eventID', right_on='seqID')
occ.drop(columns='seqID', inplace=True)
occ.head()

Unnamed: 0,eventID,eventDate,decimalLatitude,decimalLongitude,env_broad_scale,env_local_scale,env_medium,target_gene,pcr_primer_forward,pcr_primer_reverse,...,nameAccordingTo,taxonConceptID,identificationRemarks,basisOfRecord,identificationReferences,organismQuantity,organismQuantityType,sampleSizeValue,sampleSizeUnit,associatedSequences
0,05114c01_12_edna_1_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,14825,DNA sequence reads,85600,DNA sequence reads,NCBI BioProject accession number PRJNA433203
1,05114c01_12_edna_2_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16094,DNA sequence reads,90702,DNA sequence reads,NCBI BioProject accession number PRJNA433203
2,05114c01_12_edna_3_S,2014-02-20T15:33:00-08:00,36.7958,-121.848,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,22459,DNA sequence reads,130275,DNA sequence reads,NCBI BioProject accession number PRJNA433203
3,11216c01_12_edna_1_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,19312,DNA sequence reads,147220,DNA sequence reads,NCBI BioProject accession number PRJNA433203
4,11216c01_12_edna_2_S,2016-04-21T14:39:00-07:00,36.7962,-121.846,marine biome (ENVO:00000447),coastal water (ENVO:00001250),waterborne particulate matter (ENVO:01000436),18S,GTACACACCGCCCGTC,TGATCCTTCTGCAGGTTCACCTAC,...,WoRMS,,"unassigned, Genbank nr Release 221 September 2...",MaterialSample,https://github.com/MBARI-BOG/BOG-Banzai-Dada2-...,16491,DNA sequence reads,121419,DNA sequence reads,NCBI BioProject accession number PRJNA433203


In [35]:
## Drop records where organismQuantity = 0 (absences are not meaningful for this data set)

occ = occ[occ['organismQuantity'] > 0]
print(occ.shape)

(64903, 44)


In [36]:
## Check for NaN values in string fields - if there are any, replace them with empty strings ('')

occ.isna(). sum()

eventID                     0
eventDate                   0
decimalLatitude             0
decimalLongitude            0
env_broad_scale             0
env_local_scale             0
env_medium                  0
target_gene                 0
pcr_primer_forward          0
pcr_primer_reverse          0
pcr_primer_name_forward     0
pcr_primer_name_reverse     0
pcr_primer_reference        0
sop                         0
seq_meth                    0
samp_vol_we_dna_ext         0
nucl_acid_ext               0
nucl_acid_amp               0
target_subfragment          0
lib_layout                  0
otu_class_appr              0
otu_seq_comp_appr           0
otu_db                      0
occurrenceID                0
DNA_sequence                0
scientificName              0
kingdom                     0
phylum                      0
class                       0
order                       0
family                      0
genus                       0
scientificNameID            0
taxonID   

In [37]:
## Divide into occurrence and DNADerivedDataExt

ddd_cols = [
    'eventID',
    'occurrenceID',
    'DNA_sequence',
    'sop',
    'nucl_acid_ext',
    'samp_vol_we_dna_ext',
    'nucl_acid_amp',
    'target_gene',
    'target_subfragment',
    'lib_layout',
    'pcr_primer_forward',
    'pcr_primer_reverse',
    'pcr_primer_name_forward',
    'pcr_primer_name_reverse',
    'pcr_primer_reference',
    'seq_meth',
    'otu_class_appr',
    'otu_seq_comp_appr',
    'otu_db',
    'env_broad_scale',
    'env_local_scale',
    'env_medium',
]

DNADerivedData = occ[ddd_cols].copy()

occ.drop(ddd_cols[2:], axis=1, inplace=True)

## Save

In [45]:
## Save

# Get path
folder = os.getcwd().replace('src', 'processed')
occ_filename = os.path.join(folder, 'occurrence.csv')
ddd_filename = os.path.join(folder, 'dna_extension.csv')

# Create folder
if not os.path.exists(folder):
    os.makedirs(folder)

# Save
occ.to_csv(occ_filename, index=False, na_rep='NaN')
DNADerivedData.to_csv(ddd_filename, index=False, na_rep='NaN')