# DwC Occurrences. Keister Zooplankton Hood Canal 2012-13 data

University of Washington Pelagic Hypoxia Hood Canal project, Zooplankton dataset.   
Alignment of dataset to Darwin Core (DwC) for NANOOS, https://www.nanoos.org. This data alignment work, including this Jupyter notebook, are described in the GitHub repository https://github.com/nanoos-pnw/obis-keisterhczoop. See [notebooks-notes.md](https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/notebooks-notes.md) and [README.md](https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/README.md).   

Emilio Mayorga, https://github.com/emiliom

## Goals and scope of this notebook

Parse the source data and combine it with the `intermediate_DwC_taxonomy.csv` file created in the notebook `Keister-dwcTaxonomy.ipynb` to create the DwC "occurrence" file `DwC_occurrence.csv`. In this file, each record is an individual taxonomic occurrence observation from a sample, containing WoRM taxonomic information, sex and life stage. The original sex and life stage information in the source data (in the `life_history_stage` column) was parsed, manually corrected in some cases based on previous research (encoded in `data_preprocess.py` and `common_mappings.json`), and matched to standard vocabularies when applicable. This notebook also creates an intermediate file, `intermediate_DwC_occurrence_life_history_stage.csv`, containing a match up between the `OccurrenceID` created here and the `life_history_stage` entry, to be used later in the `Keister-dwceMoF.ipynb` notebook.

## Settings

In [1]:
from datetime import datetime
from pathlib import Path
import uuid

import numpy as np
import pandas as pd

from data_preprocess import read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Process JSON file containing common mappings and strings

In [4]:
with open(data_pth / 'common_mappings.json') as f:
    common_mappings = json.load(f)

In [5]:
DatasetCode = common_mappings['datasetcode']
sex_dwciri_terms = common_mappings['sex_dwciri_terms']
life_stage_mappings = common_mappings['life_stage_mappings']

# TODO: Create two mappings from life_stage_mappings, for lifeStage and dwciri:lifeStage
lifeStage_mapping = {k:v[0] for k,v in life_stage_mappings.items()}
lifeStage_dwciri_terms = {k:v[1] for k,v in life_stage_mappings.items()}

## Read the data

### Read the pre-processed csv file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
usecols = ['sample_code', 'species', 'life_history_stage', 'lhs_0', 'lhs_1']
occursource_df = read_and_parse_sourcedata()[usecols]

In [7]:
len(occursource_df)

6866

In [8]:
occursource_df.head()

Unnamed: 0,sample_code,species,life_history_stage,lhs_0,lhs_1
0,20131003DBDm2_200,ACARTIA,3;_CIII,3,CIII
1,20130906DBiDm1_200,ACARTIA,5;_CV,5,CV
2,20131003DBDm1_200,ACARTIA,Female;_Adult,Female,Adult
3,20131003DBDm1_200,ACARTIA,Male;_Adult,Male,Adult
4,20120614DBDm3_200,ACARTIA_CLAUSI,Female;_Adult,Female,Adult


## Merge resolved taxonomy from taxonomy csv

In [9]:
taxonomy_df = pd.read_csv(
    data_pth / "intermediate_DwC_taxonomy.csv"
)

In [10]:
occursource_df = occursource_df.merge(
    taxonomy_df, 
    how='inner', 
    left_on='species',
    right_on='verbatimIdentification'
)

In [11]:
len(occursource_df)

6853

In [12]:
occursource_df.head()

Unnamed: 0,sample_code,species,life_history_stage,lhs_0,lhs_1,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,20131003DBDm2_200,ACARTIA,3;_CIII,3,CIII,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
1,20130906DBiDm1_200,ACARTIA,5;_CV,5,CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
2,20131003DBDm1_200,ACARTIA,Female;_Adult,Female,Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
3,20131003DBDm1_200,ACARTIA,Male;_Adult,Male,Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
4,20130905DUNm1_200,ACARTIA,5;_CV,5,CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA


## Create and populate `occurrence_df` dataframe

### Add `occurrenceID`, `basisOfRecord`, `occurrenceStatus`

In [13]:
occurrence_df = occursource_df.copy()

In [14]:
occurrence_df.insert(1, 'occurrenceID', [uuid.uuid4() for i in range(len(occurrence_df))])
occurrence_df.insert(2, 'basisOfRecord', 'MaterialSample')
occurrence_df.insert(3, 'occurrenceStatus', 'present')

occurrence_df.rename(columns={'sample_code':'eventID'}, inplace=True)

In [15]:
len(occurrence_df)

6853

In [16]:
occurrence_df.head(10)

Unnamed: 0,eventID,occurrenceID,basisOfRecord,occurrenceStatus,species,life_history_stage,lhs_0,lhs_1,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,20131003DBDm2_200,57bb9fa7-f0b5-46dc-a396-932be8af1868,MaterialSample,present,ACARTIA,3;_CIII,3,CIII,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
1,20130906DBiDm1_200,20219c4a-a71c-4ae6-a2d2-66e93e4a2d25,MaterialSample,present,ACARTIA,5;_CV,5,CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
2,20131003DBDm1_200,30479af1-73fc-410a-b423-413c154e5ae4,MaterialSample,present,ACARTIA,Female;_Adult,Female,Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
3,20131003DBDm1_200,d7900a6e-8774-41fb-b352-c80f522a90e6,MaterialSample,present,ACARTIA,Male;_Adult,Male,Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
4,20130905DUNm1_200,b544d497-cd9d-4679-a90c-feb74408b655,MaterialSample,present,ACARTIA,5;_CV,5,CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
5,20120614DBDm3_200,7ee05e13-74d2-4b3c-a436-199d6c1740ed,MaterialSample,present,ACARTIA_CLAUSI,Female;_Adult,Female,Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
6,20120614DBDm3_200,011a20f4-4dbc-4aa0-8283-7a8e394b0371,MaterialSample,present,ACARTIA_CLAUSI,Male;_Adult,Male,Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
7,20120614DBDm4_200,13f798cb-739e-46ca-a1b1-de229e743a3a,MaterialSample,present,ACARTIA_CLAUSI,Female;_Adult,Female,Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
8,20120712DBDm1_200,0f970c3c-a7ba-465e-858d-9a31e1235862,MaterialSample,present,ACARTIA_CLAUSI,5;_CV,5,CV,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
9,20121004DBDm4_200,c13bd4f3-e623-466e-a597-bb7b43195a9b,MaterialSample,present,ACARTIA_CLAUSI,Female;_Adult,Female,Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI


### Map `life_history_stage` into `sex` and `lifeStage`

In [17]:
# dwc standard columns
occurrence_df.insert(
    4, 'sex', 
    occurrence_df['lhs_0'].apply(
        lambda s: s.lower() if s in ['Female', 'Male'] else 'indeterminate'
    )
)
occurrence_df.insert(
    5, 'lifeStage', 
    occurrence_df['lhs_0'].apply(lambda s: lifeStage_mapping[s])
)

# dwciri columns
nercvocabs_url = "http://vocab.nerc.ac.uk/collection/"
occurrence_df.insert(
    6, 'dwciri:sex', 
    occurrence_df['sex'].apply(lambda s: nercvocabs_url + sex_dwciri_terms[s])
)
occurrence_df.insert(
    7, 'dwciri:lifeStage', 
    occurrence_df['lhs_0'].apply(
        lambda s: None if lifeStage_dwciri_terms[s] is None 
        else nercvocabs_url + lifeStage_dwciri_terms[s]
    )
)

In [18]:
occurrence_df = (
    occurrence_df
    .sort_values(by=['eventID', 'scientificName', 'lifeStage', 'sex'])
    .reset_index(drop=True)
)

In [19]:
occurrence_df.head(5)

Unnamed: 0,eventID,occurrenceID,basisOfRecord,occurrenceStatus,sex,lifeStage,dwciri:sex,dwciri:lifeStage,species,life_history_stage,...,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,20120611UNDm1_200,7a3212fd-52ad-45f8-996a-d0a3c573b786,MaterialSample,present,female,adult,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,ACARTIA_CLAUSI,Female;_Adult,...,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
1,20120611UNDm1_200,9244c0a0-ac51-4d6f-b7a2-b0602e32572c,MaterialSample,present,male,adult,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,ACARTIA_CLAUSI,Male;_Adult,...,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
2,20120611UNDm1_200,399652c3-c897-443f-8251-d595da59ee50,MaterialSample,present,indeterminate,copepodites C5,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,ACARTIA_CLAUSI,5;_CV,...,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
3,20120611UNDm1_200,91c1d04e-6756-44a8-a66a-cf16c21bf92d,MaterialSample,present,indeterminate,medusae,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,AGLANTHA,Medusa,...,urn:lsid:marinespecies.org:taxname:117212,Genus,Animalia,Cnidaria,Hydrozoa,Trachymedusae,Rhopalonematidae,Aglantha,"Haeckel, 1879",AGLANTHA
4,20120611UNDm1_200,8a4e9752-8a85-4765-9ee8-8777c548dab4,MaterialSample,present,indeterminate,veliger,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,BIVALVIA,Veliger,...,urn:lsid:marinespecies.org:taxname:105,Class,Animalia,Mollusca,Bivalvia,,,,"Linnaeus, 1758",BIVALVIA


## Export intermediate table for `life_history_stage` matching to csv

In [20]:
if not debug_no_csvexport:
    occurrence_df[['occurrenceID', 'life_history_stage']].to_csv(
        data_pth / 'intermediate_DwC_occurrence_life_history_stage.csv', index=False
    )

## Export `occurrence_df` to csv

In [21]:
len(occurrence_df)

6853

### Cleanup

In [22]:
occurrence_df.drop(columns=['species', 'life_history_stage', 'lhs_0', 'lhs_1'], inplace=True)

if not debug_no_csvexport:
    occurrence_df.to_csv(data_pth / 'aligned_csvs' / 'DwC_occurrence.csv', index=False)

## Package versions

In [23]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2023-11-02 23:40:03.006607 +00:00
pandas: 1.5.3
