# DwC Occurrences. PSZMP dataset

FILL THIS OUT LATER USING THE HOOD CANAL TEXT AS A TEMPLATE.  
University of Washington Pelagic Hypoxia Hood Canal project, Zooplankton dataset.   
Alignment of dataset to Darwin Core (DwC) for NANOOS, https://www.nanoos.org. This data alignment work, including this Jupyter notebook, are described in the GitHub repository https://github.com/nanoos-pnw/obis-keisterhczoop. See [notebooks-notes.md](https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/notebooks-notes.md) and [README.md](https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/README.md).   

Emilio Mayorga, https://github.com/emiliom

2/9, 2024

## Goals and scope of this notebook

Parse the source data and combine it with the `intermediate_DwC_taxonomy.csv` file created in the notebook `Keister-dwcTaxonomy.ipynb` to create the DwC "occurrence" file `DwC_occurrence.csv`. In this file, each record is an individual taxonomic occurrence observation from a sample, containing WoRM taxonomic information, sex and life stage. The original sex and life stage information in the source data (in the `life_history_stage` column) was parsed, manually corrected in some cases based on previous research (encoded in `data_preprocess.py` and `common_mappings.json`), and matched to standard vocabularies when applicable. This notebook also creates an intermediate file, `intermediate_DwC_occurrence_life_history_stage.csv`, containing a match up between the `OccurrenceID` created here and the `life_history_stage` entry, to be used later in the `Keister-dwceMoF.ipynb` notebook.

## Settings

In [1]:
from datetime import datetime
import json
from pathlib import Path
import uuid

import pandas as pd

from data_preprocess import read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Process JSON file containing common mappings and strings

In [4]:
with open(data_pth / 'common_mappings.json') as f:
    common_mappings = json.load(f)

In [5]:
DatasetCode = common_mappings['datasetcode']
sex_dwciri_terms = common_mappings['sex_dwciri_terms']
life_stage_mappings = common_mappings['life_stage_mappings']

# TODO: Create two mappings from life_stage_mappings, for lifeStage and dwciri:lifeStage
lifeStage_mapping = {k:v[0] for k,v in life_stage_mappings.items()}
lifeStage_dwciri_terms = {k:v[1] for k,v in life_stage_mappings.items()}

## Read the data

### Read the pre-processed csv file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
# usecols = ['sample_code', 'species', 'life_history_stage', 'lhs_0', 'lhs_1']
usecols = ['Sample Code', 'Genus species_lc', 'Life History Stage_lc', 'lhs_0', 'lhs_1']

# occursource_df = read_and_parse_sourcedata(test_n_rows=1000)[usecols]
occursource_df = read_and_parse_sourcedata()[usecols]

# TODO: Rename more columns, if needed
occursource_df.rename(
    columns={
        'Sample Code':'sample_code',
        'Genus species_lc':'species',
        'Life History Stage_lc':'life_history_stage',
    },
    inplace=True
)

In [7]:
len(occursource_df)

153825

In [8]:
occursource_df.head()

Unnamed: 0,sample_code,species,life_history_stage,lhs_0,lhs_1
0,010218ELIV1151,acartia hudsonica,"male, adult",male,adult
1,010218ELIV1151,acartia longiremis,"female, adult",female,adult
2,010218ELIV1151,acartia longiremis,"male, adult",male,adult
3,010218ELIV1151,aetideus,"female, adult",female,adult
4,010218ELIV1151,aglantha digitale,medusa,medusa,


## Merge resolved taxonomy from taxonomy csv

In [9]:
taxonomy_df = pd.read_csv(
    data_pth / "intermediate_DwC_taxonomy.csv"
)

In [10]:
occursource_df = occursource_df.merge(
    taxonomy_df, 
    how='inner', 
    left_on='species',
    right_on='verbatimIdentification'
)

In [11]:
len(occursource_df)

153542

In [12]:
occursource_df.head()

Unnamed: 0,sample_code,species,life_history_stage,lhs_0,lhs_1,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,010218ELIV1151,acartia hudsonica,"male, adult",male,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
1,010621SARAV1115,acartia hudsonica,copepodite,copepodite,,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
2,010621SARAV1115,acartia hudsonica,"female, adult",female,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
3,010621SARAV1115,acartia hudsonica,"male, adult",male,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
4,010818SKETV1058,acartia hudsonica,"female, adult",female,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica


## Create and populate `occurrence_df` dataframe

### Add `occurrenceID`, `basisOfRecord`, `occurrenceStatus`

In [13]:
occurrence_df = occursource_df.copy()

In [14]:
occurrence_df.insert(1, 'occurrenceID', [uuid.uuid4() for i in range(len(occurrence_df))])
occurrence_df.insert(2, 'basisOfRecord', 'MaterialSample')
occurrence_df.insert(3, 'occurrenceStatus', 'present')

occurrence_df.rename(columns={'sample_code':'eventID'}, inplace=True)

In [15]:
len(occurrence_df)

153542

In [16]:
occurrence_df.head(10)

Unnamed: 0,eventID,occurrenceID,basisOfRecord,occurrenceStatus,species,life_history_stage,lhs_0,lhs_1,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,010218ELIV1151,d2dd5fd6-4562-4b58-a38f-d7de0259c40d,MaterialSample,present,acartia hudsonica,"male, adult",male,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
1,010621SARAV1115,5f24285b-649f-421e-b0cd-dda6b68c1ef8,MaterialSample,present,acartia hudsonica,copepodite,copepodite,,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
2,010621SARAV1115,54265715-9235-4247-b96f-dfbeb7369873,MaterialSample,present,acartia hudsonica,"female, adult",female,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
3,010621SARAV1115,a9ab3ecc-3bd5-4309-9119-c58fdf6b7bb7,MaterialSample,present,acartia hudsonica,"male, adult",male,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
4,010818SKETV1058,fa7fa9d0-0c16-44ac-81ef-408802246f84,MaterialSample,present,acartia hudsonica,"female, adult",female,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
5,010821HCB004V1342,93284eea-66a3-4d86-b5ce-5cbf6d677111,MaterialSample,present,acartia hudsonica,copepodite,copepodite,,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
6,010821HCB004V1342,2db58c44-110d-4b5b-b227-b8d6d3719abc,MaterialSample,present,acartia hudsonica,"male, adult",male,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
7,010921Wat1V1237,f1fe3314-2a74-4482-8196-2fdacf36d2e7,MaterialSample,present,acartia hudsonica,"female, adult",female,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
8,011315CAMV1330,b17f90b9-3e2b-4793-9d85-bbf3fdbc3cf1,MaterialSample,present,acartia hudsonica,copepodite,copepodite,,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
9,011315CAMV1330,9bbe3fc1-c430-4b87-97b4-94971ff5b39a,MaterialSample,present,acartia hudsonica,"female, adult",female,adult,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica


### Map `life_history_stage` into `sex` and `lifeStage`

In [17]:
# dwc standard columns
occurrence_df.insert(
    4, 'sex', 
    occurrence_df['lhs_0'].apply(
        lambda s: s if s in ['female', 'male'] else 'indeterminate'
    )
)
occurrence_df.insert(
    5, 'lifeStage', 
    occurrence_df['lhs_0'].apply(lambda s: lifeStage_mapping[s])
)

# dwciri columns
nercvocabs_url = "http://vocab.nerc.ac.uk/collection/"
occurrence_df.insert(
    6, 'dwciri:sex', 
    occurrence_df['sex'].apply(lambda s: nercvocabs_url + sex_dwciri_terms[s])
)
occurrence_df.insert(
    7, 'dwciri:lifeStage', 
    occurrence_df['lhs_0'].apply(
        lambda s: None if lifeStage_dwciri_terms[s] is None 
        else nercvocabs_url + lifeStage_dwciri_terms[s]
    )
)

In [18]:
occurrence_df = (
    occurrence_df
    .sort_values(by=['eventID', 'scientificName', 'lifeStage', 'sex'])
    .reset_index(drop=True)
)

In [19]:
occurrence_df.head(5)

Unnamed: 0,eventID,occurrenceID,basisOfRecord,occurrenceStatus,sex,lifeStage,dwciri:sex,dwciri:lifeStage,species,life_history_stage,...,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,010218ELIV1151,d2dd5fd6-4562-4b58-a38f-d7de0259c40d,MaterialSample,present,male,adult,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,acartia hudsonica,"male, adult",...,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",acartia hudsonica
1,010218ELIV1151,2a44b045-8183-49fb-a5a3-78f27bd5252a,MaterialSample,present,female,adult,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,acartia longiremis,"female, adult",...,urn:lsid:marinespecies.org:taxname:104257,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"(Lilljeborg, 1853)",acartia longiremis
2,010218ELIV1151,ec8d3696-adf7-41a9-a1d3-99344b4cb97b,MaterialSample,present,male,adult,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,acartia longiremis,"male, adult",...,urn:lsid:marinespecies.org:taxname:104257,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"(Lilljeborg, 1853)",acartia longiremis
3,010218ELIV1151,d2b1f319-dd4f-4d67-a4de-422734979f9e,MaterialSample,present,female,adult,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,aetideus,"female, adult",...,urn:lsid:marinespecies.org:taxname:104112,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,Aetideus,"Brady, 1883",aetideus
4,010218ELIV1151,92c8d50f-33d2-4569-a87e-d396c74d851b,MaterialSample,present,indeterminate,medusae,http://vocab.nerc.ac.uk/collection/S10/current...,http://vocab.nerc.ac.uk/collection/S11/current...,aglantha digitale,medusa,...,urn:lsid:marinespecies.org:taxname:117849,Species,Animalia,Cnidaria,Hydrozoa,Trachymedusae,Rhopalonematidae,Aglantha,"(O. F. Müller, 1776)",aglantha digitale


## Export intermediate table for `life_history_stage` matching to csv

In [20]:
if not debug_no_csvexport:
    occurrence_df[['occurrenceID', 'life_history_stage']].to_csv(
        data_pth / 'intermediate_DwC_occurrence_life_history_stage.csv', index=False
    )

## Export `occurrence_df` to csv

In [21]:
len(occurrence_df)

153542

### Cleanup

In [22]:
occurrence_df.drop(columns=['species', 'life_history_stage', 'lhs_0', 'lhs_1'], inplace=True)

if not debug_no_csvexport:
    occurrence_df.to_csv(data_pth / 'aligned_csvs' / 'DwC_occurrence.csv', index=False)

## Package versions

In [23]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2024-02-09 23:44:48.478632 +00:00
pandas: 1.5.3
