# Darwin Core (DwC) **Occurrences**, Puget Sound Zooplankton Monitoring Program (PSZMP) dataset

Alignment of zooplankton dataset to the [Darwin Core (DwC) data standard](https://dwc.tdwg.or/), carried out by **NANOOS**, https://www.nanoos.org. This data alignment work, including this Jupyter notebook, is described in the NANOOS GitHub repository https://github.com/nanoos-pnw/obis-pszmp. See [README.md](https://github.com/nanoos-pnw/obis-pszmp/blob/main/README.md) for further background on the dataset, DwC and data transformations.

Emilio Mayorga, https://github.com/emiliom

## Goals and scope of this notebook

Parse the source data and combine it with the `intermediate_DwC_taxonomy.csv` file created in the notebook `PSZMP-dwcTaxonomy.ipynb` to create the DwC "occurrence" file `DwC_occurrence.csv`. In this file, each record is an individual taxonomic occurrence observation from a sample, containing [WoRMS](https://www.marinespecies.org/) (World Register of Marine Species) taxonomic information as well as, when available, sex and life stage. The original sex and life stage information in the source data (in the `Life History Stage` column, renamed to `life_history_stage`) was parsed, turned to all lower case, manually corrected in some cases based on previous research (encoded in [data_preprocess.py](data_preprocess.py) and [common_mappings.json](common_mappings.json)), and matched to standard vocabularies when applicable. This notebook also creates an intermediate file, `intermediate_DwC_occurrence_life_history_stage.csv`, containing a match up between the `occurrenceID` created here and the `life_history_stage` entry, to be used later in the `PSZMP-dwceMoF.ipynb` notebook.

## Settings

In [1]:
from datetime import datetime
import json
from pathlib import Path

import pandas as pd

from data_preprocess import create_csv_zip, read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Process JSON file containing common mappings and strings

In [4]:
with open(data_pth / 'common_mappings.json') as f:
    mappings = json.load(f)

In [5]:
DatasetCode = mappings['datasetcode']
sex_dwciri_terms = mappings['sex_dwciri_terms']
life_stage_mappings = mappings['life_stage_mappings']

lifeStage_mapping = {k:v[0] for k,v in life_stage_mappings.items()}
lifeStage_dwciri_terms = {k:v[1] for k,v in life_stage_mappings.items()}

## Pre-process data for Occurrence table

### Read and pre-processe the source data from Excel file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
usecols = [
    'BugSampleID', 'Sample Code', 'Genus species', 'Genus species_lc', 'Life History Stage_lc', 'lhs_0', 'lhs_1'
]

# occursource_df = read_and_parse_sourcedata(test_n_rows=1000)[usecols]
occursource_df = read_and_parse_sourcedata()[usecols]

occursource_df.rename(
    columns={
        'Sample Code':'sample_code',
        'Genus species_lc':'species',
        'Life History Stage_lc':'life_history_stage',
    },
    inplace=True
)

In [7]:
len(occursource_df)

185737

In [8]:
occursource_df.head()

Unnamed: 0,BugSampleID,sample_code,Genus species,species,life_history_stage,lhs_0,lhs_1
0,12496,032514DANAD1147,ALPHEIDAE,alpheidae,unknown,unknown,
1,12497,032514DANAD1147,BARNACLES,cirripedia,cyprid larva,cyprid larva,
2,12498,032514DANAD1147,BARNACLES,cirripedia,nauplius,nauplius,
3,12499,032514DANAD1147,CALANUS,calanus,c5-adult,c5-adult,
4,12500,032514DANAD1147,CANCRIDAE,cancridae,"z1, zoea i",z1,zoea i


## Merge resolved taxonomy from taxonomy csv

In [9]:
taxonomy_df = pd.read_csv(
    data_pth / "intermediate_DwC_taxonomy.csv"
)

In [10]:
occursource_df = occursource_df.merge(
    taxonomy_df, 
    how='inner', 
    left_on='Genus species',
    right_on='verbatimIdentification'
)

In [11]:
len(occursource_df)

185737

In [12]:
occursource_df.head()

Unnamed: 0,BugSampleID,sample_code,Genus species,species,life_history_stage,lhs_0,lhs_1,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,12496,032514DANAD1147,ALPHEIDAE,alpheidae,unknown,unknown,,Alpheidae,urn:lsid:marinespecies.org:taxname:106776,Family,Animalia,Arthropoda,Malacostraca,Decapoda,Alpheidae,,"Rafinesque, 1815",ALPHEIDAE
1,12523,032514DANAM1126,ALPHEIDAE,alpheidae,unknown,unknown,,Alpheidae,urn:lsid:marinespecies.org:taxname:106776,Family,Animalia,Arthropoda,Malacostraca,Decapoda,Alpheidae,,"Rafinesque, 1815",ALPHEIDAE
2,12605,040114sketm1120,ALPHEIDAE,alpheidae,unknown,unknown,,Alpheidae,urn:lsid:marinespecies.org:taxname:106776,Family,Animalia,Arthropoda,Malacostraca,Decapoda,Alpheidae,,"Rafinesque, 1815",ALPHEIDAE
3,12699,040914danam1055,ALPHEIDAE,alpheidae,unknown,unknown,,Alpheidae,urn:lsid:marinespecies.org:taxname:106776,Family,Animalia,Arthropoda,Malacostraca,Decapoda,Alpheidae,,"Rafinesque, 1815",ALPHEIDAE
4,12722,040914danas1030,ALPHEIDAE,alpheidae,unknown,unknown,,Alpheidae,urn:lsid:marinespecies.org:taxname:106776,Family,Animalia,Arthropoda,Malacostraca,Decapoda,Alpheidae,,"Rafinesque, 1815",ALPHEIDAE


## Create and populate `occurrence_df` dataframe

In [13]:
occurrence_df = occursource_df.copy()

### Add `occurrenceID`, `eventID`, `basisOfRecord`, `occurrenceStatus`

`occurrenceID` will be a place holder initially, then populated in a subsequent step.

In [14]:
occurrence_df.insert(0, 'occurrenceID', '')
occurrence_df.insert(1, 'eventID', DatasetCode + "-SMP-" + occurrence_df['sample_code'])
occurrence_df.insert(2, 'basisOfRecord', 'MaterialSample')
occurrence_df.insert(3, 'occurrenceStatus', 'present')

### Map `life_history_stage` into `sex` and `lifeStage`

In [15]:
# dwc standard columns
occurrence_df.insert(
    4, 'sex', 
    occurrence_df['lhs_0'].apply(
        lambda s: s if s in ['female', 'male'] else 'indeterminate'
    )
)
occurrence_df.insert(
    5, 'lifeStage', 
    occurrence_df['lhs_0'].apply(lambda s: lifeStage_mapping[s])
)

# dwciri columns
occurrence_df.insert(
    6, 'dwciri:sex', 
    occurrence_df['sex'].apply(lambda s: mappings['vocab_server_base_url'] + sex_dwciri_terms[s])
)
occurrence_df.insert(
    7, 'dwciri:lifeStage', 
    occurrence_df['lhs_0'].apply(
        lambda s: None if lifeStage_dwciri_terms[s] is None 
        else mappings['vocab_server_base_url'] + lifeStage_dwciri_terms[s]
    )
)

### Construct and populate `occurrenceID`

`occurrenceID` is constructed from existing but regularized information. Specifically:
```
<Modified Sample Event ID>-<BugSampleID>-<WoRMS scientific name>-<lifestage>-<sex>
```
where `<Modified Sample Event ID>` is the occurrence parent Event ID with the "-SMP-" ("sample") string token replaced by "-OCC-" ("occurrence"); and `<BugSampleID>` is a unique, largely persistent integer ID used in the data provider's internal database. Here's an example of a resulting `ocurrenceID`:
```
PSZMP-OCC-040320MUKV1159-149417-DITRICHOCORYCAEUS_ANGLICUS-ADULT-M
```
where
- `<Modified Sample Event ID>` = "PSZMP-OCC-040320MUKV1159"
- `<BugSampleID>` = "149417"
- `<WoRMS scientific name>` = "DITRICHOCORYCAEUS_ANGLICUS"
- `<lifestage>` = "ADULT"
- `<sex>` = "M"

In [16]:
def construct_occurrenceID(row):
    """
    Construct occurrenceID from sample eventID, BugSampleID, marinespecies.org, 
    sex and life history stage. But replace "SMP" in eventID with "OCC".
    """
    eventid_token = row["eventID"].replace("-SMP-", "-OCC-")
    bugsample_id = str(row["BugSampleID"])
    taxon_token = row["scientificName"].replace(" ", "_").upper()
    lifestage_token = "_".join(row["lifeStage"].split(" ")[:2]).upper()
    sex_token = row["sex"][0].upper()
    
    return "-".join(
        [eventid_token, bugsample_id, taxon_token, lifestage_token, sex_token]
    )

occurrence_df['occurrenceID'] = occurrence_df.apply(
    lambda row: construct_occurrenceID(row), 
    axis=1
)

In [17]:
occurrence_df = (
    occurrence_df
    .sort_values(by=['eventID', 'scientificName', 'lifeStage', 'sex'])
    .reset_index(drop=True)
)

In [18]:
occurrence_df.head(5)

Unnamed: 0,occurrenceID,eventID,basisOfRecord,occurrenceStatus,sex,lifeStage,dwciri:sex,dwciri:lifeStage,BugSampleID,sample_code,...,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,PSZMP-OCC-010218ELIV1151-101148-ACARTIA_HUDSON...,PSZMP-SMP-010218ELIV1151,MaterialSample,present,male,adult,https://vocab.nerc.ac.uk/collection/S10/curren...,https://vocab.nerc.ac.uk/collection/S11/curren...,101148,010218ELIV1151,...,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",ACARTIA HUDSONICA
1,PSZMP-OCC-010218ELIV1151-101150-ACARTIA_LONGIR...,PSZMP-SMP-010218ELIV1151,MaterialSample,present,female,adult,https://vocab.nerc.ac.uk/collection/S10/curren...,https://vocab.nerc.ac.uk/collection/S11/curren...,101150,010218ELIV1151,...,urn:lsid:marinespecies.org:taxname:104257,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"(Lilljeborg, 1853)",ACARTIA LONGIREMIS
2,PSZMP-OCC-010218ELIV1151-101149-ACARTIA_LONGIR...,PSZMP-SMP-010218ELIV1151,MaterialSample,present,male,adult,https://vocab.nerc.ac.uk/collection/S10/curren...,https://vocab.nerc.ac.uk/collection/S11/curren...,101149,010218ELIV1151,...,urn:lsid:marinespecies.org:taxname:104257,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"(Lilljeborg, 1853)",ACARTIA LONGIREMIS
3,PSZMP-OCC-010218ELIV1151-101151-AETIDEUS-ADULT-F,PSZMP-SMP-010218ELIV1151,MaterialSample,present,female,adult,https://vocab.nerc.ac.uk/collection/S10/curren...,https://vocab.nerc.ac.uk/collection/S11/curren...,101151,010218ELIV1151,...,urn:lsid:marinespecies.org:taxname:104112,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,Aetideus,"Brady, 1883",AETIDEUS
4,PSZMP-OCC-010218ELIV1151-101147-AGLANTHA_DIGIT...,PSZMP-SMP-010218ELIV1151,MaterialSample,present,indeterminate,medusae,https://vocab.nerc.ac.uk/collection/S10/curren...,https://vocab.nerc.ac.uk/collection/S11/curren...,101147,010218ELIV1151,...,urn:lsid:marinespecies.org:taxname:117849,Species,Animalia,Cnidaria,Hydrozoa,Trachymedusae,Rhopalonematidae,Aglantha,"(O. F. Müller, 1776)",AGLANTHA DIGITALE


## Export intermediate table for `life_history_stage` matching to csv

In [19]:
if not debug_no_csvexport:
    occurrence_df[['occurrenceID', 'life_history_stage']].to_csv(
        data_pth / 'intermediate_DwC_occurrence_life_history_stage.csv', index=False
    )

## Export `occurrence_df` to csv

In [20]:
len(occurrence_df)

185737

First remove interim columns that won't be exported.

In [21]:
occurrence_df.drop(
    columns=['BugSampleID', 'sample_code', 'Genus species', 'species', 'life_history_stage', 'lhs_0', 'lhs_1'], 
    inplace=True
)

In [22]:
csv_fpth = data_pth / "aligned_csvs" / "DwC_occurrence.csv"

In [23]:
if not debug_no_csvexport:
    occurrence_df.to_csv(csv_fpth, index=False)

### Create zip file with the csv

In [24]:
if not debug_no_csvexport:
    create_csv_zip(csv_fpth)

## Package versions

In [25]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2024-10-06 20:12:42.420984 +00:00
pandas: 1.5.3
