# DwC eMoFs. PSZMP data

**Puget Sound Zooplankton Monitoring Program dataset.** Alignment of dataset to Darwin Core (DwC) for NANOOS, https://www.nanoos.org. This data alignment work, including this Jupyter notebook, are described in the GitHub repository https://github.com/nanoos-pnw/obis-pszmp. See [README.md](https://github.com/nanoos-pnw/obis-pszmp/blob/main/README.md).   

Emilio Mayorga, https://github.com/emiliom   

2/15/2024

**TODO:**
- The `eventID` for the non-Ocurrence `measurementType` entries don't look correct. They're not the `eventID` values created in the event notebookk. Could this issue (error?) be found in the Hood Canal notebook???

## Goals and scope of this notebook

Parse the source data and combine it with `common_mappings.json`, the previously generated DwC event and occurrence csv files, and the `intermediate_DwC_occurrence_life_history_stage.csv` file created in the notebook `PSZMP-dwcOccurrence.ipynb` to create the DwC "Extended Measurement or Fact" (eMoF) file `DwC_emof.csv`. This file contains four types of eMoF information, referenced to standard vocabularies:
- Description of the multinet sampling platform
- Assignment of the water column sampling scheme
- Assignment of the sample mesh size
- Matchup of the sample density (abundance per unit volume) to the `OccurrenceID` from the DwC occurrence table

The DwC eMoF table is populated sequentially for each of these eMoF types, in that order. Columns are populated differently depending on the eMoF type.

## Settings

In [1]:
from datetime import datetime
import json
from pathlib import Path

import numpy as np
import pandas as pd

from data_preprocess import read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Process JSON file containing common mappings and strings

In [4]:
with open(data_pth / 'common_mappings.json') as f:
    common_mappings = json.load(f)

In [5]:
DatasetCode = common_mappings['datasetcode']
net_tow = common_mappings['net_tow']

iso8601_format = common_mappings['iso8601_format']

## Pre-process data for eMOF table

### Read and pre-processe the source data from Excel file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
# I doubt I'll need any of the datetime columns here
# usecols = [
#     'sample_code', 'station', 
#     # 'date', 'time_start', 'time', 'day_night', 
#     'mesh_size', 'Tow Type', 
#     'species', 'life_history_stage', 'density'
# ]

usecols = [
    'Sample Code', 'Station', 
    'Mesh Size', 'Tow Type', 
    'Genus species_lc', 'Life History Stage_lc', 'Density (#/m3)'
]

emofsource_df = read_and_parse_sourcedata()[usecols]

# TODO: Rename more columns, if needed
emofsource_df.rename(
    columns={
        'Sample Code': 'sample_code',
        'Station': 'station',
        'Mesh Size': 'mesh_size',
        'Genus species_lc': 'species',
        'Life History Stage_lc': 'life_history_stage',
        'Density (#/m3)': 'density',
    },
    inplace=True
)

In [7]:
len(emofsource_df)

153825

In [8]:
emofsource_df.head()

Unnamed: 0,sample_code,station,mesh_size,Tow Type,species,life_history_stage,density
0,010218ELIV1151,ELIV,200,Vertical,acartia hudsonica,"male, adult",0.899394
1,010218ELIV1151,ELIV,200,Vertical,acartia longiremis,"female, adult",6.295757
2,010218ELIV1151,ELIV,200,Vertical,acartia longiremis,"male, adult",0.899394
3,010218ELIV1151,ELIV,200,Vertical,aetideus,"female, adult",1.798788
4,010218ELIV1151,ELIV,200,Vertical,aglantha digitale,medusa,1.43903


## Read dwcEvent and dwcOccurrence csv's

In [9]:
dwcevent_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_event.csv",
    parse_dates=['eventDate'],
    usecols=['eventID', 'parentEventID', 'eventDate']
)

dwcoccurrence_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_occurrence.csv",
    usecols=['occurrenceID', 'eventID', 'verbatimIdentification']
)

occurrence_vs_life_history_stage = pd.read_csv(
    data_pth / "intermediate_DwC_occurrence_life_history_stage.csv"
)

Add `life_history_stage` to the occurrence dataframe, to enable the eMoF data processing steps used here.

In [10]:
dwcoccurrence_df = dwcoccurrence_df.merge(occurrence_vs_life_history_stage, on='occurrenceID')

## Create empty eMoF dataframe

In [11]:
# Won't use measurementID, per Abby's explanation on Slack
emof_cols_dtypes = np.dtype(
    [
        ('eventID', str),
        ('occurrenceID', str), 
        # --- temporary, for validation
        # ('life_history_stage', str),
        # ('verbatimIdentification', str),
        # ---- below, commented out columns not used
        ('measurementType', str),
        ('measurementTypeID', str), 
        ('measurementValue', str),
        # ('measurementValueID', str),
        # ('measurementAccuracy', str),
        ('measurementUnit', str),
        ('measurementUnitID', str),
        # ('measurementDeterminedDate', str),
        # ('measurementDeterminedBy', str), 
        # ('measurementMethod', str),
        # ('measurementRemarks', str)
    ]
)

In [12]:
emof_df = pd.DataFrame(np.empty(0, dtype=emof_cols_dtypes))

## MoF's associated with an event rather than an occurrence

These will have no `occcurrenceID` entry

### Associated with sample events

In [13]:
emofsource_samples_df = dwcevent_df.merge(
    emofsource_df[['sample_code', 'mesh_size', 'Tow Type']], 
    how='inner',
    left_on='eventID',
    right_on='sample_code'
)

emofsource_samples_df = (
    emofsource_samples_df
    .drop_duplicates()
    .drop(columns='sample_code')
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [14]:
len(emofsource_samples_df)

3567

In [15]:
emofsource_samples_df.head()

Unnamed: 0,eventID,parentEventID,eventDate,mesh_size,Tow Type
0,010218ELIV1151,PSZMP_20180102D_ELIV,2018-01-02 11:51:00-07:00,200,Vertical
1,010322KSBP01D0815,PSZMP_20220103D_KSBP01D,2022-01-03 08:15:00-07:00,335,Oblique
2,010422LSNT01D1323,PSZMP_20220104D_LSNT01D,2022-01-04 13:23:00-07:00,335,Oblique
3,010422LSNT01V1305,PSZMP_20220104D_LSNT01V,2022-01-04 13:05:00-07:00,200,Vertical
4,010422NSEX01V1049,PSZMP_20220104D_NSEX01V,2022-01-04 10:49:00-07:00,200,Vertical


#### multinet sampling

In [16]:
multinet_emof_df = emofsource_samples_df.copy()

In [17]:
multinet_emof_df['measurementType'] = "multinet"
multinet_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/L05/current/68/"
multinet_emof_df['measurementValue'] = 'Hydro-bios 5-net'

In [18]:
len(multinet_emof_df)

3567

Populate (append to) the `emof_df` table with the emof records.

In [19]:
emof_df = pd.concat(
    [
        emof_df,
        multinet_emof_df[['eventID', 'measurementType', 'measurementTypeID', 'measurementValue']]
    ],
    ignore_index=True
)

In [20]:
len(emof_df)

3567

In [None]:
emof_df.head()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
0,010218ELIV1151,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
1,010322KSBP01D0815,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
2,010422LSNT01D1323,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
3,010422LSNT01V1305,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
4,010422NSEX01V1049,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
5,010621SARAV1115,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
6,010818SKETV1058,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
7,010820KSBP01D1450,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
8,010820KSBP01V1430,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
9,010820LSNT01D1131,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,


#### depth stratified vs full water column sampling

In [21]:
fwcds_emof_df = emofsource_samples_df.copy()

In [22]:
fwcds_emof_df['measurementType'] = "Sampling method"
fwcds_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/Q01/current/Q0100003/"
fwcds_emof_df['measurementValue'] = fwcds_emof_df['Tow Type'].apply(lambda fwc_ds: net_tow[fwc_ds])

In [23]:
len(fwcds_emof_df)

3567

Populate (append to) the `emof_df` table with the emof records.

In [24]:
emof_df = pd.concat(
    [
        emof_df,
        fwcds_emof_df[['eventID', 'measurementType', 'measurementTypeID', 'measurementValue']], 
    ],
    ignore_index=True
)

In [25]:
len(emof_df)

7134

In [26]:
emof_df.head(10)

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
0,010218ELIV1151,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
1,010322KSBP01D0815,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
2,010422LSNT01D1323,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
3,010422LSNT01V1305,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
4,010422NSEX01V1049,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
5,010621SARAV1115,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
6,010818SKETV1058,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
7,010820KSBP01D1450,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
8,010820KSBP01V1430,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,
9,010820LSNT01D1131,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,


#### mesh size

In [27]:
mesh_emof_df = emofsource_samples_df.copy()

In [28]:
mesh_emof_df['measurementType'] = "Sampling net mesh size"
mesh_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/"
mesh_emof_df['measurementUnit'] = "Micrometres (microns)"
mesh_emof_df['measurementUnitID'] = "http://vocab.nerc.ac.uk/collection/P06/current/UMIC/"

mesh_emof_df.rename(columns={'mesh_size':'measurementValue'}, inplace=True)

In [29]:
len(mesh_emof_df)

3567

Populate (append to) the `emof_df` table with the emof records.

In [30]:
emof_df = pd.concat(
    [
        emof_df,
        mesh_emof_df[[
            'eventID', 
            'measurementType', 'measurementTypeID', 'measurementValue', 
            'measurementUnit', 'measurementUnitID'
        ]]
    ],
    ignore_index=True
)

In [31]:
len(emof_df)

10701

In [32]:
emof_df.tail(10)

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
10691,121922TDBV1158,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
10692,122116LSNT01D1115,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
10693,122116LSNT01V1057,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
10694,122116NSEX01V1237,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
10695,122121HCB003V1233,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
10696,122121HCB004V1332,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
10697,122220CAMV1044,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
10698,122220Cow3V21046,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
10699,122220WAT1S1303,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
10700,122320ELID1047,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...


## MoF's associated with exactly one occurrence

These will have an `occcurrenceID` entry.

We'll have to preserve the original `life_history_stage` strings in the occurrence csv table, in order to be able to properly merge the occurrence table with `emofsource_df`. Do this maybe by adding a `occurrenceRemarks` column, or by keeping a `life_history_stage` column that will be dropped in a later stage

In [33]:
emofsource_samplesoccur_df = (
    dwcevent_df
    .merge(
        emofsource_df[['sample_code', 'species', 'life_history_stage', 'density']], 
        how='inner',
        left_on='eventID',
        right_on='sample_code'
    )
    .merge(
        dwcoccurrence_df,
        how='inner',
        left_on=['eventID', 'species', 'life_history_stage'],
        right_on=['eventID', 'verbatimIdentification', 'life_history_stage']
    )
)

emofsource_samplesoccur_df = (
    emofsource_samplesoccur_df
    .drop_duplicates()
    .drop(columns=['sample_code', 'parentEventID', 'species'])
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [34]:
len(emofsource_samplesoccur_df)

155471

In [35]:
emofsource_samplesoccur_df.head()

Unnamed: 0,eventID,eventDate,life_history_stage,density,occurrenceID,verbatimIdentification
0,010218ELIV1151,2018-01-02 11:51:00-07:00,"female, adult",6.295757,c8152ecf-73c1-4440-9244-3d4b31c0aa0c,acartia longiremis
1,010218ELIV1151,2018-01-02 11:51:00-07:00,"male, adult",0.119919,f52a8b15-4312-4d82-a71e-75c976187b23,metridia pacifica
2,010218ELIV1151,2018-01-02 11:51:00-07:00,"female, adult",3.117899,0414a828-2232-4b9d-ac24-6ef19d0eeb8e,metridia pacifica
3,010218ELIV1151,2018-01-02 11:51:00-07:00,"5, cv",0.899394,17101bf0-3626-4878-ad51-2553d02332e8,metridia pacifica
4,010218ELIV1151,2018-01-02 11:51:00-07:00,"z1, zoea i",0.899394,2f468579-49cc-4bde-954c-4207cd850be5,metacarcinus magister


### density / abundance

In [36]:
abundance_emof_df = emofsource_samplesoccur_df.copy()

In [37]:
abundance_emof_df['measurementType'] = "Abundance of biological entity specified elsewhere per unit volume of the water body"
abundance_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/"
abundance_emof_df['measurementUnit'] = "Number per cubic metre"
abundance_emof_df['measurementUnitID'] = "http://vocab.nerc.ac.uk/collection/P06/current/UPMM"

abundance_emof_df.rename(columns={'density':'measurementValue'}, inplace=True)

Populate (append to) the `emof_df` table with the emof records.

In [38]:
emof_df = pd.concat(
    [
        emof_df,
        abundance_emof_df[[
            'eventID', 'occurrenceID',
            # 'life_history_stage', 'verbatimIdentification', # --- temporary, for validation
            'measurementType', 'measurementTypeID', 'measurementValue', 
            'measurementUnit', 'measurementUnitID'
        ]]
    ],
    ignore_index=True
)

In [39]:
len(emof_df)

166172

In [40]:
emof_df.tail(10)

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
166162,122320ELID1047,0af3428e-851d-4093-8ec0-ada241d2f806,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,1.205102,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
166163,122320ELID1047,d872cf27-451e-4630-9389-3b805d4e50d5,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.060255,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
166164,122320ELID1047,1a237e49-5dde-4299-8e4d-ee0a7444b9ac,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.24102,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
166165,122320ELID1047,d7794aec-7472-4133-87ea-0b4875b4e55a,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.060255,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
166166,122320ELID1047,5597c18d-e6c1-4961-b27a-184b789e92da,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.060255,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
166167,122320ELID1047,2adced5a-e159-42a6-ae7d-c8103b7b527b,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.602551,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
166168,122320ELID1047,87c756ec-c197-47c6-ab65-76134ce596ae,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
166169,122320ELID1047,8f14b45e-ebe2-4350-8c63-82d658c66a2c,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.060255,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
166170,122320ELID1047,c53f59ff-0ea5-4bb1-a055-f72aa1046e98,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.602551,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
166171,122320ELID1047,f8b72e4c-3a67-4a56-8ac4-304b25822e23,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,171.726995,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...


## Export `emof_df` to csv

In [41]:
emof_df.measurementType.value_counts()

Abundance of biological entity specified elsewhere per unit volume of the water body    155471
multinet                                                                                  3567
Sampling method                                                                           3567
Sampling net mesh size                                                                    3567
Name: measurementType, dtype: int64

In [42]:
if not debug_no_csvexport:
    emof_df.to_csv(data_pth / 'aligned_csvs' / 'DwC_emof.csv', index=False)

## Package versions

In [43]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2024-02-16 04:34:47.267196 +00:00
pandas: 1.5.3
