# DwC eMoFs. PSZMP data

**Puget Sound Zooplankton Monitoring Program dataset.** Alignment of dataset to Darwin Core (DwC) for NANOOS, https://www.nanoos.org. This data alignment work, including this Jupyter notebook, are described in the GitHub repository https://github.com/nanoos-pnw/obis-pszmp. See [README.md](https://github.com/nanoos-pnw/obis-pszmp/blob/main/README.md).   

Emilio Mayorga, https://github.com/emiliom   

3/19,13,5-1. 2/16,15/2024

## Goals and scope of this notebook

Parse the source data and combine it with `common_mappings.json`, the previously generated DwC event and occurrence csv files, and the `intermediate_DwC_occurrence_life_history_stage.csv` file created in the notebook `PSZMP-dwcOccurrence.ipynb` to create the DwC "Extended Measurement or Fact" (eMoF) file `DwC_emof.csv`. This file contains four types of eMoF information, referenced to standard vocabularies:
- Description of the net sampling platform
- Assignment of the water column sampling scheme
- Assignment of the sample mesh size
- Matchup of the sample density (abundance per unit volume) and biomass carbon (mg of C per unit volume) to the `OccurrenceID` from the DwC occurrence table

The DwC eMoF table is populated sequentially for each of these eMoF types, in that order. Columns are populated differently depending on the eMoF type.

## Settings

In [1]:
from datetime import datetime
import json
from pathlib import Path

import numpy as np
import pandas as pd

from data_preprocess import create_csv_zip, read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Process JSON file containing common mappings and strings

In [4]:
with open(data_pth / 'common_mappings.json') as f:
    mappings = json.load(f)

In [5]:
DatasetCode = mappings['datasetcode']
net_tow = mappings['net_tow']

iso8601_format = mappings['iso8601_format']

## Pre-process data for eMOF table

### Read and pre-processe the source data from Excel file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
usecols = [
    'Sample Code', 'Station', 
    'Mesh Size', 'Tow Type', 
    'Genus species_lc', 'Life History Stage_lc', 
    'Density (#/m3)', 'Final Carbon (mg/m3)'
]

emofsource_df = read_and_parse_sourcedata()[usecols]

# TODO: Rename more columns, if needed
emofsource_df.rename(
    columns={
        'Sample Code': 'sample_code',
        'Station': 'station',
        'Mesh Size': 'mesh_size',
        'Genus species_lc': 'species',
        'Life History Stage_lc': 'life_history_stage',
        'Density (#/m3)': 'density',
        'Final Carbon (mg/m3)': 'carbon_biomass',
    },
    inplace=True
)

In [7]:
len(emofsource_df)

185660

In [8]:
emofsource_df.head()

Unnamed: 0,sample_code,station,mesh_size,Tow Type,species,life_history_stage,density,carbon_biomass
0,010218ELIV1151,ELIV,200,Vertical,acartia hudsonica,"male, adult",0.899394,0.001152
1,010218ELIV1151,ELIV,200,Vertical,acartia longiremis,"female, adult",6.295757,0.016254
2,010218ELIV1151,ELIV,200,Vertical,acartia longiremis,"male, adult",0.899394,0.001152
3,010218ELIV1151,ELIV,200,Vertical,aetideus,"female, adult",1.798788,0.068536
4,010218ELIV1151,ELIV,200,Vertical,aglantha digitale,medusa,1.43903,0.003957


## Read dwcEvent and dwcOccurrence csv's

In [9]:
dwcevent_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_event.csv",
    parse_dates=['eventDate'],
    usecols=['eventID', 'parentEventID', 'eventDate']
)

dwcoccurrence_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_occurrence.csv",
    usecols=['occurrenceID', 'eventID', 'verbatimIdentification']
)

occurrence_vs_life_history_stage = pd.read_csv(
    data_pth / "intermediate_DwC_occurrence_life_history_stage.csv"
)

Add `life_history_stage` to the occurrence dataframe, to enable the eMoF data processing steps used here.

In [10]:
dwcoccurrence_df = dwcoccurrence_df.merge(occurrence_vs_life_history_stage, on='occurrenceID')

## Create empty eMoF dataframe

In [11]:
# Won't use measurementID, per Abby's explanation on Slack
emof_cols_dtypes = np.dtype(
    [
        ('eventID', str),
        ('occurrenceID', str), 
        # --- temporary, for validation
        # ('life_history_stage', str),
        # ('verbatimIdentification', str),
        # ---- below, commented out columns not used
        ('measurementType', str),
        ('measurementTypeID', str), 
        ('measurementValue', str),
        # ('measurementValueID', str),
        # ('measurementAccuracy', str),
        ('measurementUnit', str),
        ('measurementUnitID', str),
        # ('measurementDeterminedDate', str),
        # ('measurementDeterminedBy', str), 
        # ('measurementMethod', str),
        # ('measurementRemarks', str)
    ]
)

In [12]:
emof_df = pd.DataFrame(np.empty(0, dtype=emof_cols_dtypes))

## Functions for eMoF processing

In [13]:
def populate_emof_columns(
        source_df, 
        meas_type, 
        meas_type_id,
        meas_value_df,
        meas_unit=None,
        meas_unit_id=None
    ):
    """
    Populate the emof columns.
    """

    df = source_df.copy()

    df['measurementType'] = meas_type
    df['measurementTypeID'] = mappings['vocab_server_base_url'] + meas_type_id
    df['measurementValue'] = meas_value_df
    if meas_unit:
        df['measurementUnit'] = meas_unit
    if meas_unit_id:
        df['measurementUnitID'] = mappings['vocab_server_base_url'] + meas_unit_id

    return df

In [14]:
def concat_to_emof_df(df, emof_df, is_occurrence_emof=False, has_meas_unit=False):
    """
    Append the new emof records to the cumulative emof_df dataframe.
    """

    emof_cols = ['eventID']
    if is_occurrence_emof:
        emof_cols += ['occurrenceID']
    emof_cols += ['measurementType', 'measurementTypeID', 'measurementValue']
    if has_meas_unit:
        emof_cols += ['measurementUnit', 'measurementUnitID']
    
    return pd.concat([emof_df, df[emof_cols]], ignore_index=True)

## MoF's associated with an event rather than an occurrence

These will have no `occcurrenceID` entry

### Associated with sample events

In [15]:
emofsource_samples_df = dwcevent_df.merge(
    emofsource_df[['sample_code', 'mesh_size', 'Tow Type']], 
    how='inner',
    left_on='eventID',
    right_on='sample_code'
)

emofsource_samples_df = (
    emofsource_samples_df
    .drop_duplicates()
    .drop(columns='sample_code')
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [16]:
len(emofsource_samples_df)

3567

In [17]:
emofsource_samples_df.head()

Unnamed: 0,eventID,parentEventID,eventDate,mesh_size,Tow Type
0,010218ELIV1151,PSZMP_20180102D_ELIV,2018-01-02 11:51:00-08:00,200,Vertical
1,010322KSBP01D0815,PSZMP_20220103D_KSBP01D,2022-01-03 08:15:00-08:00,335,Oblique
2,010422LSNT01D1323,PSZMP_20220104D_LSNT01D,2022-01-04 13:23:00-08:00,335,Oblique
3,010422LSNT01V1305,PSZMP_20220104D_LSNT01V,2022-01-04 13:05:00-08:00,200,Vertical
4,010422NSEX01V1049,PSZMP_20220104D_NSEX01V,2022-01-04 10:49:00-08:00,200,Vertical


#### net tow sampling

In [18]:
net_type = dict(Vertical="ring net", Oblique="bongo net")

net_emof_df = populate_emof_columns(
    emofsource_samples_df, 
    meas_type="plankton net", 
    meas_type_id="L05/current/68/",
    meas_value_df=emofsource_samples_df['Tow Type'].apply(lambda tt: net_type[tt])
)

In [19]:
emof_df = concat_to_emof_df(net_emof_df, emof_df)

In [20]:
len(emof_df)

3567

In [21]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
3562,122121HCB004V1332,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,ring net,,
3563,122220CAMV1044,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,ring net,,
3564,122220Cow3V21046,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,ring net,,
3565,122220WAT1S1303,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,bongo net,,
3566,122320ELID1047,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,bongo net,,


#### oblique vs vertical (full water column) sampling

In [22]:
towtype_emof_df = populate_emof_columns(
    emofsource_samples_df, 
    meas_type="Sampling method", 
    meas_type_id="Q01/current/Q0100003/",
    meas_value_df=emofsource_samples_df['Tow Type'].apply(lambda tt: net_tow[tt])
)

In [23]:
emof_df = concat_to_emof_df(towtype_emof_df, emof_df)

In [24]:
len(emof_df)

7134

In [25]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
7129,122121HCB004V1332,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,full water column vertical,,
7130,122220CAMV1044,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,full water column vertical,,
7131,122220Cow3V21046,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,full water column vertical,,
7132,122220WAT1S1303,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,oblique,,
7133,122320ELID1047,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,oblique,,


#### mesh size

In [26]:
mesh_emof_df = populate_emof_columns(
    emofsource_samples_df, 
    meas_type="Sampling net mesh size", 
    meas_type_id="Q01/current/Q0100015/",
    meas_value_df=emofsource_samples_df['mesh_size'],
    meas_unit="Micrometres (microns)",
    meas_unit_id="P06/current/UMIC/"
)

In [27]:
emof_df = concat_to_emof_df(
    mesh_emof_df, 
    emof_df, 
    has_meas_unit=True
)

In [28]:
len(emof_df)

10701

In [29]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
10696,122121HCB004V1332,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
10697,122220CAMV1044,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
10698,122220Cow3V21046,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
10699,122220WAT1S1303,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,335,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
10700,122320ELID1047,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,335,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...


## MoF's associated with exactly one occurrence

These will have an `occcurrenceID` entry.

We'll have to preserve the original `life_history_stage` strings in the occurrence csv table, in order to be able to properly merge the occurrence table with `emofsource_df`. Do this maybe by adding a `occurrenceRemarks` column, or by keeping a `life_history_stage` column that will be dropped in a later stage

In [30]:
emofsource_samplesoccur_all_df = (
    dwcevent_df
    .merge(
        emofsource_df[
            ['sample_code', 'species', 'life_history_stage', 
             'density', 'carbon_biomass']
        ], 
        how='inner',
        left_on='eventID',
        right_on='sample_code'
    )
    .merge(
        dwcoccurrence_df,
        how='inner',
        left_on=['eventID', 'species', 'life_history_stage'],
        right_on=['eventID', 'verbatimIdentification', 'life_history_stage']
    )
)

In [31]:
len(emofsource_samplesoccur_all_df)

187492

In [32]:
emofsource_samplesoccur_all_df.tail()

Unnamed: 0,eventID,parentEventID,eventDate,sample_code,species,life_history_stage,density,carbon_biomass,occurrenceID,verbatimIdentification
187487,121922TDBV1158,PSZMP_20221219D_TDBV,2022-12-19 11:58:00-08:00,121922TDBV1158,pseudocalanus,"male, adult",31.347963,0.176113,00984492-cd2f-460d-a215-8bc3effc9528,pseudocalanus
187488,121922TDBV1158,PSZMP_20221219D_TDBV,2022-12-19 11:58:00-08:00,121922TDBV1158,scolecithricella minor,"female, adult",2.61233,0.010094,9882ec7b-b675-4199-ac3b-c643f65d4539,scolecithricella minor
187489,121922TDBV1158,PSZMP_20221219D_TDBV,2022-12-19 11:58:00-08:00,121922TDBV1158,tortanus discaudatus,copepodite,0.174155,0.000298,54bed329-fd81-4159-9781-8932632f5a75,tortanus discaudatus
187490,121922TDBV1158,PSZMP_20221219D_TDBV,2022-12-19 11:58:00-08:00,121922TDBV1158,tortanus discaudatus,"female, adult",2.089864,0.012364,94e47f4b-d0bf-4c61-9a7e-f25b7e618d6d,tortanus discaudatus
187491,121922TDBV1158,PSZMP_20221219D_TDBV,2022-12-19 11:58:00-08:00,121922TDBV1158,tortanus discaudatus,"male, adult",0.174155,0.000562,9af1730a-aa00-4f3a-a1da-54f392a40270,tortanus discaudatus


In [33]:
common_cols = ['eventID', 'occurrenceID']

### density / abundance

In [34]:
emofsource_samplesoccur_df = (
    emofsource_samplesoccur_all_df[common_cols + ['density']]
    .drop_duplicates()
    # .drop(columns=['sample_code', 'species', 'life_history_stage'])
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [35]:
len(emofsource_samplesoccur_df)

187306

Count the number of records having empty (null) density values.

In [36]:
len(emofsource_samplesoccur_df[emofsource_samplesoccur_df['density'].isna()])

1703

- Then filter out records with empty density values.
- Round off density values to 3 significant digits after the decimal point.

**TODO:** But convert or create a string-type column to ensure that 0 values are used at the end? eg, `0.060` rather than `0.06`

In [37]:
emofsource_samplesoccur_nonulls_df = emofsource_samplesoccur_df[
    ~emofsource_samplesoccur_df['density'].isna()
]

In [38]:
abundance_emof_df = populate_emof_columns(
    emofsource_samplesoccur_nonulls_df, 
    meas_type="Abundance of biological entity specified elsewhere per unit volume of the water body",
    meas_type_id="P01/current/SDBIOL01/",
    meas_value_df=emofsource_samplesoccur_nonulls_df['density'].round(3),
    meas_unit="Number per cubic metre",
    meas_unit_id="P06/current/UPMM/"
)

In [39]:
# For validation, add 'life_history_stage', 'verbatimIdentification' to the emof columns

emof_df = concat_to_emof_df(
    abundance_emof_df, 
    emof_df, 
    is_occurrence_emof=True, 
    has_meas_unit=True
)

In [40]:
len(emof_df)

196304

In [41]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
196299,122320ELID1047,cc8346ca-2bdf-43c7-a817-809c2d951ba1,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.241,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
196300,122320ELID1047,2a7dce92-2730-4424-b00d-5b4e3458ba4c,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.06,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
196301,122320ELID1047,cfd7e4b5-fa6c-4ae8-b742-7ed7e9c0fb1d,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,1.205,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
196302,122320ELID1047,ccfc8fb7-3887-409c-ba5b-5b55cfc15aa9,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.06,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
196303,122320ELID1047,5d6c97f1-8ee1-4647-9fda-fe305870665c,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,1.205,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...


### biomass carbon

In [42]:
emofsource_samplesoccur_df = (
    emofsource_samplesoccur_all_df[common_cols + ['carbon_biomass']]
    .drop_duplicates()
    # .drop(columns=['sample_code', 'species', 'life_history_stage'])
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [43]:
len(emofsource_samplesoccur_df)

187476

Count the number of records having empty (null) carbon biomass values.

In [44]:
len(emofsource_samplesoccur_df[emofsource_samplesoccur_df['carbon_biomass'].isna()])

0

 Amanda & BethElLee (Julie's team): *Our zooplankton carbon values come from the literature, most of which are derived from CHN analyzers where the remaining Carbon and Nitrogen are correlated to lengths of the analyzed specimens. The result is a Carbon regression applied to the actual length of a particular specimen within our sample or an average length multiplier. The Carbon value is then multiplied to the density (individuals per cubic meter of water sampled) of corresponding taxa, resulting in milligrams of Carbon per cubic meter of water sampled: mg/m3.*


Filter out records with empty carbon biomass values.

**TODO:**
- Round off carbon biomass values to **5 (how many is appropriate?)** significant digits after the decimal point.
- Convert or create a string-type column to ensure that 0 values are used at the end? eg, `0.060` rather than `0.06`

In [45]:
emofsource_samplesoccur_nonulls_df = emofsource_samplesoccur_df[
    ~emofsource_samplesoccur_df['carbon_biomass'].isna()
]

In [46]:
# Full definition on NVS:
#   Biomass as carbon of mesozooplankton per unit volume of the water body 
#   by optical microscopy and computation of carbon biomass from abundance

carbon_emof_df = populate_emof_columns(
    emofsource_samplesoccur_nonulls_df, 
    # Abbreviated measurement type definition
    meas_type="Biomass as carbon of mesozooplankton per unit volume of the water body",
    meas_type_id="P01/current/MSBCMITX/",
    meas_value_df=emofsource_samplesoccur_nonulls_df['carbon_biomass'].round(5),
    meas_unit="Milligrams per cubic metre",
    meas_unit_id="P06/current/UMMC/"
)

In [47]:
# For validation, add 'life_history_stage', 'verbatimIdentification' to the emof columns

emof_df = concat_to_emof_df(
    carbon_emof_df, 
    emof_df, 
    is_occurrence_emof=True, 
    has_meas_unit=True
)

In [48]:
len(emof_df)

383780

In [49]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
383775,122320ELID1047,2a7dce92-2730-4424-b00d-5b4e3458ba4c,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.00264,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
383776,122320ELID1047,cfd7e4b5-fa6c-4ae8-b742-7ed7e9c0fb1d,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.00716,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
383777,122320ELID1047,7d70353a-1711-4291-bd40-ef53154eacd2,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.01775,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
383778,122320ELID1047,939f8d59-d529-4d7a-9b6c-ab080bff7da7,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.03587,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
383779,122320ELID1047,95b06a76-2d0a-4b3c-a042-a36e3031326d,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.00135,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...


## Export `emof_df` to csv

In [50]:
emof_df.measurementType.value_counts()

Biomass as carbon of mesozooplankton per unit volume of the water body                  187476
Abundance of biological entity specified elsewhere per unit volume of the water body    185603
plankton net                                                                              3567
Sampling method                                                                           3567
Sampling net mesh size                                                                    3567
Name: measurementType, dtype: int64

In [51]:
csv_fpth = data_pth / "aligned_csvs" / "DwC_emof.csv"

In [52]:
if not debug_no_csvexport:
    emof_df.to_csv(csv_fpth, index=False)

### Create zip file with the csv

In [53]:
if not debug_no_csvexport:
    create_csv_zip(csv_fpth)

## Package versions

In [54]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2024-03-20 06:26:41.805417 +00:00
pandas: 1.5.3
