# Darwin Core (DwC) **eMoFs** (extended Measurements or Facts), Puget Sound Zooplankton Monitoring Program (PSZMP) dataset

Alignment of zooplankton dataset to the [Darwin Core (DwC) data standard](https://dwc.tdwg.or/), carried out by **NANOOS**, https://www.nanoos.org. This data alignment work, including this Jupyter notebook, is described in the NANOOS GitHub repository https://github.com/nanoos-pnw/obis-pszmp. See [README.md](https://github.com/nanoos-pnw/obis-pszmp/blob/main/README.md) for further background on the dataset, DwC and data transformations.

Emilio Mayorga, https://github.com/emiliom

## Goals and scope of this notebook

Parse the source data and combine it with [common_mappings.json](common_mappings.json), the previously generated DwC event and occurrence csv files, and the `intermediate_DwC_occurrence_life_history_stage.csv` file created in the notebook `PSZMP-dwcOccurrence.ipynb` to create the DwC "Extended Measurement or Fact" (eMoF) file `DwC_emof.csv`. This file contains five types of eMoF information, referenced to standard vocabularies:
- eMoF's associated with an Event
    1. Description of the net sampling platform
    2. Assignment of the water column sampling scheme
    3. Assignment of the sample mesh size
- eMoF's associated with an Occurrence
    1. Sample density (abundance per unit volume)
    2. Biomass carbon (mg of C per unit volume)

The DwC eMoF table is populated sequentially for each of these eMoF types, in that order. Columns are populated differently depending on the eMoF type.

## Settings

In [1]:
from datetime import datetime
import json
from pathlib import Path

import numpy as np
import pandas as pd

from data_preprocess import create_csv_zip, read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Process JSON file containing common mappings and strings

In [4]:
with open(data_pth / 'common_mappings.json') as f:
    mappings = json.load(f)

In [5]:
DatasetCode = mappings['datasetcode']
net_tow = mappings['net_tow']

iso8601_format = mappings['iso8601_format']

## Pre-process data for eMOF table

### Read and pre-processe the source data from Excel file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
usecols = [
    'Sample Code', 'Station', 
    'Mesh Size', 'Tow Type', 
    'Genus species', 'Life History Stage_lc', 
    'Density (#/m3)', 'Final Carbon (mg/m3)'
]

emofsource_df = read_and_parse_sourcedata()[usecols]

emofsource_df.rename(
    columns={
        'Sample Code': 'sample_code',
        'Station': 'station',
        'Mesh Size': 'mesh_size',
        'Life History Stage_lc': 'life_history_stage',
        'Density (#/m3)': 'density',
        'Final Carbon (mg/m3)': 'carbon_biomass',
    },
    inplace=True
)

In [7]:
len(emofsource_df)

185729

### Compose the sample `eventID` for linkage to event table

In [8]:
emofsource_df['sample_eventID'] = (
    DatasetCode + "-SMP-" + emofsource_df['sample_code']
)

In [9]:
emofsource_df.head()

Unnamed: 0,sample_code,station,mesh_size,Tow Type,Genus species,life_history_stage,density,carbon_biomass,sample_eventID
0,032514DANAD1147,DANAD,335,Oblique,ALPHEIDAE,unknown,4.784479,0.047259,PSZMP-SMP-032514DANAD1147
1,032514DANAD1147,DANAD,335,Oblique,BARNACLES,cyprid larva,4.784479,0.004759,PSZMP-SMP-032514DANAD1147
2,032514DANAD1147,DANAD,335,Oblique,BARNACLES,nauplius,105.258527,0.031641,PSZMP-SMP-032514DANAD1147
3,032514DANAD1147,DANAD,335,Oblique,CALANUS,c5-adult,3.169717,0.075896,PSZMP-SMP-032514DANAD1147
4,032514DANAD1147,DANAD,335,Oblique,CANCRIDAE,"z1, zoea i",197.000903,1.16972,PSZMP-SMP-032514DANAD1147


## Read dwcEvent and dwcOccurrence csv's

In [10]:
dwcevent_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_event.csv",
    parse_dates=['eventDate'],
    usecols=['eventID', 'parentEventID', 'eventDate']
)

dwcoccurrence_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_occurrence.csv",
    usecols=['occurrenceID', 'eventID', 'verbatimIdentification']
)

## Read `life_history_stage` csv and add it to the occurrence dataframe

This enables the eMoF data processing steps used here.

In [11]:
occurrence_vs_life_history_stage = pd.read_csv(
    data_pth / "intermediate_DwC_occurrence_life_history_stage.csv"
)

dwcoccurrence_df = dwcoccurrence_df.merge(
    occurrence_vs_life_history_stage, on='occurrenceID'
)

## Create empty eMoF dataframe

In [12]:
# Won't use measurementID, per Abby's explanation on Slack
emof_cols_dtypes = np.dtype(
    [
        ('eventID', str),
        ('occurrenceID', str), 
        # --- temporary, for validation
        # ('life_history_stage', str),
        # ('verbatimIdentification', str),
        # ---- below, commented out columns not used
        ('measurementType', str),
        ('measurementTypeID', str), 
        ('measurementValue', str),
        # ('measurementValueID', str),
        ('measurementAccuracy', str),
        ('measurementUnit', str),
        ('measurementUnitID', str),
        # ('measurementDeterminedDate', str),
        # ('measurementDeterminedBy', str), 
        # ('measurementMethod', str),
        # ('measurementRemarks', str)
    ]
)

In [13]:
emof_df = pd.DataFrame(np.empty(0, dtype=emof_cols_dtypes))

## Functions for eMoF processing

In [14]:
def populate_emof_columns(
        source_df, 
        meas_type, 
        meas_type_id,
        meas_value_ser,
        meas_value_format=None,
        meas_accuracy=None,
        meas_unit=None,
        meas_unit_id=None
    ):
    """
    Populate the emof columns.
    """

    df = source_df.copy()

    df['measurementType'] = meas_type
    df['measurementTypeID'] = mappings['vocab_server_base_url'] + meas_type_id
    if meas_value_format is None:
        df['measurementValue'] = meas_value_ser
    else:
        format_str = f"{{:{meas_value_format}}}"  # eg, "{:.03f}"
        df['measurementValue'] = meas_value_ser.apply(lambda x: format_str.format(x))
    
    if meas_accuracy:
        df['measurementAccuracy'] = meas_accuracy
    if meas_unit:
        df['measurementUnit'] = meas_unit
    if meas_unit_id:
        df['measurementUnitID'] = mappings['vocab_server_base_url'] + meas_unit_id

    return df

In [15]:
def concat_to_emof_df(df, emof_df, is_occurrence_emof=False):
    """
    Append the new emof records to the cumulative emof_df dataframe.
    """

    emof_cols = ['eventID']
    if is_occurrence_emof:
        emof_cols += ['occurrenceID']
    emof_cols += ['measurementType', 'measurementTypeID', 'measurementValue']
    if 'measurementAccuracy' in df.columns:
        emof_cols += ['measurementAccuracy']
    if 'measurementUnit' in df.columns:
        emof_cols += ['measurementUnit', 'measurementUnitID']
    
    return pd.concat([emof_df, df[emof_cols]], ignore_index=True)

## MoF's associated with an event rather than an occurrence

These will have no `occcurrenceID` entry

### Associated with sample events

In [16]:
emofsource_samples_df = dwcevent_df.merge(
    emofsource_df[['sample_eventID', 'mesh_size', 'Tow Type']], 
    how='inner',
    left_on='eventID',
    right_on='sample_eventID'
)

emofsource_samples_df = (
    emofsource_samples_df
    .drop_duplicates()
    .drop(columns='sample_eventID')
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [17]:
len(emofsource_samples_df)

3569

In [None]:
emofsource_samples_df.head()

Unnamed: 0,eventID,parentEventID,eventDate,mesh_size,Tow Type
0,PSZMP-SMP-010218ELIV1151,PSZMP-STA-20180102D_ELIV,2018-01-02 11:51:00-08:00,200,Vertical
1,PSZMP-SMP-010322KSBP01D0815,PSZMP-STA-20220103D_KSBP01D,2022-01-03 08:15:00-08:00,335,Oblique
2,PSZMP-SMP-010422LSNT01D1323,PSZMP-STA-20220104D_LSNT01D,2022-01-04 13:23:00-08:00,335,Oblique
3,PSZMP-SMP-010422LSNT01V1305,PSZMP-STA-20220104D_LSNT01V,2022-01-04 13:05:00-08:00,200,Vertical
4,PSZMP-SMP-010422NSEX01V1049,PSZMP-STA-20220104D_NSEX01V,2022-01-04 10:49:00-08:00,200,Vertical


#### net tow sampling

In [19]:
net_type = dict(Vertical="ring net", Oblique="bongo net")

net_emof_df = populate_emof_columns(
    emofsource_samples_df, 
    meas_type="plankton net", 
    meas_type_id="L05/current/68",
    meas_value_ser=emofsource_samples_df['Tow Type'].apply(lambda tt: net_type[tt])
)

In [None]:
emof_df = concat_to_emof_df(net_emof_df, emof_df)

len(emof_df)

3569

In [None]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementAccuracy,measurementUnit,measurementUnitID
3564,PSZMP-SMP-122121HCB004V1332,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,ring net,,,
3565,PSZMP-SMP-122220CAMV1044,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,ring net,,,
3566,PSZMP-SMP-122220Cow3V21046,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,ring net,,,
3567,PSZMP-SMP-122220WAT1S1303,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,bongo net,,,
3568,PSZMP-SMP-122320ELID1047,,plankton net,https://vocab.nerc.ac.uk/collection/L05/curren...,bongo net,,,


#### oblique vs vertical (full water column) sampling

In [22]:
towtype_emof_df = populate_emof_columns(
    emofsource_samples_df, 
    meas_type="Sampling method", 
    meas_type_id="Q01/current/Q0100003",
    meas_value_ser=emofsource_samples_df['Tow Type'].apply(
        lambda tt: net_tow[tt] + "net tow"
    )
)

In [None]:
emof_df = concat_to_emof_df(towtype_emof_df, emof_df)

len(emof_df)

7138

In [None]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementAccuracy,measurementUnit,measurementUnitID
7133,PSZMP-SMP-122121HCB004V1332,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,verticalnet tow,,,
7134,PSZMP-SMP-122220CAMV1044,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,verticalnet tow,,,
7135,PSZMP-SMP-122220Cow3V21046,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,verticalnet tow,,,
7136,PSZMP-SMP-122220WAT1S1303,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,obliquenet tow,,,
7137,PSZMP-SMP-122320ELID1047,,Sampling method,https://vocab.nerc.ac.uk/collection/Q01/curren...,obliquenet tow,,,


#### mesh size

In [25]:
mesh_emof_df = populate_emof_columns(
    emofsource_samples_df, 
    meas_type="Sampling net mesh size", 
    meas_type_id="Q01/current/Q0100015",
    meas_value_ser=emofsource_samples_df['mesh_size'],
    meas_unit="Micrometres (microns)",
    meas_unit_id="P06/current/UMIC"
)

In [None]:
mesh_emof_df

Unnamed: 0,eventID,parentEventID,eventDate,mesh_size,Tow Type,measurementType,measurementTypeID,measurementValue,measurementUnit,measurementUnitID
0,PSZMP-SMP-010218ELIV1151,PSZMP-STA-20180102D_ELIV,2018-01-02 11:51:00-08:00,200,Vertical,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
1,PSZMP-SMP-010322KSBP01D0815,PSZMP-STA-20220103D_KSBP01D,2022-01-03 08:15:00-08:00,335,Oblique,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,335,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
2,PSZMP-SMP-010422LSNT01D1323,PSZMP-STA-20220104D_LSNT01D,2022-01-04 13:23:00-08:00,335,Oblique,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,335,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
3,PSZMP-SMP-010422LSNT01V1305,PSZMP-STA-20220104D_LSNT01V,2022-01-04 13:05:00-08:00,200,Vertical,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
4,PSZMP-SMP-010422NSEX01V1049,PSZMP-STA-20220104D_NSEX01V,2022-01-04 10:49:00-08:00,200,Vertical,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
...,...,...,...,...,...,...,...,...,...,...
3564,PSZMP-SMP-122121HCB004V1332,PSZMP-STA-20211221D_HCB004V,2021-12-21 13:32:00-08:00,200,Vertical,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
3565,PSZMP-SMP-122220CAMV1044,PSZMP-STA-20201222D_CAMV,2020-12-22 10:44:00-08:00,200,Vertical,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
3566,PSZMP-SMP-122220Cow3V21046,PSZMP-STA-20201222D_COW3V2,2020-12-22 10:46:00-08:00,200,Vertical,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
3567,PSZMP-SMP-122220WAT1S1303,PSZMP-STA-20201222D_WAT1S,2020-12-22 13:03:00-08:00,335,Oblique,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,335,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...


In [None]:
emof_df = concat_to_emof_df(mesh_emof_df, emof_df)

len(emof_df)

10707

In [28]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementAccuracy,measurementUnit,measurementUnitID
10702,PSZMP-SMP-122121HCB004V1332,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
10703,PSZMP-SMP-122220CAMV1044,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
10704,PSZMP-SMP-122220Cow3V21046,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,200,,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
10705,PSZMP-SMP-122220WAT1S1303,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,335,,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...
10706,PSZMP-SMP-122320ELID1047,,Sampling net mesh size,https://vocab.nerc.ac.uk/collection/Q01/curren...,335,,Micrometres (microns),https://vocab.nerc.ac.uk/collection/P06/curren...


## MoF's associated with exactly one occurrence

These will have an `occcurrenceID` entry.

Preserve the original `life_history_stage` strings in the occurrence csv table, in order to be able to properly merge the occurrence table with `emofsource_df`. The `life_history_stage` column will be dropped in a later step.

In [29]:
emofsource_samplesoccur_all_df = (
    dwcevent_df
    .merge(
        emofsource_df[
            ['sample_eventID', 'Genus species', 'life_history_stage', 
             'density', 'carbon_biomass']
        ], 
        how='inner',
        left_on='eventID',
        right_on='sample_eventID'
    )
    .merge(
        dwcoccurrence_df,
        how='inner',
        left_on=['eventID', 'Genus species', 'life_history_stage'],
        right_on=['eventID', 'verbatimIdentification', 'life_history_stage']
    )
)

In [30]:
len(emofsource_samplesoccur_all_df)

185729

In [None]:
emofsource_samplesoccur_all_df.tail()

Unnamed: 0,eventID,parentEventID,eventDate,sample_eventID,Genus species,life_history_stage,density,carbon_biomass,occurrenceID,verbatimIdentification
185724,PSZMP-SMP-121922TDBV1158,PSZMP-STA-20221219D_TDBV,2022-12-19 11:58:00-08:00,PSZMP-SMP-121922TDBV1158,HIPPOLYTIDAE,unknown,0.348311,0.001235,PSZMP-OCC-121922TDBV1158-201843-HIPPOLYTIDAE-I...,HIPPOLYTIDAE
185725,PSZMP-SMP-121922TDBV1158,PSZMP-STA-20221219D_TDBV,2022-12-19 11:58:00-08:00,PSZMP-SMP-121922TDBV1158,TORTANUS DISCAUDATUS,copepodite,0.174155,0.000298,PSZMP-OCC-121922TDBV1158-201844-TORTANUS_DISCA...,TORTANUS DISCAUDATUS
185726,PSZMP-SMP-121922TDBV1158,PSZMP-STA-20221219D_TDBV,2022-12-19 11:58:00-08:00,PSZMP-SMP-121922TDBV1158,METACARCINUS GRACILIS,"z3, zoea iii",0.174155,0.002495,PSZMP-OCC-121922TDBV1158-201845-METACARCINUS_G...,METACARCINUS GRACILIS
185727,PSZMP-SMP-121922TDBV1158,PSZMP-STA-20221219D_TDBV,2022-12-19 11:58:00-08:00,PSZMP-SMP-121922TDBV1158,CANCER PRODUCTUS,"z1, zoea i",0.174155,0.000983,PSZMP-OCC-121922TDBV1158-201846-CANCER_PRODUCT...,CANCER PRODUCTUS
185728,PSZMP-SMP-121922TDBV1158,PSZMP-STA-20221219D_TDBV,2022-12-19 11:58:00-08:00,PSZMP-SMP-121922TDBV1158,CANCER PRODUCTUS,"z4, zoea iv",0.174155,0.00514,PSZMP-OCC-121922TDBV1158-201847-CANCER_PRODUCT...,CANCER PRODUCTUS


`common_cols` will be used in the two occurrence-associated eMoF's

In [32]:
common_cols = ['eventID', 'occurrenceID']

### density / abundance

In [33]:
emofsource_samplesoccur_df = (
    emofsource_samplesoccur_all_df[common_cols + ['density']]
    .drop_duplicates()
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

len(emofsource_samplesoccur_df)

185729

Count the number of records having empty (null) density values.

In [34]:
len(emofsource_samplesoccur_df[emofsource_samplesoccur_df['density'].isna()])

1703

Filter out records with empty density values.

In [35]:
emofsource_samplesoccur_nonulls_df = emofsource_samplesoccur_df[
    ~emofsource_samplesoccur_df['density'].isna()
]

Verify that there are no duplicate occurrences (duplicate `occurrenceID` entries)

In [36]:
len(emofsource_samplesoccur_nonulls_df.occurrenceID.unique()) == len(emofsource_samplesoccur_nonulls_df)

True

In [37]:
abundance_emof_df = populate_emof_columns(
    emofsource_samplesoccur_nonulls_df, 
    meas_type="Abundance of biological entity specified elsewhere per unit volume of the water body",
    meas_type_id="P01/current/SDBIOL01",
    meas_value_ser=emofsource_samplesoccur_nonulls_df['density'],
    # meas_value_format=".03f",
    meas_accuracy="0.01",
    meas_unit="Number per cubic metre",
    meas_unit_id="P06/current/UPMM"
)

In [38]:
# For validation, add 'life_history_stage', 'verbatimIdentification' to the emof columns

emof_df = concat_to_emof_df(abundance_emof_df, emof_df, is_occurrence_emof=True)

len(emof_df)

194733

In [39]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementAccuracy,measurementUnit,measurementUnitID
194728,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147216-CHAETOGNATHA-C...,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.24102,0.01,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
194729,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147215-CHAETOGNATHA-C...,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.060255,0.01,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
194730,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147214-CANCRIDAE-ZOEA...,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,1.205102,0.01,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
194731,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147220-HYDROZOA-MEDUS...,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.060255,0.01,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
194732,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147226-OSTRACODA-INDE...,Abundance of biological entity specified elsew...,https://vocab.nerc.ac.uk/collection/P01/curren...,1.205102,0.01,Number per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...


### biomass carbon

In [40]:
emofsource_samplesoccur_df = (
    emofsource_samplesoccur_all_df[common_cols + ['carbon_biomass']]
    .drop_duplicates()
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

len(emofsource_samplesoccur_df)

185729

Count the number of records having empty (null) carbon biomass values.

In [41]:
len(emofsource_samplesoccur_df[emofsource_samplesoccur_df['carbon_biomass'].isna()])

0

 Amanda & BethElLee (Julie's team): *Our zooplankton carbon values come from the literature, most of which are derived from CHN analyzers where the remaining Carbon and Nitrogen are correlated to lengths of the analyzed specimens. The result is a Carbon regression applied to the actual length of a particular specimen within our sample or an average length multiplier. The Carbon value is then multiplied to the density (individuals per cubic meter of water sampled) of corresponding taxa, resulting in milligrams of Carbon per cubic meter of water sampled: mg/m3.*

Filter out records with empty carbon biomass values.

In [42]:
emofsource_samplesoccur_nonulls_df = emofsource_samplesoccur_df[
    ~emofsource_samplesoccur_df['carbon_biomass'].isna()
]

Verify that there are no duplicate occurrences (duplicate `occurrenceID` entries)

In [43]:
len(emofsource_samplesoccur_nonulls_df.occurrenceID.unique()) == len(emofsource_samplesoccur_nonulls_df)

True

In [44]:
# Full definition on NVS:
#   Biomass as carbon of mesozooplankton per unit volume of the water body 
#   by optical microscopy and computation of carbon biomass from abundance

carbon_emof_df = populate_emof_columns(
    emofsource_samplesoccur_nonulls_df, 
    # Abbreviated measurement type definition
    meas_type="Biomass as carbon of mesozooplankton per unit volume of the water body",
    meas_type_id="P01/current/MSBCMITX",
    meas_value_ser=emofsource_samplesoccur_nonulls_df['carbon_biomass'],
    # Round off to 5 significant digits after the decimal point
    # meas_value_format=".05f",
    meas_accuracy="0.01",
    meas_unit="Milligrams per cubic metre",
    meas_unit_id="P06/current/UMMC"
)

In [45]:
# For validation, add 'life_history_stage', 'verbatimIdentification' to the emof columns

emof_df = concat_to_emof_df(carbon_emof_df, emof_df, is_occurrence_emof=True)

len(emof_df)

380462

In [46]:
emof_df.tail()

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementAccuracy,measurementUnit,measurementUnitID
380457,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147216-CHAETOGNATHA-C...,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.039178,0.01,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
380458,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147215-CHAETOGNATHA-C...,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.002642,0.01,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
380459,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147214-CANCRIDAE-ZOEA...,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.007155,0.01,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
380460,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147220-HYDROZOA-MEDUS...,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.000131,0.01,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...
380461,PSZMP-SMP-122320ELID1047,PSZMP-OCC-122320ELID1047-147226-OSTRACODA-INDE...,Biomass as carbon of mesozooplankton per unit ...,https://vocab.nerc.ac.uk/collection/P01/curren...,0.021152,0.01,Milligrams per cubic metre,https://vocab.nerc.ac.uk/collection/P06/curren...


## Export `emof_df` to csv

In [47]:
emof_df.measurementType.value_counts()

Biomass as carbon of mesozooplankton per unit volume of the water body                  185729
Abundance of biological entity specified elsewhere per unit volume of the water body    184026
plankton net                                                                              3569
Sampling method                                                                           3569
Sampling net mesh size                                                                    3569
Name: measurementType, dtype: int64

In [48]:
csv_fpth = data_pth / "aligned_csvs" / "DwC_emof.csv"

In [49]:
if not debug_no_csvexport:
    emof_df.to_csv(csv_fpth, index=False)

### Create zip file with the csv

In [50]:
if not debug_no_csvexport:
    create_csv_zip(csv_fpth)

## Package versions

In [51]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2024-12-12 22:01:49.868259 +00:00
pandas: 1.5.3
