# DwC eMOFs. Keister Zooplankton Hood Canal 2012-13 data

University of Washington Pelagic Hypoxia Hood Canal project, Zooplankton dataset.   
Alignment of dataset to Darwin Core (DwC) for NANOOS, https://www.nanoos.org. This data alignment work, including this Jupyter notebook, are described in the GitHub repository https://github.com/nanoos-pnw/obis-keisterhczoop. See [notebooks-notes.md](https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/notebooks-notes.md) and [README.md](https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/README.md).   

Emilio Mayorga, https://github.com/emiliom

## Goals and scope of this notebook

Parse the source data and combine it with `common_mappings.json`, the previously generated DwC event and occurrence csv files, and the `intermediate_DwC_occurrence_life_history_stage.csv` file created in the notebook `Keister-dwcOccurrence.ipynb` to create the DwC "Extended Measurement or Fact" (eMoF) file `DwC_emof.csv`. This file contains four types of eMoF information, referenced to standard vocabularies:
- Description of the multinet sampling platform
- Assignment of the water column sampling scheme
- Assignment of the sample mesh size
- Matchup of the sample density (abundance per unit volume) to the `OccurrenceID` from the DwC occurrence table

The DwC eMoF table is populated sequentially for each of these eMoF types, in that order. Columns are populated differently depending on the eMoF type.

## Settings

In [1]:
from datetime import datetime, timedelta, timezone
from pathlib import Path

import numpy as np
import pandas as pd

from data_preprocess import read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Process JSON file containing common mappings and strings

In [4]:
with open(data_pth / 'common_mappings.json') as f:
    common_mappings = json.load(f)

In [5]:
DatasetCode = common_mappings['datasetcode']
net_tow = common_mappings['net_tow']

iso8601_format = common_mappings['iso8601_format']

## Pre-process data from csv for Event table

### Read the pre-processed csv file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
# I doubt I'll need any of the datetime columns here
usecols = [
    'sample_code', 'station', 
    # 'date', 'time_start', 'time', 'day_night', 
    'mesh_size', 'FWC_DS', 
    'species', 'life_history_stage', 'density'
]

emofsource_df = read_and_parse_sourcedata()[usecols]

In [7]:
len(emofsource_df)

6867

In [8]:
emofsource_df.head()

Unnamed: 0,sample_code,station,mesh_size,FWC_DS,species,life_history_stage,density
0,20131003DBDm2_200,DB,200,DS,ACARTIA,3;_CIII,6.395349
1,20130906DBiDm1_200,DB,200,FWC,ACARTIA,5;_CV,38.75969
2,20131003DBDm1_200,DB,200,FWC,ACARTIA,Female;_Adult,0.384615
3,20131003DBDm1_200,DB,200,FWC,ACARTIA,Male;_Adult,1.538462
4,20120614DBDm3_200,DB,200,DS,ACARTIA_CLAUSI,Female;_Adult,21.052632


## Read dwcEvent and dwcOccurrence csv's

In [9]:
dwcevent_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_event.csv",
    parse_dates=['eventDate'],
    usecols=['eventID', 'parentEventID', 'eventDate']
)

dwcoccurrence_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_occurrence.csv",
    usecols=['occurrenceID', 'eventID', 'verbatimIdentification']
)

occurrence_vs_life_history_stage = pd.read_csv(
    data_pth / "intermediate_DwC_occurrence_life_history_stage.csv"
)

Add `life_history_stage` to the occurrence dataframe, to enable the eMoF data processing steps used here.

In [10]:
dwcoccurrence_df = dwcoccurrence_df.merge(occurrence_vs_life_history_stage, on='occurrenceID')

## Create empty eMoF dataframe

In [11]:
# Won't use measurementID, per Abby's explanation on Slack
emof_cols_dtypes = np.dtype(
    [
        ('eventID', str),
        ('occurrenceID', str), 
        # --- temporary, for validation
        # ('life_history_stage', str),
        # ('verbatimIdentification', str),
        # ---- below, commented out columns not used
        ('measurementType', str),
        ('measurementTypeID', str), 
        ('measurementValue', str),
        ('measurementValueID', str),
        # ('measurementAccuracy', str),
        ('measurementUnit', str),
        ('measurementUnitID', str),
        # ('measurementDeterminedDate', str),
        # ('measurementDeterminedBy', str), 
        # ('measurementMethod', str),
        # ('measurementRemarks', str)
    ]
)

In [12]:
emof_df = pd.DataFrame(np.empty(0, dtype=emof_cols_dtypes))

## MoF's associated with an event rather than an occurrence

These will have no `occcurrenceID` entry

### Associated with sample events

In [13]:
emofsource_samples_df = dwcevent_df.merge(
    emofsource_df[['sample_code', 'mesh_size', 'FWC_DS']], 
    how='inner',
    left_on='eventID',
    right_on='sample_code'
)

emofsource_samples_df = (
    emofsource_samples_df
    .drop_duplicates()
    .drop(columns='sample_code')
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [14]:
len(emofsource_samples_df)

271

In [15]:
emofsource_samples_df.head()

Unnamed: 0,eventID,parentEventID,eventDate,mesh_size,FWC_DS
0,20120611UNDm1_200,UWPHHCZoop_CB975-20120611UND,2012-06-11 14:10:00-07:00,200,DS
1,20120611UNDm1_335,UWPHHCZoop_CB975-20120611UND,2012-06-11 15:50:00-07:00,335,DS
2,20120611UNDm2_200,UWPHHCZoop_CB975-20120611UND,2012-06-11 16:53:00-07:00,200,DS
3,20120611UNDm2_335,UWPHHCZoop_CB975-20120611UND,2012-06-11 15:50:00-07:00,335,DS
4,20120611UNDm3_200,UWPHHCZoop_CB975-20120611UND,2012-06-11 23:10:00-07:00,200,DS


#### multinet sampling

In [16]:
multinet_emof_df = emofsource_samples_df.copy()

In [17]:
multinet_emof_df['measurementType'] = "multinet"
multinet_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/L05/current/68/"
multinet_emof_df['measurementValue'] = 'Hydro-bios 5-net'

In [18]:
len(multinet_emof_df)

271

Populate (append to) the `emof_df` table with the emof records.

In [19]:
emof_df = pd.concat(
    [
        emof_df,
        multinet_emof_df[['eventID', 'measurementType', 'measurementTypeID', 'measurementValue']]
    ],
    ignore_index=True
)

In [20]:
len(emof_df)

271

#### depth stratified vs full water column sampling

In [21]:
fwcds_emof_df = emofsource_samples_df.copy()

In [22]:
fwcds_emof_df['measurementType'] = "Sampling method"
fwcds_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/Q01/current/Q0100003/"
fwcds_emof_df['measurementValue'] = fwcds_emof_df['FWC_DS'].apply(lambda fwc_ds: net_tow[fwc_ds])

In [23]:
len(fwcds_emof_df)

271

Populate (append to) the `emof_df` table with the emof records.

In [24]:
emof_df = pd.concat(
    [
        emof_df,
        fwcds_emof_df[['eventID', 'measurementType', 'measurementTypeID', 'measurementValue']], 
    ],
    ignore_index=True
)

In [25]:
len(emof_df)

542

In [26]:
emof_df.head(10)

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementValueID,measurementUnit,measurementUnitID
0,20120611UNDm1_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,
1,20120611UNDm1_335,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,
2,20120611UNDm2_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,
3,20120611UNDm2_335,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,
4,20120611UNDm3_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,
5,20120611UNDm3_335,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,
6,20120611UNDm4_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,
7,20120611UNNm1_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,
8,20120611UNNm1_335,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,
9,20120611UNNm2_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,


#### mesh size

In [27]:
mesh_emof_df = emofsource_samples_df.copy()

In [28]:
mesh_emof_df['measurementType'] = "Sampling net mesh size"
mesh_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/"
mesh_emof_df['measurementUnit'] = "Micrometres (microns)"
mesh_emof_df['measurementUnitID'] = "http://vocab.nerc.ac.uk/collection/P06/current/UMIC/"

mesh_emof_df.rename(columns={'mesh_size':'measurementValue'}, inplace=True)

In [29]:
len(mesh_emof_df)

271

Populate (append to) the `emof_df` table with the emof records.

In [30]:
emof_df = pd.concat(
    [
        emof_df,
        mesh_emof_df[[
            'eventID', 
            'measurementType', 'measurementTypeID', 'measurementValue', 
            'measurementUnit', 'measurementUnitID'
        ]]
    ],
    ignore_index=True
)

In [31]:
len(emof_df)

813

In [32]:
emof_df.tail(10)

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementValueID,measurementUnit,measurementUnitID
803,20131003DBDm3_200,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
804,20131003DBDm3_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
805,20131003DBDm4_200,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
806,20131003DBDm4_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
807,20131003DBDm5_200,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
808,20131003DBDm5_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
809,20131003DBNm1_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
810,20131003DBNm2_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
811,20131003DBNm3_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...
812,20131003DBNm4_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...


## MoF's associated with exactly one occurrence

These will have an `occcurrenceID` entry.

We'll have to preserve the original `life_history_stage` strings in the occurrence csv table, in order to be able to properly merge the occurrence table with `emofsource_df`. Do this maybe by adding a `occurrenceRemarks` column, or by keeping a `life_history_stage` column that will be dropped in a later stage

In [33]:
emofsource_samplesoccur_df = (
    dwcevent_df
    .merge(
        emofsource_df[['sample_code', 'species', 'life_history_stage', 'density']], 
        how='inner',
        left_on='eventID',
        right_on='sample_code'
    )
    .merge(
        dwcoccurrence_df,
        how='inner',
        left_on=['eventID', 'species', 'life_history_stage'],
        right_on=['eventID', 'verbatimIdentification', 'life_history_stage']
    )
)

emofsource_samplesoccur_df = (
    emofsource_samplesoccur_df
    .drop_duplicates()
    .drop(columns=['sample_code', 'parentEventID', 'species'])
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [34]:
len(emofsource_samplesoccur_df)

6854

In [35]:
emofsource_samplesoccur_df.head()

Unnamed: 0,eventID,eventDate,life_history_stage,density,occurrenceID,verbatimIdentification
0,20120611UNDm1_200,2012-06-11 14:10:00-07:00,Female;_Adult,47.567568,3dc09f04-260b-470c-bb8b-21a56df3b326,ACARTIA_CLAUSI
1,20120611UNDm1_200,2012-06-11 14:10:00-07:00,1;_CI,17.83063,15ff027b-4c05-4741-a2e2-8cfb8ee6d914,METRIDIA_PACIFICA
2,20120611UNDm1_200,2012-06-11 14:10:00-07:00,3;_CIII,79.279279,3dfa54c6-31ba-4142-996d-72a7469f1009,METRIDIA_PACIFICA
3,20120611UNDm1_200,2012-06-11 14:10:00-07:00,4;_CIV,71.66847,f3d71f04-e053-41f1-95b8-18ec55432969,METRIDIA_PACIFICA
4,20120611UNDm1_200,2012-06-11 14:10:00-07:00,5;_CV,53.953154,84409f0c-4834-42a7-a39e-d934afd1ede6,METRIDIA_PACIFICA


### density / abundance

In [36]:
abundance_emof_df = emofsource_samplesoccur_df.copy()

In [37]:
abundance_emof_df['measurementType'] = "Abundance of biological entity specified elsewhere per unit volume of the water body"
abundance_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/"
abundance_emof_df['measurementUnit'] = "Number per cubic metre"
abundance_emof_df['measurementUnitID'] = "http://vocab.nerc.ac.uk/collection/P06/current/UPMM"

abundance_emof_df.rename(columns={'density':'measurementValue'}, inplace=True)

Populate (append to) the `emof_df` table with the emof records.

In [38]:
emof_df = pd.concat(
    [
        emof_df,
        abundance_emof_df[[
            'eventID', 'occurrenceID',
            # 'life_history_stage', 'verbatimIdentification', # --- temporary, for validation
            'measurementType', 'measurementTypeID', 'measurementValue', 
            'measurementUnit', 'measurementUnitID'
        ]]
    ],
    ignore_index=True
)

In [39]:
len(emof_df)

7667

In [40]:
emof_df.tail(10)

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementValueID,measurementUnit,measurementUnitID
7657,20131003DBNm4_335,6db48ca3-6aa8-4559-9084-91395c4aa60f,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
7658,20131003DBNm4_335,722c9cd0-39ca-491e-927f-0e6e079167e8,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
7659,20131003DBNm4_335,92924515-b22c-4b12-9477-bf06d0f69f93,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
7660,20131003DBNm4_335,891777e0-914e-4f2d-836b-8a8f0a7c8d3c,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
7661,20131003DBNm4_335,2d7893a0-0a1f-469e-a69f-e6bb8abfd991,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,2.095238,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
7662,20131003DBNm4_335,01cbc051-6ac0-4a5c-99d3-d8ab229feae2,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.380952,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
7663,20131003DBNm4_335,30bfbffc-44d5-4ba0-9d47-8c6eca8ac909,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,4.190476,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
7664,20131003DBNm4_335,6d381a52-daf2-42ee-9abe-e9a611a39c7c,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,4.952381,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
7665,20131003DBNm4_335,7c3352e0-045b-4ebc-a8ca-eac5db1115a9,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,1.904762,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...
7666,20131003DBNm4_335,e38f37ae-2997-4ca2-8ed8-f48c0947c6f0,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...


## Export `emof_df` to csv

In [41]:
emof_df.measurementType.value_counts()

Abundance of biological entity specified elsewhere per unit volume of the water body    6854
multinet                                                                                 271
Sampling method                                                                          271
Sampling net mesh size                                                                   271
Name: measurementType, dtype: int64

In [42]:
if not debug_no_csvexport:
    emof_df.to_csv(data_pth / 'aligned_csvs' / 'DwC_emof.csv', index=False)

## Package versions

In [43]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2023-11-01 04:06:08.661759 +00:00
pandas: 1.5.3
