# DwC eMOFs. Keister Zooplankton Hood Canal 2012-13 data

University of Washington Pelagic Hypoxia Hood Canal project, Zooplankton dataset.

2022-1-12

eMoF: Extended Measurement or Fact

**TODO:**
- DONE. Need to carry out a manual validation of the resulting emof table, specially for abundance. I'm seeing some repeated adjacent abundance values that are suspicious
- consider other eMoF's to include? eg: day vs night, and other event-level metadata
- consider rounding off abundance values to, say, 2-3 significant digits after decimal point
- sex and life stage from `life_history_stage`
    - Add emof records here, in addition to the `sex` and `lifeStage` columns added to the occurrence table?
    - sex vocabulary: http://vocab.nerc.ac.uk/collection/S10/current/
    - life stage vocabulary: http://vocab.nerc.ac.uk/collection/S11/current/

In [1]:
from datetime import datetime, timedelta, timezone
from pathlib import Path

import numpy as np
import pandas as pd

In [2]:
data_pth = Path(".")

## Process JSON file containing common mappings and strings

In [3]:
with open(data_pth / 'common_mappings.json') as f:
    common_mappings = json.load(f)

In [4]:
DatasetCode = common_mappings['datasetcode']
net_tow = common_mappings['net_tow']

iso8601_format = common_mappings['iso8601_format']

## Pre-process data from csv for Event table

### Read the csv file

In [5]:
sourcecsvdata_pth = data_pth / "sourcedata" / "bcodmo_dataset_682074_data.csv"

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
# I doubt I'll need any of the datetime columns here
usecols = [
    'sample_code', 'station', 
    'date', 'time_start', 'time', 'day_night', 
    'mesh_size', 'FWC_DS', 
    'species', 'life_history_stage', 'density'
]

emofsource_df = pd.read_csv(
    sourcecsvdata_pth, 
    skiprows=[1], 
    parse_dates=['time'],
    usecols=usecols
)[usecols]

In [7]:
len(emofsource_df)

6884

In [8]:
emofsource_df.head()

Unnamed: 0,sample_code,station,date,time_start,time,day_night,mesh_size,FWC_DS,species,life_history_stage,density
0,20131003DBDm2_200,DB,20131003,14:11,2013-10-03 14:11:00+00:00,Day,200,DS,ACARTIA,3;_CIII,6.395349
1,20130906DBiDm1_200,DB,20130906,16:17,2013-09-06 16:17:00+00:00,Day,200,FWC,ACARTIA,5;_CV,38.75969
2,20131003DBDm1_200,DB,20131003,,NaT,Day,200,FWC,ACARTIA,Female;_Adult,0.384615
3,20131003DBDm1_200,DB,20131003,,NaT,Day,200,FWC,ACARTIA,Male;_Adult,1.538462
4,20120614DBDm3_200,DB,20120614,17:25,2012-06-14 17:25:00+00:00,Day,200,DS,ACARTIA_CLAUSI,Female;_Adult,21.052632


## Read dwcEvent and dwcOccurrence csv's

In [9]:
dwcevent_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_event.csv",
    parse_dates=['eventDate'],
    usecols=['eventID', 'parentEventID', 'eventDate']
)

dwcoccurrence_df = pd.read_csv(
    data_pth / "aligned_csvs" / "DwC_occurrence.csv",
    usecols=['occurrenceID', 'eventID', 'verbatimIdentification']
)

occurrence_vs_life_history_stage = pd.read_csv(
    data_pth / "intermediate_DwC_occurrence_life_history_stage.csv"
)

Add `life_history_stage` to the occurrence dataframe, to enable the eMoF data processing steps used here.

In [10]:
dwcoccurrence_df = dwcoccurrence_df.merge(occurrence_vs_life_history_stage, on='occurrenceID')

## Create empty eMoF dataframe

In [11]:
# Won't use measurementID, per Abby's explanation on Slack
emof_cols_dtypes = np.dtype(
    [
        ('eventID', str),
        ('occurrenceID', str), 
        # --- temporary, for validation
        # ('life_history_stage', str),
        # ('verbatimIdentification', str),
        # -------------------------------
        ('measurementType', str),
        ('measurementTypeID', str), 
        ('measurementValue', str),
        ('measurementValueID', str),
        ('measurementAccuracy', str),
        ('measurementUnit', str),
        ('measurementUnitID', str),
        ('measurementDeterminedDate', str),
        ('measurementDeterminedBy', str), 
        ('measurementMethod', str),
        ('measurementRemarks', str)
    ]
)

In [12]:
emof_df = pd.DataFrame(np.empty(0, dtype=emof_cols_dtypes))

## MoF's associated with an event rather than an occurrence

These will have no `occcurrenceID` entry

### Associated with sample events

In [13]:
emofsource_samples_df = dwcevent_df.merge(
    emofsource_df[['sample_code', 'mesh_size', 'FWC_DS']], 
    how='inner',
    left_on='eventID',
    right_on='sample_code'
)

emofsource_samples_df = (
    emofsource_samples_df
    .drop_duplicates()
    .drop(columns='sample_code')
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [14]:
len(emofsource_samples_df)

271

In [15]:
emofsource_samples_df.head()

Unnamed: 0,eventID,parentEventID,eventDate,mesh_size,FWC_DS
0,20120611UNDm1_200,UWPHHCZoop_CB975-20120611UND,2012-06-11 14:10:00+00:00,200,DS
1,20120611UNDm1_335,UWPHHCZoop_CB975-20120611UND,2012-06-11 15:50:00+00:00,335,DS
2,20120611UNDm2_200,UWPHHCZoop_CB975-20120611UND,2012-06-11 16:53:00+00:00,200,DS
3,20120611UNDm2_335,UWPHHCZoop_CB975-20120611UND,2012-06-11 15:50:00+00:00,335,DS
4,20120611UNDm3_200,UWPHHCZoop_CB975-20120611UND,2012-06-11 23:10:00+00:00,200,DS


#### multinet sampling

In [16]:
multinet_emof_df = emofsource_samples_df.copy()

In [17]:
multinet_emof_df['measurementType'] = "multinet"
multinet_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/L05/current/68/"
multinet_emof_df['measurementValue'] = 'Hydro-bios 5-net'

In [18]:
len(multinet_emof_df)

271

Populate (append to) the `emof_df` table with the emof records.

In [19]:
emof_df = emof_df.append(
    multinet_emof_df[['eventID', 
                      'measurementType', 'measurementTypeID', 'measurementValue']], 
    ignore_index=True
)

In [20]:
len(emof_df)

271

#### depth stratified vs full water column sampling

In [21]:
fwcds_emof_df = emofsource_samples_df.copy()

In [22]:
fwcds_emof_df['measurementType'] = "Sampling method"
fwcds_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/Q01/current/Q0100003/"
fwcds_emof_df['measurementValue'] = fwcds_emof_df['FWC_DS'].apply(lambda fwc_ds: net_tow[fwc_ds])

In [23]:
len(fwcds_emof_df)

271

Populate (append to) the `emof_df` table with the emof records.

In [24]:
emof_df = emof_df.append(
    fwcds_emof_df[['eventID', 
                   'measurementType', 'measurementTypeID', 'measurementValue']], 
    ignore_index=True
)

In [25]:
len(emof_df)

542

In [26]:
emof_df.head(10)

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementValueID,measurementAccuracy,measurementUnit,measurementUnitID,measurementDeterminedDate,measurementDeterminedBy,measurementMethod,measurementRemarks
0,20120611UNDm1_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,
1,20120611UNDm1_335,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,
2,20120611UNDm2_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,
3,20120611UNDm2_335,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,
4,20120611UNDm3_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,
5,20120611UNDm3_335,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,
6,20120611UNDm4_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,
7,20120611UNNm1_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,
8,20120611UNNm1_335,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,
9,20120611UNNm2_200,,multinet,http://vocab.nerc.ac.uk/collection/L05/current...,Hydro-bios 5-net,,,,,,,,


#### mesh size

In [27]:
mesh_emof_df = emofsource_samples_df.copy()

In [28]:
mesh_emof_df['measurementType'] = "Sampling net mesh size"
mesh_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/"
mesh_emof_df['measurementUnit'] = "Micrometres (microns)"
mesh_emof_df['measurementUnitID'] = "http://vocab.nerc.ac.uk/collection/P06/current/UMIC/"

mesh_emof_df.rename(columns={'mesh_size':'measurementValue'}, inplace=True)

In [29]:
len(mesh_emof_df)

271

Populate (append to) the `emof_df` table with the emof records.

In [30]:
emof_df = emof_df.append(
    mesh_emof_df[['eventID', 
                  'measurementType', 'measurementTypeID', 'measurementValue', 
                  'measurementUnit', 'measurementUnitID']], 
    ignore_index=True
)

In [31]:
len(emof_df)

813

In [32]:
emof_df.tail(10)

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementValueID,measurementAccuracy,measurementUnit,measurementUnitID,measurementDeterminedDate,measurementDeterminedBy,measurementMethod,measurementRemarks
803,20131003DBDm3_200,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,
804,20131003DBDm3_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,
805,20131003DBDm4_200,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,
806,20131003DBDm4_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,
807,20131003DBDm5_200,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,200,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,
808,20131003DBDm5_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,
809,20131003DBNm1_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,
810,20131003DBNm2_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,
811,20131003DBNm3_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,
812,20131003DBNm4_335,,Sampling net mesh size,http://vocab.nerc.ac.uk/collection/Q01/current...,335,,,Micrometres (microns),http://vocab.nerc.ac.uk/collection/P06/current...,,,,


## MoF's associated with exactly one occurrence

These will have an `occcurrenceID` entry.

We'll have to preserve the original `life_history_stage` strings in the occurrence csv table, in order to be able to properly merge the occurrence table with `emofsource_df`. Do this maybe by adding a `occurrenceRemarks` column, or by keeping a `life_history_stage` column that will be dropped in a later stage

In [33]:
emofsource_samplesoccur_df = (
    dwcevent_df
    .merge(
        emofsource_df[['sample_code', 'species', 'life_history_stage', 'density']], 
        how='inner',
        left_on='eventID',
        right_on='sample_code'
    )
    .merge(
        dwcoccurrence_df,
        how='inner',
        left_on=['eventID', 'species', 'life_history_stage'],
        right_on=['eventID', 'verbatimIdentification', 'life_history_stage']
    )
)

emofsource_samplesoccur_df = (
    emofsource_samplesoccur_df
    .drop_duplicates()
    .drop(columns=['sample_code', 'parentEventID', 'species'])
    .sort_values(by='eventID')
    .reset_index(drop=True)
)

In [34]:
len(emofsource_samplesoccur_df)

6732

**TODO/NOTE:** This is 30 greater than the 6702 occurrences found in `dwcoccurrence_df`. Will need to investigate.

In [35]:
emofsource_samplesoccur_df.head()

Unnamed: 0,eventID,eventDate,life_history_stage,density,occurrenceID,verbatimIdentification
0,20120611UNDm1_200,2012-06-11 14:10:00+00:00,Female;_Adult,47.567568,bba937ac-f6b5-4b35-a513-7852052b537c,ACARTIA_CLAUSI
1,20120611UNDm1_200,2012-06-11 14:10:00+00:00,1;_CI,17.83063,40f95f0a-f09c-4818-b4ac-902248eb5e15,METRIDIA_PACIFICA
2,20120611UNDm1_200,2012-06-11 14:10:00+00:00,3;_CIII,79.279279,8c567fcc-1b40-4bed-b0ce-a194f28de203,METRIDIA_PACIFICA
3,20120611UNDm1_200,2012-06-11 14:10:00+00:00,4;_CIV,71.66847,98e2624c-958b-4883-991c-d4fdd9dda392,METRIDIA_PACIFICA
4,20120611UNDm1_200,2012-06-11 14:10:00+00:00,5;_CV,53.953154,ce0d446b-c723-4ad1-b967-1dbbcfccc886,METRIDIA_PACIFICA


### density / abundance

In [36]:
abundance_emof_df = emofsource_samplesoccur_df.copy()

In [37]:
abundance_emof_df['measurementType'] = "Abundance of biological entity specified elsewhere per unit volume of the water body"
abundance_emof_df['measurementTypeID'] = "http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/"
abundance_emof_df['measurementUnit'] = "Number per cubic metre"
abundance_emof_df['measurementUnitID'] = "http://vocab.nerc.ac.uk/collection/P06/current/UPMM"

abundance_emof_df.rename(columns={'density':'measurementValue'}, inplace=True)

Populate (append to) the `emof_df` table with the emof records.

In [38]:
emof_df = emof_df.append(
    abundance_emof_df[['eventID', 'occurrenceID',
                       # 'life_history_stage', 'verbatimIdentification', # --- temporary, for validation
                       'measurementType', 'measurementTypeID', 'measurementValue', 
                       'measurementUnit', 'measurementUnitID']], 
    ignore_index=True
)

In [39]:
len(emof_df)

7545

In [40]:
emof_df.tail(10)

Unnamed: 0,eventID,occurrenceID,measurementType,measurementTypeID,measurementValue,measurementValueID,measurementAccuracy,measurementUnit,measurementUnitID,measurementDeterminedDate,measurementDeterminedBy,measurementMethod,measurementRemarks
7535,20131003DBNm4_335,0fe107b7-f8fc-4731-af71-b2026821c6a2,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,
7536,20131003DBNm4_335,7e702b1a-19e4-42f2-852b-a7a1209e4b49,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,
7537,20131003DBNm4_335,178ac23d-241b-4e12-8a29-39edd9f77eda,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,1.904762,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,
7538,20131003DBNm4_335,489a2657-c674-4832-82cf-e163f729973c,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,5.904762,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,
7539,20131003DBNm4_335,38342ad4-61c2-444b-99a1-a9c967f076fb,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,2.095238,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,
7540,20131003DBNm4_335,d9de87b9-a779-48f9-83a5-ebb26b30a7ed,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.380952,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,
7541,20131003DBNm4_335,b7f84115-55a3-4d86-aaf0-2ec2b8efcb9d,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,4.190476,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,
7542,20131003DBNm4_335,b8261dab-9919-4bba-8d65-d99b73fc2a44,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,
7543,20131003DBNm4_335,f311e43b-74cb-42bc-be8c-3aa57588e973,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,
7544,20131003DBNm4_335,1650ee8c-3d25-4592-bdba-72da2a58f43e,Abundance of biological entity specified elsew...,http://vocab.nerc.ac.uk/collection/P01/current...,0.190476,,,Number per cubic metre,http://vocab.nerc.ac.uk/collection/P06/current...,,,,


## Export `emof_df` to csv

In [41]:
emof_df.measurementType.value_counts()

Abundance of biological entity specified elsewhere per unit volume of the water body    6732
multinet                                                                                 271
Sampling method                                                                          271
Sampling net mesh size                                                                   271
Name: measurementType, dtype: int64

In [42]:
emof_df.to_csv(data_pth / 'aligned_csvs' / 'DwC_emof.csv', index=False)

## Package versions

In [43]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2022-01-13 02:51:59.258003 +00:00
pandas: 1.3.5
