# DwC Occurrences. Keister Zooplankton Hood Canal 2012-13 data

University of Washington Pelagic Hypoxia Hood Canal project, Zooplankton dataset.

2022-1-12

- Refer to https://obis.org/manual/darwincore/#occurrence and https://obis.org/manual/darwincore/#taxonomy for guidance
- I won't use `acceptedname` and `acceptedNameID` are not needed or even recommended


**TODO**
- Decide how to handle "absent" `occurrenceStatus`. Will need to decide when to assign "absent" occurrence, and how. From Li et al 2019, it's clear that for *Euphausia pacifica*, if there is no occurrence record for a "sample" (a specific net tow), "absence" is implied and can/should be entered. But how about others? Will have to consult with Julie and the Standardizing Bio data group

**Questions, issues, comments about life history and sex parsing**
- `life_history_stage` is exported in a separate, intermediate csv table together with `occurrenceID` in order to enable the join/merge needed for generating abundance measurements in emof.
- `life_stage_mapping` dictionary is now stored in an external JSON file
- Now that I've thoroughly compared `lhs_0` and `lhs_1`, it's clear that `lhs_1` can be thrown out. `lhs_0` is all that's needed
- `Female` and `Male` are only found for `Adult`, and only found in `lhs_0`
- Are `CI` - `CV` (and matching 1-5) the same as `Copepodite` 1 - 5? If they were, their use with *Euphausia Pacifica* (see the next cell) would be wrong. 
- Life stages `Calyptopis` 1-3, `Furcilia` 1-10 and `Zoea` 1-5 are not found in https://vocab.nerc.ac.uk/collection/S11/current/. That vocab doesn't have any "phases" for `Calyptopis`, `Furcilia` and `Zoea`
- These are also not found: `Bract`, `Gonophore`, `Nectophore`, `Polyp`
- `Furcilia_1_0_legs` is not clear

In [1]:
from datetime import datetime
from pathlib import Path
import uuid

import numpy as np
import pandas as pd

In [2]:
data_pth = Path(".")

## Process JSON file containing common mappings and strings

In [3]:
with open(data_pth / 'common_mappings.json') as f:
    common_mappings = json.load(f)

In [4]:
DatasetCode = common_mappings['datasetcode']
life_stage_mapping = common_mappings['life_stage_mapping']

## Read the data

## Pre-process data from csv for Occurrence table

### Read the csv file

In [5]:
sourcecsvdata_pth = data_pth / "sourcedata" / "bcodmo_dataset_682074_data.csv"

In [6]:
usecols = ['sample_code', 'species', 'life_history_stage']

occursource_df = pd.read_csv(
    sourcecsvdata_pth, 
    skiprows=[1], 
    usecols=usecols
)[usecols]

In [7]:
len(occursource_df)

6884

In [8]:
occursource_df.head()

Unnamed: 0,sample_code,species,life_history_stage
0,20131003DBDm2_200,ACARTIA,3;_CIII
1,20130906DBiDm1_200,ACARTIA,5;_CV
2,20131003DBDm1_200,ACARTIA,Female;_Adult
3,20131003DBDm1_200,ACARTIA,Male;_Adult
4,20120614DBDm3_200,ACARTIA_CLAUSI,Female;_Adult


## Merge resolved taxonomy from taxonomy csv

In [9]:
taxonomy_df = pd.read_csv(
    data_pth / "intermediate_DwC_taxonomy.csv"
)

In [10]:
occursource_df = occursource_df.merge(
    taxonomy_df, 
    how='inner', 
    left_on='species',
    right_on='verbatimIdentification'
)

In [11]:
len(occursource_df)

6702

In [12]:
occursource_df.head()

Unnamed: 0,sample_code,species,life_history_stage,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,20131003DBDm2_200,ACARTIA,3;_CIII,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
1,20130906DBiDm1_200,ACARTIA,5;_CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
2,20131003DBDm1_200,ACARTIA,Female;_Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
3,20131003DBDm1_200,ACARTIA,Male;_Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
4,20130905DUNm1_200,ACARTIA,5;_CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA


## Create and populate `occurrence_df` dataframe

### Add `occurrenceID`, `basisOfRecord`, `occurrenceStatus`

In [13]:
occurrence_df = occursource_df.copy()

In [14]:
occurrence_df.insert(1, 'occurrenceID', [uuid.uuid4() for i in range(len(occurrence_df))])
occurrence_df.insert(2, 'basisOfRecord', 'MaterialSample')
occurrence_df.insert(3, 'occurrenceStatus', 'present')

occurrence_df.rename(columns={'sample_code':'eventID'}, inplace=True)

In [15]:
len(occurrence_df)

6702

In [16]:
occurrence_df.head()

Unnamed: 0,eventID,occurrenceID,basisOfRecord,occurrenceStatus,species,life_history_stage,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,20131003DBDm2_200,5cf65d91-e5cb-449a-ae69-46356beb3257,MaterialSample,present,ACARTIA,3;_CIII,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
1,20130906DBiDm1_200,cc4c1e4a-05cf-48b1-bce2-4fd56842d015,MaterialSample,present,ACARTIA,5;_CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
2,20131003DBDm1_200,d843e311-da84-4134-9c9d-2f69a82a71c8,MaterialSample,present,ACARTIA,Female;_Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
3,20131003DBDm1_200,319b1cf5-f79e-4216-869e-197b70952ec3,MaterialSample,present,ACARTIA,Male;_Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
4,20130905DUNm1_200,416f34f3-6f0e-4f44-ab47-2232c52eec9d,MaterialSample,present,ACARTIA,5;_CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA


### Parse `life_history_stage` and map into `sex` and `lifeStage`

In [17]:
occurrence_df[['lhs_0', 'lhs_1']] = pd.DataFrame(
    occurrence_df.life_history_stage.str.split(';_').to_list(), 
    index=occurrence_df.index
)

In [18]:
occurrence_df.head(10)

Unnamed: 0,eventID,occurrenceID,basisOfRecord,occurrenceStatus,species,life_history_stage,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification,lhs_0,lhs_1
0,20131003DBDm2_200,5cf65d91-e5cb-449a-ae69-46356beb3257,MaterialSample,present,ACARTIA,3;_CIII,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA,3,CIII
1,20130906DBiDm1_200,cc4c1e4a-05cf-48b1-bce2-4fd56842d015,MaterialSample,present,ACARTIA,5;_CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA,5,CV
2,20131003DBDm1_200,d843e311-da84-4134-9c9d-2f69a82a71c8,MaterialSample,present,ACARTIA,Female;_Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA,Female,Adult
3,20131003DBDm1_200,319b1cf5-f79e-4216-869e-197b70952ec3,MaterialSample,present,ACARTIA,Male;_Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA,Male,Adult
4,20130905DUNm1_200,416f34f3-6f0e-4f44-ab47-2232c52eec9d,MaterialSample,present,ACARTIA,5;_CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA,5,CV
5,20120614DBDm3_200,9ab76a26-5e7e-4808-8302-d5ea5ddb6103,MaterialSample,present,ACARTIA_CLAUSI,Female;_Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI,Female,Adult
6,20120614DBDm3_200,11b133f4-05ed-4d7e-b7b9-7cc2a137ea52,MaterialSample,present,ACARTIA_CLAUSI,Male;_Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI,Male,Adult
7,20120614DBDm4_200,7fa42696-e3f8-4220-9909-85939f95a127,MaterialSample,present,ACARTIA_CLAUSI,Female;_Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI,Female,Adult
8,20120712DBDm1_200,aa8f9933-1c9e-411a-a63b-1c11d829aab0,MaterialSample,present,ACARTIA_CLAUSI,5;_CV,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI,5,CV
9,20121004DBDm4_200,42fe6b37-9076-4ed8-8385-24cfaa6c4181,MaterialSample,present,ACARTIA_CLAUSI,Female;_Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI,Female,Adult


In [19]:
occurrence_df[['lhs_0', 'lhs_1']].value_counts(dropna=False).sort_index(ascending=True)

lhs_0             lhs_1            
1                 CI                    129
2                 CII                   185
3                 CIII                  278
4                 CIV                   491
5                 CV                    611
Adult             NaN                   193
Bract             NaN                     6
Cal1              Calyptopis_1           40
Cal2              Calyptopis_2           25
Cal3              Calyptopis_3           36
Copepodite        NaN                   265
Cyphonaut         NaN                    81
Cyprid_larva      NaN                    36
Egg               NaN                    73
Eudoxid           NaN                     5
F1                Furcilia_1             76
F10               Furcilia_10             2
F1_0              Furcilia_1_0_legs       1
F2                Furcilia_2             66
F3                Furcilia_3             57
F4                Furcilia_4             30
F5                Furcilia_5            

In [20]:
# occurrence_df[occurrence_df.species == 'EUPHAUSIA_PACIFICA'][['lhs_0', 'lhs_1']].value_counts(dropna=False).sort_index(ascending=True)

In [21]:
occurrence_df.insert(
    4, 'sex', 
    occurrence_df['lhs_0'].apply(
        lambda s: s.lower() if s in ['Female', 'Male'] else 'indeterminate'
    )
)
occurrence_df.insert(
    5, 'lifeStage', 
    occurrence_df['lhs_0'].apply(
        lambda s: life_stage_mapping[s]
    )
)

occurrence_df = (
    occurrence_df
    .sort_values(by=['eventID', 'scientificName', 'lifeStage', 'sex'])
    .reset_index(drop=True)
)

In [22]:
occurrence_df.head(5)

Unnamed: 0,eventID,occurrenceID,basisOfRecord,occurrenceStatus,sex,lifeStage,species,life_history_stage,scientificName,scientificNameID,...,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification,lhs_0,lhs_1
0,20120611UNDm1_200,bba937ac-f6b5-4b35-a513-7852052b537c,MaterialSample,present,female,adult,ACARTIA_CLAUSI,Female;_Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,...,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI,Female,Adult
1,20120611UNDm1_200,07133f93-4d2d-47aa-a011-f1c8891ae684,MaterialSample,present,male,adult,ACARTIA_CLAUSI,Male;_Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,...,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI,Male,Adult
2,20120611UNDm1_200,a6b4d661-7b75-4f13-8dbb-25862dbd14e4,MaterialSample,present,indeterminate,copepodites C5,ACARTIA_CLAUSI,5;_CV,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,...,Animalia,Arthropoda,Hexanauplia,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI,5,CV
3,20120611UNDm1_200,26c1ac03-7cfc-428b-afb1-a4d669219c40,MaterialSample,present,indeterminate,medusae,AGLANTHA,Medusa,Aglantha,urn:lsid:marinespecies.org:taxname:117212,...,Animalia,Cnidaria,Hydrozoa,Trachymedusae,Rhopalonematidae,Aglantha,"Haeckel, 1879",AGLANTHA,Medusa,
4,20120611UNDm1_200,f0eba26e-9ffc-4d37-8908-0f3b3c8c8737,MaterialSample,present,indeterminate,veliger,BIVALVIA,Veliger,Bivalvia,urn:lsid:marinespecies.org:taxname:105,...,Animalia,Mollusca,Bivalvia,,,,"Linnaeus, 1758",BIVALVIA,Veliger,


## Export intermediate table for `life_history_stage` matching to csv

In [23]:
occurrence_df[['occurrenceID', 'life_history_stage']].to_csv(
    data_pth / 'intermediate_DwC_occurrence_life_history_stage.csv', index=False
)

## Export `occurrence_df` to csv

### Cleanup

In [24]:
occurrence_df.drop(columns=['species', 'life_history_stage', 'lhs_0', 'lhs_1'], inplace=True)

In [25]:
occurrence_df.to_csv(data_pth / 'aligned_csvs' / 'DwC_occurrence.csv', index=False)

## Package versions

In [26]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2022-01-13 02:50:07.198498 +00:00
pandas: 1.3.5
