# DwC Occurrences. Keister Zooplankton Hood Canal 2012-13 data

University of Washington Pelagic Hypoxia Hood Canal project, Zooplankton dataset.

2023-3-28,7. 2022-3-14, 1-12

- Refer to https://obis.org/manual/darwincore/#occurrence and https://obis.org/manual/darwincore/#taxonomy for guidance
- `acceptedname` and `acceptedNameID` are not needed or even recommended


**TODO**
- Decide how to handle "absent" `occurrenceStatus`. Will need to decide when to assign "absent" occurrence, and how. From Li et al 2019, it's clear that for *Euphausia pacifica*, if there is no occurrence record for a "sample" (a specific net tow), "absence" is implied and can/should be entered. But how about others? Will have to consult with Julie and the Standardizing Bio data group

**Questions, issues, comments about life history and sex parsing**
- `life_history_stage` is exported in a separate, intermediate csv table together with `occurrenceID` in order to enable the join/merge needed for generating abundance measurements in emof.
- `life_stage_mapping` dictionary is now stored in an external JSON file
- Now that I've thoroughly compared `lhs_0` and `lhs_1`, it's clear that `lhs_1` can be thrown out. `lhs_0` is all that's needed
- `Female` and `Male` are only found for `Adult`, and only found in `lhs_0`
- Are `CI` - `CV` (and matching 1-5) the same as `Copepodite` 1 - 5? If they were, their use with *Euphausia Pacifica* (see the next cell) would be wrong. 
- Life stages `Calyptopis` 1-3, `Furcilia` 1-10 and `Zoea` 1-5 are not found in https://vocab.nerc.ac.uk/collection/S11/current/. That vocab doesn't have any "phases" for `Calyptopis`, `Furcilia` and `Zoea`
- These are also not found: `Bract`, `Gonophore`, `Nectophore`, `Polyp`
- `Furcilia_1_0_legs` is not clear

In [1]:
from datetime import datetime
from pathlib import Path
import uuid

import numpy as np
import pandas as pd

from data_preprocess import read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

## Settings

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Process JSON file containing common mappings and strings

In [4]:
with open(data_pth / 'common_mappings.json') as f:
    common_mappings = json.load(f)

In [5]:
DatasetCode = common_mappings['datasetcode']
life_stage_mapping = common_mappings['life_stage_mapping']

## Read the data

### Read the pre-processed csv file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [6]:
usecols = ['sample_code', 'species', 'life_history_stage', 'lhs_0', 'lhs_1']
occursource_df = read_and_parse_sourcedata()[usecols]

In [7]:
len(occursource_df)

6884

In [8]:
occursource_df.head()

Unnamed: 0,sample_code,species,life_history_stage,lhs_0,lhs_1
0,20131003DBDm2_200,ACARTIA,3;_CIII,3,CIII
1,20130906DBiDm1_200,ACARTIA,5;_CV,5,CV
2,20131003DBDm1_200,ACARTIA,Female;_Adult,Female,Adult
3,20131003DBDm1_200,ACARTIA,Male;_Adult,Male,Adult
4,20120614DBDm3_200,ACARTIA_CLAUSI,Female;_Adult,Female,Adult


## Merge resolved taxonomy from taxonomy csv

In [9]:
taxonomy_df = pd.read_csv(
    data_pth / "intermediate_DwC_taxonomy.csv"
)

In [10]:
occursource_df = occursource_df.merge(
    taxonomy_df, 
    how='inner', 
    left_on='species',
    right_on='verbatimIdentification'
)

In [11]:
len(occursource_df)

6702

In [12]:
occursource_df.head()

Unnamed: 0,sample_code,species,life_history_stage,lhs_0,lhs_1,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,20131003DBDm2_200,ACARTIA,3;_CIII,3,CIII,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
1,20130906DBiDm1_200,ACARTIA,5;_CV,5,CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
2,20131003DBDm1_200,ACARTIA,Female;_Adult,Female,Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
3,20131003DBDm1_200,ACARTIA,Male;_Adult,Male,Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
4,20130905DUNm1_200,ACARTIA,5;_CV,5,CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA


## Create and populate `occurrence_df` dataframe

### Add `occurrenceID`, `basisOfRecord`, `occurrenceStatus`

In [13]:
occurrence_df = occursource_df.copy()

In [14]:
occurrence_df.insert(1, 'occurrenceID', [uuid.uuid4() for i in range(len(occurrence_df))])
occurrence_df.insert(2, 'basisOfRecord', 'MaterialSample')
occurrence_df.insert(3, 'occurrenceStatus', 'present')

occurrence_df.rename(columns={'sample_code':'eventID'}, inplace=True)

In [15]:
len(occurrence_df)

6702

In [16]:
occurrence_df.head(10)

Unnamed: 0,eventID,occurrenceID,basisOfRecord,occurrenceStatus,species,life_history_stage,lhs_0,lhs_1,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,20131003DBDm2_200,ca12f4f3-1fa1-4a1e-bbc7-1ce8fc07d6ca,MaterialSample,present,ACARTIA,3;_CIII,3,CIII,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
1,20130906DBiDm1_200,dbb9aa4e-3e58-4219-acd4-030286bfbd1a,MaterialSample,present,ACARTIA,5;_CV,5,CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
2,20131003DBDm1_200,b20c5dc7-0d8d-4338-a1f4-540b257fea00,MaterialSample,present,ACARTIA,Female;_Adult,Female,Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
3,20131003DBDm1_200,2a68e475-b7d3-4a42-8182-a446ceee6789,MaterialSample,present,ACARTIA,Male;_Adult,Male,Adult,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
4,20130905DUNm1_200,84469dfc-a6f5-47dc-acf2-54e30986a761,MaterialSample,present,ACARTIA,5;_CV,5,CV,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
5,20120614DBDm3_200,74715bc4-e434-453c-a3fb-28c458a32427,MaterialSample,present,ACARTIA_CLAUSI,Female;_Adult,Female,Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
6,20120614DBDm3_200,792b0814-c5c6-4f36-8230-59b7c1f29eda,MaterialSample,present,ACARTIA_CLAUSI,Male;_Adult,Male,Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
7,20120614DBDm4_200,0cbfe732-422e-4dff-8292-61288630ae6d,MaterialSample,present,ACARTIA_CLAUSI,Female;_Adult,Female,Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
8,20120712DBDm1_200,ec1a67e3-7ead-46e2-b81a-7e8153287828,MaterialSample,present,ACARTIA_CLAUSI,5;_CV,5,CV,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
9,20121004DBDm4_200,5579be99-96cb-4ff0-aded-f616b9171cef,MaterialSample,present,ACARTIA_CLAUSI,Female;_Adult,Female,Adult,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI


### Map `life_history_stage` into `sex` and `lifeStage`

In [17]:
occurrence_df[['lhs_0', 'lhs_1']].value_counts(dropna=False).sort_index(ascending=True)

lhs_0             lhs_1       
1                 CI               120
2                 CII              176
3                 CIII             272
4                 CIV              491
5                 CV               611
Adult             NaN              193
Bract             NaN                6
Cal1              Calyptopis_1      49
Cal2              Calyptopis_2      34
Cal3              Calyptopis_3      42
Copepodite        NaN              265
Cyphonaut         NaN               81
Cyprid_larva      NaN               36
Egg               NaN               73
Eudoxid           NaN                5
F1                Furcilia_1        76
F10               Furcilia_10        3
F2                Furcilia_2        66
F3                Furcilia_3        57
F4                Furcilia_4        30
F5                Furcilia_5        10
F6                Furcilia_6        24
F7                Furcilia_7        24
F9                Furcilia_9         1
Female            Adult          

In [18]:
# Examine use of CI-V life stages (Copepodites?) with EUPHAUSIA_PACIFICA krill
# Note that only CI-III are used with krill, and only a small subset of the totals (9, 9 & 6, respectively)

# occurrence_df[occurrence_df.species == 'EUPHAUSIA_PACIFICA'][['lhs_0', 'lhs_1']].value_counts(dropna=False).sort_index(ascending=True)

Confirm that 'Cal1', 'Cal2', 'Cal3' are only used with krill

In [19]:
pd.DataFrame(occurrence_df[
    occurrence_df.lhs_0.isin(['Cal1', 'Cal2', 'Cal3'])
][
    ['lhs_0', 'lhs_1', 'species']
].value_counts(dropna=False).sort_index(ascending=True)).head(60)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
lhs_0,lhs_1,species,Unnamed: 3_level_1
Cal1,Calyptopis_1,EUPHAUSIA_PACIFICA,43
Cal1,Calyptopis_1,THYSANOESSA,6
Cal2,Calyptopis_2,EUPHAUSIA_PACIFICA,32
Cal2,Calyptopis_2,THYSANOESSA,1
Cal2,Calyptopis_2,THYSANOESSA_RASCHII,1
Cal3,Calyptopis_3,EUPHAUSIA_PACIFICA,37
Cal3,Calyptopis_3,THYSANOESSA,4
Cal3,Calyptopis_3,THYSANOESSA_RASCHII,1


In [20]:
# pd.DataFrame(occurrence_df[
#     occurrence_df.lhs_0.isin(['1', '2', '3', '4', '5'])
# ][
#     ['lhs_0', 'lhs_1', 'species']
# ].value_counts(dropna=False).sort_index(ascending=True)).head(60)

In [21]:
occurrence_df.insert(
    4, 'sex', 
    occurrence_df['lhs_0'].apply(
        lambda s: s.lower() if s in ['Female', 'Male'] else 'indeterminate'
    )
)
occurrence_df.insert(
    5, 'lifeStage', 
    occurrence_df['lhs_0'].apply(
        lambda s: life_stage_mapping[s]
    )
)

occurrence_df = (
    occurrence_df
    .sort_values(by=['eventID', 'scientificName', 'lifeStage', 'sex'])
    .reset_index(drop=True)
)

In [22]:
occurrence_df.head(5)

Unnamed: 0,eventID,occurrenceID,basisOfRecord,occurrenceStatus,sex,lifeStage,species,life_history_stage,lhs_0,lhs_1,...,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,20120611UNDm1_200,6941d6e8-4539-49f5-b908-db87dc5f9837,MaterialSample,present,female,adult,ACARTIA_CLAUSI,Female;_Adult,Female,Adult,...,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
1,20120611UNDm1_200,52d79ec6-8dd2-466b-a03b-6bab0ab1c16c,MaterialSample,present,male,adult,ACARTIA_CLAUSI,Male;_Adult,Male,Adult,...,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
2,20120611UNDm1_200,c6c45ae2-1df8-4124-a2a9-06e3e3104105,MaterialSample,present,indeterminate,copepodites C5,ACARTIA_CLAUSI,5;_CV,5,CV,...,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
3,20120611UNDm1_200,31ab7e6f-4d0a-47a6-a31f-b0b86dc2bf5a,MaterialSample,present,indeterminate,medusae,AGLANTHA,Medusa,Medusa,,...,urn:lsid:marinespecies.org:taxname:117212,Genus,Animalia,Cnidaria,Hydrozoa,Trachymedusae,Rhopalonematidae,Aglantha,"Haeckel, 1879",AGLANTHA
4,20120611UNDm1_200,412576d2-12b0-4d9f-853f-10d829585b66,MaterialSample,present,indeterminate,veliger,BIVALVIA,Veliger,Veliger,,...,urn:lsid:marinespecies.org:taxname:105,Class,Animalia,Mollusca,Bivalvia,,,,"Linnaeus, 1758",BIVALVIA


## Export intermediate table for `life_history_stage` matching to csv

In [23]:
if not debug_no_csvexport:
    occurrence_df[['occurrenceID', 'life_history_stage']].to_csv(
        data_pth / 'intermediate_DwC_occurrence_life_history_stage.csv', index=False
    )

## Export `occurrence_df` to csv

### Cleanup

In [24]:
occurrence_df.drop(columns=['species', 'life_history_stage', 'lhs_0', 'lhs_1'], inplace=True)

if not debug_no_csvexport:
    occurrence_df.to_csv(data_pth / 'aligned_csvs' / 'DwC_occurrence.csv', index=False)

## Package versions

In [25]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2023-03-29 03:10:49.880341 +00:00
pandas: 1.5.3
