# Darwin Core (DwC) **Taxonomy Mappings**, Puget Sound Zooplankton Monitoring Program (PSZMP) dataset

Alignment of zooplankton dataset to the [Darwin Core (DwC) data standard](https://dwc.tdwg.or/), carried out by **NANOOS**, https://www.nanoos.org. This data alignment work, including this Jupyter notebook, is described in the NANOOS GitHub repository https://github.com/nanoos-pnw/obis-pszmp. See [README.md](https://github.com/nanoos-pnw/obis-pszmp/blob/main/README.md) for further background on the dataset, DwC and data transformations.

Emilio Mayorga, https://github.com/emiliom

## Goals and scope of this notebook

Match taxonomic information found in the source data with corresponding, fully fleshed out information from [WoRMS](https://www.marinespecies.org/) (World Register of Marine Species). In addition to automatic match-ups, several manual match-ups are applied as a pre-processing step in [data_preprocess.py](data_preprocess.py), based on unsuccessful matches and input from the PSZMP team. The taxonomic match-up is performed on a version of the original `Genus species` entries, `Genus species_lc` where the original strings were turned into all-lowercase and variations containing non-taxonomic information were simplified and homogogenized (eg, "PSEUDOCALANUS Lg" and "PSEUDOCALANUS" are both transformed into "pseudocalanus"). The column `Genus species_lc` is renamed to `species` for simplicity (though that's a misnomer). This notebook creates an intermediate file, `intermediate_DwC_taxonomy.csv`, containing the taxonomic match-up between the original, unmodified `Genus species` entry and the corresponding WoRMS information, used later in the `PSZMP-dwcOccurrence.ipynb` notebook.

## Settings

In [1]:
from datetime import datetime
from pathlib import Path

import pandas as pd
import pyworms

from data_preprocess import read_and_parse_sourcedata

In [2]:
data_pth = Path(".")

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [3]:
debug_no_csvexport = False

## Pre-process data for taxonomy table

### Read and pre-processe the source data from Excel file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [4]:
# NOTE: Not sure 'Sample Code' is actually used in this notebook
usecols = ['Sample Code', 'Genus species', 'Genus species_lc']

# taxonsource_df = read_and_parse_sourcedata(test_n_rows=1000)[usecols]
taxonsource_df = read_and_parse_sourcedata()[usecols]

# Map column names to simpler, more manageable ones
taxonsource_df.rename(
    columns={
        'Sample Code':'sample_code',
        'Genus species_lc':'species',
    },
    inplace=True
)

In [5]:
len(taxonsource_df)

185729

In [6]:
taxonsource_df.head()

Unnamed: 0,sample_code,Genus species,species
0,032514DANAD1147,ALPHEIDAE,alpheidae
1,032514DANAD1147,BARNACLES,cirripedia
2,032514DANAD1147,BARNACLES,cirripedia
3,032514DANAD1147,CALANUS,calanus
4,032514DANAD1147,CANCRIDAE,cancridae


## Perform taxon matching with pyworms

Next, we perform a taxonomic lookup using the library [pyworms](https://pyworms.readthedocs.io/en/latest/) using the function `pyworms.aphiaRecordsByMatchNames()` to collect the information and populate the look-up table.

In [None]:
species_valuecnt = taxonsource_df['species'].value_counts()
species_valuecnt

chaetognatha              12117
calanus pacificus          8815
euphausia pacifica         7359
metridia pacifica          5686
cirripedia                 5372
                          ...  
eucalanus hyalinus            1
farranula                     1
paraphronima crassipes        1
tiaropsis                     1
zaus                          1
Name: species, Length: 261, dtype: int64

`species_list` will be a sorted list of unique `species` entries.

In [None]:
species_list = sorted(list(taxonsource_df['species'].unique()))

len(species_list)

261

In [9]:
%%time
resp = pyworms.aphiaRecordsByMatchNames(
    [s.replace('_', ' ') for s in species_list]
)

CPU times: user 256 ms, sys: 7.27 ms, total: 263 ms
Wall time: 47.5 s


In [10]:
len(resp)

261

In [None]:
resp[1][0]

{'AphiaID': 157664,
 'url': 'https://www.marinespecies.org/aphia.php?p=taxdetails&id=157664',
 'scientificname': 'Acartia hudsonica',
 'authority': 'Pinhey, 1926',
 'status': 'alternative representation',
 'unacceptreason': None,
 'taxonRankID': 220,
 'rank': 'Species',
 'valid_AphiaID': 149751,
 'valid_name': 'Acartia (Acartiura) hudsonica',
 'valid_authority': 'Pinhey, 1926',
 'parentNameUsageID': 104108,
 'kingdom': 'Animalia',
 'phylum': 'Arthropoda',
 'class': 'Copepoda',
 'order': 'Calanoida',
 'family': 'Acartiidae',
 'genus': 'Acartia',
 'citation': 'Walter, T.C.; Boxshall, G. (2024). World of Copepods Database. Acartia hudsonica Pinhey, 1926. Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=157664 on 2024-12-12',
 'lsid': 'urn:lsid:marinespecies.org:taxname:157664',
 'isMarine': 1,
 'isBrackish': 0,
 'isFreshwater': 0,
 'isTerrestrial': 0,
 'isExtinct': None,
 'match_type': 'exact',
 'modified': '2023-12-09T12:47:42

Create dictionary of successfuly matches by `AphiaID`, to pull out information conveniently

In [None]:
resp_matched_byaphiaid = {r[0]['AphiaID']:r[0] for r in resp if len(r) > 0}

len(resp_matched_byaphiaid)

261

If this test is `True`, all `species` entries were successfully matched.

In [None]:
len(species_list) == len(resp_matched_byaphiaid)

True

Create simple dictionary of `species` (original dataset species name) vs `AphiaID`

In [14]:
species_worms_all = {}
for spec,r in zip(species_list, resp):
    aphiaID = r[0]['AphiaID'] if len(r) > 0 else None
    species_worms_all[spec] = aphiaID

Print a few of the *successful* matches, with corresponding `AphiaID`, followed by an example of the use of `resp_matched_byaphiaid` to bring in all the taxonomic information returned by `pyworms` (eg, for species name, 'CANCER_GRACILIS' > 'Metacarcinus gracilis')

In [15]:
for k,v in list(species_worms_all.items())[:10]:
    print(f"{k}: {v}")

acartia: 104108
acartia hudsonica: 157664
acartia longiremis: 104257
acartia tonsa: 104262
aegina citrea: 117263
aequorea victoria: 283998
aetideidae: 104075
aetideus: 104112
aglantha digitale: 117849
alienacanthomysis macropsis: 226662


In [16]:
resp_matched_byaphiaid[species_worms_all['acartia longiremis']]

{'AphiaID': 104257,
 'url': 'https://www.marinespecies.org/aphia.php?p=taxdetails&id=104257',
 'scientificname': 'Acartia longiremis',
 'authority': '(Lilljeborg, 1853)',
 'status': 'alternative representation',
 'unacceptreason': None,
 'taxonRankID': 220,
 'rank': 'Species',
 'valid_AphiaID': 346037,
 'valid_name': 'Acartia (Acartiura) longiremis',
 'valid_authority': '(Lilljeborg, 1853)',
 'parentNameUsageID': 104108,
 'kingdom': 'Animalia',
 'phylum': 'Arthropoda',
 'class': 'Copepoda',
 'order': 'Calanoida',
 'family': 'Acartiidae',
 'genus': 'Acartia',
 'citation': 'Walter, T.C.; Boxshall, G. (2024). World of Copepods Database. Acartia longiremis (Lilljeborg, 1853). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=104257 on 2024-12-12',
 'lsid': 'urn:lsid:marinespecies.org:taxname:104257',
 'isMarine': 1,
 'isBrackish': 0,
 'isFreshwater': 0,
 'isTerrestrial': 0,
 'isExtinct': None,
 'match_type': 'exact',
 'modified':

## Confirm there are no unmatched species

In [17]:
species_worms_no_match = [
    (k,species_valuecnt[k]) for k,v in species_worms_all.items()
    if v is None
]

# Entries not matched by `pyworms`, followed by their occurrence count
species_worms_no_match

[]

For code to apply here (instead of or in addition to in `data_preprocess.py`) a manual substitution of taxa that were not matched by WoRMS, see the corresponding code in the Hood Canal zooplankton dataset notebook, [Keister-dwcTaxonomy.ipynb](https://github.com/nanoos-pnw/obis-keisterhczoop/blob/main/Keister-dwcTaxonomy.ipynb).

## Final taxon look-up table

`aphiaid` will be `None` if there was no match.

In [None]:
species_worms_lut_df = pd.DataFrame.from_dict(species_worms_all, orient='index', columns=['aphiaid'])
species_worms_lut_df['species'] = species_worms_lut_df.index

len(species_worms_lut_df)

261

In [19]:
species_worms_lut_df.head()

Unnamed: 0,aphiaid,species
acartia,104108,acartia
acartia hudsonica,157664,acartia hudsonica
acartia longiremis,104257,acartia longiremis
acartia tonsa,104262,acartia tonsa
aegina citrea,117263,aegina citrea


If `aphiaid` contains some nulls, it was cast to `float` (because of the creation of `NaN` values). Remove those rows, if any, then cast `aphiaid` to `int`.

In [20]:
species_worms_lut_df.dropna(axis='rows', inplace=True)
species_worms_lut_df['aphiaid'] = species_worms_lut_df['aphiaid'].astype(int)

len(species_worms_lut_df)

261

In [21]:
col_mapping = [
    ('scientificName', 'scientificname'),
    ('scientificNameID', 'lsid'),
    ('taxonRank', 'rank'),
    ('kingdom', 'kingdom'),
    ('phylum', 'phylum'),
    ('class', 'class'),
    ('order', 'order'),
    ('family', 'family'),
    ('genus', 'genus'),
    ('scientificNameAuthorship', 'authority')
]

for cm in col_mapping:
    species_worms_lut_df[cm[0]] = species_worms_lut_df['aphiaid'].apply(
        lambda aid: resp_matched_byaphiaid[aid][cm[1]]
    )

In [22]:
set(species_list).difference(set(species_worms_lut_df['species'].unique()))

set()

In [None]:
species_worms_lut_df.head(15)

Unnamed: 0,aphiaid,species,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship
acartia,104108,acartia,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846"
acartia hudsonica,157664,acartia hudsonica,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926"
acartia longiremis,104257,acartia longiremis,Acartia longiremis,urn:lsid:marinespecies.org:taxname:104257,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"(Lilljeborg, 1853)"
acartia tonsa,104262,acartia tonsa,Acartia tonsa,urn:lsid:marinespecies.org:taxname:104262,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1849-1852"
aegina citrea,117263,aegina citrea,Aegina citrea,urn:lsid:marinespecies.org:taxname:117263,Species,Animalia,Cnidaria,Hydrozoa,Narcomedusae,Aeginidae,Aegina,"Eschscholtz, 1829"
aequorea victoria,283998,aequorea victoria,Aequorea victoria,urn:lsid:marinespecies.org:taxname:283998,Species,Animalia,Cnidaria,Hydrozoa,Leptothecata,Aequoreidae,Aequorea,"(Murbach & Shearer, 1902)"
aetideidae,104075,aetideidae,Aetideidae,urn:lsid:marinespecies.org:taxname:104075,Family,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,,"Giesbrecht, 1892-1893"
aetideus,104112,aetideus,Aetideus,urn:lsid:marinespecies.org:taxname:104112,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,Aetideus,"Brady, 1883"
aglantha digitale,117849,aglantha digitale,Aglantha digitale,urn:lsid:marinespecies.org:taxname:117849,Species,Animalia,Cnidaria,Hydrozoa,Trachymedusae,Rhopalonematidae,Aglantha,"(O. F. Müller, 1776)"
alienacanthomysis macropsis,226662,alienacanthomysis macropsis,Alienacanthomysis macropsis,urn:lsid:marinespecies.org:taxname:226662,Species,Animalia,Arthropoda,Malacostraca,Mysida,Mysidae,Alienacanthomysis,"(W. Tattersall, 1932)"


Some summaries:

In [24]:
species_worms_lut_df['taxonRank'].value_counts()

Species        103
Genus           71
Family          26
Order           21
Class           13
Phylum          10
Suborder         6
Infraorder       3
Subclass         3
Superfamily      3
Kingdom          1
Subphylum        1
Name: taxonRank, dtype: int64

In [25]:
species_worms_lut_df[species_worms_lut_df['taxonRank'] == 'Phylum']['scientificName'].value_counts()

Annelida           1
Bryozoa            1
Chaetognatha       1
Ctenophora         1
Echinodermata      1
Mollusca           1
Nematoda           1
Nemertea           1
Phoronida          1
Platyhelminthes    1
Name: scientificName, dtype: int64

In [None]:
species_worms_lut_df[species_worms_lut_df['taxonRank'].isin(['Class', 'Subclass'])]['scientificName'].value_counts()

Ascidiacea       1
Bivalvia         1
Cirripedia       1
Copepoda         1
Facetotecta      1
Gastropoda       1
Holothuroidea    1
Hydrozoa         1
Insecta          1
Larvacea         1
Ophiuroidea      1
Ostracoda        1
Polychaeta       1
Pycnogonida      1
Scyphozoa        1
Teleostei        1
Name: scientificName, dtype: int64

## Create and populate `taxonomy_df` dataframe

In [None]:
taxonomy_df = taxonsource_df.merge(
    species_worms_lut_df, 
    how='inner', 
    on='species'
)

taxonomy_df['verbatimIdentification'] = taxonomy_df['Genus species']

len(taxonomy_df)

185729

Some summaries, based on individual records:

In [28]:
taxonomy_df['taxonRank'].value_counts()

Species        83794
Genus          34431
Class          21025
Family         18312
Phylum         16309
Subclass        5498
Order           2937
Kingdom         1084
Suborder        1060
Infraorder      1047
Superfamily      131
Subphylum        101
Name: taxonRank, dtype: int64

In [29]:
taxonomy_df[taxonomy_df['taxonRank'] == 'Phylum']['scientificName'].value_counts()

Chaetognatha       12117
Bryozoa             2090
Echinodermata       1328
Phoronida            221
Ctenophora           204
Platyhelminthes      124
Nemertea             115
Nematoda              73
Mollusca              31
Annelida               6
Name: scientificName, dtype: int64

In [30]:
taxonomy_df[taxonomy_df['taxonRank'].isin(['Class', 'Subclass'])]['scientificName'].value_counts()

Cirripedia       5372
Copepoda         4787
Polychaeta       3733
Teleostei        3473
Gastropoda       2816
Bivalvia         1860
Larvacea         1647
Hydrozoa         1500
Ostracoda         972
Insecta           124
Ascidiacea         92
Scyphozoa          84
Holothuroidea      37
Ophiuroidea        19
Pycnogonida         5
Facetotecta         2
Name: scientificName, dtype: int64

### Clean up and simplify into a taxonomy table without duplicates

It'll still contain a few repeated taxa `scientificName` because of the variations in `Genus species`. The length of this final table will be larger than that of `species_worms_lut_df` because of the variations in `Genus species` (eg, "PSEUDOCALANUS Lg", "PSEUDOCALANUS Sm" and "PSEUDOCALANUS") that correspond to the same `species` (eg, "pseudocalanus") and `scientificName` (eg, "Pseudocalanus").

In [31]:
taxonomy_df = (
    taxonomy_df
    .drop(columns=['sample_code', 'Genus species', 'species', 'aphiaid'])
    .drop_duplicates()
    .sort_values(by='verbatimIdentification')
    .reset_index(drop=True)
)

len(taxonomy_df)

288

In [32]:
taxonomy_df.head()

Unnamed: 0,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
1,Acartia hudsonica,urn:lsid:marinespecies.org:taxname:157664,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Pinhey, 1926",ACARTIA HUDSONICA
2,Acartia longiremis,urn:lsid:marinespecies.org:taxname:104257,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"(Lilljeborg, 1853)",ACARTIA LONGIREMIS
3,Acartia tonsa,urn:lsid:marinespecies.org:taxname:104262,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1849-1852",ACARTIA TONSA
4,Aegina citrea,urn:lsid:marinespecies.org:taxname:117263,Species,Animalia,Cnidaria,Hydrozoa,Narcomedusae,Aeginidae,Aegina,"Eschscholtz, 1829",AEGINA CITREA


## Export `taxonomy_df` to csv

In [33]:
if not debug_no_csvexport:
    taxonomy_df.to_csv(data_pth / 'intermediate_DwC_taxonomy.csv', index=False)

## Package versions

In [34]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2024-12-12 21:50:04.255372 +00:00
pandas: 1.5.3


The version of the `pyworms` package used here doesn't have a version attribute. From https://github.com/iobis/pyworms/blob/master/setup.py, the latest version is 0.2.0, as of February 2021. The latest [tagged release](https://github.com/iobis/pyworms/releases/tag/0.2.0) is 0.2.0, from October 2020.