# DwC Taxonomoy Mappings. Keister Zooplankton Hood Canal 2012-13 data

University of Washington Pelagic Hypoxia Hood Canal project, Zooplankton dataset.

5-3, 4-25, 3-28, 2023. 2022-3-14, 1-12

**BRIEFLY DESCRIBE WHAT THIS NOTEBOOK DOES**

In [1]:
from datetime import datetime
from pathlib import Path

import numpy as np
import pandas as pd
import pyworms

from data_preprocess import read_and_parse_sourcedata

## Settings

Set to `True` when debugging. `csv` ﬁles will not be exported when `debug_no_csvexport = True`

In [2]:
debug_no_csvexport = False

## Read the data

In [3]:
data_pth = Path(".")

## Pre-process data from csv for Taxonomy table

### Read the pre-processed csv file

`usecols` defines the columns that will be kept and the order in which they'll be organized

In [4]:
usecols = ['sample_code', 'species']
taxonsource_df = read_and_parse_sourcedata()[usecols]

In [5]:
len(taxonsource_df)

6884

In [6]:
taxonsource_df.head()

Unnamed: 0,sample_code,species
0,20131003DBDm2_200,ACARTIA
1,20130906DBiDm1_200,ACARTIA
2,20131003DBDm1_200,ACARTIA
3,20131003DBDm1_200,ACARTIA
4,20120614DBDm3_200,ACARTIA_CLAUSI


## Perform taxon matching with pyworms

Next, we perform a taxonomic lookup using the library [pyworms](https://pyworms.readthedocs.io/en/latest/). Using the function `pyworms.aphiaRecordsByMatchNames()` to collect the information and populate the look up table.

In [7]:
species_valuecnt = taxonsource_df.species.value_counts()
species_valuecnt

EUPHAUSIA_PACIFICA      647
CALANUS_PACIFICUS       603
PARACALANUS_PARVUS      561
METRIDIA_PACIFICA       450
PSEUDOCALANUS           404
                       ... 
NEOCALANUS_PLUMCHRUS      1
PINNOTHERIDAE             1
OPHIUROIDEA               1
FRITILLARIA               1
TANAIDACEA                1
Name: species, Length: 137, dtype: int64

`species_list` will be a sorted list

In [8]:
species_list = sorted(list(taxonsource_df.species.unique()))

In [9]:
%%time
resp = pyworms.aphiaRecordsByMatchNames(
    [s.replace('_', ' ') for s in species_list]
)

CPU times: user 119 ms, sys: 1.41 ms, total: 121 ms
Wall time: 22.7 s


In [10]:
len(resp)

137

In [11]:
resp[1][0]

{'AphiaID': 104251,
 'url': 'https://www.marinespecies.org/aphia.php?p=taxdetails&id=104251',
 'scientificname': 'Acartia clausi',
 'authority': 'Giesbrecht, 1889',
 'status': 'alternate representation',
 'unacceptreason': None,
 'taxonRankID': 220,
 'rank': 'Species',
 'valid_AphiaID': 149755,
 'valid_name': 'Acartia (Acartiura) clausi',
 'valid_authority': 'Giesbrecht, 1889',
 'parentNameUsageID': 104108,
 'kingdom': 'Animalia',
 'phylum': 'Arthropoda',
 'class': 'Copepoda',
 'order': 'Calanoida',
 'family': 'Acartiidae',
 'genus': 'Acartia',
 'citation': 'Walter, T.C.; Boxshall, G. (2023). World of Copepods Database. Acartia clausi Giesbrecht, 1889. Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=104251 on 2023-05-04',
 'lsid': 'urn:lsid:marinespecies.org:taxname:104251',
 'isMarine': 1,
 'isBrackish': 0,
 'isFreshwater': 0,
 'isTerrestrial': 0,
 'isExtinct': None,
 'match_type': 'exact',
 'modified': '2020-05-28T16:04:5

Create dictionary of successfuly matches by `AphiaID`, to pull out information conveniently

In [12]:
resp_matched_byaphiaid = {r[0]['AphiaID']:r[0] for r in resp if len(r) > 0}
len(resp_matched_byaphiaid)

124

13 (137 - 124) of the 137 unique "species" names in the original dataset were not matched by `pyworms`

Create simple dictionary of `species` (original dataset species name) vs `AphiaID`

In [13]:
species_worms_all = {}
for spec,r in zip(species_list, resp):
    aphiaID = r[0]['AphiaID'] if len(r) > 0 else None
    species_worms_all[spec] = aphiaID

Print a few of the *successful* matches, with corresponding `AphiaID`, followed by an example of the use of `resp_matched_byaphiaid` to bring in all the taxonomic information returned by `pyworms` (eg, for species name, 'CANCER_GRACILIS' > 'Metacarcinus gracilis')

In [14]:
for k,v in list(species_worms_all.items())[:10]:
    print(f"{k}: {v}")

ACARTIA: 104108
ACARTIA_CLAUSI: 104251
ACARTIA_LONGIREMIS: 104257
AEQUOREA: 116998
AETIDEIDAE: 104075
AETIDEUS: 104112
AETIDEUS_ARMATUS: 104275
AETIDEUS_PACIFICUS: 254648
AETIDEUS_divergens: 346076
AGLANTHA: 117212


In [15]:
resp_matched_byaphiaid[species_worms_all['ACARTIA_LONGIREMIS']]

{'AphiaID': 104257,
 'url': 'https://www.marinespecies.org/aphia.php?p=taxdetails&id=104257',
 'scientificname': 'Acartia longiremis',
 'authority': '(Lilljeborg, 1853)',
 'status': 'alternate representation',
 'unacceptreason': None,
 'taxonRankID': 220,
 'rank': 'Species',
 'valid_AphiaID': 346037,
 'valid_name': 'Acartia (Acartiura) longiremis',
 'valid_authority': '(Lilljeborg, 1853)',
 'parentNameUsageID': 104108,
 'kingdom': 'Animalia',
 'phylum': 'Arthropoda',
 'class': 'Copepoda',
 'order': 'Calanoida',
 'family': 'Acartiidae',
 'genus': 'Acartia',
 'citation': 'Walter, T.C.; Boxshall, G. (2023). World of Copepods Database. Acartia longiremis (Lilljeborg, 1853). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=104257 on 2023-05-04',
 'lsid': 'urn:lsid:marinespecies.org:taxname:104257',
 'isMarine': 1,
 'isBrackish': 0,
 'isFreshwater': 0,
 'isTerrestrial': 0,
 'isExtinct': None,
 'match_type': 'exact',
 'modified': '

In the future, could try to identify cases where `valid_AphiaID != AphiaID`, select `valid_AphiaID`, reissue a query based on those aphia ID's, then use those results instead of the original results.

## Replace unmatched species with the known matched taxon

In [16]:
species_worms_no_match = [
    (k,species_valuecnt[k]) for k,v in species_worms_all.items()
    if v is None
]

Entries not matched by `pyworms`, followed by their occurrence count

In [17]:
species_worms_no_match

[('BARNACLES', 116),
 ('BRYOZOAN', 80),
 ('CANCER_ANTENNARIUS', 2),
 ('CRABS', 11),
 ('FISH', 45),
 ('JELLYFISHES', 9),
 ('MITES', 1),
 ('PLAINFIN_MIDSHIPMAN', 1),
 ('PORCELAIN_CRABS', 1),
 ('SHRIMP', 100),
 ('UNKNOWN_EGG_JK', 13),
 ('WORM', 3),
 ('monstrilloid', 2)]

**Several of these have matches to different taxonomic databases other than WoRMS when using the [Global Names Resolver](http://resolver.globalnames.org/api).** Others can be identified with a bit of digging, to varying degrees of taxonomic resolution/rank (eg, BRYOZOAN, PLAINFIN_MIDSHIPMAN, PORCELAIN_CRABS, monstrilloid). Used these resources to find matching scientific names.

**We'll use the following "manual" matches. Only 3 matches will not be attempted: 'MITES', 'UNKNOWN_EGG_JK', 'WORM'**

In [18]:
species_replacements = {
    'BARNACLES': 'Cirripedia',
    'BRYOZOAN': 'Bryozoa',
    'CANCER_ANTENNARIUS': 'Romaleon antennarium',
    'PLAINFIN_MIDSHIPMAN': 'Porichthys notatus',
    'PORCELAIN_CRABS': 'Porcellanidae',
    'monstrilloid': 'Monstrillidae',
    'CRABS': 'Brachyura',
    'FISH': 'Actinopterygii',
    'JELLYFISHES': 'Cnidaria',
    'SHRIMP': 'Dendrobranchiata',
}

species_replacements_rev = {v:k for k,v in species_replacements.items()}

Taxa that will remain unmatched, unprocessed:

In [19]:
{spec[0] for spec in species_worms_no_match} - species_replacements.keys()

{'MITES', 'UNKNOWN_EGG_JK', 'WORM'}

In [20]:
species_list2 = species_replacements.values()

In [21]:
resp2 = pyworms.aphiaRecordsByMatchNames(
    [s.replace('_', ' ') for s in species_list2]
)

In [22]:
resp_matched_byaphiaid2 = {r[0]['AphiaID']:r[0] for r in resp2 if len(r) > 0}
len(resp_matched_byaphiaid2)

10

In [23]:
species_worms_2 = {}
for spec,r in zip(species_list2, resp2):
    aphiaID = r[0]['AphiaID'] if len(r) > 0 else None
    species_worms_2[species_replacements_rev[spec]] = aphiaID

## Final taxon look-up table

`aphiaid` will be `None` if there was no match.

In [24]:
species_worms_lut_df = pd.DataFrame.from_dict(species_worms_all, orient='index', columns=['aphiaid'])
species_worms_lut_df['species'] = species_worms_lut_df.index

len(species_worms_lut_df)

137

In [25]:
species_worms_lut_df.head()

Unnamed: 0,aphiaid,species
ACARTIA,104108.0,ACARTIA
ACARTIA_CLAUSI,104251.0,ACARTIA_CLAUSI
ACARTIA_LONGIREMIS,104257.0,ACARTIA_LONGIREMIS
AEQUOREA,116998.0,AEQUOREA
AETIDEIDAE,104075.0,AETIDEIDAE


`aphiaid` is cast to float because it contains some nulls. Cast them to int after removing null rows.

In [26]:
sel = species_worms_lut_df['species'].isin(species_worms_2.keys())
species_worms_lut_df.loc[sel, 'aphiaid'] = species_worms_lut_df[sel]['species'].apply(
    lambda s: species_worms_2[s]
)

species_worms_lut_df.dropna(axis='rows', inplace=True)
species_worms_lut_df['aphiaid'] = species_worms_lut_df.aphiaid.astype(int)

len(species_worms_lut_df)

134

In [27]:
species_worms_lut_df.head()

Unnamed: 0,aphiaid,species
ACARTIA,104108,ACARTIA
ACARTIA_CLAUSI,104251,ACARTIA_CLAUSI
ACARTIA_LONGIREMIS,104257,ACARTIA_LONGIREMIS
AEQUOREA,116998,AEQUOREA
AETIDEIDAE,104075,AETIDEIDAE


In [28]:
len(resp_matched_byaphiaid), len(resp_matched_byaphiaid2)

(124, 10)

Based on `len(resp_matched_byaphiaid)`, one entry from `resp_matched_byaphiaid2` is present in `resp_matched_byaphiaid`. It's 'Brachyura' (see below). So, this is the sames as the manual assignment of 'CRABS'.

In [29]:
resp_matched_byaphiaid2[
    list(set(resp_matched_byaphiaid.keys()) &  set(resp_matched_byaphiaid2.keys()))[0]
]['scientificname']

'Brachyura'

`|=` is an or / assignment operator (https://blog.finxter.com/python-inplace-or-operator-meaning/). `dict_1 |= dict_2` updates the first dictionary `dict_1` with the same merged dictionary elements.

In [30]:
resp_matched_byaphiaid |= resp_matched_byaphiaid2

In [31]:
len(resp_matched_byaphiaid)

133

In [32]:
col_mapping = [
    ('scientificName', 'scientificname'),
    ('scientificNameID', 'lsid'),
    ('taxonRank', 'rank'),
    ('kingdom', 'kingdom'),
    ('phylum', 'phylum'),
    ('class', 'class'),
    ('order', 'order'),
    ('family', 'family'),
    ('genus', 'genus'),
    ('scientificNameAuthorship', 'authority')
]

for cm in col_mapping:
    species_worms_lut_df[cm[0]] = species_worms_lut_df['aphiaid'].apply(
        lambda aid: resp_matched_byaphiaid[aid][cm[1]]
    )

In [33]:
species_worms_lut_df.head(15)

Unnamed: 0,aphiaid,species,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship
ACARTIA,104108,ACARTIA,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846"
ACARTIA_CLAUSI,104251,ACARTIA_CLAUSI,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889"
ACARTIA_LONGIREMIS,104257,ACARTIA_LONGIREMIS,Acartia longiremis,urn:lsid:marinespecies.org:taxname:104257,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"(Lilljeborg, 1853)"
AEQUOREA,116998,AEQUOREA,Aequorea,urn:lsid:marinespecies.org:taxname:116998,Genus,Animalia,Cnidaria,Hydrozoa,Leptothecata,Aequoreidae,Aequorea,"Péron & Lesueur, 1810"
AETIDEIDAE,104075,AETIDEIDAE,Aetideidae,urn:lsid:marinespecies.org:taxname:104075,Family,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,,"Giesbrecht, 1892"
AETIDEUS,104112,AETIDEUS,Aetideus,urn:lsid:marinespecies.org:taxname:104112,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,Aetideus,"Brady, 1883"
AETIDEUS_ARMATUS,104275,AETIDEUS_ARMATUS,Aetideus armatus,urn:lsid:marinespecies.org:taxname:104275,Species,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,Aetideus,"(Boeck, 1872)"
AETIDEUS_PACIFICUS,254648,AETIDEUS_PACIFICUS,Aetideus pacificus,urn:lsid:marinespecies.org:taxname:254648,Species,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,Aetideus,"Brodsky, 1950"
AETIDEUS_divergens,346076,AETIDEUS_divergens,Aetideus divergens,urn:lsid:marinespecies.org:taxname:346076,Species,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,Aetideus,"Bradford, 1971"
AGLANTHA,117212,AGLANTHA,Aglantha,urn:lsid:marinespecies.org:taxname:117212,Genus,Animalia,Cnidaria,Hydrozoa,Trachymedusae,Rhopalonematidae,Aglantha,"Haeckel, 1879"


## Create and populate `taxonomy_df` dataframe

In [34]:
taxonomy_df = taxonsource_df.merge(
    species_worms_lut_df, 
    how='inner', 
    on='species'
)

taxonomy_df['verbatimIdentification'] = taxonomy_df['species']

Some summaries, based on individual records:

In [35]:
taxonomy_df.taxonRank.value_counts()

Species       4047
Genus         1216
Class          603
Family         327
Phylum         214
Suborder       146
Order          132
Subclass       116
Gigaclass       45
Infraorder      21
Name: taxonRank, dtype: int64

In [36]:
taxonomy_df[taxonomy_df.taxonRank == 'Phylum'].scientificName.value_counts()

Chaetognatha     108
Bryozoa           80
Echinodermata     17
Cnidaria           9
Name: scientificName, dtype: int64

In [37]:
taxonomy_df[taxonomy_df.taxonRank.isin(['Class', 'Subclass'])].scientificName.value_counts()

Polychaeta     157
Bivalvia       145
Gastropoda     145
Cirripedia     116
Larvacea        91
Ostracoda       48
Copepoda        14
Hydrozoa         2
Ophiuroidea      1
Name: scientificName, dtype: int64

Clean up and simplify into a taxonomy table without duplicates

In [38]:
taxonomy_df = (
    taxonomy_df
    .drop(columns=['sample_code', 'species', 'aphiaid'])
    .drop_duplicates()
    .sort_values(by='verbatimIdentification')
    .reset_index(drop=True)
)

In [39]:
len(taxonomy_df)

134

In [40]:
taxonomy_df.head()

Unnamed: 0,scientificName,scientificNameID,taxonRank,kingdom,phylum,class,order,family,genus,scientificNameAuthorship,verbatimIdentification
0,Acartia,urn:lsid:marinespecies.org:taxname:104108,Genus,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Dana, 1846",ACARTIA
1,Acartia clausi,urn:lsid:marinespecies.org:taxname:104251,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"Giesbrecht, 1889",ACARTIA_CLAUSI
2,Acartia longiremis,urn:lsid:marinespecies.org:taxname:104257,Species,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,"(Lilljeborg, 1853)",ACARTIA_LONGIREMIS
3,Aequorea,urn:lsid:marinespecies.org:taxname:116998,Genus,Animalia,Cnidaria,Hydrozoa,Leptothecata,Aequoreidae,Aequorea,"Péron & Lesueur, 1810",AEQUOREA
4,Aetideidae,urn:lsid:marinespecies.org:taxname:104075,Family,Animalia,Arthropoda,Copepoda,Calanoida,Aetideidae,,"Giesbrecht, 1892",AETIDEIDAE


Some summaries, now based on unique taxon records (no duplicates):

In [41]:
taxonomy_df.taxonRank.value_counts()

Species       51
Genus         35
Family        17
Order         11
Class          8
Phylum         4
Infraorder     3
Suborder       3
Subclass       1
Gigaclass      1
Name: taxonRank, dtype: int64

In [42]:
taxonomy_df[taxonomy_df.taxonRank == 'Phylum'].scientificName.value_counts()

Bryozoa          1
Chaetognatha     1
Echinodermata    1
Cnidaria         1
Name: scientificName, dtype: int64

In [43]:
taxonomy_df[taxonomy_df.taxonRank.isin(['Class', 'Subclass'])].scientificName.value_counts()

Cirripedia     1
Bivalvia       1
Copepoda       1
Gastropoda     1
Hydrozoa       1
Larvacea       1
Ophiuroidea    1
Ostracoda      1
Polychaeta     1
Name: scientificName, dtype: int64

## Export `taxonomy_df` to csv

In [44]:
if not debug_no_csvexport:
    taxonomy_df.to_csv(data_pth / 'intermediate_DwC_taxonomy.csv', index=False)

## Package versions

In [45]:
print(
    f"{datetime.utcnow()} +00:00\n"
    f"pandas: {pd.__version__}"
)

2023-05-04 07:17:07.526523 +00:00
pandas: 1.5.3


The `pyworms` package doesn't have a version attribute. From https://github.com/iobis/pyworms/blob/master/setup.py, the latest version is 0.2.0, as of February 2021. The latest [tagged release](https://github.com/iobis/pyworms/releases/tag/0.2.0) is 0.2.0, from October 2020.