In [39]:
#| default_exp handlers.ospar

# OSPAR (WIP)
Data pipeline (handler) to convert OSPAR data ([source](https://odims.ospar.org/en/)) to `NetCDF` format.


***

## OSPAR Environment database

OSPAR [data](https://odims.ospar.org/en/) is provided as a Microsoft Access database. 
`Mdbtools` (https://github.com/mdbtools/mdbtools) can be used to convert the tables of the Microsoft Access database to `.csv` files on Unix-like OS.

Example steps:
1. Download data.
2. Install mdbtools via VScode Terminal 

    ```
    sudo apt-get -y install mdbtools
    ````

3. Install unzip via VScode Terminal 

    ```
    sudo apt-get -y install unzip
    ````

4. In VS code terminal, navigate to the marisco data folder

    ```
    cd /home/marisco/downloads/marisco/_data/accdb/mors_19840101_20211231
    ```

5. Unzip MORS_ENVIRONMENT.zip 

    ```
    unzip MORS_ENVIRONMENT.zip 
    ```

6. Run preprocess.sh to generate the required data files

    ```
    ./preprocess.sh MORS_ENVIRONMENT.zip
    ````
7. Conetens of 'preprocess.sh' script.
    ```
    #!/bin/bash

    # Example of use: ./preprocess.sh MORS_ENVIRONMENT.zip
    unzip $1
    dbname=$(ls *.accdb *.mdb)
    mkdir csv
    for table in $(mdb-tables -1 "$dbname"); do
        echo "Export table $table"
        mdb-export "$dbname" "$table" > "csv/$table.csv"
    done
    ```


***

## Packages import

In [40]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [41]:
#| export
import pandas as pd # Python package that provides fast, flexible, and expressive data structures.
import numpy as np
from tqdm import tqdm # Python Progress Bar Library
from functools import partial # Function which Return a new partial object which when called will behave like func called with the positional arguments args and keyword arguments keywords
import fastcore.all as fc # package that brings fastcore functionality, see https://fastcore.fast.ai/.
from pathlib import Path # This module offers classes representing filesystem paths
from dataclasses import asdict
import re # provides regular expression matching operations

from marisco.utils import (has_valid_varname, match_worms, match_maris_lut, Match)
from marisco.callbacks import (Callback, Transformer, EncodeTimeCB, SanitizeLonLatCB)
from marisco.metadata import (GlobAttrsFeeder, BboxCB, DepthRangeCB, TimeRangeCB, ZoteroCB, KeyValuePairCB)
from marisco.configs import (base_path, nc_tpl_path, cfg, cache_path, cdl_cfg, Enums, lut_path,
                             species_lut_path, sediments_lut_path, bodyparts_lut_path)
from marisco.serializers import NetCDFEncoder


In [42]:
import warnings
warnings.filterwarnings('ignore')

Get the current working directory (cwd).  

In [43]:
Path.cwd()

Path('/home/marisco/downloads/marisco/nbs/handlers')

Here we define the fname_in and fname_out variables. These variables are paths which are defined as relative paths. These paths are relative to 
the current working directory. Note that fname_in refers to the csv folder that contains the  HELCOM data. fname_out defines the path and filename for the NetCDF output.

In [44]:
fname_in = '../../_data/accdb/ospar/csv'
fname_out = '../../_data/output/ospar_19950103_2021214.nc'

***

## Utils

In [45]:
#| export
def load_data(src_dir,
                smp_types=['Seawater data', 'Biota data']):
    "Load OSPAR data and return them as an individual dataframe by sample type"
    '''
    Load data from the measurement files and sample information files found 
    in the src_dir (i.e. fname_in).
    Returns a dictionary of pandas' dataframes. The key to the dictionary is 
    the sample type (i.e lut_smp_type)
    '''    
    dfs = {}
    lut_smp_type = {'Seawater data': 'seawater', 'Biota data': 'biota'}
    for smp_type in smp_types:
        fname_meas = smp_type + '.csv' # measurement (i.e. radioactivity) information and sample information     
        df = pd.read_csv(Path(src_dir)/fname_meas, encoding='unicode_escape')
        dfs[lut_smp_type[smp_type]] = df
    return dfs

In [46]:
dfs = load_data(fname_in)
dfs

{'seawater':            ID Contracting Party  RSC Sub-division   Station ID Sample ID  \
 0           1           Belgium               8.0  Belgica-W01    WNZ 01   
 1           2           Belgium               8.0  Belgica-W02    WNZ 02   
 2           3           Belgium               8.0  Belgica-W03    WNZ 03   
 3           4           Belgium               8.0  Belgica-W04    WNZ 04   
 4           5           Belgium               8.0  Belgica-W05    WNZ 05   
 ...       ...               ...               ...          ...       ...   
 18851  121646    United Kingdom              10.0       Rosyth   2100318   
 18852  121647    United Kingdom              10.0       Rosyth   2101399   
 18853  121648    United Kingdom               6.0        Wylfa    21-656   
 18854  121649    United Kingdom               6.0        Wylfa    21-657   
 18855  121650    United Kingdom               6.0        Wylfa    21-654   
 
        LatD  LatM  LatS LatDir  LongD  ...  Sampling date  Nu

In [47]:
#| export
def rename_cols(cols):
    "Flatten multiindex columns"
    new_cols = []
    for outer, inner in cols:
        if not inner:
            new_cols.append(outer)
        else:
            if outer == 'unit':
                new_cols.append(inner + '_' + outer)
            if outer == 'unc':
                new_cols.append(inner + '_' + outer)
            if outer == 'value':
                new_cols.append(inner)
    return new_cols

***

## Load tables (dataframes)

dfs includes a dictionary of tables (dataframes) that is created from the OSPAR dataset defined by fname_in. The data to be included in each dataframe is sorted by sample type. Each dictionary is defined with a key equal to the sample type. 

In [48]:
dfs = load_data(fname_in)
dfs

{'seawater':            ID Contracting Party  RSC Sub-division   Station ID Sample ID  \
 0           1           Belgium               8.0  Belgica-W01    WNZ 01   
 1           2           Belgium               8.0  Belgica-W02    WNZ 02   
 2           3           Belgium               8.0  Belgica-W03    WNZ 03   
 3           4           Belgium               8.0  Belgica-W04    WNZ 04   
 4           5           Belgium               8.0  Belgica-W05    WNZ 05   
 ...       ...               ...               ...          ...       ...   
 18851  121646    United Kingdom              10.0       Rosyth   2100318   
 18852  121647    United Kingdom              10.0       Rosyth   2101399   
 18853  121648    United Kingdom               6.0        Wylfa    21-656   
 18854  121649    United Kingdom               6.0        Wylfa    21-657   
 18855  121650    United Kingdom               6.0        Wylfa    21-654   
 
        LatD  LatM  LatS LatDir  LongD  ...  Sampling date  Nu

List the keys for the dictionary of dataframes.  

In [49]:
keys=dfs.keys()
keys

dict_keys(['seawater', 'biota'])

Show the structure of the 'seawater' dataframe. 

In [50]:
dfs['seawater'].head()


Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
0,1,Belgium,8.0,Belgica-W01,WNZ 01,51.0,22.0,31.0,N,3.0,...,27/01/2010,137Cs,<,0.2,,Bq/l,SCKCEN,,,
1,2,Belgium,8.0,Belgica-W02,WNZ 02,51.0,13.0,25.0,N,2.0,...,27/01/2010,137Cs,<,0.27,,Bq/l,SCKCEN,,,
2,3,Belgium,8.0,Belgica-W03,WNZ 03,51.0,11.0,4.0,N,2.0,...,27/01/2010,137Cs,<,0.26,,Bq/l,SCKCEN,,,
3,4,Belgium,8.0,Belgica-W04,WNZ 04,51.0,25.0,13.0,N,3.0,...,27/01/2010,137Cs,<,0.25,,Bq/l,SCKCEN,,,
4,5,Belgium,8.0,Belgica-W05,WNZ 05,51.0,24.0,58.0,N,2.0,...,26/01/2010,137Cs,<,0.2,,Bq/l,SCKCEN,,,


Show the structure of the 'biota' dataframe. 

In [51]:
dfs['biota'].head()

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
0,96793,United Kingdom,5,Hunterston,2200086,55,43,31.0,N,4,...,31/12/2021,"239,240Pu",=,0.351,0.066,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,"PLZ. Annual bulk of 2 samples, representative ...",
1,96822,United Kingdom,6,Chapelcross,2200081,54,58,8.0,N,3,...,31/12/2021,99Tc,=,39.0,15.0,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,PLZ,
2,96823,United Kingdom,7,Dounreay,2200093,58,33,57.0,N,3,...,31/12/2021,"239,240Pu",=,0.0938,0.018,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,"Sandside Bay. Annual bulk of 4 samples, repre...",
3,96824,United Kingdom,7,Dounreay,2200089,58,37,7.0,N,3,...,31/12/2021,"239,240Pu",=,1.54,0.31,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,"Brims Ness. Annual bulk of 4 samples, represe...",
4,96857,United Kingdom,10,Torness,2100074,55,57,53.0,N,2,...,31/12/2021,99Tc,=,16.0,6.0,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,"Thornton Loch. Annual bulk of 2 samples, repre...",


***

## Data transformation pipeline

### Normalize nuclide names

**Lower & strip** 

Creates a class ,LowerStripRdnNameCB, that receives a dictionary of dataframes. For each dataframe in the dictionary of dataframes it coverts the contents of the nuclide name column, 'Nuclides', to lowercase and strips any leading or trailing whitespace(s). 

In [99]:
#| export
class LowerStripRdnNameCB(Callback):
    "Drop NaN nuclide names, convert nuclide names to lowercase, strip separators (e.g. `-`,`,`) and any trailing space(s)"
    def __call__(self, tfm):
        # Drop NaN entries. 
        self.drop_nan(tfm)        
        
        # Apply condition to Nuclide col. 
        for k in tfm.dfs.keys():
            tfm.dfs[k]['Nuclide'] = tfm.dfs[k]['Nuclide'].apply(lambda x: self.condition(x))
            
    def drop_nan(self, tfm):
            for k in tfm.dfs.keys():
                tfm.dfs[k] = tfm.dfs[k][tfm.dfs[k]['Nuclide'].notna()]
                
    def condition(self, var):
        # lowercase, strip separators (e.g. `-`,`,`) and any white-space(s)
        separators="-,"
        var= var.lower().translate({ord(x): '' for x in separators}).replace(" ", "")
        
        # Format nuclide name with number then letters (e.g. 137cs) to 
        # letters and then numbers (e.g. cs137).
        reg_num_str=re.compile("([0-9]+)([a-zA-Z]+)")
        sol=reg_num_str.match(var)
        if sol is not None:
            reg_group=sol.groups()
            var=reg_group[1]+reg_group[0]
        return (var)  
    

Here we apply the transformer LowerStripRdnNameCB. Print the nuclide name that is unique from the column, 'Nuclide', of each dataframe include in the dictionary of dataframes. 

In [100]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB()])
print('seawater nuclides: ')
print(tfm()['seawater']['Nuclide'].unique())
print('biota nuclides: ')
print(tfm()['biota']['Nuclide'].unique())

seawater nuclides: 
['cs137' 'pu239240' 'ra226' 'ra228' 'tc99' 'h3' 'po210' 'pb210']
biota nuclides: 
['pu239240' 'tc99' 'cs137' 'ra226' 'ra228' 'pu238' 'am241' 'cs134' 'h3'
 'pb210' 'po210']


***


#### Remap to MARIS nuclide names 

The marisco package includes a template that defines the permitted structure of the data. This template is located at `nc_tpl_path` and is available in a `NetCDF` format.
This template can be viewed in a human-readable form as CDL (Common Data Language).

Path to maris-template.nc

In [54]:
nc_tpl_path()

Path('/home/marisco/.marisco/maris-template.nc')

*View the 'maris-template.nc' with 'ncdump' via Terminal*
```
cd /home/marisco/.marisco/
ncdump -h maris-template.nc
```

The function, get_unique_nuclides, returns list of unique nuclides from each dataframe that is included in the dictionary of dataframes.

In [55]:
#| export
def get_unique_nuclides(dfs):
    "Get list of unique radionuclide types measured across samples."
    nuclides = []
    for k in dfs.keys():
        nuclides += dfs[k]['Nuclide'].unique().tolist()
    return nuclides

Function, has_valid_varname, checks if a variable defined in the dataframes (i.e. Helcom dataset), in this case nuclide names, are consistent with the template defined by maris-template.nc. If the variable name is not valid it will print the variable name. 

In [56]:
has_valid_varname(get_unique_nuclides(tfm.dfs), nc_tpl_path())

"pu239240" variable name not found in MARIS CDL
"pu239240" variable name not found in MARIS CDL


False

Create a look up table, varnames_lut_updates, which will be used to correct the nuclide names in the dictionary of dataframes (i.e. dfs) that are not compatible with the template at nc_tpl_path. 

Note : Known error in Helcom dataset. cs138, cs139, cs140, cs141, cs142, cs143, cs144, cs145, cs146 are all cs137. 

In [57]:
#| export
varnames_lut_updates = {
    'pu239240': 'pu239_240_tot'}

Create a function, get_varnames_lut, which returns a dictionary of nuclide names. This dictionary of nuclide names includes the 'Nuclide' names in the dictionary and the corrections included in varnames_lut_updates.

In [58]:
#| export
def get_varnames_lut(dfs, lut=varnames_lut_updates):
    lut = {n: n for n in set(get_unique_nuclides(dfs))}
    lut.update(varnames_lut_updates)
    return lut

Create the varnames_lut variable, a dictionary of nuclide names including updates defined by varnames_lut_updates.  

In [59]:
#|eval: false
varnames_lut = partial(get_varnames_lut, lut=varnames_lut_updates)(tfm.dfs)
varnames_lut

{'pu239240': 'pu239_240_tot',
 'ra228': 'ra228',
 'ra226': 'ra226',
 'cs134': 'cs134',
 'po210': 'po210',
 'h3': 'h3',
 'am241': 'am241',
 'tc99': 'tc99',
 'pu238': 'pu238',
 'pb210': 'pb210',
 'cs137': 'cs137'}

Create a class that remaps the nuclide names in the dfs to those in varnames_lut_updates.

In [60]:
# | export
class RemapRdnNameCB(Callback):
    "Remap to MARIS radionuclide names."
    def __init__(self,
                 fn_lut=partial(get_varnames_lut, lut=varnames_lut_updates)):
        fc.store_attr()
        
    def __call__(self, tfm):       
        # Replace 'Nuclide' vars according to lut. 
        lut = self.fn_lut(tfm.dfs)
        for k in tfm.dfs.keys():
            tfm.dfs[k]['Nuclide'].replace(lut, inplace=True)


Apply the transformers LowerStripRdnNameCB and RemapRdnNameCB. Print the unique nuclides for each dataframe included in the dictionary of dataframes. 

In [61]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB()])
print('seawater nuclides: ')
print(tfm()['seawater']['Nuclide'].unique())
print('biota nuclides: ')
print(tfm()['biota']['Nuclide'].unique())


seawater nuclides: 
['cs137' 'pu239_240_tot' 'ra226' 'ra228' 'tc99' 'h3' 'po210' 'pb210']
biota nuclides: 
['pu239_240_tot' 'tc99' 'cs137' 'ra226' 'ra228' 'pu238' 'am241' 'cs134'
 'h3' 'pb210' 'po210']


Check that all nuclide varnames are valid. Returns True if all are valid.

In [62]:
has_valid_varname(get_unique_nuclides(tfm.dfs), nc_tpl_path())

True

### Parse time

Create a class that remaps the time format in the dictionary of dataframes (i.e. '%d/%m/%Y')

In [63]:
#| export
class ParseTimeCB(Callback):
    def __call__(self, tfm):
        self.drop_nan(tfm)
        for k in tfm.dfs.keys():
            tfm.dfs[k]['time'] = pd.to_datetime(tfm.dfs[k]['Sampling date'], 
                                                format='%d/%m/%Y')
    def drop_nan(self, tfm):
            for k in tfm.dfs.keys():
                tfm.dfs[k] = tfm.dfs[k][tfm.dfs[k]['Sampling date'].notna()]

In [64]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB()])

print(tfm()['seawater']['time'][:5])

0   2010-01-27
1   2010-01-27
2   2010-01-27
3   2010-01-27
4   2010-01-26
Name: time, dtype: datetime64[ns]


***

In [65]:
tfm()['biota']['Species']

0        LITTORINA LITTOREA
1         FUCUS VESICULOSUS
2        LITTORINA LITTOREA
3        LITTORINA LITTOREA
4         FUCUS VESICULOSUS
                ...        
15309    LITTORINA LITTOREA
15310           Patella sp.
15311        Fucus serratus
15312        Fucus serratus
15313        Fucus serratus
Name: Species, Length: 15314, dtype: object

### Lookup

#### Biota species

Review unique Rubin names included in the biota dataframe. 

In [66]:
#|export
def get_maris_lut(df_biota,
                  fname_cache, # For instance 'species_ospar.pkl'
                  data_provider_name_col:str, # Data provider lookup column name of interest
                  maris_lut:str, # MARIS source lookup table name and path
                  maris_id: str, # Id of MARIS lookup table nomenclature item to match
                  maris_name: str, # Name of MARIS lookup table nomenclature item to match
                  unmatched_fixes={},
                  as_dataframe=False,
                  overwrite=False
                 ):
    fname_cache = cache_path() / fname_cache
    lut = {}


    if overwrite or (not fname_cache.exists()):
        
        df = pd.DataFrame({data_provider_name_col : df_biota[data_provider_name_col].unique()})
        
        for _, row in tqdm(df.iterrows(), total=len(df)):
            
            # Fix if unmatched
            has_to_be_fixed = row[data_provider_name_col] in unmatched_fixes       
            name_to_match = unmatched_fixes[row[data_provider_name_col]] if has_to_be_fixed else row[data_provider_name_col]
            
            
            # Match
            # Mark 'Not available' with a matched_id of 0.
            if name_to_match == 'Not available':
                match = Match(matched_id=0, matched_maris_name='0', source_name=row[data_provider_name_col], match_score=-1)
            # Mark 'nan' with a matched_id of 0.
            elif pd.isna(name_to_match):
                match = Match(matched_id=0, matched_maris_name='0', source_name=row[data_provider_name_col], match_score=-1)

            else:    
                result = match_maris_lut(maris_lut, name_to_match, maris_id, maris_name)
                match = Match(result.iloc[0][maris_id], result.iloc[0][maris_name], 
                            row[data_provider_name_col], result.iloc[0]['score'])
                
            lut[row[data_provider_name_col]] = match
            
        fc.save_pickle(fname_cache, lut)
    else:
        lut = fc.load_pickle(fname_cache)

    if as_dataframe:
        df_lut = pd.DataFrame({k: asdict(v) for k, v in lut.items()}).transpose()
        df_lut.index.name = 'source_id'
        return df_lut.sort_values(by='match_score', ascending=False)
    else:
        return lut

In [67]:
#|export
# key equals name in dfs['biota']. 
# value equals replacement name to use in match_maris_lut (i.e. name_to_match)
unmatched_fixes_biota_species = {}

using dbo_species_expanded.xlsx which does not  includes 'Not available' like dbo_species.xlsx

TODO: investigate speeding this up. 

In [68]:
species_lut_df = get_maris_lut(df_biota=tfm()['biota'], 
                                fname_cache='species_ospar.pkl', 
                                data_provider_name_col='Species',
                                maris_lut=species_lut_path(),
                                maris_id='species_id',
                                maris_name='species',
                                unmatched_fixes=unmatched_fixes_biota_species,
                                as_dataframe=True,
                                overwrite=True)

100%|██████████| 156/156 [01:14<00:00,  2.10it/s]


In [69]:
species_lut_df

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,898,Rhombosolea leporina,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,31
"Mixture of green, red and brown algae",1041,Melongena melongena,"Mixture of green, red and brown algae",26
Solea solea (S.vulgaris),161,Loligo vulgaris,Solea solea (S.vulgaris),12
SOLEA SOLEA (S.VULGARIS),161,Loligo vulgaris,SOLEA SOLEA (S.VULGARIS),12
CERASTODERMA (CARDIUM) EDULE,274,Cerastoderma edule,CERASTODERMA (CARDIUM) EDULE,10
...,...,...,...,...
BUCCINUM UNDATUM,391,Buccinum undatum,BUCCINUM UNDATUM,0
Anguilla anguilla,272,Anguilla anguilla,Anguilla anguilla,0
Thunnus thynnus,556,Thunnus thynnus,Thunnus thynnus,0
LITTORINA LITTOREA,394,Littorina littorea,LITTORINA LITTOREA,0


TODO Mixed species ID (e.g.RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA ). Drop?

Show maris_species_lut where match_type is not a perfect match ( i.e. not equal 0).

In [70]:
species_lut_df[species_lut_df['match_score'] > 1]

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,898,Rhombosolea leporina,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,31
"Mixture of green, red and brown algae",1041,Melongena melongena,"Mixture of green, red and brown algae",26
Solea solea (S.vulgaris),161,Loligo vulgaris,Solea solea (S.vulgaris),12
SOLEA SOLEA (S.VULGARIS),161,Loligo vulgaris,SOLEA SOLEA (S.VULGARIS),12
CERASTODERMA (CARDIUM) EDULE,274,Cerastoderma edule,CERASTODERMA (CARDIUM) EDULE,10
Cerastoderma (Cardium) Edule,274,Cerastoderma edule,Cerastoderma (Cardium) Edule,10
MONODONTA LINEATA,425,Osilinus lineatus,MONODONTA LINEATA,9
NUCELLA LAPILLUS,1074,Nacella concinna,NUCELLA LAPILLUS,9
DICENTRARCHUS (MORONE) LABRAX,424,Dicentrarchus labrax,DICENTRARCHUS (MORONE) LABRAX,9
Pleuronectiformes [order],411,Pleuronectiformes,Pleuronectiformes [order],8


Match unmatched biota_species

In [71]:
#|export
# 0 used for 'na' 
unmatched_fixes_biota_species = {'RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA': 'Not available', # mix
 'Mixture of green, red and brown algae': 'Not available', #mix 
 'Solea solea (S.vulgaris)': 'Solea solea',
 'SOLEA SOLEA (S.VULGARIS)': 'Solea solea',
 'CERASTODERMA (CARDIUM) EDULE': 'Cerastoderma edule',
 'Cerastoderma (Cardium) Edule': 'Cerastoderma edule',
 'MONODONTA LINEATA': 'Phorcus lineatus',
 'NUCELLA LAPILLUS': 'Not available', # Droped. In worms 'Nucella lapillus (Linnaeus, 1758)', 
 'DICENTRARCHUS (MORONE) LABRAX': 'Dicentrarchus labrax',
 'Pleuronectiformes [order]': 'Pleuronectiformes',
 'RAJIDAE/BATOIDEA': 'Not available', #mix 
 'PALMARIA PALMATA': 'Not available', # Droped. In worms 'Palmaria palmata (Linnaeus) F.Weber & D.Mohr, 1805',
 'Sepia spp.': 'Sepia',
 'Rhodymenia spp.': 'Rhodymenia',
 'unknown': 'Not available',
 'RAJA DIPTURUS BATIS': 'Dipturus batis',
 'Unknown': 'Not available',
 'Flatfish': 'Not available',
 'FUCUS SPP.': 'FUCUS',
 'Patella sp.': 'Patella',
 'Gadus sp.': 'Gadus',
 'FUCUS spp': 'FUCUS',
 'Tapes sp.': 'Tapes',
 'Thunnus sp.': 'Thunnus',
 'RHODYMENIA spp': 'RHODYMENIA',
 'Fucus sp.': 'Fucus',
 'PECTINIDAE': 'Not available', # Droped. In worms as PECTINIDAE is a family.
 'PLUERONECTES PLATESSA': 'Pleuronectes platessa',
 'Gaidropsarus argenteus': 'Gaidropsarus argentatus'}

In [73]:
species_lut_df = get_maris_lut(df_biota=tfm()['biota'], 
                                fname_cache='species_ospar.pkl', 
                                data_provider_name_col='Species',
                                maris_lut=species_lut_path(),
                                maris_id='species_id',
                                maris_name='species',
                                unmatched_fixes=unmatched_fixes_biota_species,
                                as_dataframe=True,
                                overwrite=True)

100%|██████████| 156/156 [01:11<00:00,  2.17it/s]


In [74]:
species_lut_df[species_lut_df['match_score'] > 1]

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


In [75]:
species_lut_df

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sebastes vivipares,390,Sebastes viviparus,Sebastes vivipares,1
ASCOPHYLLUN NODOSUM,401,Ascophyllum nodosum,ASCOPHYLLUN NODOSUM,1
LITTORINA LITTOREA,394,Littorina littorea,LITTORINA LITTOREA,0
PORPHYRA UMBILICALIS,417,Porphyra umbilicalis,PORPHYRA UMBILICALIS,0
FUCUS spp,395,Fucus,FUCUS spp,0
...,...,...,...,...
PECTINIDAE,0,0,PECTINIDAE,-1
NUCELLA LAPILLUS,0,0,NUCELLA LAPILLUS,-1
PALMARIA PALMATA,0,0,PALMARIA PALMATA,-1
,0,0,,-1


In [168]:
#| export
class LookupBiotaSpeciesCB(Callback):
    """
    Biota species remapped to MARIS db:

    """
    def __init__(self, fn_lut): fc.store_attr()
    def __call__(self, tfm):
        lut = self.fn_lut(df_biota=tfm.dfs['biota'])      
        # Drop rows where 'Species' are 'nan'
        tfm.dfs['biota']=tfm.dfs['biota'][tfm.dfs['biota']['Species'].notna()]
        tfm.dfs['biota']['species'] = tfm.dfs['biota']['Species'].apply(lambda x: lut[x].matched_id)
        

In [169]:
#| export
get_maris_species = partial(get_maris_lut, 
                fname_cache='species_ospar.pkl', 
                data_provider_name_col='SCIENTIFIC NAME',
                maris_lut=species_lut_path(),
                maris_id='species_id',
                maris_name='species',
                unmatched_fixes=unmatched_fixes_biota_species,
                as_dataframe=False,
                overwrite=False)

In [170]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaSpeciesCB(get_maris_species)
                            ])
print(tfm()['biota']['species'].unique())

[ 394   96  129   50  139  270  395   99  377  414 1608  244  192   23
    0  402  407  401  274  378 1609  384  386  191  382  404  405  385
  388  383  379  432  243  392  393  413  400  425  419  399  556  272
  391  234  431  442  396 1606  403  412  435 1610  381  437  434  444
  443  389  440  441  439  427  438 1605  436  426  433  390  420  417
  397  421  294  422  423  428  424  415 1607  387  380  406  398  416
  408  409  418  430  429  411  410]


#### Biota tissues

##### Correct OSPAR 'Body Part' labelled as Whole

The OSPAR data includes entries with the variable Body Part labelled as whole. The Maris data requires that the body 'body_part' distinguishes between 'Whole animal' and 'Whole plant'. The OSPAR data defines the 'Biological group' which allows for the Body Part labelled as whole to be defined as 'Whole animal' and 'Whole plant'. 

In [118]:
#| export
whole_animal_plant = {'whole' : ['Whole','WHOLE', 'WHOLE FISH', 'Whole fisk', 'Whole fish'],
                      'Whole animal' : ['Molluscs','Fish','FISH','molluscs','fish','MOLLUSCS'],
                      'Whole plant' : ['Seaweed','seaweed','SEAWEED'] }

In [119]:
#| export
class CorrectWholeBodyPart(Callback):
    def __init__(self, wap=whole_animal_plant): fc.store_attr()
    
    def __call__(self, tfm):
        tfm.dfs['biota'] = self.correct_whole_body_part(tfm.dfs['biota'],self.wap)

    def correct_whole_body_part(self, df, wap):
        whole_list= wap['whole']
        animal_list = wap['Whole animal']
        plant_lst = wap['Whole plant']
        df['Body Part'].loc[(df['Body Part'].isin(whole_list)) & (df['Biological group'].isin(animal_list))] = 'Whole animal'
        df['Body Part'].loc[(df['Body Part'].isin(whole_list)) & (df['Biological group'].isin(plant_lst))] = 'Whole plant'
        
        return df



In [120]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaSpeciesCB(get_maris_species),
                            CorrectWholeBodyPart()
                            ])
print(tfm()['biota']['Body Part'].unique())

['SOFT PARTS' 'GROWING TIPS' 'Whole plant' 'Whole animal'
 'FLESH WITHOUT BONES' 'WHOLE ANIMAL' 'WHOLE PLANT' 'Soft Parts'
 'Whole without head' 'Cod medallion' 'Muscle'
 'Mix of muscle and whole fish without liver' 'Flesh' 'FLESH WITHOUT BONE'
 'UNKNOWN' 'FLESH' 'FLESH WITH SCALES' 'HEAD' 'Flesh without bones'
 'Soft parts' 'whole plant' 'LIVER' 'MUSCLE']


Get a dataframe of matched OSPAR biota tissues with Maris Bodyparts

In [142]:
#|export
unmatched_fixes_biota_tissues = {}

In [143]:
tissues_lut_df = get_maris_lut(df_biota=tfm()['biota'], 
                                fname_cache='tissues_ospar.pkl', 
                                data_provider_name_col='Body Part',
                                maris_lut=bodyparts_lut_path(),
                                maris_id='bodypar_id',
                                maris_name='bodypar',
                                unmatched_fixes=unmatched_fixes_biota_tissues,
                                as_dataframe=True,
                                overwrite=True)
tissues_lut_df.head()

100%|██████████| 23/23 [00:00<00:00, 101.65it/s]


Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mix of muscle and whole fish without liver,52,Flesh without bones,Mix of muscle and whole fish without liver,27
Whole without head,52,Flesh without bones,Whole without head,10
Cod medallion,9,Endoskeleton,Cod medallion,9
UNKNOWN,12,Skin,UNKNOWN,5
FLESH,42,Leaf,FLESH,3


List unmatched OSPAR tissues

In [144]:
tissues_lut_df[tissues_lut_df['match_score'] > 1]['source_name'].tolist()

['Mix of muscle and whole fish without liver',
 'Whole without head',
 'Cod medallion',
 'UNKNOWN',
 'FLESH',
 'Flesh']

Read Maris tissue lut to correct unmatched tissues

In [147]:
marisco_lut_df=pd.read_excel(bodyparts_lut_path())
marisco_lut_df

Unnamed: 0,bodypar_id,bodypar,bodycode,groupcode
0,0,(Not available),0,0
1,1,Whole animal,WHOA,WHO
2,2,Whole animal eviscerated,WHOEV,WHO
3,3,Whole animal eviscerated without head,WHOHE,WHO
4,4,Flesh with bones,FLEB,FLEB
...,...,...,...,...
56,56,Growing tips,GTIP,PHAN
57,57,Upper parts of plants,UPPL,PHAN
58,58,Lower parts of plants,LWPL,PHAN
59,59,Shells/carapace,SHCA,SKEL


Create a dictionary of unmatched tissues to allow for  correctection

In [150]:
unmatched_fixes_biota_tissues = {
'Mix of muscle and whole fish without liver' : 'Not available', # Drop
 'Whole without head' : 'Whole animal eviscerated without head', # Drop? eviscerated? ,
 'Cod medallion' : 'Whole animal eviscerated without head',
 'FLESH' : 'Flesh without bones', # Drop? with or without bones?
 'Flesh' : 'Flesh without bones', # Drop? with or without bones?
 'UNKNOWN' : 'Not available'
}

In [151]:
tissues_lut_df = get_maris_lut(df_biota=tfm()['biota'], 
                                fname_cache='tissues_ospar.pkl', 
                                data_provider_name_col='Body Part',
                                maris_lut=bodyparts_lut_path(),
                                maris_id='bodypar_id',
                                maris_name='bodypar',
                                unmatched_fixes=unmatched_fixes_biota_tissues,
                                as_dataframe=True,
                                overwrite=True)
tissues_lut_df.head()

100%|██████████| 23/23 [00:00<00:00, 111.22it/s]


Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FLESH WITHOUT BONE,52,Flesh without bones,FLESH WITHOUT BONE,1
SOFT PARTS,19,Soft parts,SOFT PARTS,0
GROWING TIPS,56,Growing tips,GROWING TIPS,0
LIVER,25,Liver,LIVER,0
whole plant,40,Whole plant,whole plant,0


List unmatched OSPAR tissues

In [172]:
tissues_lut_df[tissues_lut_df['match_score'] > 1]['source_name'].tolist()

[]

In [177]:
#| export
class LookupBiotaBodyPartCB(Callback):
    """
    Update bodypart id based on MARIS dbo_bodypar.xlsx:
        - 3: 'Whole animal eviscerated without head',
        - 12: 'Viscera',
        - 8: 'Skin'
    """
    def __init__(self, fn_lut): fc.store_attr()
    def __call__(self, tfm):
        lut = self.fn_lut(df_biota=tfm.dfs['biota'])      
        # Drop rows where 'Species' are 'nan'
        tfm.dfs['biota']=tfm.dfs['biota'][tfm.dfs['biota']['Body Part'].notna()]
        tfm.dfs['biota']['body_part'] = tfm.dfs['biota']['Body Part'].apply(lambda x: lut[x].matched_id)

In [178]:
get_maris_bodypart=partial(get_maris_lut, 
                            fname_cache='tissues_ospar.pkl', 
                            data_provider_name_col='Body Part',
                            maris_lut=bodyparts_lut_path(),
                            maris_id='bodypar_id',
                            maris_name='bodypar',
                            unmatched_fixes=unmatched_fixes_biota_tissues,
                            as_dataframe=False,
                            overwrite=False)
tissues_lut_df.head()

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FLESH WITHOUT BONE,52,Flesh without bones,FLESH WITHOUT BONE,1
SOFT PARTS,19,Soft parts,SOFT PARTS,0
GROWING TIPS,56,Growing tips,GROWING TIPS,0
LIVER,25,Liver,LIVER,0
whole plant,40,Whole plant,whole plant,0


In [182]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaSpeciesCB(get_maris_species),
                            CorrectWholeBodyPart(),
                            LookupBiotaBodyPartCB(get_maris_bodypart)
                            ])
print(tfm()['biota'][['Body Part', 'body_part']][:5])

      Body Part  body_part
0    SOFT PARTS         19
1  GROWING TIPS         56
2    SOFT PARTS         19
3    SOFT PARTS         19
4  GROWING TIPS         56


In [520]:
species_lut=get_maris_species(fname_in, 'species_helcom.pkl', overwrite=True,verbose=False)

  0%|          | 0/46 [00:00<?, ?it/s]

100%|██████████| 46/46 [00:42<00:00,  1.08it/s]


Show species_lut as a dataframe 

In [521]:
species_lut_df=pd.DataFrame(species_lut).transpose()
species_lut_df

Unnamed: 0,id,name,source,status,match_type
ABRA BRA,271,Abramis brama,ABRAMIS BRAMA,marisco_cdl,0
ANGU ANG,272,Anguilla anguilla,ANGUILLA ANGUILLA,marisco_cdl,0
ARCT ISL,273,Arctica islandica,ARCTICA ISLANDICA,marisco_cdl,0
ASTE RUB,21,Asterias rubens,ASTERIAS RUBENS,marisco_cdl,0
CARD EDU,988,Cardiidae,CARDIUM EDULE,marisco_cdl,6
CH HI;BA,122,Macoma balthica,CHARA BALTICA,marisco_cdl,6
CLAD GLO,290,Cladophora glomerata,CLADOPHORA GLOMERATA,marisco_cdl,0
CLUP HAR,50,Clupea harengus,CLUPEA HARENGUS,marisco_cdl,0
CRAN CRA,59,Crangon crangon,CRANGON CRANGON,marisco_cdl,0
CYPR CAR,275,Cyprinus carpio,CYPRINUS CARPIO,marisco_cdl,0


Show maris_species_lut where match_type is not a perfect match ( i.e. not equal 0).

In [132]:
species_lut_df[species_lut_df['match_type']!=0]

KeyError: 'match_type'

get_worms_species completes a lookup of the species in RUBIN_NAME.csv against the WORMS database at https://www.marinespecies.org/rest/AphiaRecordsByMatchNames. If
load_lut parameter equals 'True' then load the previously saved lut. If the 'RUBIN' is already included in the lut with a score of 0 then no lookup is performed.

In [523]:
#| export
def get_worms_species(fname_in, fname_cache, load_lut=False, overwrite=False):
    fname_cache = cache_path() / fname_cache
    lut = {}

    if overwrite or (not fname_cache.exists()):
        df = pd.read_csv(Path(fname_in) / 'RUBIN_NAME.csv')
        
        if load_lut:
            '''
            open and read the LUT file
            '''
            lut = fc.load_pickle(fname_cache)
        
        for _, row in tqdm(df[['RUBIN', 'SCIENTIFIC NAME']].iterrows(), total=df.shape[0]):
            if load_lut:
                '''
                If row['RUBIN'] in LUT and match_type equals 0 then dont complete the lookup with WORMS. 
                '''
                if row['RUBIN'] in lut:
                    if lut[row['RUBIN']]['match_type'] == 0:
                        continue
            res = match_worms(row['SCIENTIFIC NAME'])
            if (res == -1):
                print(f"No match found for {row['RUBIN']} ({row['SCIENTIFIC NAME']})")
                id = -1 
                lut[row['RUBIN']] = {'id': id, 'name': '', 'source': row["SCIENTIFIC NAME"] ,'status': 'No match', 'match_type': 'No match', 'unacceptreason':'No match'}
            else:
                if len(res[0]) > 1:
                    print(f"Several matches for {row['RUBIN']} ({row['SCIENTIFIC NAME']})")
                    
                id, name, status, match_type,unacceptreason  = [res[0][0].get(key) 
                                                for key in ['AphiaID', 'scientificname', 'status', 'match_type','unacceptreason']]        
                
                lut[row['RUBIN']] = {'id': id, 'name': name, 'source': row["SCIENTIFIC NAME"] ,'status': status, 'match_type': match_type, 'unacceptreason':unacceptreason}
        fc.save_pickle(fname_cache, lut)
    else:
        lut = fc.load_pickle(fname_cache)
        
    return lut

In [524]:
species_lut = get_worms_species(fname_in, 'species_helcom.pkl', load_lut=True, overwrite=True); 

  0%|          | 0/46 [00:00<?, ?it/s]

 24%|██▍       | 11/46 [00:09<00:30,  1.15it/s]

No match found for ENCH CIM (ENCHINODERMATA CIM)


100%|██████████| 46/46 [00:27<00:00,  1.67it/s]


Show species_lut as a dataframe after worms lookup. 

In [525]:
species_lut_df=pd.DataFrame(species_lut).transpose()
species_lut_df

Unnamed: 0,id,name,source,status,match_type,unacceptreason
ABRA BRA,271,Abramis brama,ABRAMIS BRAMA,marisco_cdl,0,
ANGU ANG,272,Anguilla anguilla,ANGUILLA ANGUILLA,marisco_cdl,0,
ARCT ISL,273,Arctica islandica,ARCTICA ISLANDICA,marisco_cdl,0,
ASTE RUB,21,Asterias rubens,ASTERIAS RUBENS,marisco_cdl,0,
CARD EDU,152921,Cardium edule,CARDIUM EDULE,superseded combination,exact,original combination
CH HI;BA,399467,Chara baltica,CHARA BALTICA,accepted,exact,
CLAD GLO,290,Cladophora glomerata,CLADOPHORA GLOMERATA,marisco_cdl,0,
CLUP HAR,50,Clupea harengus,CLUPEA HARENGUS,marisco_cdl,0,
CRAN CRA,59,Crangon crangon,CRANGON CRANGON,marisco_cdl,0,
CYPR CAR,275,Cyprinus carpio,CYPRINUS CARPIO,marisco_cdl,0,


Show all rows which were included in the WORMS lookup  

In [131]:
species_lut_df[species_lut_df['match_type']!=0]

KeyError: 'match_type'

In [527]:
#| export
class LookupBiotaSpeciesCB(Callback):
    'Match species with MARIS database.'
    def __init__(self, fn_lut): fc.store_attr()
    def __call__(self, tfm):
        lut = self.fn_lut()
        tfm.dfs['biota']['species_id'] = tfm.dfs['biota']['RUBIN'].apply(
            lambda x: lut[x.strip()]['id'])
        # Remove data with a species_id of -1.
        tfm.dfs['biota']=tfm.dfs['biota'].drop(tfm.dfs['biota'][tfm.dfs['biota']['species_id'] == -1 ].index)

Species helcom is found at /home/marisco/.marisco/cache/species_helcom.pkl.



In [528]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            NormalizeUncUnitCB(),
                            LookupBiotaSpeciesCB(partial(get_maris_species, 
                                                         fname_in, 'species_helcom.pkl'))
                            ])
tfm()

{'seawater':                 KEY NUCLIDE METHOD < VALUE_Bq/m³  VALUE_Bq/m³  ERROR%_m³  \
 0      WKRIL2012003   cs137    NaN           NaN          5.3      1.696   
 1      WKRIL2012004   cs137    NaN           NaN         19.9      3.980   
 2      WKRIL2012005   cs137    NaN           NaN         25.5      5.100   
 3      WKRIL2012006   cs137    NaN           NaN         17.0      4.930   
 4      WKRIL2012007   cs137    NaN           NaN         22.2      3.996   
 ...             ...     ...    ...           ...          ...        ...   
 21211  WSSSM2021005      h3  SSM45           NaN       1030.0    960.000   
 21212  WSSSM2021006      h3  SSM45           NaN       2240.0    970.000   
 21213  WSSSM2021007      h3  SSM45           NaN       2060.0    970.000   
 21214  WSSSM2021008      h3  SSM45           NaN       2300.0   1000.000   
 21215  WSSSM2021004      h3  SSM45             <          NaN        NaN   
 
          DATE_OF_ENTRY_x  COUNTRY LABORATORY   SEQUENCE  ... 

### Lookup biota tissues

In [529]:
dfs['biota']['TISSUE'].unique()

array([ 5,  1, 41,  3, 51, 43, 42, 12, 10, 18, 52, 20,  8, 54, 53, 13])

In [530]:
#| export
def get_bodypart(verbose=False):
    "Naive lut - TO BE REFACTORED"
    lut={
        5: 52,
        1: 1,
        41: 1,
        3: 3,
        51: 54,
        43: 19,        
        42: 59,
        12: 20,
        10: 7,
        18: 25,
        52: 55,
        20: 38,
        8: 12,
        54: 57,
        53: 56,
        13:21}
    
    if verbose:
        marris_dbo_bodypar=pd.read_excel('../../nbs/files/lut/dbo_bodypar.xlsx')
        helcom_tissue=pd.read_csv('../../_data/accdb/mors/csv/TISSUE.csv')
        print ('marris_dbo_bodypar  :  helcom_tissue')
        for k, v in lut.items():
            print (str(helcom_tissue[helcom_tissue.TISSUE==int(k)].TISSUE_DESCRIPTION.values[0]) + '  :  ' + str(marris_dbo_bodypar[marris_dbo_bodypar.bodypar_id==v].bodypar.values[0]))   
    return lut

In [531]:
#| export
class LookupBiotaBodyPartCB(Callback):
    'Update bodypart id based on MARIS dbo_bodypar.xlsx'
    def __init__(self, fn_lut): fc.store_attr()
    def __call__(self, tfm):
        lut = self.fn_lut()
        tfm.dfs['biota']['body_part'] = tfm.dfs['biota']['TISSUE'].apply(lambda x: lut[x])

In [532]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaBodyPartCB(get_bodypart)
                            ])

tfm()

{'seawater':                 KEY NUCLIDE METHOD < VALUE_Bq/m³  VALUE_Bq/m³  ERROR%_m³  \
 0      WKRIL2012003   cs137    NaN           NaN          5.3  32.000000   
 1      WKRIL2012004   cs137    NaN           NaN         19.9  20.000000   
 2      WKRIL2012005   cs137    NaN           NaN         25.5  20.000000   
 3      WKRIL2012006   cs137    NaN           NaN         17.0  29.000000   
 4      WKRIL2012007   cs137    NaN           NaN         22.2  18.000000   
 ...             ...     ...    ...           ...          ...        ...   
 21211  WSSSM2021005      h3  SSM45           NaN       1030.0  93.203883   
 21212  WSSSM2021006      h3  SSM45           NaN       2240.0  43.303571   
 21213  WSSSM2021007      h3  SSM45           NaN       2060.0  47.087379   
 21214  WSSSM2021008      h3  SSM45           NaN       2300.0  43.478261   
 21215  WSSSM2021004      h3  SSM45             <          NaN        NaN   
 
          DATE_OF_ENTRY_x  COUNTRY LABORATORY   SEQUENCE  ... 

***

### Lookup sediment types

In [533]:
df_sediment = pd.read_csv(Path(fname_in) / 'SEDIMENT_TYPE.csv')
df_sediment.head(5)

Unnamed: 0,SEDI,SEDIMENT TYPE,RECOMMENDED TO BE USED
0,-99,NO DATA,
1,0,GRAVEL,YES
2,1,SAND,YES
3,2,FINE SAND,NO
4,3,SILT,YES


In [534]:
df_sediment['SEDI'].unique()

array([-99,   0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,
        12,  13,  14,  15,  20,  21,  22,  23,  24,  25,  30,  31,  32,
        33,  34,  35,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,
        50,  51,  52,  54,  55,  57,  58,  59])

In [535]:
#| export
def get_sediment(verbose=False):
    lut = {}
    if verbose: print('Source:Destination')
    df_sediment = pd.read_csv(Path(fname_in) / 'SEDIMENT_TYPE.csv')
    
    for _, row in df_sediment.iterrows():
        match = match_maris_sediment(row['SEDIMENT TYPE'])
        
        lut[row['SEDI']] = match.iloc[0,0]
        if verbose: print(f'({row["SEDI"]}) {row["SEDIMENT TYPE"]}: ({match.iloc[0,0]}) {match.iloc[0,1]}')
    return lut   

In [589]:
df_sediment=get_sediment(verbose=True)
df_sediment

Source:Destination
(-99) NO DATA: (0) (Not available)
(0) GRAVEL: (2) Gravel
(1) SAND: (6) Sand
(2) FINE SAND: (7) Fine sand
(3) SILT: (12) Silt
(4) CLAY: (1) Clay
(5) MUD: (4) Mud
(6) GLACIAL: (25) Glacial
(7) SOFT: (26) Soft
(8) SULPHIDIC: (27) Sulphidic
(9) Fe-Mg CONCRETIONS: (28) Fe-Mg concretions
(10) SAND AND GRAVEL: (29) Sand and gravel
(11) PURE SAND: (30) Pure sand
(12) SAND AND FINE SAND: (31) Sand and fine sand
(13) SAND AND SILT: (62) Sand and silt
(14) SAND AND CLAY: (32) Sand and clay
(15) SAND AND MUD: (33) Sand and mud
(20) FINE SAND AND GRAVEL: (34) Fine sand and gravel
(21) FINE SAND AND SAND: (35) Fine sand and sand
(22) PURE FINE SAND: (36) Pure fine sand
(23) FINE SAND AND SILT: (37) Fine sand and silt
(24) FINE SAND AND CLAY: (38) Fine sand and clay
(25) FINE SAND AND MUD: (39) Fine sand and mud
(30) SILT AND GRAVEL: (11) Silt and gravel
(31) SILT AND SAND: (40) Silt and sand
(32) SILT AND FINE SAND: (41) Silt and fine sand
(33) PURE SILT: (42) Pure silt
(34) SILT

{-99: 0,
 0: 2,
 1: 6,
 2: 7,
 3: 12,
 4: 1,
 5: 4,
 6: 25,
 7: 26,
 8: 27,
 9: 28,
 10: 29,
 11: 30,
 12: 31,
 13: 62,
 14: 32,
 15: 33,
 20: 34,
 21: 35,
 22: 36,
 23: 37,
 24: 38,
 25: 39,
 30: 11,
 31: 40,
 32: 41,
 33: 42,
 34: 10,
 35: 43,
 40: 44,
 41: 45,
 42: 46,
 43: 48,
 44: 47,
 45: 49,
 46: 50,
 47: 51,
 48: 52,
 49: 53,
 50: 54,
 51: 55,
 52: 56,
 54: 57,
 55: 58,
 57: 59,
 58: 60,
 59: 61}

View unique SEDI types

In [590]:
dfs['sediment']['SEDI'].unique()

array([ nan, -99.,   0.,  55.,  11.,  57.,  51.,  52.,  22.,  10.,  44.,
         5.,  50.,  15.,   1.,  40.,  33.,  43.,  59.,  54.,   9.,  45.,
        14.,  41.,  25.,  42.,  24.,  12.,  58.,  13.,   7.,  49.,  48.,
         4.,  47.,  23.,  20.,  46.,   2.,  34.,  32.,  56.,  35.,  73.,
        21.])

Fill 'nan' with -99

In [591]:
dfs['sediment']['SEDI'].fillna(-99).astype('int')

0       -99
1       -99
2       -99
3       -99
4       -99
         ..
39812     0
39813     0
39814     0
39815     0
39816     0
Name: SEDI, Length: 39817, dtype: int64

In [592]:
#| export
class LookupSedimentCB(Callback):
    'Update sediment id  based on MARIS dbo_sedtype.xlsx'
    def __init__(self, fn_lut): fc.store_attr()
    def __call__(self, tfm):
        lut = self.fn_lut()
        tfm.dfs['sediment']['SEDI'] = dfs['sediment']['SEDI'].fillna(-99).astype('int')
        # To check with Helcom
        tfm.dfs['sediment']['SEDI'].replace(56, -99, inplace=True)
        tfm.dfs['sediment']['SEDI'].replace(73, -99, inplace=True)
        tfm.dfs['sediment']['sed_type'] = tfm.dfs['sediment']['SEDI'].apply(lambda x: lut[x])

In [593]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupSedimentCB(get_sediment)
                            ])

tfm()['sediment'][['SEDI', 'sed_type']]


Unnamed: 0,SEDI,sed_type
0,-99,0
1,-99,0
2,-99,0
3,-99,0
4,-99,0
...,...,...
39812,0,2
39813,0,2
39814,0,2
39815,0,2


### Capture Units

In [594]:
#| export
# Define unit names renaming rules
renaming_unit_rules = { 'VALUE_Bq/m³': 1, #'Bq/m3'
                  'VALUE_Bq/kg': 3 #'Bq/kg'
                }
                  

TODO: For sediment we drop '< VALUE_Bq/m²', 'VALUE_Bq/m²', 'ERROR%_m²'. Is this intentional? 

In [595]:
#| export
class LookupUnitCB(Callback):
    def __init__(self,
                 renaming_unit_rules=renaming_unit_rules):
        fc.store_attr()
    def __call__(self, tfm):
        for grp in tfm.dfs.keys():
            for k,v in self.renaming_unit_rules.items():
                if k in tfm.dfs[grp].columns:
                    tfm.dfs[grp]['unit'] = np.where(tfm.dfs[grp].loc[:,k].notna(), np.int64(v), np.int64(0))


In [596]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            NormalizeUncUnitCB(),
                            LookupBiotaSpeciesCB(partial(get_maris_species, 
                                                         fname_in, 'species_helcom.pkl')),
                            LookupBiotaBodyPartCB(get_bodypart),
                            LookupSedimentCB(get_sediment),
                            LookupUnitCB()])

tfm()


{'seawater':                 KEY NUCLIDE METHOD < VALUE_Bq/m³  VALUE_Bq/m³  ERROR%_m³  \
 0      WKRIL2012003   cs137    NaN           NaN          5.3      1.696   
 1      WKRIL2012004   cs137    NaN           NaN         19.9      3.980   
 2      WKRIL2012005   cs137    NaN           NaN         25.5      5.100   
 3      WKRIL2012006   cs137    NaN           NaN         17.0      4.930   
 4      WKRIL2012007   cs137    NaN           NaN         22.2      3.996   
 ...             ...     ...    ...           ...          ...        ...   
 21211  WSSSM2021005      h3  SSM45           NaN       1030.0    960.000   
 21212  WSSSM2021006      h3  SSM45           NaN       2240.0    970.000   
 21213  WSSSM2021007      h3  SSM45           NaN       2060.0    970.000   
 21214  WSSSM2021008      h3  SSM45           NaN       2300.0   1000.000   
 21215  WSSSM2021004      h3  SSM45             <          NaN        NaN   
 
          DATE_OF_ENTRY_x  COUNTRY LABORATORY   SEQUENCE  ... 

In [597]:
tfm.dfs['sediment']['unit']

0        3
1        3
2        3
3        3
4        3
        ..
39812    3
39813    3
39814    3
39815    3
39816    3
Name: unit, Length: 39817, dtype: int64

### Rename columns

In [598]:
#| export
# Define columns of interest by sample type
coi_grp = {'seawater': ['NUCLIDE', 'VALUE_Bq/m³', 'ERROR%_m³', 'time',
                        'TDEPTH', 'LATITUDE (dddddd)', 'LONGITUDE (dddddd)','unit'],
           'sediment': ['NUCLIDE', 'VALUE_Bq/kg', 'ERROR%_kg', 'time',
                        'TDEPTH', 'LATITUDE (dddddd)', 'LONGITUDE (dddddd)',
                        'sed_type','unit'],
           'biota': ['NUCLIDE', 'VALUE_Bq/kg', 'ERROR%', 'time',
                     'SDEPTH', 'LATITUDE ddmmmm', 'LONGITUDE ddmmmm',
                     'species_id', 'body_part','unit']}


In [599]:
#| export
# Define column names renaming rules
renaming_rules = {
    'NUCLIDE': 'nuclide',
    'VALUE_Bq/m³': 'value',
    'VALUE_Bq/kg': 'value',
    'ERROR%_m³': 'unc',
    'ERROR%_kg': 'unc',
    'ERROR%': 'unc',
    'TDEPTH': 'depth',
    'SDEPTH': 'depth',
    'LATITUDE (dddddd)': 'lat',
    'LATITUDE ddmmmm': 'lat',
    'LONGITUDE (dddddd)': 'lon',
    'LONGITUDE ddmmmm': 'lon'
}


In [600]:
#| export
class RenameColumnCB(Callback):
    def __init__(self,
                 coi=coi_grp,
                 renaming_rules=renaming_rules):
        fc.store_attr()

    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            # Select cols of interest
            tfm.dfs[k] = tfm.dfs[k].loc[:, self.coi[k]]

            # Rename cols
            tfm.dfs[k].rename(columns=self.renaming_rules, inplace=True)

In [601]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            NormalizeUncUnitCB(),
                            LookupBiotaSpeciesCB(partial(get_maris_species, 
                                                         fname_in, 'species_helcom.pkl')),
                            LookupBiotaBodyPartCB(get_bodypart),
                            LookupSedimentCB(get_sediment),
                            LookupUnitCB(),
                            RenameColumnCB()])

#print(tfm()['biota'].head(5))

tfm()

{'seawater':       nuclide   value       unc       time  depth      lat      lon  unit
 0       cs137     5.3     1.696 2012-05-23    NaN  60.0833  29.3333     1
 1       cs137    19.9     3.980 2012-05-23    NaN  60.0833  29.3333     1
 2       cs137    25.5     5.100 2012-06-17    NaN  59.4333  23.1500     1
 3       cs137    17.0     4.930 2012-05-24    NaN  60.2500  27.9833     1
 4       cs137    22.2     3.996 2012-05-24    NaN  60.2500  27.9833     1
 ...       ...     ...       ...        ...    ...      ...      ...   ...
 21211      h3  1030.0   960.000 2021-10-15    NaN  60.5200  18.3572     1
 21212      h3  2240.0   970.000 2021-11-04    NaN  57.4217  17.0000     1
 21213      h3  2060.0   970.000 2021-10-15    NaN  57.2347  11.9452     1
 21214      h3  2300.0  1000.000 2021-05-17    NaN  57.2347  11.9452     1
 21215      h3     NaN       NaN 2021-05-13    NaN  58.6033  11.2450     0
 
 [21216 rows x 8 columns],
 'sediment':       nuclide   value       unc       time  de

In [602]:
tfm.dfs['biota']

Unnamed: 0,nuclide,value,unc,time,depth,lat,lon,species_id,body_part,unit
0,cs134,0.010140,,2012-09-23,,54.170,12.1900,99,52,3
1,k40,135.300000,4.830210,2012-09-23,,54.170,12.1900,99,52,3
2,co60,0.013980,,2012-09-23,,54.170,12.1900,99,52,3
3,cs137,4.338000,0.150962,2012-09-23,,54.170,12.1900,99,52,3
4,cs134,0.009614,,2012-09-23,,54.170,12.1900,99,52,3
...,...,...,...,...,...,...,...,...,...,...
15822,k40,65.000000,6.630000,2020-10-09,0.0,60.224,18.2374,141579,1,3
15823,cs137,4.500000,0.279000,2020-10-09,0.0,60.224,18.2374,141579,1,3
15824,be7,94.000000,3.196000,2020-10-26,0.0,60.302,18.2200,96,54,3
15825,k40,1100.000000,17.600000,2020-10-26,0.0,60.302,18.2200,96,54,3


In [620]:
#| export
class ReshapeLongToWide(Callback):
    def __init__(self): fc.store_attr()

    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            cols = ['nuclide']
            #vals = ['value', 'unc']
            vals = ['value', 'unc', 'unit']
            
            idx = list(set(tfm.dfs[k].columns) -
                       set(cols + vals))  # All others

            tfm.dfs[k] = tfm.dfs[k].pivot_table(index=idx,
                                                columns=cols,
                                                values=vals).reset_index()
            
            # Flatten cols name
            tfm.dfs[k].columns = rename_cols(tfm.dfs[k].columns)
            
            # Update dtypes of unit
            date_cols = [col for col in tfm.dfs[k].columns if 'unit' in col]
            tfm.dfs[k][date_cols] = tfm.dfs[k][date_cols].fillna(0)
            tfm.dfs[k][date_cols] = tfm.dfs[k][date_cols].apply(lambda x: x.astype('int64'))
            
            #tfm.dfs[grp]['unit']=tfm.dfs[grp]['unit'].astype('int64')
            # Set index
            tfm.dfs[k].index.name = 'sample'

In [621]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            NormalizeUncUnitCB(),
                            LookupBiotaSpeciesCB(partial(get_maris_species, 
                                                         fname_in, 'species_helcom.pkl')),
                            LookupBiotaBodyPartCB(get_bodypart),
                            LookupSedimentCB(get_sediment),
                            LookupUnitCB(),
                            RenameColumnCB(),
                            ReshapeLongToWide()])

tfm()

{'seawater':             lat     lon       time  depth  ag110m_unc  am241_unc  ba140_unc  \
 sample                                                                        
 0        0.0000   0.000 2015-04-21   12.0         NaN        NaN        NaN   
 1        0.0000   0.000 2015-04-23    4.0         NaN        NaN        NaN   
 2        0.0000   0.000 2015-04-30   13.0         NaN        NaN        NaN   
 3        0.0000   0.000 2015-05-19   81.0         NaN        NaN        NaN   
 4        0.0000   0.000 2015-05-20   69.0         NaN        NaN        NaN   
 ...         ...     ...        ...    ...         ...        ...        ...   
 4814    65.6347  24.335 1994-07-12   17.0         NaN        NaN        NaN   
 4815    65.6347  24.335 1990-07-23   17.0         NaN        NaN        NaN   
 4816    65.6347  24.335 1991-07-23   17.0         NaN        NaN        NaN   
 4817    65.6347  24.335 1992-05-25   17.0         NaN        NaN        NaN   
 4818    65.6347  24.335 199

### Encode time (seconds since ...)

In [622]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            NormalizeUncUnitCB(),
                            LookupBiotaSpeciesCB(partial(get_maris_species, 
                                                         fname_in, 'species_helcom.pkl')),
                            LookupBiotaBodyPartCB(get_bodypart),
                            LookupSedimentCB(get_sediment),
                            LookupUnitCB(),
                            RenameColumnCB(),
                            ReshapeLongToWide(),
                            EncodeTimeCB(cfg())])

tfm()

{'seawater':             lat     lon        time  depth  ag110m_unc  am241_unc  ba140_unc  \
 sample                                                                         
 0        0.0000   0.000  1429574400   12.0         NaN        NaN        NaN   
 1        0.0000   0.000  1429747200    4.0         NaN        NaN        NaN   
 2        0.0000   0.000  1430352000   13.0         NaN        NaN        NaN   
 3        0.0000   0.000  1431993600   81.0         NaN        NaN        NaN   
 4        0.0000   0.000  1432080000   69.0         NaN        NaN        NaN   
 ...         ...     ...         ...    ...         ...        ...        ...   
 4814    65.6347  24.335   773971200   17.0         NaN        NaN        NaN   
 4815    65.6347  24.335   648691200   17.0         NaN        NaN        NaN   
 4816    65.6347  24.335   680227200   17.0         NaN        NaN        NaN   
 4817    65.6347  24.335   706752000   17.0         NaN        NaN        NaN   
 4818    65.6347

### Sanitize coordinates

In [623]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            NormalizeUncUnitCB(),
                            LookupBiotaSpeciesCB(partial(get_maris_species, 
                                                         fname_in, 'species_helcom.pkl')),
                            LookupBiotaBodyPartCB(get_bodypart),
                            LookupSedimentCB(get_sediment),
                            LookupUnitCB(),
                            RenameColumnCB(),
                            ReshapeLongToWide(),
                            EncodeTimeCB(cfg()),
                            SanitizeLonLatCB()])

tfm()


{'seawater':             lat      lon        time  depth  ag110m_unc  am241_unc  ba140_unc  \
 sample                                                                          
 10      53.9422  14.2578  1339545600   10.0         NaN        NaN        NaN   
 11      53.9483  14.2633   870220800   10.0         NaN        NaN        NaN   
 12      53.9483  14.2633   966729600   10.0         NaN        NaN        NaN   
 13      53.9483  14.2633   992044800   10.0         NaN        NaN        NaN   
 14      53.9483  14.2633  1023580800   10.0         NaN        NaN        NaN   
 ...         ...      ...         ...    ...         ...        ...        ...   
 4814    65.6347  24.3350   773971200   17.0         NaN        NaN        NaN   
 4815    65.6347  24.3350   648691200   17.0         NaN        NaN        NaN   
 4816    65.6347  24.3350   680227200   17.0         NaN        NaN        NaN   
 4817    65.6347  24.3350   706752000   17.0         NaN        NaN        NaN   
 481

## Encode to NetCDF

In [624]:
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            NormalizeUncUnitCB(),
                            LookupBiotaSpeciesCB(partial(get_maris_species, 
                                                         fname_in, 'species_helcom.pkl')),
                            LookupBiotaBodyPartCB(get_bodypart),
                            LookupSedimentCB(get_sediment),
                            LookupUnitCB(),
                            RenameColumnCB(),
                            ReshapeLongToWide(),
                            EncodeTimeCB(cfg()),
                            SanitizeLonLatCB()])

tfm()

{'seawater':             lat      lon        time  depth  ag110m_unc  am241_unc  ba140_unc  \
 sample                                                                          
 10      53.9422  14.2578  1339545600   10.0         NaN        NaN        NaN   
 11      53.9483  14.2633   870220800   10.0         NaN        NaN        NaN   
 12      53.9483  14.2633   966729600   10.0         NaN        NaN        NaN   
 13      53.9483  14.2633   992044800   10.0         NaN        NaN        NaN   
 14      53.9483  14.2633  1023580800   10.0         NaN        NaN        NaN   
 ...         ...      ...         ...    ...         ...        ...        ...   
 4814    65.6347  24.3350   773971200   17.0         NaN        NaN        NaN   
 4815    65.6347  24.3350   648691200   17.0         NaN        NaN        NaN   
 4816    65.6347  24.3350   680227200   17.0         NaN        NaN        NaN   
 4817    65.6347  24.3350   706752000   17.0         NaN        NaN        NaN   
 481

In [625]:
dfs_tfm_biota=tfm.dfs['biota']

In [626]:
dfs_tfm_sediment=tfm.dfs['sediment']

In [627]:
dfs_tfm_seawater=tfm.dfs['seawater']

In [628]:
print(dfs_tfm_seawater['ag110m_unit'])

sample
10      0
11      0
12      0
13      0
14      0
       ..
4814    0
4815    0
4816    0
4817    0
4818    0
Name: ag110m_unit, Length: 4809, dtype: int64


In [629]:
tfm.logs

['Convert nuclide names to lowercase & strip any trailing space(s)',
 'Remap to MARIS radionuclide names.',
 'Convert from relative error % to uncertainty of activity unit',
 'Match species with MARIS database.',
 'Update bodypart id based on MARIS dbo_bodypar.xlsx',
 'Update sediment id  based on MARIS dbo_sedtype.xlsx',
 'Encode time as `int` representing seconds since xxx',
 'Drop row when both longitude & latitude equal 0. Drop unrealistic longitude & latitude values. Convert longitude & latitude `,` separator to `.` separator.']

### Feed global attributes

In [630]:
#| export
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']


In [631]:
#| export
def get_attrs(tfm, zotero_key='26VMZZ2Q', kw=kw):
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(cfg()),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()

In [632]:
get_attrs(tfm, zotero_key='26VMZZ2Q', kw=kw)

{'geospatial_lat_min': '31.1667',
 'geospatial_lat_max': '65.6347',
 'geospatial_lon_min': '9.41',
 'geospatial_lon_max': '53.458',
 'geospatial_bounds': 'POLYGON ((9.41 53.458, 31.1667 53.458, 31.1667 65.6347, 9.41 65.6347, 9.41 53.458))',
 'geospatial_vertical_max': '0',
 'geospatial_vertical_min': '-460.0',
 'time_coverage_start': '1984-01-10T00:00:00',
 'time_coverage_end': '2021-12-06T00:00:00',
 'title': 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances',
 'summary': 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting d

### Encoding

In [633]:
#| export
def encode(fname_in, fname_out, nc_tpl_path, **kwargs):
    dfs = load_data(fname_in)         
    tfm = Transformer(dfs, cbs=[
        LowerStripRdnNameCB(),
        RemapRdnNameCB(),
        ParseTimeCB(),
        NormalizeUncUnitCB(),
        LookupBiotaSpeciesCB(partial(get_maris_species, 
                                     fname_in, 'species_helcom.pkl')),
        LookupBiotaBodyPartCB(get_bodypart),
        LookupSedimentCB(get_sediment),
        LookupUnitCB(),        
        RenameColumnCB(),
        ReshapeLongToWide(),
        EncodeTimeCB(cfg()),
        SanitizeLonLatCB()
        ])
    
    species_lut = get_maris_species(fname_in, 'species_helcom.pkl')
    enums_xtra = {
        'species_t': {info['name']: info['id'] 
                      for info in species_lut.values() if info['name'] != ''}
    }
        
    encoder = NetCDFEncoder(tfm(), 
                            src_fname=nc_tpl_path,
                            dest_fname=fname_out, 
                            global_attrs=get_attrs(tfm, zotero_key='26VMZZ2Q', kw=kw),
                            enums_xtra=enums_xtra,
                            **kwargs)
    encoder.encode()
    return encoder

In [634]:
encode(fname_in, fname_out, nc_tpl_path(), verbose=False)

<marisco.serializers.NetCDFEncoder at 0x7fde305993d0>