# Generating a data model for CLIWOC

The purpose of this notebook is to demonstrate the structure of data models used by the `cdm_reader_mapper` toolbox.

## ICOADS IMMA

A common format for marine observational records is the ICOADS IMMA format. This is a text format, where each line contains the data (including metadata) for an individual record. The format is _attachment_ based, each record is constructed from a selection of (typically) fixed-width sections (called attachments) containing different subsets of the data or metadata associated with the record. Documentation on the format, and the available attachments can be found at [https://icoads.noaa.gov/e-doc/imma/R3.0-imma1.pdf](https://icoads.noaa.gov/e-doc/imma/R3.0-imma1.pdf).

Records within the same file can contain different attachments, meaning that the IMMA format is not a fixed-width format, as line lengths will vary between records. Each record, however, must contain a certain subset of the attachments (in this case the `core` (or `c0`), `c1`, and `c98` attachments). 

### Supplementary Data

Additional data or metadata can be provided in the `c99` attachment. This attachment is not fixed-width as different sources or decks can provide different collections of supplementary data.

## CLIWOC

In this example we use a subset of ICOADS release 3.0.0 IMMA formatted data for deck 730, which is data from the Climatological Database for the World's Oceans (CLIWOC). There is a large amount of supplementary data available in the `c99` attachment, which for deck 730 can be split into multiple sections. Here, we will start with the standard schema for the ICOADS IMMA format (included in `cdm_reader_mapper` as the `"icoads"` `imodel`), and extend the schema with fields for a subset of the `c99` attachment. We will add fields for the _logbook_ section of the `c99` attachment for this deck.

An internal schema already exists for this deck (`"icoads_r300_d730"`), the purpose of this notebook is to demonstrate how one can extend the `"icoads"` data model to parse `c99` data.

## Overview

* An initial read of the data subset using the `"icoads"` data model which does not parse the `c99` attachment.
* Extension of the `"icoads"` schema to add fields for the logbook section of the `c99` attachment for deck 730.
* Construction of a code table for a categorical field in the `c99` attachment.
* Comparison with the internal schema for deck 730.

In [None]:
from __future__ import annotations

import glob
import json
import os
import shutil

import pandas as pd

from cdm_reader_mapper import read_mdf, test_data
from cdm_reader_mapper.mdf_reader.properties import _base as base

try:
    from importlib.resources import files as get_files
except ImportError:
    from importlib_resources import files as get_files

from collections import OrderedDict
from tempfile import TemporaryDirectory

## The Data

For this example we load a subset of ICOADS data for deck 730 from the `cdm_reader_mapper` test data. This is the data that will be used throughout this notebook.

In [None]:
data_file_path = test_data.test_icoads_r300_d730["source"]

## Initial Read

First we read the data using the basic `"icoads"` data model. This isn't necessary for extending the schema, it is to highlight the raw `c99` data.

In [None]:
data_bundle = read_mdf(data_file_path, imodel="icoads")
data_raw = data_bundle.data

### Supplementary (`c99`) data

By looking at the `c99` section we can see that the supplementary data has not been parsed.

In [None]:
data_raw["c99"].head()

In [None]:
data_raw["c99"].iloc[3]

## Creating a data model

### Custom Schema

To use a custom schema we need to use the `ext_schema_path` argument in `read_mdf`. The structure of the directory is:

```
name_of_model/
    name_of_model.json
    code_tables/
        ...
```

The `code_tables` sub-directory contains the code tables that map the key columns in the data to their values.

In this example we create a temporary directory for the data model, so that it is cleaned up after the notebook is finished; in reality you would want to store the data model in a permanent directory!

We start from the basic `"icoads"` model. The `c99` section will be based on the `"icoads_r300_d730"` schema and code tables.

#### Copy the `"icoads"` schema

First we create a copy of the `"icoads"` schema (located at `mdf_reader/schemas/icoads/icoads.json`). NOTE: `cdm_reader_mapper.mdf_reader.properties._base` is used so that we have a relative path to the original schema and code tables.

In [None]:
tmp_dir = TemporaryDirectory()
my_model_name = "cliwoc"
my_model_path = os.path.join(tmp_dir.name, my_model_name)
os.mkdir(my_model_path)

# Get a copy of the "imma1" schema
icoads_schema_path = icoads_code_tables_path = get_files(
    ".".join([base, "schemas", "icoads"])
)
icoads_schema_path = os.path.join(icoads_schema_path, "icoads.json")

my_schema_path = os.path.join(my_model_path, my_model_name + ".json")
copy = shutil.copyfile(icoads_schema_path, my_schema_path)

#### Copy the code tables

We now copy each of the `"icoads"` code tables. This includes generic `icoads` code tables (located in `mdf_reader/codes/icoads`).

In [None]:
# Get code tables and copy to the directory
my_code_tables_path = os.path.join(my_model_path, "code_tables")
os.mkdir(my_code_tables_path)

# Original code table directories (general ICOADS and Deck specific)
icoads_code_tables_path = get_files(".".join([base, "codes", "icoads"]))

# Get filenames for each of the code tables
code_table_files = glob.glob(os.path.join(icoads_code_tables_path, "ICOADS.*.json"))

# Copy each file
for file in code_table_files:
    basename = os.path.basename(file)
    out_path = os.path.join(my_code_tables_path, basename)
    shutil.copyfile(file, out_path)

#### Extending the schema: CLIWOC logbook information

For this example we'll load the schema into the environment as a dictionary (we use an ordered dictionary to guarantee that the ordering of the fields is maintained!).

In [None]:
with open(my_schema_path) as io:
    schema = json.load(io, object_pairs_hook=OrderedDict)

We now add the contents for section `c99`. There are some standard ("header"_ fields we need to supply. The `"sentinel"` is the prefix for the attachment, this is printed in the raw supplementary data and identifies the start of the attachment.

We also need to specify the length of the attachment and the layout.

We then add our data fields to the `elements` field for the `c99` section. We'll add the fields for the logbook component of the supplementary data for CLIWOC data, there are additional components we can resolve but we'll keep it to the logbook for this example.

In [None]:
schema["sections"]["c99"]["header"]["sentinel"] = "99 0 "
schema["sections"]["c99"]["header"]["disable_read"] = False
schema["sections"]["c99"]["header"]["field_layout"] = "fixed_width"
schema["sections"]["c99"]["header"]["length"] = 245 + 5  # sentinel length
schema["sections"]["c99"]["elements"] = OrderedDict(
    {
        "sentinel": {
            "description": "attachment sentinel",
            "field_length": 5,
            "column_type": "str",
            "ignore": True,
        },
        "InstAbbr": {
            "description": "Abbreviation of the Institute storing the original data",
            "field_length": 8,
            "column_type": "str",
        },
        "InstName": {
            "description": "Full name of the Institute storing the original data",
            "field_length": 50,
            "column_type": "str",
        },
        "InstCity": {
            "description": "City where the Institute storing the data is located",
            "field_length": 10,
            "column_type": "str",
        },
        "InstCountry": {
            "description": "Country where the Institute storing the data is located",
            "field_length": 14,
            "column_type": "str",
        },
        "ArchiveID": {
            "description": "Administrative number under which the data is found within the Institute storing the data",
            "field_length": 15,
            "column_type": "str",
        },
        "ArchiveName": {
            "description": "Administrative name under which the data is found within the Institute storing the data",
            "field_length": 17,
            "column_type": "str",
        },
        "ArchivePart": {
            "description": "Part of the archive set in which the data is found within the Institute storing the data",
            "field_length": 39,
            "column_type": "str",
        },
        "ArchivePartSpec": {
            "description": "Specification of the part of the archive set in which the data is found within the Institute storing the data",
            "field_length": 31,
            "column_type": "str",
        },
        "LogbookID": {
            "description": "Identificaion Number of the logbook containing the data",
            "field_length": 30,
            "column_type": "str",
        },
        "LogbookLang": {
            "description": "Language of the logbook containing the data",
            "field_length": 7,
            "column_type": "str",
        },
        "ImageID": {
            "description": "Identificaion Number of the original image of the logbook",
            "field_length": 23,
            "column_type": "str",
        },
        "IllustrationAvail": {
            "description": "Illustration available on the current page of the logbook",
            "field_length": 1,
            "column_type": "key",
            "codetable": "CLIWOC_ILLUSTRATION_I",
        },
    }
)

We can now write the dictionary to the schema file.

In [None]:
json_object = json.dumps(schema, indent=2)

with open(my_schema_path, "w") as outfile:
    outfile.write(json_object)

#### `ImageAvail` Code Table

One of the fields we have added has `"column_type"` of `"key"`. This is used to indicate categorical data, where the key value maps to a larger descriptive value. We also specified a code table for this field, which should describe that mapping. Let's create that table now. As with the schema it should be json formatted.

For this field, we have two possible values. We save the dictionary to a json file in the code_tables directory, the name of the file must match the `"codetable"` value for the field (plus the `".json"` extension).

In [None]:
illustration_avail_codes = {
    "0": "No illustration on the current logbook page.",
    "1": "Illustration available on the current logbook page.",
}
illustration_avail_path = os.path.join(
    my_code_tables_path, "CLIWOC_ILLUSTRATION_I.json"
)

json_object = json.dumps(illustration_avail_codes, indent=2)

with open(illustration_avail_path, "w") as outfile:
    outfile.write(json_object)

### Reading

We can now read the data file with the schema we have just created (copied...). We specify the path to the data model (the directory containing the schema json file) and the path to the code tables.

In [None]:
my_bundle = read_mdf(
    data_file_path,  # Path to the data file
    ext_schema_path=my_model_path,  # Path to the directory containing the schema json file
    ext_table_path=my_code_tables_path,  # Path to the directory containing the json code tables
)
my_data = my_bundle.data

#### Analysing the output

We can now investigate components of the c99 section.

In [None]:
my_data[["c99"]].head()

In [None]:
my_data[["c99"]].describe(include="all")

## Internal Schema

`cdm_reader_mapper` already includes a data model for the CLIWOC deck. The model parses all sections of supplementary data and provides all required code tables. Let's now read in the data using the `"icoads_r300_d730"` model.

In [None]:
all_data = read_mdf(
    data_file_path,
    imodel="icoads_r300_d730",
)

The `c99` section has been split into multiple sections. There is no `c99` section in the output, however we now have:

* `c99_logbook`
* `c99_voyage`
* `c99_data`

We can compare the `c99_logbook` section to the output of our model. We see that we have extracted the same data, although we chose different column names for the elements.

In [None]:
all_data.data[["c99_logbook"]].describe(include="all")

In [None]:
my_data[["c99"]].describe(include="all")

#### Additional Sections

We can also look at the additional components we did not parse in our model.

We can note some remaining issues with the model as we look at the extra data. Most of the challenges relate to language translations.

In [None]:
pd.options.display.max_columns = None
all_data.data[["c99_voyage"]].describe(include="all")

In [None]:
all_data.data[["c99_voyage"]].c99_voyage.ZeroMeridian.head()

##### Ship types and languages

For example, the ship types on this deck will be given in many different languages. There is no code table for this variable in the CLIWOC website.

In [None]:
all_data.data[["c99_voyage"]].c99_voyage.Ship_type.dropna().head()

In [None]:
all_data.data[["c99_data"]].c99_data.describe(include="all")

##### Wind force scales and languages

What about the different scales for the wind force, given different languages?

In [None]:
all_data.data[["c99_data"]].c99_data.wind_force.head()