## How to create a data model

### Creating a data model

In this notebook we will create and apply a new **data model/schema** to a raw `.imma` file, using the ``mdf_reader``. We will add supplemental metadata to the basic `imma1` data model and display supplemental data as a pandas dataframe.

Lets first import all the tools that we will need.

In [1]:
from __future__ import annotations

import json
import os
import shutil
import sys

import pandas as pd

pd.options.display.max_columns = None

from collections import OrderedDict
from tempfile import TemporaryDirectory

try:
    from importlib.resources import files as get_files
except ImportError:
    from importlib_resources import files as get_files

from cdm_reader_mapper import mdf_reader
from cdm_reader_mapper.data import test_data

2024-10-02 13:31:02,677 - root - INFO - init basic configure of logging success
  from .autonotebook import tqdm as notebook_tqdm


The `mdf_reader` tool comes with data model templates of `.json` files, that we can use to build our models. For more information see the following [manual](https://git.noc.ac.uk/iregon/mdf_reader/-/blob/master/docs/User_manual.docx).

In [2]:
mdf_reader.properties.supported_data_models

['craid', 'gcc', 'icoads', 'pub47']

According to the manual, ICOADS data stored with the [IMMA format](https://icoads.noaa.gov/e-doc/imma/R3.0-imma1.pdf) represents a complex data model, since the data includes blocks of sections which are exclusive to certain DCK's (e.g. data coming from the NOAA National Climatic Data Center (NCDC) TD-11 formats). Most of the ICOADS data however will need a **schema** based on the `icoads.json` format.

Lets try to build our own **schema** based on this template for a new dck. In this notebook we will organise the data and metadata from the **US Maury collection** that corresponds to `source/dck 069-701`.

1. First lets read a raw `.imma` file from dck 701 as an example, for a subset of the data collected in April/1845.

One should note that a full schema for this deck already exists: `"icoads_r300_d701"`.

In [3]:
# Load the test data
data = test_data.test_icoads_r300_d701_type2

data_raw = mdf_reader.read(data.get("source"), imodel="icoads")

2024-10-02 13:31:05,923 - root - INFO - Attempting to fetch remote file: icoads/r300/d701/input/icoads_r300_d701_1845-04-01_subset.imma.md5
2024-10-02 13:31:06,254 - root - INFO - READING DATA MODEL SCHEMA FILE...
2024-10-02 13:31:06,261 - root - INFO - EXTRACTING DATA FROM MODEL: icoads
2024-10-02 13:31:06,262 - root - INFO - Getting data string from source...
2024-10-02 13:31:06,614 - root - INFO - CREATING OUTPUT DATA ATTRIBUTES FROM DATA MODEL


We now look at the supplementary data column for this data, i.e.: the `"c99"` column.

In [4]:
data_raw.data["c99"]

0    99 0 300850118450401  5404N 2354W             ...
1    99 0 810348118450401  4836N 2330W             ...
2    99 0 370731118450401  4643N15147W             ...
3    99 0 260597118450401  4454N 3015W             ...
4    99 0 250661118450401  4356N 2220W             ...
Name: c99, dtype: object

The `c99` column is a bit messy. Here, we will need to separate the Supplemental Metadata ingested in ICOADS as an entire string and sort each row out according to the source & dck documentation.

2. We then need to make a new data model or **schema** which can then be used by the `mdf_reader` module. For this we create a schema with the name `imma1_d701`. For the purposes of this notebook we will create this schema in a temporary directory.
3. In this directory we will need to add a `.json` file with the same name. This `imma1_d701.json` file will contain all the data model information with instructions on how to subdivide the metadata added to `c99`. The name of the file is `imma1_d701.json` because the data model for this deck is based on the `imma1` template shown above, but the `c99` will be further subdivided into other columns/sections. We will start with a copy of the original `"imma1"` schema and add elements to the `c99` section.

In [5]:
# Get a copy of the "imma1" schema
schema: OrderedDict = mdf_reader.schemas.read_schema(imodel="icoads")
del schema["name"]

In [6]:
# Create the directory where we store the schema
my_model_name = "imma1_d701"
tmp_dir = TemporaryDirectory()
my_model_path = os.path.join(tmp_dir.name, my_model_name)
os.mkdir(my_model_path)
print(my_model_path)

/var/folders/vf/pskk3w4j38l8kk7bc9xm07j00000gp/T/tmp3g4eeqbu/imma1_d701


We should now look at the documentation for this deck, to see if we can parse the `c99` section.

From the US Maury collection [ICOADS documentation](https://icoads.noaa.gov/e-doc/other/transpec/maury/maury_transpec), we find out that the `c99` for this deck is composed of the following sections:

- Data
- Header information
- Quality control information (qc)

In this example we will only look to make a few new elements for demonstration purposes. A full schema file already exists for deck 701, we are not looking to duplicate that in full here.

```
Data stored in the supplemental attachment consisted of the entire data record
(173 characters); followed by a selection of fields from, or derived from, the
associated header record (through character 241); and selected fields from the
qc file (total 250 characters):
  # Pos.     Total #  Field  Record
    range    of pos.   name    type  Description of field (of derived field)
--- -------  -------  -----  ------  ----------------------------------------
  1 1-7         7     cvoyd    data  voyage number
... ...               ...       ...  ...
 47 172-173     2     cmvq     data  magnetic variation QC indicator
 NA 174-175     2     cts2   header  (fr ship type, ctship, according to [5])
  4 176-177     2     cft    header  form type
  5 178-193    16     comm   header  commander (first 16 positions only) [6]
  6 194-217    24     cfr    header  from city
  7 218-241    24     cto    header  to city
  2 242-246     5     qc2    qc      reel sequence number
  5 247-248     2     qc5    qc      day  (local time) (99 indicates missing)
  6 249-250     2     qc6*   qc      hour (local time) (99 indicates missing)
--- -------  -------  -----  ------  ----------------------------------------
* Whenever qc6 was 24, zero was inadvertently written out to the supplemental
attachment.  This resulted from an error in the conversion program, but can
be fixed by interpretation of hour zero as hour 24 of qc5 + 1 (as noted in [2],
qc6 originally ranged 1-24, with 24 signifying hour 0 of the next day.  As
intended, qc5 was included in the supplementary attachment in original form.
```

The `c99_sentinal` section identifies where in the data, we will have a new section. In this case we will have a new section corresponding to Supplemental Metadata.

In our example this supplemental metadata will come from the documentation of the US Maury collection stored in the [ICOADS website](https://icoads.noaa.gov/e-doc/other/transpec/maury/maury_transpec).

4. We will need to add the metadata information from the website inside that `c99_sentinal` section and create as many sections as the data requires.

> sentinal: section identifier
> applies to: format.fixed_width
> is mandatory: it is not mandatory if the section is unique, unique in a parsing_order block, or
> part of a sequential parsing_order block.
> type: string
> comments: the element bearing the sentinal needs to be, additionally, declared in the
> elements block

In [7]:
c99 = data_raw.data["c99"]
line = c99.iloc[2]

In [8]:
# sentinal = 5
part_1 = line[0:5]
part_1

'99 0 '

In [9]:
# cvoyd voyage number = 7
part_2 = line[5 : 5 + 7]
part_2

'3707311'

In [10]:
# date = 10
part_3 = line[12 : 12 + 10]
part_3

'18450401  '

### Create the custom model

We now make the adjustments to the schema to parse the sentinal, voyage number, and date fields. The rest can be skipped for this example.

Here we add to the dictionary containing the `"imma1"` schema loaded earlier, we then save that to a `json` file in our model directory.

Note that we need to use an `OrderedDict` here, since the ordering of the fields is important, a standard python `dict` is un-ordered and may shuffle the elements.

In [11]:
schema["sections"]["c99"]["header"]["sentinal"] = "99 0 "
schema["sections"]["c99"]["header"]["disable_read"] = False
schema["sections"]["c99"]["header"]["field_layout"] = "fixed_width"
schema["sections"]["c99"]["header"]["length"] = 250 + 5  # Sentinal length
schema["sections"]["c99"]["elements"] = OrderedDict(
    {
        "sentinal": {
            "description": "attachment sentinal",
            "field_length": 5,
            "column_type": "str",
        },
        "cvoyd": {
            "description": "Voyage Information",
            "field_length": 7,
            "column_type": "str",
        },
        "year": {
            "description": "Year",
            "field_length": 4,
            "column_type": "uint16",
        },
        "month": {
            "description": "Month",
            "field_length": 2,
            "column_type": "uint8",
        },
        "day": {
            "description": "Day",
            "field_length": 2,
            "column_type": "uint8",
        },
        "rest": {
            "description": "Remaining c99 string",
            "field_length": 235,  # 250 - (8 + 7)
            "column_type": "str",
        },
    }
)

In [12]:
json_object = json.dumps(schema, indent=2)

with open(os.path.join(my_model_path, my_model_name + ".json"), "w") as outfile:
    outfile.write(json_object)

The final component of a model for the `mdf_reader` module is the `code_tables`. These are the tables that relate `key` columns to their values. For this example we will copy the code tables from the original `icoads` model.

In [13]:
code_tables_path = get_files(".".join([mdf_reader.properties._base, "codes", "icoads"]))
shutil.copytree(code_tables_path, os.path.join(my_model_path, "code_tables"))

'/var/folders/vf/pskk3w4j38l8kk7bc9xm07j00000gp/T/tmp3g4eeqbu/imma1_d701/code_tables'

Now we feed this new data model to the `mdf_reader.read` function. To use our custom schema we need to specify the `ext_schema_path` argument, rather than the `imodel` argument used earlier.

In [14]:
data_new = mdf_reader.read(data.get("source"), ext_schema_path=my_model_path)

2024-10-02 13:31:07,213 - root - INFO - READING DATA MODEL SCHEMA FILE...
2024-10-02 13:31:07,269 - root - INFO - EXTRACTING DATA FROM MODEL: None
2024-10-02 13:31:07,269 - root - INFO - Getting data string from source...
2024-10-02 13:31:07,831 - root - INFO - CREATING OUTPUT DATA ATTRIBUTES FROM DATA MODEL


And magically all the messy string is _partially_ separated!

In [15]:
data_new.data["c99"]

Unnamed: 0,sentinal,cvoyd,year,month,day,rest
0,99 0,3008501,1845,4,1,5404N 2354W ...
1,99 0,8103481,1845,4,1,4836N 2330W 29291 ...
2,99 0,3707311,1845,4,1,4643N15147W ...
3,99 0,2605971,1845,4,1,4454N 3015W 20200W ...
4,99 0,2506611,1845,4,1,4356N 2220W ...


We can also quickly verify that the original fields are parsed in the same way, here we are just verifying that the columns in the `core` section are unchanged following the changes made to the schema.

In [16]:
for c in data_new.data["core"].columns:
    assert data_new.data["core"][c].equals(data_raw.data[("core", c)])