## Mapping supplemental data from deck 704 to a CDM

This guide is a follow up from the [mdf_reader()](https://git.noc.ac.uk/brecinosrivas/mdf_reader/-/tree/master/) python tool [notebooks](https://git.noc.ac.uk/brecinosrivas/mdf_reader/-/tree/master/docs/notebooks). Where we extracted supplemental metadata from [ICOADSv3.0](https://icoads.noaa.gov/r3.html) stored in the [IMMA](https://icoads.noaa.gov/e-doc/imma/R3.0-imma1.pdf) format. 
Now we will map this supplemental data to a Common Data Model (CDM) format defined in the following [documentation](https://git.noc.ac.uk/brecinosrivas/cdm-mapper/-/blob/master/docs/cdm_latest.pdf).

This is done by using the [cdm-mapper python tool from the branch deck704v0](https://git.noc.ac.uk/brecinosrivas/cdm-mapper/-/tree/deck704v0) and following the workflow explained below. 

Along this notebook we will also point out Issues found in the cdm-mapper code and the cdm docs. This issues need to be address in the case that we want to map certain variables to the cdm.

We are analysing deck: `704`, the [US Marine Meteorological Journals Collection](https://icoads.noaa.gov/usmmj.html)

In [1]:
from __future__ import annotations

import os
import sys

import pandas as pd

import cdm_reader_mapper.cdm_mapper as cdm
from cdm_reader_mapper import mdf_reader, test_data

2024-08-23 10:16:15,244 - root - INFO - init basic configure of logging success
2024-08-23 10:16:17,527 - root - INFO - init basic configure of logging success


We first read the supplemental data information from the `c99` imma format for a subset of the data (e.g. 1878/10)

In [2]:
schema = "imma1_d704"
data_file_path = test_data.test_125_704["source"]
data_raw = mdf_reader.read(data_file_path, data_model=schema)

2024-08-23 10:16:17,673 - root - INFO - Attempting to fetch remote file: imma1_704/input/125-704_1878-10-01_subset.imma.md5
2024-08-23 10:16:18,442 - root - INFO - READING DATA MODEL SCHEMA FILE...
2024-08-23 10:16:18,449 - root - INFO - EXTRACTING DATA FROM MODEL: imma1_d704
2024-08-23 10:16:18,449 - root - INFO - Getting data string from source...
2024-08-23 10:16:19,641 - root - INFO - CREATING OUTPUT DATA ATTRIBUTES FROM DATA MODEL


The data from the c99 column for this deck is separated in the following sub sections:
- c99_sentinal
- c99_journal
- c99_voyage
- c99_daily
- c99_data4
- c99_data5

In [3]:
data_raw.data.c99_sentinal.head()

Unnamed: 0,ATTI,ATTL,BLK
0,99,0,
1,99,0,
2,99,0,
3,99,0,
4,99,0,


In [4]:
pd.options.display.max_columns = None
data_raw.data.c99_journal.head()

Unnamed: 0,sentinal,reel_no,journal_no,frame_no,ship_name,journal_ed,rig,ship_material,vessel_type,vessel_length,vessel_beam,commander,country,screw_paddle,hold_depth,tonnage,baro_type,baro_height,baro_cdate,baro_loc,baro_units,baro_cor,thermo_mount,SST_I
0,1,2,18,3,Panay,78,1,1,1,187,37,"S.P.Bray,Jr",1,3,23,1190,2,14,,Bulkhead of cabin,1,- .102,2,
1,1,2,18,3,Panay,78,1,1,1,187,37,"S.P.Bray,Jr",1,3,23,1190,2,14,,Bulkhead of cabin,1,- .102,2,
2,1,2,18,3,Panay,78,1,1,1,187,37,"S.P.Bray,Jr",1,3,23,1190,2,14,,Bulkhead of cabin,1,- .102,2,
3,1,2,18,3,Panay,78,1,1,1,187,37,"S.P.Bray,Jr",1,3,23,1190,2,14,,Bulkhead of cabin,1,- .102,2,
4,1,2,18,3,Panay,78,1,1,1,187,37,"S.P.Bray,Jr",1,3,23,1190,2,14,,Bulkhead of cabin,1,- .102,2,


In [5]:
data_raw.data.c99_voyage.head()

Unnamed: 0,sentinal,reel_no,journal_no,frame_start,from_city,to_city
0,2,2,18,14,Boston,Rio de Janeiro
1,2,2,18,14,Boston,Rio de Janeiro
2,2,2,18,14,Boston,Rio de Janeiro
3,2,2,18,14,Boston,Rio de Janeiro
4,2,2,18,14,Boston,Rio de Janeiro


In [6]:
data_raw.data.c99_daily.head()

Unnamed: 0,sentinal,reel_no,journal_no,frame_start,frame,year,month,day,distance,lat_deg_an,lat_min_an,lat_hemis_an,lon_deg_an,lon_min_an,lon_hemis_an,lat_deg_on,lat_min_on,lat_hemis_on,lon_deg_of,lon_min_of,lon_hemis_of,current_speed,current_direction
0,3,2,18,14,15,1878,10,20,,,,,,,,42,20,N,66,30,W,0.1,E
1,3,2,18,14,15,1878,10,20,,,,,,,,42,20,N,66,30,W,0.1,E
2,3,2,18,14,15,1878,10,20,,,,,,,,42,20,N,66,30,W,0.1,E
3,3,2,18,14,15,1878,10,20,,,,,,,,42,20,N,66,30,W,0.1,E
4,3,2,18,14,15,1878,10,20,,,,,,,,42,20,N,66,30,W,0.1,E


In [7]:
data_raw.data.c99_data4.head()

Unnamed: 0,sentinal,reel_no,journal_no,frame_start,frame,year,month,day,time_ind,hour,ship_speed,compass_ind,ship_course_compass,compass_correction,ship_course_true,wind_dir_mag,wind_dir_true,wind_force,barometer,temp_ind,attached_thermometer,air_temperature,wet_bulb_temperature,sea_temperature,present_weather,clouds,sky_clear,sea_state
0,4,2,18,14,15,1878,10,20,1,2,8.5,,EXS,,,WSW,,6,2960,1,5.8,,,,BOC,CU,5,R
1,4,2,18,14,15,1878,10,20,1,4,8.5,,EXS,,,WSW,,6,2960,1,5.6,,,,BOC,SC,3,R
2,4,2,18,14,15,1878,10,20,1,6,8.5,,EXS,,,W,,6,2962,1,5.6,4.8,,5.2,OCG,SC,0,R
3,4,2,18,14,15,1878,10,20,1,8,8.0,,EXS,,,W,,6,2964,1,5.6,4.8,,5.2,CG,SC,0,R
4,4,2,18,14,15,1878,10,20,1,10,8.5,,EXS,,,W,,6,2969,1,5.7,4.8,,5.0,BC,SC,2,L


In [8]:
data_raw.data.c99_data5.head()

Unnamed: 0,sentinal,reel_no,journal_no,frame_start,frame,year,month,day,time_ind,hour,ship_speed,compass_ind,ship_course_compass,blank,ship_course_true,wind_dir_mag,wind_dir_true,wind_force,barometer,temp_ind,attached_thermometer,air_temperature,wet_bulb_temperature,sea_temperature,present_weather,clouds,sky_clear,sea_state,compass_correction_ind,compass_correction,compass_correction_dir
0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Now that we have separated the c99 data into the different sections, we see that this deck is composed of two types of data, which are the same:
    
    - c99_data4
    - c99_data5
    
Both sections have the same name in variables. To map the correct section into the CDM it is necessary to impose a filter on the sections composed only of NaN data. 
The problem is that we dont know which years in the time series will have a section c99_data4 and which will have a c99_data5

> Note that this solution of excluding one section, will only work for decks from which sections are exclusive: Among the sections listed in the block, only one of them appears in every report.


We can now use the `"icoads_r3000_d704"` model to map the raw data to the Common Data Model [glamod/common_data_model](https://www.github.com/glamod/common_data_model). The function `map_model` from the `cdm_mapper` module contains all the functions for the model to convert variables to the correct units and/or specification following the [CDM Documentation](https://github.com/glamod/common_data_model/blob/master/cdm_latest.pdf).

To run the data model we need three things:

- raw data (the data we just read above)
- attributes of the raw data (sections and column names)
- the name of the model

In [9]:
help(cdm.map_model)

Help on function map_model in module cdm_reader_mapper.cdm_mapper.mapper:

map_model(imodel, data, cdm_subset=None, codes_subset=None, log_level='INFO')
    Map a pandas DataFrame to the CDM header and observational tables.

    Parameters
    ----------
    imodel: str
      a data model that can be of several types.
      1. A generic mapping from a defined data model, like IMMA1’s core and attachments.
      e.g. ``cdm/library/mappings/icoads_r3000``
      2. A specific mapping from generic data model to CDM, like map a SID-DCK from IMMA1’s core and attachments to
      CDM in a specific way.
      e.g. ``cdm/library/mappings/icoads_r3000_d704``
    data: pd.DataFrame, pd.parser.TextFileReader or io.String
      input data to map.
      Type: string.
    cdm_subset: list, optional
      subset of CDM model tables to map.
      Defaults to the full set of CDM tables defined for the imodel.
    codes_subset: list, optional
      subset of code mapping tables to map.
      Default to t

In [10]:
name_of_model = "icoads_r3000_d704"

cdm_dict = cdm.map_model(
    name_of_model,
    data_raw.data,
)

2024-08-23 10:16:19,878 - root - INFO - init basic configure of logging success
2024-08-23 10:16:19,894 - root - INFO - init basic configure of logging success
2024-08-23 10:16:19,898 - root - INFO - init basic configure of logging success
2024-08-23 10:16:19,901 - root - INFO - init basic configure of logging success
     YR VS
0  1878  3
1  1878  3
2  1878  3
3  1878  3
4  1878  3 to frame.


Now, have we succeeded in writing some of the data to the CDM format?

We were looking to write the following data 

### Header section

 - Platform type and sub type
 - primary station id: original ship names
 - Longitude and Latitudes: converted from Degrees Minutes and Hemisphere to Decimal degrees
 - Location accuracy
 
 
### Observations tables

- `Observations-at`: latitude, longitude and location precision
- `Observations-dpt`: latitude, longitude and location precision
- `Observations-slp`: latitude, longitude and location precision
     - z_coordinate_type: Barometer height in feet converted to m.
     - original units: written in the CDM code format

- `Observations-sst`: latitude, longitude and location precision
- `Observations-wbt`: latitude, longitude and location precision
- `Observations-wd`: latitude, longitude and location precision
- `Observations-ws`: latitude, longitude and location precision


In [11]:
data = cdm_dict["header"]["data"]
data.head()

Unnamed: 0,report_id,application_area,observing_programme,report_type,station_name,station_type,platform_type,platform_sub_type,primary_station_id,station_record_number,primary_station_id_scheme,longitude,latitude,location_accuracy,location_quality,crs,station_speed,station_course,height_of_station_above_local_ground,height_of_station_above_sea_level,report_meaning_of_timestamp,report_timestamp,report_duration,report_time_accuracy,report_time_quality,report_quality,duplicate_status,record_timestamp,history,source_id,source_record_id
0,ICOADS-30-020N16,"[1, 7, 10, 11]","[5, 7, 56]",0,Panay,2,2,26,Panay,1,8,-68.41,42.28,,0,0,4.11552,90,0,0,2,1878-10-20 06:00:00,11,3600,2,0,4,2024-08-23 08:16:19.992685+00:00,2024-08-23 08:16:19. Initial conversion from I...,ICOADS-3-0-0T-125-704-1878-10,020N16
1,ICOADS-30-020N1P,"[1, 7, 10, 11]","[5, 7, 56]",0,Panay,2,2,26,Panay,1,8,-68.03,42.31,,0,0,4.11552,90,0,0,2,1878-10-20 08:00:00,11,3600,2,0,4,2024-08-23 08:16:19.992685+00:00,2024-08-23 08:16:19. Initial conversion from I...,ICOADS-3-0-0T-125-704-1878-10,020N1P
2,ICOADS-30-020N25,"[1, 7, 10, 11]","[5, 7, 56]",0,Panay,2,2,26,Panay,1,8,-67.64,42.33,,0,0,4.11552,90,0,0,2,1878-10-20 10:00:00,11,3600,2,0,4,2024-08-23 08:16:19.992685+00:00,2024-08-23 08:16:19. Initial conversion from I...,ICOADS-3-0-0T-125-704-1878-10,020N25
3,ICOADS-30-020N2Q,"[1, 7, 10, 11]","[5, 7, 56]",0,Panay,2,2,26,Panay,1,8,-67.29,42.35,,0,0,4.11552,90,0,0,2,1878-10-20 12:00:00,11,3600,2,0,4,2024-08-23 08:16:19.992685+00:00,2024-08-23 08:16:19. Initial conversion from I...,ICOADS-3-0-0T-125-704-1878-10,020N2Q
4,ICOADS-30-020N3A,"[1, 7, 10, 11]","[5, 7, 56]",0,Panay,2,2,26,Panay,1,8,-66.9,42.37,,0,0,4.11552,90,0,0,2,1878-10-20 14:00:00,11,3600,2,0,4,2024-08-23 08:16:19.992685+00:00,2024-08-23 08:16:19. Initial conversion from I...,ICOADS-3-0-0T-125-704-1878-10,020N3A


We now show an example of Lat and Lon

In [12]:
data.latitude.head(), data.longitude.head()

(0    42.28
 1    42.31
 2    42.33
 3    42.35
 4    42.37
 Name: latitude, dtype: float64,
 0   -68.41
 1   -68.03
 2   -67.64
 3   -67.29
 4   -66.90
 Name: longitude, dtype: float64)

In [13]:
data_raw.data.c99_daily[
    [
        "lat_deg_on",
        "lat_min_on",
        "lat_hemis_on",
        "lon_deg_of",
        "lon_min_of",
        "lon_hemis_of",
    ]
].head()

Unnamed: 0,lat_deg_on,lat_min_on,lat_hemis_on,lon_deg_of,lon_min_of,lon_hemis_of
0,42,20,N,66,30,W
1,42,20,N,66,30,W
2,42,20,N,66,30,W
3,42,20,N,66,30,W
4,42,20,N,66,30,W


This has been successfully converted to Decimal degrees with the right (-) for each hemisphere. 


Now for the SLP we have other information:

In [14]:
data_raw.data.c99_journal[["baro_type", "baro_height", "baro_units"]].head()

Unnamed: 0,baro_type,baro_height,baro_units
0,2,14,1
1,2,14,1
2,2,14,1
3,2,14,1
4,2,14,1


Baro type original code table

```
{
	"1":"aneroid",
	"2":"mercurial"
}
```
Baro units original code table. It has been left like this:

```
{
	"1":"inches",
	"2":"millimeters",
	"3":"millibars",
	"4":"unable to determine",
	"5":"Paris inches"
}
```

Our CDM table will be
```
{
  "1":1001,
  "2":1002,
  "3":1003,
  "4":9999,
  "5":1005
}
```

9999 will be the `"fill_value": 9999` that indicates to the CDM-mapper that these are NaN values.


In [15]:
data_obs = cdm_dict["observations-slp"]["data"]
data_obs.head()

Unnamed: 0,observation_id,report_id,data_policy_licence,date_time,date_time_meaning,observation_duration,longitude,latitude,crs,z_coordinate,z_coordinate_type,observation_height_above_station_surface,observed_variable,observation_value,value_significance,units,conversion_flag,location_precision,spatial_representativeness,quality_flag,numerical_precision,sensor_automation_status,exposure_of_sensor,original_precision,original_units,original_value,conversion_method,processing_level,traceability,advanced_qc,advanced_uncertainty,advanced_homogenisation,source_id
0,ICOADS-30-020N16-SLP,ICOADS-30-020N16,0,1878-10-20 06:00:00,2,8,-68.41,42.28,0,4.27,0,4.27,58,99610.0,2,32,0,,3,2,,5,3,,1001,996.1,7,3,2,0,0,0,ICOADS-3-0-0T-125-704-1878-10
1,ICOADS-30-020N1P-SLP,ICOADS-30-020N1P,0,1878-10-20 08:00:00,2,8,-68.03,42.31,0,4.27,0,4.27,58,99630.0,2,32,0,,3,2,,5,3,,1001,996.3,7,3,2,0,0,0,ICOADS-3-0-0T-125-704-1878-10
2,ICOADS-30-020N25-SLP,ICOADS-30-020N25,0,1878-10-20 10:00:00,2,8,-67.64,42.33,0,4.27,0,4.27,58,99690.0,2,32,0,,3,2,,5,3,,1001,996.9,7,3,2,0,0,0,ICOADS-3-0-0T-125-704-1878-10
3,ICOADS-30-020N2Q-SLP,ICOADS-30-020N2Q,0,1878-10-20 12:00:00,2,8,-67.29,42.35,0,4.27,0,4.27,58,99760.0,2,32,0,,3,2,,5,3,,1001,997.6,7,3,2,0,0,0,ICOADS-3-0-0T-125-704-1878-10
4,ICOADS-30-020N3A-SLP,ICOADS-30-020N3A,0,1878-10-20 14:00:00,2,8,-66.9,42.37,0,4.27,0,4.27,58,99920.0,2,32,0,,3,2,,5,3,,1001,999.2,7,3,2,0,0,0,ICOADS-3-0-0T-125-704-1878-10
