# Interchange format

In PRIMAP2, data is internally handled in
[xarray datasets](https://xarray.pydata.org/en/stable/data-structures.html#dataset)
with defined coordinates and metadata. On disk this structure is stored as a net-cdf
file. To enable easy data interchange with other researchers and provide a data format
that has the full structural information of PRIMAP2 datasets but is easy to read using
other software packages or even Excel or Calc, we have developed the **PRIMAP2
Interchange Format** which is based on a wide format with individually stored
accompanying metadata.

## Logical format
In the interchange format all dimensions and time points are represented by columns in
a two-dimensional array. Values of the time columns are data while values of the other
columns are metadata. To store metadata and additional information that is contained in
the `attrs` dict in the PRIMAP2 xarray format, we use an additional structure. See
sections *In-memory representation* and *on-disk representation* below for information
on the storage of these structures.

The metadata requirements are the same as in the PRIMAP2 standard data format.
Dimensions `area` and `source` which are mandatory in the xarray format are mandatory
columns in the interchange format. The `time` dimension is included in the horizontal
dimension of the tabular interchange format. Additionally, we have `unit` and `entity`
as mandatory columns with the restriction that each entity can have only one unit.

All optional dimensions (see [Data format details](data_format_details.rst)) can be
added as optional columns. Secondary categories are columns with free format names.
They are listed as secondary columns in the metadata dict.

Column names correspond to the dimension key of the xarray format, i.e. they contain
the terminology in parentheses (e.g. `area (ISO3)`).

Additional columns are currently not possible, but the option will be added
in a future release ([#25](https://github.com/pik-primap/primap2/issues/25)).

The metadata dict corresponds to the `attrs` dict of the xarray format
(see [Data format details](data_format_details.rst)).

## Use
The interchange format is intended for use mainly in two settings.

* To publish data processed using PRIMAP2 in a way that is easy to read by others but
also keeps the internal structure and metadata. The format will be used by future data
publications by the PRIMAP team including PRIMAP-hist.
* To have a common intermediate format for reading data from original sources (mostly
xls or csv files in different formats) to simplify data reading functions and to enable
use of our data reading functionality by other projects. All data is
first read into the interchange format and subsequently converted into the native
PRIMAP2 format. This enables using our data reading routines in other software
packages.

## In-memory representation
The in-memory representation of the interchange format is using a pandas DataFrame
to store the data, and a dict to store the additional metadata. Pandas DataFrames
have the capability to store the metadata internally, however it is still experimental
and subject to change without notice, and we thus use the feature only in rare exceptions
and generally store the additional metadata individually. For an example see *Examples*
section below.

## On-disk representation
On disk the dataset is represented by a csv file containing the array, and a yaml file
containing the additional metadata as a dict with name `attrs`.
Both files should have the same name except for the
ending. Additionally, the yaml file contains the string variable `data_file` which contains the
name of the csv file. Thus, a function reading interchange format data just needs the yaml
file name to read the data. For an example see *Examples* section below.

## Examples
Here we show a few examples of the interchange format. As the methods are still
under development the examples are currently limited and will be expanded as the methods
become available.

In [2]:
# import all the used libraries
import primap2 as pm2

### Reading csv data
The PRIMAP2 data reading procedures first convert data into the interchange format.
For explanations of the used parameters see the
[Data reading example](data_reading_example_test_data.ipynb). A more complex dataset is
read in [Data reading PRIMAP-hist](data_reading_example_PRIMAP-hist.ipynb).

In [3]:
file = "test_csv_data_sec_cat.csv"
coords_cols = {
    "unit": "unit",
    "entity": "gas",
    "area": "country",
    "category": "category",
    "sec_cats__Class": "classification",
}
coords_defaults = {
    "source": "TESTcsv2021",
    "sec_cats__Type": "fugitive",
    "scenario": "HISTORY",
}
coords_terminologies = {
    "area": "ISO3",
    "category": "IPCC2006",
    "sec_cats__Type": "type",
    "sec_cats__Class": "class",
    "scenario": "general",
}
coords_value_mapping = {"category": "PRIMAP1", "entity": "PRIMAP1"}
filter_keep = {}
filter_remove = {}
data_if = pm2.pm2io.read_wide_csv_file_if(
    file,
    coords_cols=coords_cols,
    coords_defaults=coords_defaults,
    coords_terminologies=coords_terminologies,
    coords_value_mapping=coords_value_mapping,
    filter_keep=filter_keep,
    filter_remove=filter_remove,
)
data_if.head()

Unnamed: 0,source,scenario (general),area (ISO3),entity,unit,category (IPCC2006),Class (class),Type (type),1991,2000,2010
0,TESTcsv2021,HISTORY,AUS,CO2,Mt CO2 / yr,1,TOTAL,fugitive,4.0,5.0,6.0
1,TESTcsv2021,HISTORY,AUS,KYOTOGHG (SARGWP100),Mt CO2 / yr,0,TOTAL,fugitive,8.0,9.0,10.0
2,TESTcsv2021,HISTORY,FRA,CH4,Gg CH4 / yr,2,TOTAL,fugitive,7.0,8.0,9.0
3,TESTcsv2021,HISTORY,FRA,CO2,Mt CO2 / yr,2,TOTAL,fugitive,0.012,0.013,0.014
4,TESTcsv2021,HISTORY,FRA,KYOTOGHG (SARGWP100),Mt CO2 / yr,0,TOTAL,fugitive,0.03,0.02,0.04


### Writing interchange format data
Data is written using the `pm2io.write_interchange_format` function which takes a filename
and path (`str` or `pathlib.Path`), an interchange format dataframe (`pandas.DataFrame`)
and optionally an attribute `dict` as inputs. If the filename has an ending it will be
ignored. The function writes a `yaml` file and a `csv` file.

In [4]:
file_if = "test_csv_data_sec_cat_if"
pm2.pm2io.write_interchange_format(file_if, data_if)


### Reading data from
To read interchange format data from disk the function `pm2io.read_interchange_format`
is used. It just takes a filename and path as input (`str` or `pathlib.Path`) and returns
a `pandas.DataFrame` containing the data and metadata. The filename and path has to point
to the `yaml` file. the `csv` file will be read from the filename contained in the `yaml`
file.

In [5]:
data_if_read = pm2.pm2io.read_interchange_format(file_if)
data_if_read.head()

Unnamed: 0,source,scenario (general),area (ISO3),entity,unit,category (IPCC2006),Class (class),Type (type),1991,2000,2010
0,TESTcsv2021,HISTORY,AUS,CO2,Mt CO2 / yr,1,TOTAL,fugitive,4.0,5.0,6.0
1,TESTcsv2021,HISTORY,AUS,KYOTOGHG (SARGWP100),Mt CO2 / yr,0,TOTAL,fugitive,8.0,9.0,10.0
2,TESTcsv2021,HISTORY,FRA,CH4,Gg CH4 / yr,2,TOTAL,fugitive,7.0,8.0,9.0
3,TESTcsv2021,HISTORY,FRA,CO2,Mt CO2 / yr,2,TOTAL,fugitive,0.012,0.013,0.014
4,TESTcsv2021,HISTORY,FRA,KYOTOGHG (SARGWP100),Mt CO2 / yr,0,TOTAL,fugitive,0.03,0.02,0.04


### Further examples
After finalization of the functionality an example for conversion from the xarray format
to the interchange format will be added.