# Xarray - Pandas converter
---------------------------

This Notebook uses the example used in Xarray user-guide (section ["working with pandas"](https://docs.xarray.dev/en/stable/user-guide/pandas.html)) to show how the ntv_pandas converter complements the existing Xarray interface.

A [simple use case](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_multidimensional.ipynb) shows the advantage of multidimensional representation (conversion Xarray of a dataset, optimization of data size).

A [third example](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_multidimensional.ipynb) shows how the hidden multidimensional structure of a tabular dataset can be revealed.

## Xarray interface

In [1]:
import numpy as np
import xarray as xr


ds = xr.Dataset(
    {"foo": (("x", "y"), np.random.randn(2, 3))},
    coords={
        "x": [10, 20],
        "y": ["a", "b", "c"],
        "along_x": ("x", np.random.randn(2)),
        "scalar": 123,
    },
    attrs={"example": "Xarray user-guide"},
)
ds

*Note:*

- the `attrs` metadata is an addition to the example in the Xarray user-guide.

In [2]:
df = ds.to_dataframe()
df

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,along_x,scalar
x,y,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,a,-3.420051,0.415806,123
10,b,-0.906099,0.415806,123
10,c,0.498232,0.415806,123
20,a,-1.322896,1.140233,123
20,b,1.836943,1.140233,123
20,c,0.268444,1.140233,123


In [3]:
xr.Dataset.from_dataframe(df)

This example shows that the conversion is not reversible (lossy roundtrip) and that the size of the ``dataset`` increases.

Particularly after a roundtrip, the following deviations are noted:

- a non-dimension Dataset ``coordinate`` is converted into ``variable`` 
- a non-dimension DataArray ``coordinate`` is not converted 
- ``dtype`` is not allways the same (e.g. "str" is converted to "object")
- ``attrs`` metadata is not converted

The `ntv_pandas` converter avoids these data loss as explained below.

## ntv_pandas converter : Dataset -> DataFrame

Three options are available :

- **ntv_type**: Boolean (default True) - if False the `ntv_type` is not included in the columns name
- **info**: Boolean (default True) - if True, the `DataFrame.attrs` contains the multidimensional structure
- **index**: Boolean (default True) - if True, dimensions are translated into `indexes`

In [4]:
df_min = ds.nxr.to_dataframe(
    ntv_type=False, info=False, index=False
)  # without additional data
df_min

Unnamed: 0,x,y,along_x,foo,scalar
0,10,a,0.415806,-3.420051,123
1,10,b,0.415806,-0.906099,123
2,10,c,0.415806,0.498232,123
3,20,a,1.140233,-1.322896,123
4,20,b,1.140233,1.836943,123
5,20,c,1.140233,0.268444,123


In [5]:
df_min.attrs

{}

In [6]:
df_full = ds.nxr.to_dataframe()
df_full

Unnamed: 0_level_0,Unnamed: 1_level_0,along_x:float64,foo:float64,scalar:int32
x:int32,y:string,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,a,0.415806,-3.420051,123
10,b,0.415806,-0.906099,123
10,c,0.415806,0.498232,123
20,a,1.140233,-1.322896,123
20,b,1.140233,1.836943,123
20,c,1.140233,0.268444,123


In [7]:
df_full.attrs

{'info': {'dimensions': ['x', 'y'],
  'data': {'example': {'meta': 'Xarray user-guide', 'xtype': 'meta'},
   'x': {'shape': [2], 'xtype': 'namedarray'},
   'y': {'shape': [3], 'xtype': 'namedarray'},
   'along_x': {'shape': [2], 'xtype': 'variable', 'links': ['x']},
   'scalar': {'shape': [1], 'xtype': 'namedarray'},
   'foo': {'shape': [2, 3], 'xtype': 'variable', 'links': ['x', 'y']}}},
 'metadata': {'example': 'Xarray user-guide'}}

*note*:

- The `DataFrame.attrs` attribute is still experimental (some operations remove it). The associated information must therefore be processed as a priority.

## ntv_pandas converter : DataFrame -> Dataset

The conversion is done without loss, by reading the `DataFrame.attrs` or by finding the multidimensional structure hidden by the tabular structure.

Three options are available:

- **dims**: list of string (default None) - order of dimensions to apply
- **dataset** : Boolean (default True) - if False and a single data_var,
return a xr.DataArray
- **info** : Boolean (default True) - if True, use `DataFrame.attrs`

In [8]:
ds_min = df_min.npd.to_xarray()
ds_min

*Note:*

- The multidimensional structure is found by the `tab_analysis` module

In [9]:
df_min.reset_index(drop=False).npd.analysis(distr=True).field_partition()

{'primary': ['x', 'y'],
 'secondary': ['along_x'],
 'mixte': [],
 'unique': ['scalar'],
 'variable': ['index', 'foo']}

In [10]:
ds_full = df_full.npd.to_xarray()
ds_full

*Note:*

- The multidimensional structure is preserved with both options
- The `dtype` is preserved with both options 