# Pandas interoperability with multidimensional tools
----
## Goal
- show on a simple example the interoperability between tabular data and multidimensional data

## Presentation of the example
Let's take the example of a table containing the price of some fruits and vegetables.

|product|plants   |plts |quantity|price|price level|group  |id   |supplier|location|valid|
|:-----:|:-------:|:---:|:-----:|:----:|:---------:|:-----:|:---:|:------:|:------:|:---:|
|apple  |fruit	  |fr   |1 kg	|1	   |low        |fruit1 |1001 |sup1    |fr      |ok   |
|apple  |fruit	  |fr   |10 kg	|10    |low        |fruit10|1002 |sup1    |gb      |ok   |
|orange |fruit	  |fr   |1 kg   |2     |high       |fruit1 |1003 |sup1    |es      |ok   |
|orange |fruit	  |fr   |10 kg	|20	   |high       |veget  |1004 |sup2    |ch      |ok   |
|peppers|vegetable|ve   |1 kg	|1.5   |low        |veget  |1005 |sup2    |gb      |ok   |
|peppers|vegetable|ve   |10 kg  |15    |low        |veget  |1006 |sup2    |fr      |ok   |
|carrot |vegetable|ve   |1 kg	|1.5   |high       |veget  |1007 |sup2    |es      |ok   |
|carrot |vegetable|ve   |10 kg	|20    |high       |veget  |1008 |sup1    |ch      |ok   |


The price is different depending on the product and the packaging of 1 or 10 kg.

In [1]:
import pandas as pd
import ntv_pandas as npd # activate pandas npd accessor

fruits = {'plants':      ['fruit', 'fruit', 'fruit', 'fruit', 'vegetable', 'vegetable', 'vegetable', 'vegetable'],
          'plts':        ['fr', 'fr', 'fr', 'fr', 've', 've', 've', 've'], 
          'quantity':    ['1 kg', '10 kg', '1 kg', '10 kg', '1 kg', '10 kg', '1 kg', '10 kg'],
          'product':     ['apple', 'apple', 'orange', 'orange', 'peppers', 'peppers', 'carrot', 'carrot'],
          'price':       [1, 10, 2, 20, 1.5, 15, 1.5, 20],
          'price level': ['low', 'low', 'high', 'high', 'low', 'low', 'high', 'high'],
          'group':       ['fruit 1', 'fruit 10', 'fruit 1', 'veget', 'veget', 'veget', 'veget', 'veget'],
          'id':          [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
          'supplier':    ["sup1", "sup1", "sup1", "sup2", "sup2", "sup2", "sup2", "sup1"],
          'location':    ["fr", "gb", "es", "ch", "gb", "fr", "es", "ch"],
          'valid':       ["ok", "ok", "ok", "ok", "ok", "ok", "ok", "ok"]}
df_fruits = pd.DataFrame(fruits)

## multidimensional structure analysis

In [2]:
ana_fruits = df_fruits.npd.analysis(distr=True)

### Partitions
A partition is a minimum list of Field where combinations are all different in the dataset (i.e. dimensions in a multidimensional structure).

In [3]:
ana_fruits.partitions()

[['plants', 'quantity', 'price level'],
 ['quantity', 'price level', 'supplier'],
 ['plants', 'location'],
 ['quantity', 'product'],
 ['supplier', 'location'],
 ['id']]

The dimension of a Dataset is the highest size of a partition.

In [4]:
ana_fruits.dimension

3

The Dataset is composed for a partition of:
- primary: partition fields
- secondary: fields derived from or coupled to primary fields
- mixte: fields with multidimensional structure and derived from a primary field
- unique: unique fields
- variable: other fields


In [5]:
ana_fruits.field_partition() # first partition

{'primary': ['plants', 'quantity', 'price level'],
 'secondary': ['plts'],
 'mixte': ['product'],
 'unique': ['valid'],
 'variable': ['price', 'group', 'id', 'supplier', 'location']}

In [6]:
ana_fruits.relation_partition() # first partition

{'plants': ['plants'],
 'quantity': ['quantity'],
 'price level': ['price level'],
 'plts': ['plants'],
 'product': ['plants', 'price level'],
 'valid': [],
 'price': ['plants', 'quantity', 'price level'],
 'group': ['plants', 'quantity', 'price level'],
 'id': ['plants', 'quantity', 'price level'],
 'supplier': ['plants', 'quantity', 'price level'],
 'location': ['plants', 'quantity', 'price level']}

In [7]:
ana_fruits.field_partition(partition=['product', 'quantity'])

{'primary': ['product', 'quantity'],
 'secondary': ['plants', 'plts', 'price level'],
 'mixte': [],
 'unique': ['valid'],
 'variable': ['price', 'group', 'id', 'supplier', 'location']}

In [8]:
ana_fruits.relation_partition(partition=['product', 'quantity'])

{'product': ['product'],
 'quantity': ['quantity'],
 'plants': ['product'],
 'plts': ['plants'],
 'price level': ['product'],
 'valid': [],
 'price': ['product', 'quantity'],
 'group': ['product', 'quantity'],
 'id': ['product', 'quantity'],
 'supplier': ['product', 'quantity'],
 'location': ['product', 'quantity']}

## Xarray and scipp interoperability
For a partition, a DataFrame can be converted into a multidimensional entity. 

In [9]:
from base64 import b64encode
from IPython.display import Image, display
display(Image(url="https://mermaid.ink/img/" + b64encode(open('fruits.mmd', 'r', encoding="utf-8").read().encode("ascii")).decode("ascii")))

### Partition 1

In [10]:
kwargs = {'dims':['plants', 'quantity', 'price level'], 'info': False, 'ntv_type': False}

xd_fruits_1 = df_fruits.npd.to_xarray(**kwargs)
xd_fruits_1

In [11]:
import ntv_numpy  # activate xarray nxr accessor

In [12]:
df_fruits_xd = xd_fruits_1.nxr.to_dataframe(ntv_type=False) # identical as: df_fruits_xd = npd.from_xarray(ntv_type=False)

df_fruits_xd_sort = df_fruits_xd.reset_index()[list(df_fruits.columns)].sort_values(list(df_fruits.columns)).reset_index(drop=True)
df_fruits_sort = df_fruits.sort_values(list(df_fruits.columns)).reset_index(drop=True)

df_fruits_xd_sort.equals(df_fruits_sort)

True

In [13]:
sc_fruits_1 = df_fruits.npd.to_scipp(**kwargs)
sc_fruits_1

In [14]:
df_fruits_sc = npd.from_scipp(sc_fruits_1, ntv_type=False)

df_fruits_sc_sort = df_fruits_sc.reset_index()[list(df_fruits.columns)].sort_values(list(df_fruits.columns)).reset_index(drop=True)
df_fruits_sort = df_fruits.sort_values(list(df_fruits.columns)).reset_index(drop=True)

df_fruits_sc_sort.equals(df_fruits_sort)

True

### Partition 2

In [15]:
kwargs = {'dims':['product', 'quantity'], 'info': False, 'ntv_type': False}

xd_fruits_2 = df_fruits.npd.to_xarray(**kwargs)
xd_fruits_2

In [16]:
df_fruits_xd = xd_fruits_2.nxr.to_dataframe(ntv_type=False) # or npd.from_xarray(xd_fruits_2, ntv_type=False)

df_fruits_xd_sort = df_fruits_xd.reset_index()[list(df_fruits.columns)].sort_values(list(df_fruits.columns)).reset_index(drop=True)
df_fruits_sort = df_fruits.sort_values(list(df_fruits.columns)).reset_index(drop=True)

df_fruits_xd_sort.equals(df_fruits_sort)

True

In [17]:
sc_fruits_2 = df_fruits.npd.to_scipp(**kwargs)
sc_fruits_2

In [18]:
df_fruits_sc = npd.from_scipp(sc_fruits_2, ntv_type=False)

df_fruits_sc_sort = df_fruits_sc.reset_index()[list(df_fruits.columns)].sort_values(list(df_fruits.columns)).reset_index(drop=True)
df_fruits_sort = df_fruits.sort_values(list(df_fruits.columns)).reset_index(drop=True)

df_fruits_sc_sort.equals(df_fruits_sort)

True

## Appendix: relationship analysis

### Relationship
Three kind of relationships are present:
- coupled : each 'plants' value corresponds to one 'plts' value
- derived : each 'product' value is associated to one 'plants' value
- crossed : each 'quantity' value is associated to each 'product' value

In [19]:
print(ana_fruits.get_relation('plants', 'plts').typecoupl)
print(ana_fruits.get_relation('plants', 'product').typecoupl)
print(ana_fruits.get_relation('quantity', 'product').typecoupl)


coupled
derived
crossed


A relationship can be quantified by a notion of distance (number of codec links to change to be coupled). 

If a relation is coupled, the distance is null.
The maximal distance is the Fields length minus one

In [20]:
print('minimum distance: ', ana_fruits.get_relation('plants', 'plts').distance)
print('maximum distance: ', ana_fruits.get_relation('id', 'valid').distance)
print('intermediate distance: ', ana_fruits.get_relation('plants', 'product').distance)
# The 'plants' - 'product' relationship will be 'coupled' if we change, for example, 
#'fruit-orange' in 'citrus-orange' and 'carrot-vegetable' in 'carrot-root vegetable' (2 changes)  

minimum distance:  0
maximum distance:  7
intermediate distance:  2


## Fields
Each field has a category based on its relationships with other fields:
- rooted : Fields coupled with the root Field
- unique : Fields with a single value
- coupled : Fields coupled with another Field
- derived : Fields without derived child
- mixed : other Fields

In [21]:
# list of categories for each Field
print({field.idfield: category for field, category in zip(ana_fruits.fields, ana_fruits.category)})

{'plants': 'derived', 'plts': 'coupled', 'quantity': 'derived', 'product': 'mixed', 'price': 'mixed', 'price level': 'derived', 'group': 'derived', 'id': 'rooted', 'supplier': 'derived', 'location': 'mixed', 'valid': 'unique'}


## Tree
A Dataset can be represented with a Field tree where each Field has a parent Field.
The parent Field is the derived Field with a minimal 'distance'

In [22]:
print(ana_fruits.tree())

-1: root-derived (8)
   3 : product (4 - 4)
      0 : plants (2 - 2)
         1 : plts (0 - 2)
      5 : price level (2 - 2)
   4 : price (2 - 6)
      2 : quantity (4 - 2)
      6 : group (3 - 3)
   7 : id (0 - 8)
   8 : supplier (6 - 2)
   9 : location (4 - 4)
   10: valid (7 - 1)
