# Analysis : Example

----
## Goal
- show on a simple example the main uses of the analysis module for tabular data

## Presentation of the example
Let's take the example of a csv file containing the price of some fruits and vegetables.

|product|plants   |plts |quantity|price|price level|group  |id   |supplier|location|valid|
|:-----:|:-------:|:---:|:-----:|:----:|:---------:|:-----:|:---:|:------:|:------:|:---:|
|apple  |fruit	  |fr   |1 kg	|1	   |low        |fruit1 |1001 |sup1    |fr      |ok   |
|apple  |fruit	  |fr   |10 kg	|10    |low        |fruit10|1002 |sup1    |gb      |ok   |
|orange |fruit	  |fr   |1 kg   |2     |high       |fruit1 |1003 |sup1    |es      |ok   |
|orange |fruit	  |fr   |10 kg	|20	   |high       |veget  |1004 |sup2    |ch      |ok   |
|peppers|vegetable|ve   |1 kg	|1.5   |low        |veget  |1005 |sup2    |gb      |ok   |
|peppers|vegetable|ve   |10 kg  |15    |low        |veget  |1006 |sup2    |fr      |ok   |
|carrot |vegetable|ve   |1 kg	|1.5   |high       |veget  |1007 |sup2    |es      |ok   |
|carrot |vegetable|ve   |10 kg	|20    |high       |veget  |1008 |sup1    |ch      |ok   |


The price is different depending on the product and the packaging of 1 or 10 kg.

In [1]:
fruits = {
    "plants": [
        "fruit",
        "fruit",
        "fruit",
        "fruit",
        "vegetable",
        "vegetable",
        "vegetable",
        "vegetable",
    ],
    "plts": ["fr", "fr", "fr", "fr", "ve", "ve", "ve", "ve"],
    "quantity": ["1 kg", "10 kg", "1 kg", "10 kg", "1 kg", "10 kg", "1 kg", "10 kg"],
    "product": [
        "apple",
        "apple",
        "orange",
        "orange",
        "peppers",
        "peppers",
        "carrot",
        "carrot",
    ],
    "price": [1, 10, 2, 20, 1.5, 15, 1.5, 20],
    "price level": ["low", "low", "high", "high", "low", "low", "high", "high"],
    "group": [
        "fruit 1",
        "fruit 10",
        "fruit 1",
        "veget",
        "veget",
        "veget",
        "veget",
        "veget",
    ],
    "id": [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
    "supplier": ["sup1", "sup1", "sup1", "sup2", "sup2", "sup2", "sup2", "sup1"],
    "location": ["fr", "gb", "es", "ch", "gb", "fr", "es", "ch"],
    "valid": ["ok", "ok", "ok", "ok", "ok", "ok", "ok", "ok"],
}

In [2]:
from tab_dataset import Sdataset

dts = Sdataset.ntv(fruits)
adts = dts.analysis

## Relationship
Three kind of relationships are present:
- coupled : each 'plants' value corresponds to one 'plts' value
- derived : each 'product' value is associated to one 'plants' value
- crossed : each 'quantity' value is associated to each 'product' value

In [3]:
print(adts.get_relation("plants", "plts").typecoupl)
print(adts.get_relation("plants", "product").typecoupl)
print(adts.get_relation("quantity", "product").typecoupl)

coupled
derived
crossed


A relationship can be quantified by a notion of distance (number of codec links to change to be coupled). 

If a relation is coupled, the distance is null.
The maximal distance is the Fields length minus one

In [4]:
print("minimum distance: ", adts.get_relation("plants", "plts").distance)
print("maximum distance: ", adts.get_relation("id", "valid").distance)
print("intermediate distance: ", adts.get_relation("plants", "product").distance)
# The 'plants' - 'product' relationship will be 'coupled' if we change, for example,
#'fruit-orange' in 'citrus-orange' and 'carrot-vegetable' in 'carrot-root vegetable' (2 changes)

minimum distance:  0
maximum distance:  7
intermediate distance:  2


## Fields
Each field has a category based on its relationships with other fields:
- rooted : Fields coupled with the root Field
- unique : Fields with a single value
- coupled : Fields coupled with another Field
- derived : Fields without derived child
- mixed : other Fields

In [5]:
# list of categories for each Field
print(adts.category)

['derived', 'coupled', 'derived', 'mixed', 'mixed', 'derived', 'derived', 'rooted', 'derived', 'mixed', 'unique']


## Tree
A Dataset can be represented with a Field tree where each Field has a parent Field.
The parent Field is the derived Field with a minimal 'distance'

In [6]:
print(adts.tree())

-1: root-derived (8)
   3 : product (4 - 4)
      0 : plants (2 - 2)
         1 : plts (0 - 2)
      5 : price level (2 - 2)
   4 : price (2 - 6)
      2 : quantity (4 - 2)
      6 : group (3 - 3)
   7 : id (0 - 8)
   8 : supplier (6 - 2)
   9 : location (4 - 4)
   10: valid (7 - 1)


## Partitions
A partition is a minimum list of Field where combinations are all different in the dataset.

In [7]:
adts.partitions(mode="id")

[['plants', 'price level', 'quantity'],
 ['price level', 'quantity', 'supplier'],
 ['location', 'plants'],
 ['location', 'supplier'],
 ['product', 'quantity'],
 ['id']]

The dimension of a Dataset is the highest size of a partition.

In [8]:
adts.dimension

3

The Dataset is composed for a partition of:
- primary: partition fields
- secondary: fields derived from or coupled to primary fields
- unique: unique fields
- variable: other fields


In [9]:
adts.field_partition(mode="id")

{'primary': ['plants', 'quantity', 'price level'],
 'secondary': ['plts'],
 'mixte': ['supplier', 'location', 'product'],
 'unique': ['valid'],
 'variable': ['price', 'group', 'id']}

In [14]:
adts.relation_partition()

{'plants': ['plants'],
 'quantity': ['quantity'],
 'price level': ['price level'],
 'plts': ['plants'],
 'supplier': [],
 'location': ['price level'],
 'product': ['plants', 'price level'],
 'valid': [],
 'price': ['plants', 'quantity', 'price level'],
 'group': ['plants', 'quantity', 'price level'],
 'id': ['plants', 'quantity', 'price level']}

In [10]:
adts.field_partition(mode="id", partition=["product", "quantity"])

{'primary': ['product', 'quantity'],
 'secondary': ['plants', 'plts', 'price level'],
 'mixte': ['supplier', 'location'],
 'unique': ['valid'],
 'variable': ['price', 'group', 'id']}

In [13]:
adts.relation_partition(partition=["product", "quantity"])

{'product': ['product'],
 'quantity': ['quantity'],
 'plants': ['product'],
 'plts': ['plants'],
 'price level': ['product'],
 'supplier': [],
 'location': [],
 'valid': [],
 'price': ['product', 'quantity'],
 'group': ['product', 'quantity'],
 'id': ['product', 'quantity']}

## Use of Partitions
For a partition, a Dataset can be converted into a multi-dimensional entity. 

In [11]:
dts.to_xarray()

IndexError: list index out of range

In [None]:
print(dts.to_xarray(idxname=["product", "quantity"]))