# Dataset : structure analysis

## Goals

- understand the structure of Dataset object
- introduce methods for managing structure


-----

## Example

The reuse Dataset is :

<img src="https://loco-philippe.github.io/ES/ilist_merge.png" width="700">


In [1]:
from tab_dataset import Sdataset
from pprint import pprint

img = Sdataset.from_file('score.il')                # reuse Dataset from aggregation Notebook

## relationship

the relationships between Field can be shown with the img Dataset

<img src="https://loco-philippe.github.io/ES/ilist_canonical.png" width="600">


In [2]:
pprint(img.field_partition())
print('\n', img.tree())

{'primary': [], 'secondary': [], 'unique': [], 'variable': []}

 -1: root-derived (13)
   0 : score (2 - 11)
      1 : course (8 - 3)
   2 : year (12 - 1)
   3 : examen (10 - 3)
   4 : full name (9 - 4)
      7 : surname (0 - 4)
      5 : last name (1 - 3)
      6 : first name (1 - 3)
      8 : group (1 - 3)


## Transformation to complete Dataset

When primary Field are crossed, a Dataset is complete (can be converted into matrix or in file with low size).

In [3]:
img.full(fillvalue=float('nan'), idxname=['full name', 'course', 'examen'])   # convert secondary indexes in primary indexes

pprint(img.field_partition(mode='id'))

print('\nlength :', len(img), 'dimension :', img.dimension)

{'primary': ['course', 'examen', 'full name'],
 'secondary': ['last name', 'first name', 'group', 'surname'],
 'unique': ['year'],
 'variable': ['score']}

length : 36 dimension : 3


## Indexed matrix

With codec or values ajustement, a Dataset can be transformed in a matrix with choosen dimension

In [4]:
print(img.to_xarray())
print('\nObject img is complete  ? ', img.complete)
print('\nObject img have the canonical order ? ', img.iscanonorder())


<xarray.DataArray 'score' (course: 3, examen: 3, full name: 4)>
array([[[11, 15, nan, nan],
        [13, nan, nan, nan],
        [15, nan, nan, nan]],

       [[nan, nan, 2, 6],
        [10, 8, 4, nan],
        [12, nan, nan, nan]],

       [[nan, nan, nan, nan],
        [nan, nan, 18, nan],
        [nan, nan, 17, 18]]], dtype=object)
Coordinates:
  * course      (course) object 'math' 'english' 'software'
  * examen      (examen) object 't1' 't2' 't3'
  * full name   (full name) object 'anne white' ... 'philippe black'
    last name   (full name) object 'white' 'white' 'red' 'black'
    first name  (full name) object 'anne' 'philippe' 'camille' 'philippe'
    group       (full name) object 'gr1' 'gr2' 'gr3' 'gr3'
    surname     (full name) object 'skyler' 'heisenberg' 'saul' 'gus'
Attributes:
    year:     2021

Object img is complete  ?  True

Object img have the canonical order ?  True


In [5]:
#img.nindex('score').tostdcodec(inplace=True)
print('absolute keys is not necessary in json object when Dataset is complete :\n')   
pprint(img.to_ntv().to_obj(), width=200)
print('\nconversion is reversible ? ', img.from_ntv(img.to_ntv()) == img)

absolute keys is not necessary in json object when Dataset is complete :

{'course': [['math', 'english', 'software'], [12]],
 'examen': [['t1', 't2', 't3'], [4]],
 'first name': [['anne', 'philippe', 'camille'], 2, [0, 1, 2, 1]],
 'full name': [['anne white', 'philippe white', 'camille red', 'philippe black'], [1]],
 'group': [['gr1', 'gr2', 'gr3'], 2, [0, 1, 2, 2]],
 'last name': [['white', 'red', 'black'], 2, [0, 0, 1, 2]],
 'score': [[11, 13, 15, 10, 12, 8, 17, 18, 2, 4, 6, nan], [0, 2, 11, 11, 1, 11, 11, 11, 2, 11, 11, 11, 11, 11, 8, 10, 3, 5, 9, 11, 4, 11, 11, 11, 11, 11, 11, 11, 11, 11, 7, 11, 11, 11, 6, 7]],
 'surname': [['skyler', 'heisenberg', 'saul', 'gus'], 2],
 'year': 2021}

conversion is reversible ?  True


In [6]:
# matrix with dimension 2
img.nindex('course').coupling(img.nindex('examen'))   # transform two linked Field in two derived or coupled Field
print('new dimension : ', img.dimension, '\n')
img.to_xarray()

new dimension :  2 

