# Object: Example of usage of the "relationship" property 
The Jupyter Notebooks is available on [nbviewer](http://nbviewer.org/github/loco-philippe/tab-dataset/tree/main/example/Dataset/)   
## Goal
- show on a real example how to specify the links between fields 
- identify the contributions that a tool for analyzing these links could have


## Presentation of the example
It concerns the IRVE file of VE charging stations ([data schema](https://schema.data.gouv.fr/etalab/schema-irve/latest/)). 

The IRVE file contains a list of charging stations with in particular: 
- for a station: an Id, a name, an address and coordinates
- for each station several charging points identified by an Id_pdc 
- an operator for each station 

>   
> <img src="https://loco-philippe.github.io/ES/IRVE_modele_conceptuel.PNG" width="600">

Only a few rows and columns have been extracted for the example (table below for 4 stations):

|nom_operateur	|id_station_itinerance	|nom_station	|adresse_station	|coordonneesXY |id_pdc_itinerance|
|:----|:----|:----|:----|:----|:----|
|SEVDEC	|FRSEVP1SCH01	|SCH01	|151 Rue d'Uelzen 76230 Bois-Guillaume	|[1.106329, 49.474202]	|FRSEVE1SCH0101|
|SEVDEC	|FRSEVP1SCH03	|SCH03	|151 Rue d'Uelzen 76230 Bois-Guillaume	|[1.106329, 49.474202]	|FRSEVE1SCH0301|
|SEVDEC	|FRSEVP1SCH02	|SCH02	|151 Rue d'Uelzen 76230 Bois-Guillaume	|[1.106329, 49.474202]	|FRSEVE1SCH0201|
|Sodetrel	|FRS35PSD35711	|RENNES - PLACE HONORE COMMEREUC	|13 Place HonorÃ© Commeurec 35000 Rennes	|[-1.679739, 48.108482]	|FRS35ESD357111|
|Sodetrel	|FRS35PSD35712	|RENNES - PLACE HONORE COMMEREUC	|13 Place HonorÃ© Commeurec 35000 Rennes	|[-1.679739, 48.108482]	|FRS35ESD357112|
|Virta	|FRE10E30333	|Camping Arinella	|Route de la mer, Brushetto - 20240 Ghisonaccia	|[9.445075, 41.995246]	|FRE10E30333|
|Virta	|FRE10E20923	|Camping Arinella	|Route de la mer, Brushetto - 20240 Ghisonaccia	|[9.445073, 41.995246]	|FRE10E20923|
|Virta	|FRE10P20922	|Camping Arinella	|Route de la mer, Brushetto - 20240 Ghisonaccia	|[9.445072, 41.995246]	|FRE10P20922|
|Virta	|FRE10P20921	|Camping Arinella	|Route de la mer, Brushetto - 20240 Ghisonaccia	|[9.445071, 41.995246]	|FRE10P20921|
|DEBELEC	|FRSGAP1M2026	|M2026	|2682 Boulevard FranÃ§ois Xavier Fafeur 11000 Carcassonne	|[2.298185, 43.212574]	|FRSGAE1M202603|
|DEBELEC	|FRSGAP1M2026	|M2026	|2682 Boulevard FranÃ§ois Xavier Fafeur 11000 Carcassonne	|[2.298185, 43.212574]	|FRSGAE1M202602|
|DEBELEC	|FRSGAP1M2026	|M2026	|2682 Boulevard FranÃ§ois Xavier Fafeur 11000 Carcassonne	|[2.298185, 43.212574]	|FRSGAE1M202601|

In particular, there are a few errors: 
- the id and name of the station operated by SEVDEC is different for each charging point,
- the id of the station operated by Sodetrel is also different for each charging point,
- Virta station coordinates and ids are also variable depending on charging points

## improvement of the specification
The errors found could be avoided by defining the dependency rules between columns according to the data model associated with the table. 

There are three entities: 
- the operator who can operate several stations (a single field: nom_operateur)
- the stations which contain several charging points (four fields: id_station_itinerance, nom_station, adresse_station, coordonnéesXY),
- the charging points (a single field: id_pdc_itinerance)

This data model results in the following specifications: 
- the operator field is derived from the id_station field (1-n relationship)
- the id_station_itinerance field is derived from the id_pdc_itinerance field (1-n relationship)
- the nom_station, addresse_station, coordonnéesXY fields are coupled to the id_station field (relation 1-1)

These specifications translate into "relationship" properties for each of the fields:

In [1]:
schema = {'relationships': [
    {'fields': ['id_station_itinerance', 'nom_operateur'],     'link': 'derived'},
    {'fields': ['id_pdc_itinerance', 'id_station_itinerance'], 'link': 'derived'},
    {'fields': ['id_station_itinerance', 'nom_station'],       'link': 'coupled'},
    {'fields': ['id_station_itinerance', 'adresse_station'],   'link': 'coupled'},
    {'fields': ['id_station_itinerance', 'coordonneesXY'],     'link': 'coupled'}
]}

------
## specification check tool example

- a csv file is populated with the above table
- a `Dataset` object is initialized with this file


In [2]:
import requests
from pprint import pprint
from tab_dataset import Sdataset

chemin = 'https://raw.githubusercontent.com/loco-philippe/Environmental-Sensing/master/python/Validation/irve/'
data_csv = 'IRVE_example.csv'
url = chemin + data_csv
open(data_csv, 'wb').write(requests.get(url, allow_redirects=True).content)

irve = Sdataset.from_csv(data_csv, header=True, optcsv=None, decode_json=False)
print('row number : ', len(irve))
print('fields list : ')
pprint(irve.indexinfos(keys=['num', 'id']))

row number :  12
fields list : 
[{'id': 'nom_operateur', 'num': 0},
 {'id': 'id_station_itinerance', 'num': 1},
 {'id': 'nom_station', 'num': 2},
 {'id': 'adresse_station', 'num': 3},
 {'id': 'coordonneesXY', 'num': 4},
 {'id': 'id_pdc_itinerance', 'num': 5}]


## Initial control 

We note that only two relationships are correct (id_station / id_pdc and operateur / id_station).


In [3]:
operateur, id_station, nom_station, adresse, coord, id_pdc = irve.lindex
print('operateur and id_station are derived : ', id_station.isderived(operateur))
print('id_station and id_pdc are derived : ', id_station.isderived(id_pdc))
print('nom_station and id_station are coupled : ', nom_station.iscoupled(id_station))
print('adresse_station and id_station are coupled : ', adresse.iscoupled(id_station))
print('coordonneesXY and id_station are coupled : ', coord.iscoupled(id_station), '\n')

print('derived tree :\n', irve.tree())

operateur and id_station are derived :  True
id_station and id_pdc are derived :  True
nom_station and id_station are coupled :  False
adresse_station and id_station are coupled :  False
coordonneesXY and id_station are coupled :  False 

derived tree :
 -1: root-derived (12)
   1 : id_station_itineranc (2 - 10)
      2 : nom_station (4 - 6)
         0 : nom_operateur (2 - 4)
            3 : adresse_station (0 - 4)
      4 : coordonneesXY (3 - 7)
   5 : id_pdc_itinerance (0 - 12)


----
## Application of an imposed structure
Records that are inconsistent with the defined data schema can also be searched.

In [4]:
#the check_relationship returns the records with errors
errors = irve.check_relationship(schema)
pprint(errors)

{'adresse_station - id_station_itinerance': (0, 1, 2, 3, 4, 5, 6, 7, 8),
 'coordonneesXY - id_station_itinerance': (0, 1, 2, 3, 4),
 'id_station_itinerance - id_pdc_itinerance': (),
 'nom_operateur - id_station_itinerance': (),
 'nom_station - id_station_itinerance': (3, 4, 5, 6, 7, 8)}


## Checking against the imposed structure
Forcing the structure results in additional data which is checked by the 'getduplicates' function. 

A new column is added with True value when a record respects the structure and False otherwise. In the example considered, the last three records corresponding to operator DEBELEC are correct 

Note : for more detail, a column could be had for each of the defined couplings.

----
## data correction
The corrections to be made to comply with the specification could be as follows:
- field id_station: FRSEVP1SCH (first 3), FRS35PSD35711 (2 next), FRE10E2092 (4 next)
- field nom_station: SCH (first 3)
- field coordonneesXY: [9.445071, 41.995246] from 6th to 8th

The corrected table would therefore be:

|nom_operateur	|id_station_itinerance	|nom_station	|adresse_station	|coordonneesXY  |id_pdc_itinerance|
|:----|:----|:----|:----|:----|:----|
|SEVDEC	|FRSEVP1SCH	|SCH	|151 Rue d'Uelzen 76230 Bois-Guillaume	|[1.106329, 49.474202]	|FRSEVE1SCH0101|
|SEVDEC	|FRSEVP1SCH	|SCH	|151 Rue d'Uelzen 76230 Bois-Guillaume	|[1.106329, 49.474202]	|FRSEVE1SCH0301|
|SEVDEC	|FRSEVP1SCH	|SCH	|151 Rue d'Uelzen 76230 Bois-Guillaume	|[1.106329, 49.474202]	|FRSEVE1SCH0201|
|Sodetrel	|FRS35PSD35711	|RENNES - PLACE HONORE COMMEREUC	|13 Place HonorÃ© Commeurec 35000 Rennes	|[-1.679739, 48.108482]	|FRS35ESD357111|
|Sodetrel	|FRS35PSD35711	|RENNES - PLACE HONORE COMMEREUC	|13 Place HonorÃ© Commeurec 35000 Rennes	|[-1.679739, 48.108482]	|FRS35ESD357112|
|Virta	|FRE10E2092	|Camping Arinella	|Route de la mer, Brushetto - 20240 Ghisonaccia	|[9.445071, 41.995246]	|FRE10E30333|
|Virta	|FRE10E2092	|Camping Arinella	|Route de la mer, Brushetto - 20240 Ghisonaccia	|[9.445071, 41.995246]	|FRE10E20923|
|Virta	|FRE10P2092	|Camping Arinella	|Route de la mer, Brushetto - 20240 Ghisonaccia	|[9.445071, 41.995246]	|FRE10P20922|
|Virta	|FRE10P2092	|Camping Arinella	|Route de la mer, Brushetto - 20240 Ghisonaccia	|[9.445071, 41.995246]	|FRE10P20921|
|DEBELEC	|FRSGAP1M2026	|M2026	|2682 Boulevard FranÃ§ois Xavier Fafeur 11000 Carcassonne	|[2.298185, 43.212574]	|FRSGAE1M202603|
|DEBELEC	|FRSGAP1M2026	|M2026	|2682 Boulevard FranÃ§ois Xavier Fafeur 11000 Carcassonne	|[2.298185, 43.212574]	|FRSGAE1M202602|
|DEBELEC	|FRSGAP1M2026	|M2026	|2682 Boulevard FranÃ§ois Xavier Fafeur 11000 Carcassonne	|[2.298185, 43.212574]	|FRSGAE1M202601|

In [5]:
id_station[:9] = ['FRSEVP1SCH', 'FRSEVP1SCH', 'FRSEVP1SCH', 'FRS35PSD35711', 'FRS35PSD35711', 
                   'FRE10E2092', 'FRE10E2092', 'FRE10E2092', 'FRE10E2092']
nom_station[:3] = ['SCH', 'SCH', 'SCH']
coord[5:8] = ['[9.445071, 41.995246]', '[9.445071, 41.995246]', '[9.445071, 41.995246]']
irve.reindex()

Sdataset[12, 6]

## New check 
The check carried out with this new data shows that the specification would then be respected:


In [6]:
print('operateur and id_station are derived : ', id_station.isderived(operateur))
print('id_station and id_pdc are derived : ', id_station.isderived(id_pdc))
print('nom_station and id_station are coupled : ', nom_station.iscoupled(id_station))
print('adresse_station and id_station are coupled : ', adresse.iscoupled(id_station))
print('coordonneesXY and id_station are coupled : ', coord.iscoupled(id_station), '\n')

pprint(irve.check_relationship(schema))

operateur and id_station are derived :  True
id_station and id_pdc are derived :  True
nom_station and id_station are coupled :  True
adresse_station and id_station are coupled :  True
coordonneesXY and id_station are coupled :  True 

{'adresse_station - id_station_itinerance': (),
 'coordonneesXY - id_station_itinerance': (),
 'id_station_itinerance - id_pdc_itinerance': (),
 'nom_operateur - id_station_itinerance': (),
 'nom_station - id_station_itinerance': ()}
