# Methodology for taking into account relations between fields in tabular representations
This Notebook proposes an evolution to the methodology used in several opendata projects (eg. french guide to [préparation des données à l'ouverture et la circulation](https://guides.etalab.gouv.fr/qualite/documenter-les-donnees/#points-de-contact))   
   
Only the additions that could be made to the existing approach are discussed below.
   
This Jupyter Notebooks is available for consultation on [nbviewer](http://nbviewer.org/github/loco-philippe/Environmental-Sensing/tree/main/property_relationship)

----------

## 0 - Introduction
### 0.1 - Objective
The data schema definition tools (eg [TableSchema](https://specs.frictionlessdata.io/table-schema/#language) define on the one hand descriptive and explanatory information of a data structure and on the other hand the rules to be respected to document this structure.
   
The rules currently defined mainly concern the fields taken separately but do not include the relationships between the fields that make up this structure.
> *Example of rules not currently processed:*
> - *a "person" is associated with a single "social security number" (and vice versa)*
> - *a "student" belongs to a single "class"*   
   
Relationships between fields are important in the consistency of a dataset. Moreover, they are very often expressed in the data models that describe them.   
However, they are not included in the data schemas nor controlled in the dataset documentation.
   
The proposed evolution therefore consists in taking into account these relationships between fields at the level of the preparation phase as well as at the level of the operating phase:

### 0.2 - Example

In order to facilitate understanding of the subject, an example will be treated throughout this presentation. It concerns electric vehicle charging infrastructures (IRVE) which is the subject of a detailed schema and a large data set [lien data.gouv.fr](https://www.data.gouv.fr/fr/datasets/fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques/)
   
An analysis of the complete IRVE dataset is available [on this link](https://github.com/loco-philippe/Environmental-Sensing/blob/main/python/Validation/irve/test_IRVE.ipynb).
    
The example presented is also detailed [on this link](https://github.com/loco-philippe/Environmental-Sensing/blob/main/python/Validation/irve/test_IRVE-simple.ipynb).

## 1 - Preparation: Establishing the table schema


### 1.1 Description of the conceptual data model

The conceptual data model makes it possible to describe the structuring of the information that makes up the data sets.
The most used modeling and the most adapted to tabular datasets is the ["entity-association"](https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model) modelling. . This allows you to describe:
- the entities,
- associations and dependencies between entities,
- the identifiers (primaryKeys) and attributes which explain the entities.

The initial modeling does not take implementation constraints into account; it is a tool for dialogue between the various stakeholders.

> *IRVE example:*
>
> In the simplified example, we consider two main entities:
> - **stations**: they are uniquely identified by an "Id" and characterized by a name,
> - the **points de charge (pdc)**: These are the equipment associated with a station which ensure the connection to the vehicles to be charged. They are also identified by an "Id".
>
> Two other entities are also present:
> - a **localisation**: identified by geographical coordinates and described by an address
> - an **opérateur**: the operator of the infrastructure. It is identified by a name
>
> <img src="https://loco-philippe.github.io/ES/IRVE_modele_conceptuel.PNG" width="600">

### 1.2 Description of the logical data model

The logical data model declines the conceptual model according to the envisaged information system (eg relational database, object modelling, etc.).
   
In the case of a tabular implementation, the logical model shows each of the future fields as an entity with the following rules:
- the entities of the conceptual model are replaced by the entities of the identifiers
- the attributes of the conceptual model are transformed into new entities
- attribute entities are associated with identifier entities by 1-n relationships

The logical model is therefore directly deduced from the conceptual model.
   
> *Note:*
>
> 1 - *The 1-n relationship between attributes and identifiers expresses the fact that an attribute describes a given object. It can be reinforced into a 1-1 relationship if the considered attribute must be unique for a given identifier.*
> 2 - *In a relational database implementation, the notion of attributes can remain attached to the notion of entity when a "table" includes both identifiers and attributes.*

> *IRVE example:*
>
> The logical model deduced from the previous conceptual model is as follows:
> <img src="https://loco-philippe.github.io/ES/IRVE_modele_logique.PNG" width="600">
>    
> If the name of the station must be unique, the 1-n relationship between id_station_itinerance and nom_station can be reinforced by a 1-1 relationship.

### 1.3 Physical model

The physical model consists on the one hand in describing the fields in the schema and on the other hand in specifying the division into files.

#### 1.3.1 Field structure

The fields are defined in the schema (not detailed here). It is therefore appropriate to add to this schema the relationships expressed at the level of the logic model.
    
To do this, a `relationship property` with two possible values is added to the schema for the fields concerned:
- "derived" which expresses a 1 - n relationship
- "coupled" which expresses a 1 - 1 relationship


> *Note:*
>
> 1 - *The "coupled" property is symmetric, so it can be carried indifferently by one of the two fields (unlike the "derived" property).*   
> 2 - *A cardinality 0-n (or 0-1) in a tabular representation is equivalent to indicating that the field is optional (undefined value - null, Nan, None or other - authorized in the corresponding field).*   
> 3 - *This new property is the subject of a TableSchema upgrade request (issue 803 under examination)*   
    
> *IRVE example:*
>
> Properties replace cardinalities:
> <img src="https://loco-philippe.github.io/ES/IRVE_champs.PNG" width="600">
>    
> With the TableSchema syntax the structure of the fields is as follows (in addition to the [existing properties](https://schema.data.gouv.fr/schemas/etalab/schema-irve/2.1.0/schema.json)) :
>```json
"fields": [
  {
    "name": "nom_operateur",
    "relationship" : {
        "parent" : "id_station_itinerance",
        "link" : "derived" 
    }
  },
  {  
    "name": "id_station_itinerance",
    "relationship" : {
        "parent" : "id_pdc_itinerance",
        "link" : "derived" 
    }
  },
  {  
    "name": "nom_station",
    "relationship" : {
        "parent" : "id_station_itinerance",
        "link" : "derived" 
    }
  },
  {  
    "name": "adresse_station",
    "relationship" : {
        "parent" : "coordonneesXY",
        "link" : "derived" 
    }
  },
  {  
    "name": "coordonnéesXY",
    "relationship" : {
        "parent" : "id_station_itinerance",
        "link" : "coupled" 
    }
  }
]
>```


#### 1.3.2 Splitting into files

Several strategies are possible:
- minimize the number of files: This makes it easier to access the data (eg direct use in a spreadsheet)
- create one file per main entity: This allows you to "physically" enforce the defined structure

In the multi-file case, the separation is necessarily carried out at the level of the identifier entities.
   
> *IRVE example:*
>
> An example of a two-file implementation is shown below:
> <img src="https://loco-philippe.github.io/ES/IRVE_fichiers.PNG" width="600">
>
> An example of a documented single file with 4 PDC is given below:
>
> |nom_operateur	|id_station_itinerance	|nom_station	|adresse_station	|coordonneesXY	|id_pdc_itinerance|
|:----|:----|:----|:----|:----|:----|					
|SEVDEC	|FRSEVP1SCH01	|SCH01	|151 Rue d'Uelzen 76230 Bois-Guillaume	|[1.106329, 49.474202]	|FRSEVE1SCH0101|
|SEVDEC	|FRSEVP1SCH03	|SCH03	|151 Rue d'Uelzen 76230 Bois-Guillaume	|[1.106329, 49.474202]	|FRSEVE1SCH0301|
|SEVDEC	|FRSEVP1SCH02	|SCH02	|151 Rue d'Uelzen 76230 Bois-Guillaume	|[1.106329, 49.474202]	|FRSEVE1SCH0201|	
|Sodetrel	|FRS35PSD35711	|RENNES - PLACE HONORE COMMEREUC	|13 Place HonorÃ© Commeurec 35000 Rennes	|[-1.679739, 48.108482]	|FRS35ESD357111|


## 2 - Operation: Documentation and assembly of data sets

Dataset documentation consists of documenting a set of rows according to the defined file structure.
The main expectation of this phase is to be able to detect and correct as soon as possible and simply any deviations from the defined rules.
   
Four levels of analysis must be taken into account:
- unit validation of data for a field
- validation of a record (multi-fields)
- internal validation of a single data set (multi records)
- external validation of the global dataset (multiple data sets)

The first two levels are processed in the existing tools (not detailed here).
The third level consists in validating the rules defined by the `relationship property` on the data set before an integration request in the corresponding file.
The fourth level is functionally identical to the third but can only be performed on the aggregation of all the data sets.
   
The restitution of errors can be done simply by adding Boolean control fields associated with each property checked.
   
> *Note:*
>
> 1 - *To be easily operable, the control tool must make it possible to precisely locate the errors identified.*   
> 2 - *The tool indicated in [this link](https://github.com/loco-philippe/Environmental-Sensing/blob/main/property_relationship/example.ipynb) is a simple example of a control but this one does not allow error localization*   
> 3 - *The aggregation of multiple validated data sets does not necessarily result in a valid file. The external validation of a data set can therefore lead to the identification of errors that potentially relate to data sets that had already been validated*   
> 4 - *To integrate the 4th level as soon as data is entered, global control must be able to be activated on request (eg via a service made available by the holder of the dataset aggregation file). this can be expensive in terms of resources and response time.*
   
> *IRVE example:*
>
> The activation and validation of these rules on a dataset are presented [on this link](https://github.com/loco-philippe/Environmental-Sensing/blob/main/python/Validation/irve/test_IRVE-simple.ipynb).
> In particular, it presents the implementation of a tool allowing both the detection and localization of errors.