# DataFrame integrity
------
This Notebook describes an example to explain how identify integrity errors in a DataFrame

The example in which the records are countries, is made up of four Fields :

- country : country name
- region : name of the region of the country
- code : alpha-2 country code
- population : population of the region (millions)

A Data model is used to identify relationships between Fields.

## Data model

Two entities are defined :

- country : The first attribute is the name of the country (primary key of the entity), the second is its alpha-2 country code. The value of this attribute is unique for each country.
- region : The first attribute is the name of the region (primary key of the entity), the second is its population. 

The data model is as follows :

In [1]:
from base64 import b64encode
from IPython.display import Image, display
from json_ntv import MermaidConnec

In [2]:
# Json data model 
df_country = { 
    'country and region:$erDiagram' : { 
        'entity': {
            'COUNTRY':  [ 
                ['string', 'country',  'PK' ], 
                ['string', 'code', 'unique'] 
            ], 
            'REGION': [ 
                ['string', 'region',  'PK'],
                ['number',    'population'] 
            ]
        },
        'relationship': [ 
            [ 'REGION', 'exactly one', 'identifying', 'one or more', 'COUNTRY',     'brings_together']
        ],

     } }

# It is converted in Mermaid structure and then displayed
diag = MermaidConnec.diagram(df_country)
display(Image(url="https://mermaid.ink/img/" + b64encode(diag.encode("ascii")).decode("ascii")))

## Rules to translate Data model in Tabular structure

Main rules :

* Each attribute in the data model if converted into a Field
* Each Field has a 'derived' (or 'coupled' if the attribute is unique) relationship with the Field associated with the PK attribute of the same entity
* The relationship between two entities is converted into a relationship between the Fields associated with the PK attributes
* Cardinalities with a 0 are translated with the same rules as cardinalities with 1 (0 indicates that the Field is optional) 
* The cardinality of the data model relationships translates as follows :
  * 1 - 1 : "coupled"
  * 1 - n : "derived"
  * n - n : "linked" 

## Relationships

By applying the rules above, we identify three relationships :

```json
{ "relationships":
  { "fields" : [ "country", "code"],      "description" : "attributes",      "link" : "coupled" },
  { "fields" : [ "region", "population"], "description" : "attributes",      "link" : "derived" },
  { "fields" : [ "region", "country"],    "description" : "brings_together", "link" : "derived" },
 }
 ```

The indication that the country code is unique for a country reinforces the relationship between "code" and "country" (it was "derived" and is now "coupled").


## Example
The example contains three EU countries and one in Africa:

| country | region         | code  | population |
|---------|----------------|-------|------------|
| France  | European Union | FR    | 449        |
| Spain   | European Union | ES    | 48         |
| Estonia | European Union | ES    | 449        |
| Nigeria | Africa         | NI    | 1460       |

In [3]:
import pandas as pd
import ntv_pandas as npd

In [4]:
example1 = {'country' :   ['France', 'Spain', 'Estonia', 'Nigeria'],
            'region':     ['European Union', 'European Union', 'European Union', 'Africa'],
            'code':       ['FR', 'ES', 'ES', 'NI'],
            'population': [449, 48, 449, 1460]}
pd_ex1 = pd.DataFrame(example1)

The analysis method uses the [TAB-analysis](https://github.com/loco-philippe/tab-analysis/blob/main/README.md) module to check relationships.

In [5]:
ana1 = pd_ex1.npd.analysis()
print("country - code (must be coupled): ", ana1.get_relation('country', 'code').typecoupl)
print("region - population (must be derived True): ", ana1.get_relation('region', 'population').typecoupl, 
      ana1.get_relation('region', 'population').parent_child)
print("country - region (must be derived True): ", ana1.get_relation('country', 'region').typecoupl,
     ana1.get_relation('country', 'region').parent_child)

country - code (must be coupled):  derived
region - population (must be derived True):  derived False
country - region (must be derived True):  derived True


Two relationships are unconsistent. The usage of the detection tool (not shown here, see [tab_dataset](https://github.com/loco-philippe/tab-dataset/blob/main/README.md)) give us the errors:

In [6]:
print(pd_ex1.npd.check_relation('country', 'code', 'coupled', value=True))
print(pd_ex1.npd.check_relation('region', 'population', 'derived', value=True))

{'row': [1, 2], 'code': ['ES', 'ES'], 'country': ['Spain', 'Estonia']}
{'row': [0, 2, 1], 'population': [449, 449, 48], 'region': ['European Union', 'European Union', 'European Union']}


## Example : after corrections



We update the Estonia code with EE and the population of EU with 449 in the Spain record:

| country | region         | code  | population |
|---------|----------------|-------|------------|
| France  | European Union | FR    | 449        |
| Spain   | European Union | ES    | 449         |
| Estonia | European Union | EE    | 449        |
| Nigeria | Africa         | NI    | 1460       |

In [7]:
example2 = {'country' :   ['France', 'Spain', 'Estonia', 'Nigeria'],
            'region':     ['European Union', 'European Union', 'European Union', 'Africa'],
            'code':       ['FR', 'ES', 'EE', 'NI'],
            'population': [449, 449, 449, 1460]}
pd_ex2 = pd.DataFrame(example2)

In [8]:
ana2 = pd_ex2.npd.analysis()
print("country - code (must be coupled): ", ana2.get_relation('country', 'code').typecoupl)
print("region - population (must be derived True): ", ana2.get_relation('region', 'population').typecoupl, 
      ana2.get_relation('region', 'population').parent_child)
print("country - region (must be derived True): ", ana2.get_relation('country', 'region').typecoupl,
     ana2.get_relation('country', 'region').parent_child)

country - code (must be coupled):  coupled
region - population (must be derived True):  coupled True
country - region (must be derived True):  derived True


Note : The relationship between region and population is coupled and therefore also derived 