# Relationships descriptor

This Notebook presents an example of implementing `relationships` descriptor in `Validata` as proposed in the [pattern](https://github.com/frictionlessdata/specs/pull/859) document.

## Example

The choosen example is defined in the pattern :

    
| country | region         | code  | population |
|---------|----------------|-------|------------|
| France  | European Union | FR    | 449        |
| Spain   | European Union | ES    | 48         |
| Estonia | European Union | ES    | 449        |
| Nigeria | Africa         | NI    | 1460       |
    
The data schema for this dataset is :

  ```json
  {"fields": [ 
      {"name": "country", "type": "string"},
      {"name": "region", "type": "string"},
      {"name": "code", "type": "string", "description": "country code alpha-2"},
      {"name": "population", "type": "string", "description": "region population in 2022 (millions)"}]
  }
  ```
If we now look at the data we see that this dataset is not consistent because it contains two structural errors:

* The value of the "code" Field must be unique for each country, we cannot therefore have "ES" for "Spain" and "Estonia",
* The value of the "population" Field of "European Union" cannot have two different values (449 and 48)

These structural errors make the data unusable and yet they are not detected in the validation of the dataset (in the current version of Table Schema, there are no Descriptors to express this dependency between two fields).

The pattern proposal is to add a `relationships` descriptor to check relationships :

  ```json
  { "fields": [ ... ],
    "custom_checks": [
        {"name": "relationships",
         "params": {
             "fields" : ["country", "code"],
             "description" : "is the country code alpha-2 of",
             "link" : "coupled"}}
        {"name": "relationships",
         "params": {
             "fields" : ["region", "population"],
             "description" : "is the population of",
             "link" : "derived"}}
    ]
  }
  ```
 

In [6]:
countries = [['country', 'region',         'code', 'population'], 
             ['France',  'European Union', 'FR',    449        ], 
             ['Spain',   'European Union', 'ES',    48         ], 
             ['Estonia', 'European Union', 'ES',    449        ], 
             ['Nigeria', 'African',        'NI',    1460       ]
            ]
sch = { "fields": [
              {"name": "country", "type": "string"},
              {"name": "region", "type": "string"},
              {"name": "code", "type": "string", "description": "country code alpha-2"},
              {"name": "population", "type": "integer", "description": "region population in 2022 (millions)"}],
        "custom_checks": [
              {"name": "relationships",
               "params": {
                   "fields" : ["country", "code"], 
                   "link" : "coupled", 
                   "description" : "is the country code alpha-2 of"}},
              {"name": "relationships",
               "params": {
                   "fields" : ["region", "population"],
                   "link" : "derived",
                   "description" : "is the population of"}}
        ]
      }

## Implementation 

The analysis of relationships is based on vectorized processing with Fields data. Today, Validata is built with a row structure. This analysis must therefore be implemented in Validata as a global control.

The implementation below is an example for discussion.

In [2]:
from typing import Optional, Type

import attrs
from frictionless import Check, Row
from frictionless.errors import RowError
from tab_dataset import Cdataset, Cfield
from typing_extensions import Self

class RelationshipError(RowError):
    title = None
    type = 'relationships'
    description = None
    template = "row position {rowNumber} is not consistent"

@attrs.define(kw_only=True, repr=False)
class Relationship(Check):
    """Check a Relationship between two fields"""

    type = "relationships"
    Errors = [RelationshipError]
    
    def __init__(self, **kwargs):
        super().__init__()
       
    @classmethod
    def from_descriptor(cls, descriptor, resource=None):
        res_t = list(map(list, zip(*resource)))
        dts = Cdataset([Cfield(fld[1:], fld[0]) for fld in res_t])
        check = super().from_descriptor(descriptor)
        check.__num_row = -1
        check.__relationship = descriptor
        check.__errors = dts.check_relation(descriptor['fields'][0],
                                          descriptor['fields'][1],
                                          descriptor['link'], value=False)        
        return check
        
    def validate_row(self, row: Row):
        self.__num_row += 1
        if self.__num_row in self.__errors:
            note = 'cells "' + self.__relationship['fields'][0] + \
                   '" and "' + self.__relationship['fields'][1] + \
                   '" are not ' + self.__relationship['link'] + ' in this row'
            yield RelationshipError.from_row(row, note=note)

    @classmethod
    def metadata_select_class(cls, type: Optional[str]) -> Type[Self]:
        return cls

    
    metadata_profile = {  # type: ignore
        "type": "object",
        "required": ["fields", "link"],
        "properties": {
            "link": {"type": "string"},
            "fields": {"type": "array"}}
    }

## Tests
The `validate` function detects two errors :

- between "region" and "population" Fields (rows 2, 3 and 4) 
- between "country" and "code" Fields (rows 3 and 4)

In [5]:
from validata_core import validate

report = validate(countries, sch, resource=countries)
print(report.valid)
print(report.stats)
report.to_dict()['tasks'][0]['errors']

False


[{'type': 'relationships',
  'message': 'row position 2 is not consistent',
  'tags': ['#table', '#row'],
  'note': 'cells "region" and "population" are not derived in this row',
  'cells': ['France', 'European Union', 'FR', '449'],
  'rowNumber': 2,
  'code': 'relationships',
  'name': '',
  'rowPosition': 2},
 {'type': 'relationships',
  'message': 'row position 3 is not consistent',
  'tags': ['#table', '#row'],
  'note': 'cells "country" and "code" are not coupled in this row',
  'cells': ['Spain', 'European Union', 'ES', '48'],
  'rowNumber': 3,
  'code': 'relationships',
  'name': '',
  'rowPosition': 3},
 {'type': 'relationships',
  'message': 'row position 3 is not consistent',
  'tags': ['#table', '#row'],
  'note': 'cells "region" and "population" are not derived in this row',
  'cells': ['Spain', 'European Union', 'ES', '48'],
  'rowNumber': 3,
  'code': 'relationships',
  'name': '',
  'rowPosition': 3},
 {'type': 'relationships',
  'message': 'row position 4 is not consist

The test with the correct values (Estonia code : EE, European Union poulation : 449) does not detect any errors.

In [4]:
countries_2 = [['country', 'region',         'code', 'population'], 
               ['France',  'European Union', 'FR',    449        ], 
               ['Spain',   'European Union', 'ES',    449        ], 
               ['Estonia', 'European Union', 'EE',    449        ], 
               ['Nigeria', 'African',        'NI',    1460       ]
              ]
validate(countries_2, sch, resource=countries_2).valid

True