# Relationships descriptor

This Notebook presents an implementation example of `relationships` descriptor as proposed in the [pattern](https://github.com/frictionlessdata/specs/pull/859) documentation.

## Example

The choosen example is defined in the pattern :

    
| country | region         | code  | population |
|---------|----------------|-------|------------|
| France  | European Union | FR    | 449        |
| Spain   | European Union | ES    | 48         |
| Estonia | European Union | ES    | 449        |
| Nigeria | Africa         | NI    | 1460       |
    
The data schema for this dataset is :

  ```json
  {"fields": [ 
      {"name": "country", "type": "string"},
      {"name": "region", "type": "string"},
      {"name": "code", "type": "string", "description": "country code alpha-2"},
      {"name": "population", "type": "string", "description": "region population in 2022 (millions)"}]
  }
  ```
If we now look at the data we see that this dataset is not consistent because it contains two structural errors:

* The value of the "code" Field must be unique for each country, we cannot therefore have "ES" for "Spain" and "Estonia",
* The value of the "population" Field of "European Union" cannot have two different values (449 and 48)

These structural errors make the data unusable and yet they are not detected in the validation of the dataset (in the current version of Table Schema, there are no Descriptors to express this dependency between two fields).

The pattern proposal is to add a `relationsips` descriptor to check relationships :

  ```json
  { "fields": [ ... ],
    "relationships": [
      { "fields" : [ "country", "code"],
        "description" : "is the country code alpha-2 of",
        "link" : "coupled"
      }
      { "fields" : [ "region", "population"],
        "description" : "is the population of",
        "link" : "derived"}
    ]
  }
  ```
 

In [1]:
from frictionless import Resource, Schema

countries = Resource(data=[  ['country', 'region',         'code', 'population'], 
                             ['France',  'European Union', 'FR',    449        ], 
                             ['Spain',   'European Union', 'ES',    48         ], 
                             ['Estonia', 'European Union', 'ES',    449        ], 
                             ['Nigeria', 'African',        'NI',    1460       ]])
sch = {"fields": [
          {"name": "country", "type": "string"},
          {"name": "region", "type": "string"},
          {"name": "code", "type": "string", "description": "country code alpha-2"},
          {"name": "population", "type": "integer", "description": "region population in 2022 (millions)"}],
       "relationships": [
          { "fields" : [ "country", "code"], "link" : "coupled", "description" : "is the country code alpha-2 of"},
          { "fields" : [ "region", "population"], "link" : "derived", "description" : "is the population of"}]}

countries.schema = Schema.from_descriptor(sch)

## Implementation 

The relationships analysis is build on vectorized treatments with Fields data. Actually, Table schema is build with a row structure. So this analysis have to be implemented in Table Schema as a global check.

The implementation below is an example to discuss (i'm not an expert of Tanle Schema)

In [6]:
import attrs
import frictionless
from frictionless import Check, Row
from frictionless.errors import RowError
from tab_dataset import Cdataset, Cfield

def validate(resource):
    checks = [Relationship(resource, desc) for desc in resource.schema.custom['relationships']]
    return frictionless.validate(resource, checks=checks)
    
class RelationshipError(RowError):
    title = None
    type = 'Relationship'
    description = None
    template = "row position {rowNumber} is not consistent"

@attrs.define(kw_only=True, repr=False)
class Relationship(Check):
    """Check a Relationship between two fields"""

    Errors = [RelationshipError]
    
    def __init__(self, resource, descriptor):
        
        super().__init__()
        res_t = list(map(list, zip(*resource.read_data())))
        dts = Cdataset([Cfield(fld[1:], fld[0]) for fld in res_t])
        self.__num_row = -1
        self.__relationship = descriptor
        self.__errors = dts.check_relation(descriptor['fields'][0],
                                          descriptor['fields'][1],
                                          descriptor['link'], value=False)
        
    def validate_row(self, row: Row):
        self.__num_row += 1
        if self.__num_row in self.__errors:
            note = 'cells "' + self.__relationship['fields'][0] + \
                   '" and "' + self.__relationship['fields'][1] + \
                   '" are not ' + self.__relationship['link'] + ' in this row'
            yield RelationshipError.from_row(row, note=note)

## Tests
The validate function detects two errors :

- between "region" and "population" Fields (rows 2, 3 and 4) 
- between "country" and "code" Fields (rows 3 and 4)

In [7]:
validate(countries)

{'valid': False,
 'errors': [],
 'tasks': [{'name': 'memory',
            'type': 'table',
            'valid': False,
            'place': '<memory>',
            'labels': ['country', 'region', 'code', 'population'],
            'stats': {'errors': 5,
                      'seconds': 0.006,
                      'fields': 4,
                      'rows': 4},
            'errors': [{'type': 'Relationship',
                        'message': 'row position 2 is not consistent',
                        'tags': ['#table', '#row'],
                        'note': 'cells "region" and "population" are not '
                                'derived in this row',
                        'cells': ['France', 'European Union', 'FR', '449'],
                        'rowNumber': 2},
                       {'type': 'Relationship',
                        'message': 'row position 3 is not consistent',
                        'tags': ['#table', '#row'],
                        'note': 'cells "country"

The test with correct values (Estonia code : EE, European Union poulation : 449) detects no errors.

In [8]:
countries_2 = Resource(data=[['country', 'region',         'code', 'population'], 
                             ['France',  'European Union', 'FR',    449        ], 
                             ['Spain',   'European Union', 'ES',    449        ], 
                             ['Estonia', 'European Union', 'EE',    449        ], 
                             ['Nigeria', 'African',        'NI',    1460       ]])
countries_2.schema = Schema.from_descriptor(sch)
validate(countries_2)

{'valid': True,
 'errors': [],
 'tasks': [{'name': 'memory',
            'type': 'table',
            'valid': True,
            'place': '<memory>',
            'labels': ['country', 'region', 'code', 'population'],
            'stats': {'errors': 0,
                      'seconds': 0.003,
                      'fields': 4,
                      'rows': 4},
            'errors': []}]}