# Validation

The following is an introduction to basic validation in datachef and usage of the built in `validators` module.

Note - while `validators` is useful for getting up and running in validation, its more important to understand the concepts here so you can implement something appropriate for your own use cases.

## Source Data

The data source we're using for these examples is shown below:

| <span style="color:green">Note - this particular table has some very verbose headers we don't care about, so we'll be using `bounded=` to remove them from the previews as well as to show just the subset of data we're working with.</span>|
|-----------------------------------------|

The [full data source can be downloaded here](https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx). We'll be using th 10th tab named "Table 3c".

In [1]:
from typing import List
from datachef import acquire, preview, XlsxSelectable

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx")
preview(tables[9], bounded="A4:H10")

0,1,2,3,4,5,6,7,8
,A,B,C,D,E,F,G,H
4.0,Percentage change 3 months on previous 3 months,,,,,,,
5.0,Time period,Public new housing,Private new housing,Total new housing,Infrastructure new work,Public other new work,Private industrial new work,Private commercial new work
6.0,Dataset identifier code,MVO6,MVO7,MVO8,MVO9,MVP2,MVP3,MVP4
7.0,Jun 2010,5.6,9.8,8.8,3,4.3,3.7,1.9
8.0,Jul 2010,2,5.6,4.8,0.2,-0.2,9.7,3.5
9.0,Aug 2010,5.5,4.5,4.7,-2.9,-2.9,24.4,5.9
10.0,Sep 2010,11.7,7.5,8.5,-6.8,-3.3,16.1,5.3


## Column: the validation= keyword

It's important to understand how Column validation works:

- The `Column` class has a keyword argument of `valiation=`
- This keyword expects a callable (function,lambda function or callable class)
- Whatever callable is passed will _typically_ raise an exception when presented with the `Cell` object to populate said column which is not valid.

_Note - I say "typically" here as validators are intended to be highly customised and defined in large part by the user base. Raising an immediate exception is the simplest thing you could do, but by no means the only thing you could do._

## Simple Regex Validation

For this example we're going to use the `matches` module. This has a simple regex validator that works in exactly the way explained above.

i.e "matches" is just a convenience, you could just define this yourself (and we will as the next example).

So the following example is valid for the regex provided:

In [2]:
from typing import List
from datachef import acquire, XlsxSelectable, valid, TidyData, Column, down, right

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx")
table = tables[9]

observations = table.excel_ref("B7:H10").label_as("Observations")
dataset_identifier_code = table.excel_ref("B6").expand(right).label_as("Dataset Identifier Codes")

# Note: matches a regex of capital M followed by anything
tidy_data = TidyData(
    observations,
    Column(dataset_identifier_code.finds_observations_directly(down), validate=valid.regex("M.*"))
)

## Source Data

The data source we're using for these examples is shown below:

| <span style="color:green">Note - this particular table has some very verbose headers we don't care about, so we'll be using `bounded=` to remove them from the previews as well as to show just the subset of data we're working with.</span>|
|-----------------------------------------|

The [full data source can be downloaded here](https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx). We'll be using th 10th tab named "Table 3c".

Runs without error.

Whereas in the following example we change the regex so the data no longer matches that which is expected - note the exception.

In [3]:
from typing import List
from datachef import acquire, XlsxSelectable, valid, TidyData, Column, down, right

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx")
table = tables[9]

observations = table.excel_ref("B7:H10").label_as("Observations")
housing = table.excel_ref("B6").expand(right).label_as("Dataset Identifier Code")

# Note: matches a regex of capital Z followed by anything
tidy_data = TidyData(
    observations,
    Column(dataset_identifier_code.finds_observations_directly(down), validate=valid.regex("Z.*"))
)

print(tidy_data)

AssertionError: Value of cell "<B6, value:"MVO6", x:1, y:5>" does not match provided regex pattern "Z.*".

## A Note on Lazy Evaluation

One thing you many notice about the above is that the validation error does not occur until we try and print the `tidy_data` variable, this is because the `TidyData` class uses _lazy evaluation_.

Simply put, this means the tidy data is never extracted until the last possible moment that is has to be, in this case when the users goes to print the results.

## A Custom Validation Example

For this example we're going to create a custom validation callable.

In [None]:
from typing import List
from datachef.models.source.cell import Cell
from datachef import acquire, XlsxSelectable, matches, TidyData, Column, down, right

def code_validator(cell: Cell):
    """
    Custom validator that validates that a code is 4 characters
    long and begins with a capital M
    """
    if len(cell.value) == 4 and cell.value[0] == "M":
        return True
    return False

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx")
table = tables[9]

observations = table.excel_ref("B7:H10").label_as("Observations")
housing = table.excel_ref("B6").expand(right).label_as("Housing")

# Note: matches a regex of capital Z followed by anything
tidy_data = TidyData(
    observations,
    Column(dataset_identifier_code.finds_observations_directly(down), validate=code_validator)
)

print(tidy_data)

: 

## Validating Against An External Source Of Truth

The most powerful form of validation is to compare values to an external source of truth, be that via a scheme, codelists, api or pretty much any "master list" of acceptable values.

For this example we're going to use a simple json file.

First we create the json file and write it to `./validation.json`.

In [None]:
import json

# Note: I've only populated Housing, but you could specify multiple
# columns this way.
valid = {
    "Housing": [
        "Public new housing",
        "Private new housing",
        "Total new housing",
        "Infrastructure new work",
        "Public other new work",
        "Private industrial new work",
        "Private commercial new work"
        ]
}

with open("./validation.json") as f:
    json.dump(valid, f)

: 

Now we have our json file here's (one implementation) of how you could use it for Column value validation.

In [None]:
from typing import List
from datachef.models.source.cell import Cell
from datachef import acquire, XlsxSelectable, matches, TidyData, Column, down, right

class Validator:

    def __init__(self, column: str):
        """
        Use the column name to get a list of valid values for
        this column
        """
        with open("./validation.json") as f:
            validation_dict = json.load(f)
        self.valid_values= validation_dict[column]

    def __call__(self, cell: Cell):
        """
        Its valid if the cell value is in the list of valid
        values for this column
        """
        assert cell.value in self.valid_values, (f'''
            Cell value {cell.value} not in list of
            valid values: {self.valid_values}
        ''')

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx")
table = tables[9]

observations = table.excel_ref("B7:H10").label_as("Observations")
housing = table.excel_ref("B5").expand(right).label_as("Housing")

tidy_data = TidyData(
    observations,
    Column(housing.finds_observations_directly(down), validation=Validator("Housing"))
)

print(tidy_data)

: 

As a final example, we'll change the validation json and remove most of the valid values and see what happens (it'll raise an exception when we go to print and the Column gets evaluated because it'll contains multiple values that cannot be asserted when passed to `Validator("Housing")`).

In [None]:
import json

# Note: I've only populated Housing, but you could specify multiple
# columns this way.
valid = {
    "Housing": [
        "Public new housing",
        "Private new housing"
        ]
}

with open("./validation-dict2.json") as f:
    json.dump(valid, f)

: 

In [None]:
from typing import List
from datachef.models.source.cell import Cell
from datachef import acquire, XlsxSelectable, matches, TidyData, Column, down, right

class Validator:

    def __init__(self, column: str):
        """
        Use the column name to get a list of valid values for
        this column
        """
        with open("./validation-dict2.json") as f:
            validation_dict = json.load(f)
        self.valid_values= validation_dict[column]

    def __call__(self, cell: Cell):
        """
        Its valid if the cell value is in the list of valid
        values for this column
        """
        assert cell.value in self.valid_values, (f'''
            Cell value {cell.value} not in list of
            valid values: {self.valid_values}
        ''')

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx")
table = tables[9]

observations = table.excel_ref("B7:H10").label_as("Observations")
housing = table.excel_ref("B5").expand(right).label_as("Housing")

tidy_data = TidyData(
    observations,
    Column(housing.finds_observations_directly(down), validation=Validator("Housing"))
)

print(tidy_data)

: 