# Data Validation

We have probably grown accustomed to the fact that in Python we can arbitrarily "stick" our labels wherever we want. The variable `x` could mean a number, three lines later a numpy array or even a file object. In function parameters we can specify type hints, but the interpreter does not enforce them; you can still pass something else.

This is sometimes a blessing, but sometimes a curse. One common scenario where this is least useful is standard communication. If we receive data from another system or want to send data to another system, we usually work with a strongly typed format.
If a single number is given as text, everything breaks.

When data comes from elsewhere (external system, user), there is no guarantee it will come in the correct format, meaning:
- every field is present,
- it comes in the right type (e.g., string instead of number),
- the value range is valid (negative count, future birth date, etc.).

If there is no clear, type-described "contract" that specifies what must be passed, then:

* errors occur at runtime, often discovered late,
* itâ€™s hard to find the error,
* different parts interpret the same data differently (think about all the different meanings of multiplication or exponentiation symbols in Python!).

If we need strict types (and want to check them), our best friend will be the Pydantic library.

## Pydantic
Pydantic is a data validation and data modeling library built on Python type annotations.
So, you provide your types, and Pydantic checks, converts, and guarantees that the data moving in your system conforms to the rules you specified.

Pydantic is not a built-in package, so you need to install it:
```bash
pip install pydantic
```

In this Colab all important packages are already installed, including this one.


## Models

In [None]:
# pydantic derives every "validatable" structure from BaseModel
from pydantic import BaseModel
from datetime import date

class Person(BaseModel):
    name: str
    age: int
    birth_date: date | None = None

p = Person(name="Anna", age="30", birth_date="1995-05-10")
print(p)
print(p.age, type(p.age))
print(p.birth_date, type(p.birth_date))


What are we doing here?
* "30" automatically converts to `int` type
* "1995-05-10" to `date` type
* If we give nonsense (age="thirty") then we get a `ValidationError`

Create a program that asks the user for these three data items separated by commas and validates them! Catch the ValidationError and print that the user provided bad data.

You can split text like this: `text.split(',')`.

In [None]:
# user data validation
...



## Required and optional fields, with default values

In [None]:
from pydantic import BaseModel
from typing import Optional

class SensorReading(BaseModel):
    id: int
    value: float
    unit: str = "C" # default value
    location: Optional[str] = None # optional

r1 = SensorReading(id=1, value=23.5)
r2 = SensorReading(id=2, value=18.2, unit="kPa", location="Lab 3")
r1,r2

(SensorReading(id=1, value=23.5, unit='C', location=None),
 SensorReading(id=2, value=18.2, unit='kPa', location='Lab 3'))

## Complex structures

With interfaces it is common that there are nested objects in JSON (with further nested objects and so on).

In [None]:
from pydantic import BaseModel
from typing import List

# let's have a point type that is a pair of floating numbers.
class Point(BaseModel):
    x: float
    y: float

class Measurement(BaseModel):
    id: int
    points: List[Point]
    description: str | None = None # either text or none

# incoming data (validation).
# try making an error in it: e.g., use z instead of y or wrong data!
m = Measurement(
    id=42,
    points=[
        {"x": 0, "y": 1.5},
        {"x": 2.3, "y": -0.4},
    ],
)

m

But what if the two data types do not come together but either one or the other? For example, a user identifier might be `int` but it could also be `str`!

Well, we just specify it as an alternative Python type:

In [None]:
class User(BaseModel):
    id: str | int

User(id=12), User(id="jozsi")

*Extra info*: programmers (and mathematicians) call this a 'sum type' because the possible values are the sum of both type sets (all str + all int).

In contrast, if you store both, for example:
```python
class User(BaseModel):
    id:int
    name:str
```
That would be a "product type" because its possible values are the product of the two types.


## Validation rules

Often it is not enough for us that the type is correct; we want to precisely specify the valid range (min/max, length, etc.).

In this case we can use the Field object, where all this can be specified!

In [None]:
from pydantic import BaseModel, Field

class Product(BaseModel):
    name: str = Field(min_length=1, max_length=100)
    price: float = Field(gt=0) # > 0, gt = greater
    quantity: int = Field(ge=0) # >= 0, ge = greater equal

Product(name="Cement", price=12.5, quantity=100)

In [None]:
# if the product name cannot be anything, we can use Literal type
# (specific value set)
from typing import Literal

ProductType = Literal["Cement", "Steel"]

class Product(BaseModel):
    name: ProductType
    price: float = Field(gt=0)
    quantity: int = Field(ge=0)

Product(name="Steel", price=12.5, quantity=100)

## Data input/output, communication with JSON

Suppose we work with the earlier strictly defined specification (interface) and receive the following data (from a user, API call, other system):

In [None]:
class Product(BaseModel):
    name: str = Field(min_length=1, max_length=100)
    price: float = Field(gt=0, le=10000)
    quantity: int = Field(ge=0)

incoming = {
    "name": "Cement",
    "price": "12.5",
    "quantity": "100"
}

In [None]:
# because this is already a Python structure, the case is super simple:
p = Product(**incoming)

# and we can use it:
p.name, p.price, p.quantity

('Cement', 12.5, 100)

Unfortunately, data usually does not come to us as Python code, but in another format, e.g., CSV or, even more often, JSON.
But that is not a problem either!


In [None]:
# this is a text, e.g., coming from a file or network:
json_data = '{"name": "Cement", "price": "12.5", "quantity": "100"}'

In [None]:
# validate the data:
Product.model_validate_json(json_data)

But it can also easily happen that we have many such "Product" data items, for example in a list! And we want to validate all of them.

In the normal case, if any data is bad anywhere, you immediately get a ValidationError, because Pydantic (as its name suggests) is super strict!

In [None]:
be_JSON = """
 [
 {"name": "Cement", "price": "12.5", "quantity": "100"},
 {"name": "Steel", "price": "22.1", "quantity": "-80"},
 {"name": "Cement", "price": "11.5", "quantity": "72"}
 ]
"""
# the third line has incorrectly specified quantity!

Here the TypeAdapter helps us, which is a typical Python container (e.g., list) combined with a Pydantic type.

TypeAdapter will no longer be a Pydantic type (you cannot use it in other contexts), it just helps us easily create a "list reader" or "list writer".
We will solve omission of errors with the OnErrorOmit type (this really is a type) which has enough capability that in case of an embedded type error it does not raise an error, it just skips it.


In [None]:
from pydantic import OnErrorOmit, TypeAdapter

# create a Python list reader that expects Product items:
adapter = TypeAdapter(list[OnErrorOmit[Product]])

# and now we can read it (validated!)
products = adapter.validate_json(be_JSON)

products

[Product(name='Cement', price=12.5, quantity=100),
 Product(name='Cement', price=11.5, quantity=72)]