<a href="https://colab.research.google.com/github/goteguru/kmooc_python/blob/main/notebooks/en/kmooc_11_2_pydantic_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data validation

With Python we've gotten used to being able to "stick" labels on things however we like. `x` can mean a number, a few lines later it can be a numpy array or even a file object. For function parameters we can provide type hints, but the interpreter does not enforce them â€” you can still pass something else.

This is sometimes a blessing, sometimes a curse. One common case where it's not helpful at all is standardized communication. If we transfer data from another system or want to send it there, we usually work with a strictly defined format. If a single number is given as text, everything can break.

If data comes from elsewhere (an external system, a user) there is no guarantee it will be in the right format, i.e.:
- every field arrives,
- it arrives with the correct type (not a string instead of a number, etc.),
- the value range is valid (negative counts, a birth date in the future, etc.).

If there is no clear, type-described "contract" that prescribes what must be provided, then:

* errors are discovered at runtime, often late,
* it is hard to find the bug,
* different parts interpret the same data differently (think about how many meanings multiplication or exponentiation can have in Python!)


If you need strict types (and want to check them), our best friend will be the Pydantic library.

## Pydantic
Pydantic is a data validation and modelling library built on Python's type annotations. (Sounds good, right?) You give it your types, and Pydantic checks, transforms, and guarantees that the data moving through your system conforms to the rules you specified. It won't be smaller, larger, longer, or in any other way different.

Pydantic is not a built-in package, so you need to install it:
```bash
pip install pydantic
```

In Colab all important packages are present, so this one is too.


## Models

In [None]:
# pydantic derives all "validatable" structures from BaseModel
from pydantic import BaseModel
from datetime import date

class Person(BaseModel):
    name: str
    age: int
    birth_date: date | None = None

p = Person(name="Anna", age="30", birth_date="1995-05-10")
print(p)
print(p.age, type(p.age))
print(p.birth_date, type(p.birth_date))


What are we doing here?
* "30" is automatically converted to an `int` type
* "1995-05-10" becomes a `date` type
* If we give nonsense (age="thirty") we'll get a `ValidationError`

Create a program that asks the user for these three values separated by commas and validates them! Catch the ValidationError exception and print that the user provided invalid data.

You can split the text like this: `text.split(',')`.

In [None]:
# user data validation
...



## Required and optional fields, with default values

In [None]:
from pydantic import BaseModel
from typing import Optional

class SensorReading(BaseModel):
    id: int
    value: float
    unit: str = "C"  # default value
    location: Optional[str] = None  # optional

r1 = SensorReading(id=1, value=23.5)
r2 = SensorReading(id=2, value=18.2, unit="kPa", location="Lab 3")
r1,r2

(SensorReading(id=1, value=23.5, unit='C', location=None),
 SensorReading(id=2, value=18.2, unit='kPa', location='Lab 3'))

## Complex structures

With interfaces it's common to have nested objects in JSON (which themselves can contain further nested objects, and so on).

In [None]:
from pydantic import BaseModel
from typing import List

# let's define a Point type which is a pair of floats.
class Point(BaseModel):
    x: float
    y: float

class Measurement(BaseModel):
    id: int
    points: List[Point]
    description: str | None = None  # either there is text or not

# incoming data (validation).
# try to make an error in it: e.g. use z instead of y or provide bad data!
m = Measurement(
    id=42,
    points=[
      {"x": 0, "y": 1.5},
      {"x": 2.3, "y": -0.4},
    ],
)

m

But what if the two data types don't arrive together, but either one or the other? For example, a user identifier might be an `int` or it might be a `str`!

Well, we simply specify it as an alternative Python type:

In [None]:
class User(BaseModel):
    id: str | int

User(id=12), User(id="jozsi")

*Extra info*: programmers (and mathematicians) call this a 'sum type' because the set of possible values is the sum of the two types (all str + all int).

In contrast, if you store both, for example:
```python
class User(BaseModel):
  id:int
  name:str
```
That would be a "product type", because the possible values are the product of the two types. (Every int can be paired with every str).

## Validation rules

Often it's not enough that the type is correct, we want to precisely specify the valid range (min/max, length, etc.).

In that case we can use the Field object where all this can be specified!

In [None]:
from pydantic import BaseModel, Field

class Product(BaseModel):
  name: str = Field(min_length=1, max_length=100)
  price: float = Field(gt=0)  # > 0, gt = greater
  quantity: int = Field(ge=0) # >= 0, ge = greater equal

Product(name="Cement", price=12.5, quantity=100)

In [None]:
# if the product name cannot be arbitrary, we can use a Literal type
# (a concrete set of allowed values)
from typing import Literal

ProductType = Literal["Cement", "Steel"]

class Product(BaseModel):
  name: ProductType
  price: float = Field(gt=0)  # > 0, gt = greater
  quantity: int = Field(ge=0) # >= 0, ge = greater equal

Product(name="Steel", price=12.5, quantity=100)

In [None]:
# but we can also write our own validation rule
class Person(BaseModel):
  name: str
  age: float = Field(gt=0, le=200)
  def name_validator()

## Data input/output, communicating with JSON

Assume we work with the previously strictly defined standard (interface) and we receive the following data (from a user, an API call, another system):

In [None]:
class Product(BaseModel):
  name: str = Field(min_length=1, max_length=100)
  price: float = Field(gt=0, le=10000)
  quantity: int = Field(ge=0)

incoming = {
    "name": "Cement",
    "price": "12.5",
    "quantity": "100"
}

In [None]:
# since this is already a python structure, the case is super simple:
p = Product(**incoming)

# and we can use it right away:
p.name, p.price, p.quantity

('Cement', 12.5, 100)

Unfortunately the data usually doesn't come as Python code, but in other formats, e.g. CSV or more often JSON. But that's not a problem either!



In [None]:
# this is a text, e.g. from a file or over the network:
json_data = '{"name": "Cement", "price": "12.5", "quantity": "100"}'

In [None]:
# validate the data:
Product.model_validate_json(json_data)

But it can also easily happen that we have many such "Product" items, for example in a list! And we want to validate them all. Let's add a twist: we simply ignore the invalid items.

Normally, if any item is wrong anywhere, we immediately get a ValidationError, because Pydantic (as its name suggests) is super strict!

In [None]:
in_JSON = """
    [
      {"name": "Cement", "price": "12.5", "quantity": "100"},
      {"name": "Steel", "price": "22.1", "quantity": "-80"},
      {"name": "Cement", "price": "11.5", "quantity": "72"}
    ]
"""
# the quantity is given incorrectly in the third line!


In such cases TypeAdapter helps us: it can combine a normal Python container (e.g. a list) with a Pydantic type. The TypeAdapter itself is not a Pydantic type (you can't use it in other types), it simply helps us easily create a "list reader" or "list writer".

We will handle omission with the OnErrorOmit type (which is indeed a type) that has the capability to skip an entry when the nested type fails validation instead of raising an error.


In [None]:
from pydantic import OnErrorOmit, TypeAdapter

# let's create a python list reader that expects Products:
adapter = TypeAdapter(list[OnErrorOmit[Product]])

# and we can read it now (validated!)
products = adapter.validate_json(in_JSON)

products

[Product(name='Cement', price=12.5, quantity=100),
 Product(name='Cement', price=11.5, quantity=72)]