<img src=images/xd-logo.png align=right width=300px>

# Pydantic
*Data parsing and validation using type annotations.*

After this notebook, you will be able to:

- Understand why and when to use Pydantic.
- How Pydantic to validate your data at any stage of your application.
- Validate that your data meets any arbitrary condition.
- Be aware of some of the extra functionality that Pydantic provides.

You can access the official Pydantic documentation [here](https://docs.pydantic.dev/latest/).

## Why Pydantic?

Let's have a look at the following scenario with three components:

- A custom data structure `ApiResponse`, which is a `dataclass` that holds all the input data for this application.
- The function `get_api_data()` that simulates requesting an API and parsing the response into our custom data structure.
- The function `use_data()` that simulates the behavior of the application.


In [None]:
from dataclasses import dataclass

@dataclass
class ApiResponse:
    number: int

def get_api_data():
    return ApiResponse(**{"number": 2})

def use_data(val):
    return val**2 + 1

api_response = get_api_data()

use_data(api_response.number)

So far everything works as expected.
However, uncomment the following code and see what would happen if at any point the output of the external API changes.

In [None]:
# def get_new_api_data(): # the api returns number as an str now instead of an int
#     return ApiResponse(**{"number": "2"})
# 
# new_api_response = get_new_api_data()
# use_data(new_api_response.number)

Even though you specified that you expected `number` to be an `int` in `ApiResponse`, dataclasses in Python don't perform any sort of data validation and now your application fails to execute.

If the data is not validated, an application that now works could stop working in the future if the input data changes.
Or, even worse, it might still run without errors, but not working as expected.

Pydantic allows you to validate data in your applications.
It's useful to validate data from external sources (like APIS or user input), but can also be useful to test programs and other use cases.

The previous example using Pydantic looks like:

In [None]:
from pydantic import BaseModel

class ValidatedApiResponse(BaseModel):
    number: int
        
def get_validated_api_data(number=2):
    return ValidatedApiResponse(**{"number": number})

validated_api_response = get_validated_api_data()

use_data(validated_api_response.number)

In [None]:
use_data(get_validated_api_data("2").number)

In [None]:
# use_data(get_validated_api_data("asdasd").number)

Data validation allows you to catch potential errors and have confidence that your application will behave as expected.
If the data is not validated, you will get warned about it once it happens, instead of finding out once you encounter an error and having to debug/backtrace to find the origin of the error.

## Pydantic Basics

The main class exposed by Pydantic is the `BaseModel`. Any class that inherents from it will validate that its inputs conform to the required types when objects of that class are initialized.

You can also set default field values, like the `name` field in the following example.

In [None]:
from pydantic import BaseModel

class User(BaseModel):
    id: int
    name: str = "Jane Doe"

In [None]:
user = User(id=1)
user

Pydantic also converts values to the specified types whenever possible (type casting).

In [None]:
User(id="123")

In [None]:
# User(id="one-two-three")

In [None]:
# User(name="True")

### Exercise: Create your own validated data structure

- a) Create a Pydantic alternative to `DataClassUser`.
- b) Create instances of the `DataClassUser` and your own validated class from an input dict. 
- c) What happens if you provide a string in the list of friends?
- d) Bonus: Try other possible combinations of inputs and explore what succeeds or fails.

In [None]:
from dataclasses import dataclass
from datetime import datetime

from pydantic import BaseModel

@dataclass
class DataClassUser:
    id: int
    name: str  = "John Doe"
    signup_ts: datetime | None =  None
    friend_ids: list[int] | None = None
        
external_data = {
    "id": "123",
    "signup_ts": "2019-06-01 12:22",
    "friend_ids": [1, 2, 3],
}

user_dataclass = DataClassUser(**external_data)

In [None]:
# %load answers/exercise-1.py

## Validators

So far you've seen how to validate that the types of the data match their expected types.
But with Pydantic you can do much more than that.

Writing your own validators allows you validate that any arbitrary conditions are met, and also to apply transformations to each field.

To create validators you need to define class methods using the `@field_validator` decorator, which takes as an argument the name of the field it will validate.
The method itself needs to accept as arguments:
- First argument: the class
- Second argument: the value to validate
- Third argument: an object (usually called info) with a `.data` attribute that is a `dict` with all previously validated fields.

The validator method should also return the validated value, possibly after transforming it.

Validators are run in the order in which their associated fields are defined.

In [None]:
from pydantic import BaseModel, field_validator

class User(BaseModel):
    id: int
    name: str = "Jane Doe"
    
    @field_validator("id")
    def id_is_positive(cls, v):
        assert v > 0, "id has to be positive"
        return v
    
    @field_validator("name")
    def name_must_have_space(cls, v):
        if " " not in v:
            raise ValueError("must contain a space")
        return v.title()

In [None]:
# User(id = 0)

In [None]:
# User(id = 3, name = "David")

In [None]:
User(id = 3, name = "xeBIa dAtA")

## Exercise:  Create your own validators

- a) Validate that `signup_ts` is not in the future.                                                             
    - *Hint: use `datetime.now()`.*                                                                            
- b) Did the type conversion from string to datetime happen before or after your custom validator?               
- c) Add two password fields: `password1` and `password2` and validate that the input to both fields is the same.
    - *Hint: use the optional third argument of the validator.*                                             

In [None]:
from datetime import datetime

from pydantic import BaseModel, field_validator

class PydanticUser(BaseModel):
    id: int
    name:str = "John Doe"
    signup_ts: datetime | None =  None
    friend_ids: list[int] | None = None
        
        
external_data = {
    "id": "123",
    "signup_ts": "2019-06-01 12:22",
    "friend_ids": [1, 2, "3"],
   # "password1": "passypass",
   # "password2": "passypazz"
}


In [None]:
# %load answers/exercise-2.py

## Additional niceties of Pydantic

### Aliases

Aliases allow you to have input and output names different from the field name. This is useful when communicating with APIS that follow different styles.

For example, in Python it's preferred to use `snake_case` to name objects, while other environments might use `camelCase`.

In [None]:
camel_data = {"firstName": "Topsy", "lastName": "Tops"}
snake_data = {"first_name": "Kaa", "last_name": "Kipling"}

If you try to validate the data with unmatching field names, the validation will fail:

In [None]:
# from pydantic import BaseModel

# class User(BaseModel):
#     first_name: str
#     last_name: str
#         
# User(**camel_data)

You can use aliases to change the expected argument names on the class constructor

In [None]:
from pydantic import BaseModel, Field

class User(BaseModel):
    first_name: str = Field(alias="firstName")
    last_name: str = Field(alias="lastName")
        
User(**camel_data)

However, this naive approach would prevent you from using the actual field names (in `snake_case`). This would fail:

In [None]:
# User(**snake_data)

To allow for both options you can set the `populate_by_name` argument to `True` in the class definition.

In [None]:
class User(BaseModel, populate_by_name=True):
    first_name: str = Field(alias="firstName")
    last_name: str = Field(alias="lastName")
        
print(User(**camel_data))
print(User(**snake_data))

Alternatively you can define aliases dynamically for all fields by defining an `alias_generator` function that automatically generates aliases for all fields.

In [None]:
def to_camel_case(snake_str: str) -> str:
    components = snake_str.split("_")
    return components[0] + "".join(x.title() for x in components[1:])

class User(BaseModel, populate_by_name=True, alias_generator=to_camel_case):
    first_name: str
    last_name: str

print(User(**camel_data))
print(User(**snake_data))

### IO (input-output)

There are a few handy methods to import/export information and data about our validated classes.

In [None]:
class PydanticUser(BaseModel, populate_by_name=True, alias_generator=to_camel_case):
    id: int
    name:str = "John Doe"
    signup_ts: datetime | None =  None
    friend_ids: list[int] | None = None

user = PydanticUser(id = 3)
user

In [None]:
user.model_dump()

In [None]:
user.model_dump_json()

In [None]:
user.model_json_schema()

In [None]:
# User.parse_file("path/to/JSON")

### Settings management

Another usecase for Pydantic is to use validate and deal with all kind of settings, configuration, options, etc. that might differ between different environments.

Pydantic will automatically get the values from environmental variables, and validate that they conform to the expected schema. Pydantic comes with batteries-included for multiple common setting options.

In [None]:
from pydantic import RedisDsn
from pydantic_settings import BaseSettings, SettingsConfigDict
import os

class APIConfig(BaseSettings):
    AUTH_KEY: str
    API_KEY: str
    DB_HANDLE: RedisDsn
        
os.environ["AUTH_KEY"] = "authauthauth"
os.environ["API_KEY"] = "apiapiapi"
os.environ["DB_HANDLE"] = "redis://user:pass@localhost:6379/1"

APIConfig()

A commonly used option is the `env_prefix` to specify a prefix that will be expected to prepend all environmental variables.

In [None]:
from pydantic import RedisDsn
from pydantic_settings import BaseSettings, SettingsConfigDict
import os

class APIConfig(BaseSettings, env_prefix="xebia_data_training_"):
    AUTH_KEY: str
    API_KEY: str
    DB_HANDLE: RedisDsn
        
os.environ["xebia_data_training_AUTH_KEY"] = "authauthauth"
os.environ["xebia_data_training_API_KEY"] = "apiapiapi"
os.environ["xebia_data_training_DB_HANDLE"] = "redis://user:pass@localhost:6379/1"

APIConfig()

### FastAPI <3 Pydantic

FastAPI is built on top of Pydantic. It automatically detects when the type of the input in a request is a Pydantic class, and automatically expects the input as part of the request body and performs data validation. This is an extremely ergonomic interface to define end-to-end validated API pipelines.

In [None]:
# FastAPI endpoint definition
# @app.post("/users/") 
def create_user(user: User):
    # do important things
    return {"user_name": user.name, "user_id": user.id}

## Conclusion

Pydantic is not a very complex and deep library, but it does a collection of simple things very well. It's usually a very pleasant experience to use it, and it is a very useful tool to have in your toolbox when developing Python applications. Not only does it take away some of the most annoying jobs (e.g. type casting), but it also allows you to relatively easily define validation logic to ensure your app stays working as designed.
