<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#First-comprehension" data-toc-modified-id="First-comprehension-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>First comprehension</a></span></li><li><span><a href="#Models" data-toc-modified-id="Models-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Models</a></span></li><li><span><a href="#More-explicit-errors-:-ValidationError" data-toc-modified-id="More-explicit-errors-:-ValidationError-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>More explicit errors : ValidationError</a></span></li><li><span><a href="#Integration-with-IDE" data-toc-modified-id="Integration-with-IDE-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Integration with IDE</a></span></li><li><span><a href="#Adding-constraints" data-toc-modified-id="Adding-constraints-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Adding constraints</a></span></li><li><span><a href="#Extra-validation" data-toc-modified-id="Extra-validation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Extra validation</a></span></li><li><span><a href="#Environment-variables" data-toc-modified-id="Environment-variables-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Environment variables</a></span></li><li><span><a href="#How-can-we-run-validation-on-our-use-case-?" data-toc-modified-id="How-can-we-run-validation-on-our-use-case-?-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>How can we run validation on our use case ?</a></span></li><li><span><a href="#Extra-types" data-toc-modified-id="Extra-types-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Extra types</a></span></li><li><span><a href="#Why-is-it-useful" data-toc-modified-id="Why-is-it-useful-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Why is it useful</a></span></li><li><span><a href="#Sources" data-toc-modified-id="Sources-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Sources</a></span></li></ul></div>

*Objective* : Pydantic test (a data validation library)

*Ressources* : https://pydantic-docs.helpmanual.io/

*What is it ?* 
- An abstraction for settings and data validation which does not have any impact on your code logic.

*Data validation*  
"Data validation is a process that makes data compliant with a set of rules, schemas or constraints that we defined. This makes our code ingest and return data in the exact way it was expected to.
Data validation prevents unexpected errors that occur due to problems such as malformed user inputs, schema evolutions etc.  In that sense, it also acts as a sanitization process." (https://towardsdatascience.com/8-reasons-to-start-using-pydantic-to-improve-data-parsing-and-validation-4f437eae7678) 
  
*Value ?* 
- Define how data should be in pure, canonical python; validate it with pydantic.
- Data validation makes sure the data we ingest and send to another service follow a set of constraints.

## Imports 

In [85]:
from pydantic import BaseModel, ValidationError
from datetime import datetime
from typing import List, Optional
import json

## First comprehension

Python do not enforce the type hint (https://docs.python.org/3/library/typing.html)

In [65]:
def greeting(name: str) -> str:
    """this function takes as input a string and is expected to return a string as well """
    return 'Hello ' + name

# On python runtime, an error will be raised and the type will not be enforced. 
# greeting(1) will raise an error => can only concatenate str (not "int") to str

With pydantic model, we can set data types. It will enforce types corresponding to what is expected. 

In [66]:
# We can set Optionnal and required params 
class User(BaseModel):
    id: int 
    name: str
    date:Optional[datetime]
    referrals:Optional[List[int]] = []

In [67]:
params = {"id": "2", "name": 1, "date":datetime(2021, 1, 1), "referrals":["1"]}
USER = User(**params) # here the int name has been coerced into a string => Strings, bytes or floats will be coerced to ints if possible; otherwise an exception will be raised.
greeting(USER.name)

'Hello 1'

In [73]:
# If the type cannot be enforced, a clear error is raised. 
try : 
    User(id=1, name=datetime(2021, 1, 1))
except Exception as e: 
    print(f"we raised an exception {e} with pydantic ")

we raised an exception 1 validation error for User
name
  str type expected (type=type_error.str) with pydantic 


In [78]:
# erors about missing required data are also raised explicitely 

try : 
    User(name=datetime(2021, 1, 1))
except Exception as e: 
    print(f"we raised an exception {e} with pydantic ")

we raised an exception 2 validation errors for User
id
  field required (type=value_error.missing)
name
  str type expected (type=type_error.str) with pydantic 


## Models 

"Untrusted data can be passed to a model, and after parsing and validation pydantic guarantees that the fields of the resultant model instance will conform to the field types defined on the model.

You can still make your data follow these constraints by loading it and applying a series of conditions to each field. This could work but it can quickly result in a lot of code that becomes unmaintainable over time.
What if we could encapsulate the data into a class, create a typed attribute for each of its fields and validate the field constraints at runtime when the data is loaded into the class?"

It is an abstraction to set constraints about data validation. 

if data parsed do not meet the Model constraints, a ValidationError will be raised. 


In [76]:
# We can set other models as inputs 
from pydb.base_connector import BaseConnector
class ReservationsSourceConnector(BaseModel):
    connector: BaseConnector

ModuleNotFoundError: No module named 'pydb'

## More explicit errors : ValidationError 

In [86]:
bad_data = {"id": 1, "name": datetime(2021, 1, 1)}
try : 
    User(**bad_data)
except ValidationError as e:
    print(f"more eplicit error {e.json()}")
                

more eplicit error [
  {
    "loc": [
      "name"
    ],
    "msg": "str type expected",
    "type": "type_error.str"
  }
]


## Integration with IDE

Cool stuff when coding like autocompletion etc. It is very useful to avoid mistakes and to gain time. 

## Adding constraints 

We can add constraints when creating validation models. To do that, we use the Field class. 


- you can add constraints on the length of the string fields by using the Field’s max_lengthand min_length arguments
- you can set boundaries on the numerical fields by using the Field’s ge and le arguments. (ge: greater or equal, le: lower or equal).
- regex : this adds a regular expression validator. This is useful when you want some string values to match a specific pattern
- multiple_of : this applies to int fields. It adds a “multiple of” validator
- max_items and min_items : this applies to lists and imposes a constraint on the number of items contained in them
- allow_mutation : this applies to any type of field. It defaults to False. When set to True, it makes the field immutable (or protected).

In [100]:
from pydantic import Field
class SourceData(BaseModel):
    db : str = Field(min_length=1, max_length=25)
    table : str = Field(min_length=1, max_length=25)
    attributes : List = Field(min_items=1, max_items=20)

In [102]:
config_source_data = {"db" : "bddadmcity", "table" : "reservations", "attributes" : ["id", "date_end", "last_status"]}
source_data = SourceData(**config_source_data)
source_data.json() # source_data.schema() is also cool to see the data. 

'{"db": "bddadmcity", "table": "reservations", "attributes": ["id", "date_end", "last_status"]}'

## Extra validation

We can create our own validators with the decorator @validator. Far exemple here, the params for our gaussian mixture model need to fit these constraints : 
- n_components needs to be an integer lower than 100 and greater than 1; 
- covariance_type needs to be a string and alloew values are "full","tied","diag" and "spherical". 

In [117]:
from pydantic import BaseModel, validator
from typing import Optional

class BayesianGaussianMixture(BaseModel):
    n_components: int = Field(gt=1, lt=100)
    covariance_type: Optional[str]

    @validator("covariance_type")
    def covariance_type_is_valid(cls, covariance_type: Optional[str]) -> Optional[str]:
        valid_set_values = ["full","tied","diag","spherical"]
        if (covariance_type is None) or (covariance_type not in valid_set_values) :
            raise ValueError(" covariance_type should match these keywords : full, tied, diag or spherical")
        return covariance_type

In [116]:
BGM_PARAMS = {"n_components" : 2, "covariance_type" : "full"}
BayesianGaussianMixture(**BGM_PARAMS)

BayesianGaussianMixture(n_components=2, covariance_type='full')

We can handle a lot of things like missing data etc. 

## Environment variables

Pydantic allows you to read environment variables from .env files and parse them directly inside BaseSettings class.

In [121]:
from pydantic import BaseSettings

class Settings(BaseSettings):
    api_key: str
    login: str
    seed: int
    class Config:
        env_file = ".env"
        env_file_encoding = "utf-8"

settings = Settings()
print(settings)

ImportError: python-dotenv is not installed, run `pip install pydantic[dotenv]`

## How can we run validation on our use case ? 

* Services : 
    - data preparation 
    - PCA decompostion 
    - Unsupervised clustering
    - Load results

* Interfaces : 
    - data_source
    
* Models : 
    - tables + fields => data_source_model(connector), table, fields (readable and maintainable => schema is explicitely defined ! ) 
    - settings (config file => .env + model config) => validation
    
* Validation :
    - config file => .env + model config

In [75]:
"Pydantic models are structures that ingest the data, parse it and make sure it conforms to the fields’ constraints defined in it."

'Pydantic models are structures that ingest the data, parse it and make sure it conforms to the fields’ constraints defined in it.'

- interface db source (bigquery ou mysql); 
- model input validation; 

## Extra types 

## Why is it useful

- it is fast;  
- provides clear error messages (data validation); 
- allows us to focus on input data when writing code and to set constraints); 
- makes it easy to write readable code.


=> "To avoid starting our functions with a long set of validations and assertions, we use pydantic to validate the input."

## Sources 

- https://towardsdatascience.com/8-reasons-to-start-using-pydantic-to-improve-data-parsing-and-validation-4f437eae7678
-https://pydantic-docs.helpmanual.io/
- https://datascience.statnett.no/2020/05/11/how-we-validate-data-using-pydantic/
- https://dev.to/tiangolo/the-future-of-fastapi-and-pydantic-is-bright-3pbm
- https://www.youtube.com/watch?v=lon-dEXfY2I