# Test Pydantic

In this script we test the usage of the Python library Pydantic in order to model the validation of datasets to be included in our workflows.

This script is based on https://towardsai.net/p/machine-learning/data-reliability-101-a-practical-guide-to-data-validation-using-pydantic-in-data-science-projects

### 1. Import Python libraries

In [18]:
import numpy as np
from pydantic import BaseModel, ValidationError, Field
from typing import List, Optional
import pandas as pd

### 2. Define Pydantic classes

Dictvalidator specifies the data type of each feature in the dataset to be imported and the conditions the features have to fulfill in order not to trigger a validation error.

INPORTANT: conditions inside the class are designed for https://github.com/shivamshinde123/Thyroid_Disease_Detection_Internship/blob/main/Data/Raw_data/ThyroidRawData.csv. If you intend to test Pydantic on your own dataset you will need to edit the conditions inside the class in order to match your own set of features.

In [19]:
class Dictvalidator(BaseModel):
    
    age: int = Field(gt=0, le=100)
    sex: Optional[str]
    on_thyroxine: Optional[str]
    query_on_thyroxine: Optional[str]
    on_antithyroid_meds: Optional[str]
    sick: Optional[str]
    pregnant: Optional[str]
    thyroid_surgery: Optional[str]
    I131_treatment: Optional[str]
    query_hypothyroid: Optional[str]
    query_hyperthyroid: Optional[str]
    lithium: Optional[str]
    goitre: Optional[str]
    tumor: Optional[str]
    hypopituitary: Optional[str]
    psych: Optional[str]
    TSH_measured: str
    TSH: Optional[float]
    T3_measured: str
    T3: Optional[float]
    TT4_measured: str
    TT4: Optional[float]
    T4U_measured: str
    T4U: Optional[float]
    FTI_measured: str
    FTI: Optional[float]
    TBG_measured: str
    TBG: Optional[float]
    referral_source: Optional[str]
    target: str
    patient_id: int

dataframe_validator guarantees that the imported pandas dataframe is modeled according to the conditions specified in Dictvalidator

In [8]:
class dataframe_validator(BaseModel):
    
    df_dict: List[Dictvalidator]

### 3. Locate dataset to be modeled

In [None]:
raw_data_file_path = input("Type the filepath to your dataset (including filename and extension):")

print("\nYou are about to model the following dataset:", raw_data_file_path)


### 4. Run pydantic modeling

In [17]:
if __name__ =='__main__':
    
    # read dataframe    
    df = pd.read_csv(raw_data_file_path)
    
    # convert NaNs to None
    df = df.replace({np.nan:None})
    
    # remove individuals with incorrect ages
    df = df[df['age'] <= 100]    
    
    # validade dataframe
    try:
        dataframe_validator(df_dict = df.to_dict(orient = 'records'))
    except ValidationError as e:
        raise e
    