<a id='Q0'></a>
<center><a target="_blank" href="http://www.propulsion.academy"><img src="https://drive.google.com/uc?id=1McNxpNrSwfqu1w-QtlOmPSmfULvkkMQV" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>
<center> <h1> Live Coding 1: Simple Prediction Notebook</h1> </center>
<p style="margin-bottom:1cm;"></p>
<center><h4>Propulsion Academy, 2021</h4></center>
<p style="margin-bottom:1cm;"></p>

<div style="background:#EEEDF5;border-top:0.1cm solid #EF475B;border-bottom:0.1cm solid #EF475B;">
    <div style="margin-left: 0.5cm;margin-top: 0.5cm;margin-bottom: 0.5cm;color:#303030">
        <p><strong>Goal:</strong> Revision on a simple prediction model using Scikit Learn</p>
        <strong> Outline:</strong>
        <a id='P0' name="P0"></a>
        <ol>
            <li> <a style="color:#303030" href='#SU'>Set up</a></li>
            <li> <a style="color:#303030" href='#P1'>Data Exploration and Cleaning</a></li>
            <li> <a style="color:#303030" href='#P2'>Modeling</a></li>
            <li> <a style="color:#303030" href='#P3'>Model Evaluation</a></li>
            <li> <a style="color:#303030" href='#CL'>Conclusion</a></li>
        </ol>
        <strong>Topics Trained:</strong> Notebook Layout, Data Cleaning, Modelling and Model Evaluation
    </div>
</div>

<nav style="text-align:right"><strong>
        <a style="color:#00BAE5" href="https://monolith.propulsion-home.ch/backend/api/momentum/materials/ds-materials/07_MLEngineering/index.html" title="momentum"> Module 7, Machine Learning Engineering </a>|
        <a style="color:#00BAE5" href="https://monolith.propulsion-home.ch/backend/api/momentum/materials/ds-materials/07_MLEngineering/day4/index.html" title="momentum">Day 4, Data Science Project Development </a>|
        <a style="color:#00BAE5" href="https://monolith.propulsion-home.ch/backend/api/momentum/materials/ds-materials/07_MLEngineering/day4/pages/MLE_D4_LC1_California_Prediction.html" title="momentum"> Live Coding 1, Simple Prediction Notebook</a>
</strong></nav>

<a id='I' name="I"></a>
## [Introduction](#P0)

This Notebook is a minimal example of a regression experiment on the California Housing Dataset. It is inspired from the exercise from day 2 of the [Machine Learning Module](https://monolith.propulsion-home.ch/backend/api/momentum/materials/ds-materials/04_MachineLearning/day2/pages/07_exercises.html).

The modeling and data cleaning are very simple, so that you can focus on MLOps concepts

<a id='SU' name="SU"></a>
## [Set up](#P0)

### Packages

In [2]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
import pydantic as pyd


### User Dependent Variables

In [3]:
data_path = "./data/raw/california_housing_0.csv"

<a id='P1'></a>
## [Data Preparation](#P0)

In [4]:
data = pd.read_csv(data_path)

### Data Exploration

In [5]:
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


In [6]:
data.shape

(5000, 9)

In [7]:
data.isna().sum()

longitude              0
latitude               0
housing_median_age     0
total_rooms            0
total_bedrooms        53
population             0
households             0
median_income          0
median_house_value     0
dtype: int64

### Data Cleaning

In [8]:
data = data.drop(columns="total_bedrooms")

### Train-Test Split

In [9]:
data_train, data_test = train_test_split(data, test_size=0.33, random_state=0)

In [10]:
data_train.shape, data_test.shape

((3350, 8), (1650, 8))

In [11]:
# # Select X and y values (predictor and outcome)
X_train = data_train.drop(columns="median_house_value")
y_train = data_train["median_house_value"]

In [12]:
X_test = data_test.drop(columns="median_house_value")
y_test = data_test["median_house_value"]

In [13]:
X_train.shape, X_test.shape

((3350, 7), (1650, 7))

<a id='P2' name="P2"></a>
## [Modelling](#P0)

### Pipeline Definition

In [14]:
sc = StandardScaler()
lin_reg = LinearRegression()
pipeline_mlr = Pipeline([("data_scaling", sc), ("estimator", lin_reg)])

### Model Fit

In [15]:
pipeline_mlr.fit(X_train, y_train)

<a id='P3' name="P3"></a>
## [Model Evaluation](#P0)

In [16]:
import pickle
with open("./src/models/trained_model.pkl", "wb+") as f:
    pickle.dump(pipeline_mlr, f)

In [17]:
predictions_mlr = pipeline_mlr.predict(X_test)

###-----------

In [18]:
X_test.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,population,households,median_income
398,-122.29,37.89,52.0,979.0,374.0,153.0,5.1675
3833,-118.44,34.2,35.0,1717.0,1628.0,495.0,2.5197
4836,-118.29,34.04,32.0,432.0,702.0,186.0,2.1471
4572,-118.27,34.07,32.0,1657.0,2071.0,598.0,2.1135
636,-122.16,37.72,10.0,2229.0,877.0,485.0,3.3431


In [19]:
predictions_mlr

array([261798.58448047, 167748.87186656, 146574.29536721, ...,
        50622.42473071, 175935.40919334, 174491.35266133])

## Make model and input data types explicit

In [43]:
def robust_predict(model: sklearn.pipeline.Pipeline, input_data: pd.DataFrame)->np.array:
    
    return model.predict(input_data)

# Add docstrings

In [44]:
def robust_predict(model: sklearn.pipeline.Pipeline, input_data: pd.DataFrame)->np.array:
    """This function outputs an inference from an sklearn pipeline using a dataframe for input data

    Args:
        model (sklearn.pipeline.Pipeline): model to be used for inference
        input_data (pd.DataFrame):dataframe of x to be used for inference

    Returns:
        np.ndarray:array with housing price prediction
    """
    return model.predict(input_data)

## Include data validation with pydantic



In [36]:
X_test.iloc[0]

longitude            -122.2900
latitude               37.8900
housing_median_age     52.0000
total_rooms           979.0000
population            374.0000
households            153.0000
median_income           5.1675
Name: 398, dtype: float64

In [22]:
class InputDataModel(pyd.BaseModel):
    longitude: float
    latitude:float
    housing_median_age:float
    total_rooms: float  
    population: float           
    households: float
    median_income: float
    
    @pyd.field_validator("longitude")
    def enforce_longitude_range(cls, value):
        if value > 0:
            raise ValueError("longitude should be lower than 0")
        return value
    
#---
input_data = InputDataModel(**X_test.iloc[0].to_dict())

# Final version: Import from script


In [41]:
class InputDataModel(pyd.BaseModel):
    longitude: float
    latitude:float
    housing_median_age:float
    total_rooms: float  
    population: float           
    households: float
    median_income: float
    
    @pyd.field_validator("longitude")
    def enforce_longitude_range(cls, value):
        if value > 0:
            raise ValueError("longitude should be lower than 0")
        return value

def robust_predict(model: sklearn.pipeline.Pipeline, input_data: pd.DataFrame, Validator:pyd.BaseModel = InputDataModel)->np.array:
    """This function outputs an inference from an sklearn pipeline using a dataframe for input data

    Args:
        model (sklearn.pipeline.Pipeline): model to be used for inference
        input_data (pd.DataFrame):dataframe of x to be used for inference

    Returns:
        np.ndarray:array with housing price prediction
    """
    try:
        input_data.apply(lambda row: Validator(**row.to_dict()), axis=1)
    except pyd.ValidationError as e:
        raise TypeError("input data is wrong, my friend!")
            
    return model.predict(input_data)

#---
robust_predict(model=pipeline_mlr, input_data=X_test)

array([261798.58448047, 167748.87186656, 146574.29536721, ...,
        50622.42473071, 175935.40919334, 174491.35266133])

In [None]:
from src.models.predict_model import robust_predict

In [None]:
import sklearn
import pydantic as pyd

class InputData(pyd.BaseModel):
    longitude: float
    latitude:float
    housing_median_age:float
    total_rooms: float  
    population: float           
    households: float
    median_income: float
    
    @pyd.field_validator("longitude")
    def enforce_longitude_range(cls, value):
        if value > 0:
            raise ValueError("longitude should be lower than 0")
        return value



def robust_predict(model:sklearn.pipeline.Pipeline, input_data: pd.DataFrame, data_model:pyd.BaseModel)->np.ndarray:
    """This function outputs an inference from an sklearn pipeline using a dataframe for input data

    Args:
        model (sklearn.pipeline.Pipeline): model to be used for inference
        input_data (pd.DataFrame):dataframe of x to be used for inference

    Returns:
        np.ndarray:array with housing price prediction
    """
    # check data type with dataModel
    validated_inputs = []
    for n,v in input_data.iterrows():
        print(dict(v))
        validated_data = data_model(**dict(v))
        validated_inputs.append(validated_data)


    return model.predict(pd.DataFrame(validated_data))


In [None]:
robust_predict(pipeline_mlr, X_test, InputData)

{'longitude': -122.29, 'latitude': 37.89, 'housing_median_age': 52.0, 'total_rooms': 979.0, 'population': 374.0, 'households': 153.0, 'median_income': 5.1675}
{'longitude': -118.44, 'latitude': 34.2, 'housing_median_age': 35.0, 'total_rooms': 1717.0, 'population': 1628.0, 'households': 495.0, 'median_income': 2.5197}
{'longitude': -118.29, 'latitude': 34.04, 'housing_median_age': 32.0, 'total_rooms': 432.0, 'population': 702.0, 'households': 186.0, 'median_income': 2.1471}
{'longitude': -118.27, 'latitude': 34.07, 'housing_median_age': 32.0, 'total_rooms': 1657.0, 'population': 2071.0, 'households': 598.0, 'median_income': 2.1135}
{'longitude': -122.16, 'latitude': 37.72, 'housing_median_age': 10.0, 'total_rooms': 2229.0, 'population': 877.0, 'households': 485.0, 'median_income': 3.3431}
{'longitude': -124.17, 'latitude': 40.78, 'housing_median_age': 39.0, 'total_rooms': 1606.0, 'population': 731.0, 'households': 327.0, 'median_income': 1.6369}
{'longitude': -121.54, 'latitude': 39.51,



ValueError: could not convert string to float: 'longitude'

# comment


In [None]:
import pickle
with open("../src/models/trained_model.pkl", "wb+") as f:
    pickle.dump(pipeline_mlr, f)

In [None]:
### Production-ready predict function
import sklearn
from pydantic import BaseModel, field_validator, ValidationError
import pickle
class PredictionInput(BaseModel):
    longitude: float
    latitude: float
    housing_median_age: float
    total_rooms: float
    population: float
    households: float
    median_income: float

    @field_validator("longitude")
    def longitude_is_negative(cls, value):
        if value > 0:
             raise ValueError("longitude should be lower than 0")
        return value

def verify_input_with_pydantic_dm(input_data:pd.DataFrame, DataModel:BaseModel=PredictionInput)->pd.DataFrame:
    """uses a pydantic data model tocheck each row of a pndas dataframe with input data

    Args:
        input_data (pd.DataFrame): dataframe to check
        DataModel (BaseModel, optional): dataModel defining data expectations. Defaults to PredictionInput.

    Raises:
        ValueError: if datamodel validation raises an error

    Returns:
        pd.DataFrame: verified_data
    """
    verified_inputs=[]
    for n,i in input_data.iterrows():
        try:
            verified_input=DataModel(**dict(i))
            verified_inputs.append(verified_input.dict())
        except ValidationError as e:
            raise ValueError(f"input_data on line {n} does not confor to expectations: {e}")
    
    return pd.DataFrame(verified_inputs)
    
        
def robust_predict(model_path: str, input_data:pd.DataFrame, DataVerifier:callable)->np.ndarray:
    """ 
    this functions outputs a prediction given a model and a sk_learn pipeline object as model.
    Created to decouple model building and prediciton and ensure conformity of input data

    Created to decouple model building and prediciton and ensure conformity of input data

    Args:
        model_path (str): path to trained_model
        input_data (pd.DataFrame): dataframe with input data for predictions
        dm (BaseModel): dataModel to che conformity of input data

    Returns:
        np.ndarray: output array of floats giving the predicted price of houses
    """
    with open(model_path, "rb") as f:
        model= pickle.load(f)
    
    verified_data=DataVerifier(input_data)

    return model.predict(verified_data)

#-------
y_pred = robust_predict("../src/models/trained_model.pkl", X_test, verify_input_with_pydantic_dm )
y_pred
    

/tmp/ipykernel_13788/1354777644.py:37: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
  verified_inputs.append(verified_input.dict())


array([261798.58448047, 167748.87186656, 146574.29536721, ...,
        50622.42473071, 175935.40919334, 174491.35266133])

In [None]:
# Test score
pipeline_mlr.score(X_test, y_test)

0.6303801772866726

In [None]:
print("MAE", metrics.mean_absolute_error(y_test, predictions_mlr))
print("MSE", metrics.mean_squared_error(y_test, predictions_mlr))
print("RMSE", np.sqrt(metrics.mean_squared_error(y_test, predictions_mlr)))
print("Explained Var Score", metrics.explained_variance_score(y_test, predictions_mlr))

MAE 46722.86268024486
MSE 4155629700.837575
RMSE 64464.17377766952
Explained Var Score 0.6304406174521564


<a id='CL'></a>
## [Conclusion](#P0)

This Notebook Shows a simple modelling experiment. We will use this base for building our machine Learning Project.

<div style="border-top:0.1cm solid #EF475B"></div>
    <strong><a href='#Q0'><div style="text-align: right"> <h3>End of this Notebook.</h3></div></a></strong>