## Execise 1 MLOps

In this exercise, do the following:
1. Create a function that preprocess new ames data in the same way as the original ames data was preprocessed in step 5 in the `MLOps.ipynb` notebook.
2. Create a function that takes as input a new ames dataset and a model. The function should pre-process the new data and evaluate the model on that new data using mean absolute error.
3. Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.
4. Test the function from 2. on the "NewAmesData2.csv" dataset and the best model from the `MLOps.ipynb` notebook. Do you see any drift?
5. Do you see a data drift in "NewAmesData2.csv"? If so, for which variables?
6. Do you see a data drift in "NewAmesData4.csv"? If so, for which variables?

In [110]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_absolute_error
import pickle
from scipy.stats import ks_2samp, entropy


### 1. Create a function that preprocess new ames data in the same way as the original ames data was preprocessed in step 5 in the `MLOps.ipynb` notebook.

In [111]:
def preprocess_ames_data(df):
    """
    Preprocess new Ames housing data by converting the categorical columns
    'Bldg Type' and 'Neighborhood' into dummy variables, and then dropping them.
    
    Parameters:
        df (pd.DataFrame): The input Ames dataset containing at least 'Bldg Type' and 'Neighborhood'.
        
    Returns:
        pd.DataFrame: The preprocessed dataset with dummy variables.
    """
    df_processed = df.copy()
    
    # Create dummy variables for 'Bldg Type'
    bldg_dummies = pd.get_dummies(df_processed['Bldg Type'], drop_first=True, dtype='int', prefix='BType')
    df_processed = df_processed.join(bldg_dummies)
    
    # Create dummy variables for 'Neighborhood'
    nbh_dummies = pd.get_dummies(df_processed['Neighborhood'], drop_first=True, dtype='int', prefix='Nbh')
    df_processed = df_processed.join(nbh_dummies)
    
    # Drop the original categorical columns
    df_processed.drop(columns=['Bldg Type', 'Neighborhood'], inplace=True)
    
    return df_processed

### 2. Create a function that takes as input a new ames dataset and a model. The function should pre-process the new data and evaluate the model on that new data using mean absolute error.

In [112]:
def evaluate_model_on_new_data(new_data, model):
    """
    Preprocess the new Ames data and evaluate the given model using MAE.
    
    Parameters:
        new_data (pd.DataFrame): New Ames dataset that includes the 'SalePrice' column.
        model: A trained ML model with a .predict() method.
        
    Returns:
        float: Mean Absolute Error (MAE) of the model predictions on the new data.
    """
    # Preprocess the new data
    new_data_processed = preprocess_ames_data(new_data)
    
    # Separate features and target
    X_new = new_data_processed.drop(columns=["SalePrice"])
    y_new = new_data_processed["SalePrice"]
    
    # Make predictions and calculate MAE
    y_pred = model.predict(X_new)
    mae = mean_absolute_error(y_new, y_pred)
    
    return mae

### 3. Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.

In [113]:
with open("model_rf_500.pkl", "rb") as f:
    model_final = pickle.load(f)

In [114]:
# Load the new Ames dataset
new_ames_data1 = pd.read_csv("NewAmesData1.csv")

# Evaluate the model on the new data
mae_new1 = evaluate_model_on_new_data(new_ames_data1, model_final)

print("Mean Absolute Error on NewAmesData1.csv:", mae_new1)

Mean Absolute Error on NewAmesData1.csv: 19358.33538293174


### 4. Test the function from 2. on the "NewAmesData2.csv" dataset and the best model from the `MLOps.ipynb` notebook. Do you see any drift?

In [115]:
# Load the new Ames dataset
new_ames_data2 = pd.read_csv("NewAmesData2.csv")

# Evaluate the model on the new data
mae_new2 = evaluate_model_on_new_data(new_ames_data2, model_final)

print("Mean Absolute Error on NewAmesData1.csv:", mae_new2)

Mean Absolute Error on NewAmesData1.csv: 122642.93781866827


we see drift, the MAE went from 
19358.33538293174 to 122642.93781866827
thats around a increase of 6 times. 

In [116]:
ks_2samp(new_ames_data1['SalePrice'], new_ames_data2['SalePrice'])

KstestResult(statistic=np.float64(0.6208277703604806), pvalue=np.float64(2.4920189035284662e-135), statistic_location=np.float64(78441.7083), statistic_sign=np.int8(-1))

### 5. Do you see a data drift in "NewAmesData2.csv"? If so, for which variables?

In [120]:
new_ames_data1.describe()

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,SalePrice
count,749.0,749.0,749.0,749.0,749.0,749.0,749.0,749.0
mean,10077.269693,5.058745,1971.823765,1481.263017,5.870494,6.325768,2007.736983,181129.140187
std,5431.422205,1.226607,32.027285,491.585274,1.69398,2.853957,1.346011,84355.480982
min,1471.0,1.0,1865.0,409.0,2.0,1.0,2006.0,12796.0
25%,7590.0,4.0,1951.0,1113.0,5.0,4.0,2007.0,128708.0
50%,9421.0,5.0,1975.0,1442.0,6.0,6.0,2008.0,159453.0
75%,11599.0,6.0,1998.0,1752.0,7.0,8.0,2009.0,211905.0
max,70207.0,9.0,2022.0,4669.0,13.0,12.0,2010.0,621346.0


In [121]:
new_ames_data2.describe()

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,SalePrice
count,749.0,749.0,749.0,749.0,749.0,749.0,749.0,749.0
mean,10077.269693,5.058745,1971.823765,1481.263017,5.870494,6.325768,2007.736983,115736.106582
std,5431.422205,1.226607,32.027285,491.585274,1.69398,2.853957,1.346011,161832.682309
min,1471.0,1.0,1865.0,409.0,2.0,1.0,2006.0,-80248.6992
25%,7590.0,4.0,1951.0,1113.0,5.0,4.0,2007.0,7215.642
50%,9421.0,5.0,1975.0,1442.0,6.0,6.0,2008.0,35803.9088
75%,11599.0,6.0,1998.0,1752.0,7.0,8.0,2009.0,208342.8996
max,70207.0,9.0,2022.0,4669.0,13.0,12.0,2010.0,681099.7182


so it looks like most of the numeric values are quite simmilar, the only changes is in TotRms AbvGrd and prince. but as we can see in price something strange is going on. negative salePrice ?.

In [None]:
new_ames_data2[new_ames_data2['SalePrice'] < 0]

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,Bldg Type,Neighborhood,SalePrice
0,10738,6,1954,1457,5,8,2009,1Fam,NAmes,-9790.3818
4,14033,4,1992,1357,5,10,2009,1Fam,Mitchel,-57592.0466
17,10871,4,1948,1245,5,5,2008,1Fam,NAmes,-13582.5482
22,17581,4,1915,878,4,4,2006,1Fam,Sawyer,-79560.3207
29,12121,5,1977,1421,7,6,2008,1Fam,NWAmes,-37094.5932
...,...,...,...,...,...,...,...,...,...,...
730,10724,3,1971,1645,5,7,2008,1Fam,Sawyer,-10152.1452
736,13211,4,1938,1223,6,6,2007,1Fam,Crawfor,-49638.5279
738,10851,3,1930,709,4,8,2006,1Fam,NAmes,-5751.5686
741,13153,5,1956,1721,9,11,2008,Duplex,NAmes,-44992.0488


so drop or replace ?. the data seems to show the same houses in time, so we could replace the prices of the negative sales prices with the last know posetive prices. for now ill just drop them we are only talking about 132 rows. 

In [123]:
new_ames_data2_clean = new_ames_data2[new_ames_data2['SalePrice'] >= 0]

In [124]:
new_ames_data2_clean[new_ames_data2_clean['SalePrice'] < 0]

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,Bldg Type,Neighborhood,SalePrice


In [125]:
new_ames_data2_clean.describe()


Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,SalePrice
count,617.0,617.0,617.0,617.0,617.0,617.0,617.0,617.0
mean,9755.862237,5.082658,1974.179903,1515.218801,5.923825,6.364668,2007.753647,145032.787927
std,5900.405248,1.203879,32.610029,507.13967,1.69674,2.843011,1.359523,163871.250758
min,1471.0,1.0,1865.0,409.0,2.0,1.0,2006.0,336.062
25%,7170.0,4.0,1953.0,1135.0,5.0,4.0,2007.0,22982.4292
50%,8809.0,5.0,1981.0,1471.0,6.0,6.0,2008.0,52608.0663
75%,11165.0,6.0,2001.0,1790.0,7.0,8.0,2009.0,297437.02
max,70207.0,9.0,2022.0,4669.0,13.0,12.0,2010.0,681099.7182


## Execise 1 mlflow

In this exercise, do the following:
1. Load the dataset used in the time series example - Energy consumption data. You can find it in the notebook "TSA_Example" in Time Series folder in Moodle.
2. Setup a nested MLFlow loop where different modelling experiments can be tracked and the use the dataset in point 1 to experiment and track models. You should do following combinations:
    1. At least 3 model types
    2. At least 3 different feature combinations
    3. At least 3 different options for 3 different hyperparameters
    4. At least 3 different time splits for train test
3. For each option in the combination, you should calculate & log the following in MLFlow:
    1. RMSE
    2. MAE
    3. Plot of actual vs predicted for 1 month data
    4. Plot of actual vs predicted for 1 week of data
    5. All of the combination info in point 2, such as which model, what feature combindation, what hyperparameter, what train test split has been used
4. Turn on MLFlow UI and track your experiments