# Lab | Comparing regression models


For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs. 

### Instructions

1. In this final lab, we will model our data. Import sklearn `train_test_split` and separate the data.
2. Try a simple linear regression with all the data to see whether we are getting good results.
3. Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.
4. Use the function to check `LinearRegressor` and `KNeighborsRegressor`.
5. You can check also the `MLPRegressor` for this task!
6. Check and discuss the results.


In [1]:
# Libraries

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

pd.set_option('display.max_columns', None)


In [2]:
# Import our database

cleaned_df = pd.read_csv('/Users/leozinho.air/Desktop/ironhack_da/class_12/lab-data-cleaning-and-wrangling/cleaned_customer.csv')
cleaned_df = cleaned_df.drop('Unnamed: 0', axis = 1)
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7940 entries, 0 to 7939
Data columns (total 69 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   customer                           7940 non-null   object 
 1   response                           7940 non-null   int64  
 2   effective_to_date                  7940 non-null   object 
 3   gender_f                           7940 non-null   int64  
 4   gender_m                           7940 non-null   int64  
 5   corporate_auto                     7940 non-null   int64  
 6   personal_auto                      7940 non-null   int64  
 7   special_auto                       7940 non-null   int64  
 8   customer_lifetime_value            7940 non-null   float64
 9   total_claim_amount                 7940 non-null   float64
 10  income                             7940 non-null   int64  
 11  monthly_premium_auto               7940 non-null   int64

In [231]:
# Load the original df -> I will use it to get customer_lifetime_value / total_claim_amount NOT normalized

original_df = pd.read_csv('/Users/leozinho.air/Desktop/ironhack_da/class_12/lab-data-cleaning-and-wrangling/original_customer.csv')

# Concate the unnormalized columns to the cleaned df

cleaned_df.rename(columns = {'customer_lifetime_value':'customer_lifetime_value_norm',
                                'total_claim_amount':'total_claim_amount_norm'}, inplace = True) # Rename columns to distinguish them from the normalized ones

original_features = original_df.loc[:, ['total_claim_amount', 'customer_lifetime_value']]

cleaned_df = pd.concat([cleaned_df, original_features], axis=1) # Concate the two dfs

cleaned_df

Unnamed: 0,customer,response,effective_to_date,gender_f,gender_m,corporate_auto,personal_auto,special_auto,customer_lifetime_value_norm,total_claim_amount_norm,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,state_Arizona,state_California,state_Nevada,state_Oregon,state_Washington,marital_status_Divorced,marital_status_Married,marital_status_Single,policy_combined_Corporate Auto_L1,policy_combined_Corporate Auto_L2,policy_combined_Corporate Auto_L3,policy_combined_Personal Auto_L1,policy_combined_Personal Auto_L2,policy_combined_Personal Auto_L3,policy_combined_Special Auto_L1,policy_combined_Special Auto_L2,policy_combined_Special Auto_L3,renew_offer_type_Offer1,renew_offer_type_Offer2,renew_offer_type_Offer3,renew_offer_type_Offer4,sales_channel_Agent,sales_channel_Branch,sales_channel_Call Center,sales_channel_Web,vehicle_class_Four-Door Car,vehicle_class_Luxury Car,vehicle_class_Luxury SUV,vehicle_class_SUV,vehicle_class_Sports Car,vehicle_class_Two-Door Car,employmentstatus_Disabled,employmentstatus_Employed,employmentstatus_Medical Leave,employmentstatus_Retired,employmentstatus_Unemployed,location_code_Rural,location_code_Suburban,location_code_Urban,vehicle_size_Large,vehicle_size_Medsize,vehicle_size_Small,coverage_Basic,coverage_Extended,coverage_Premium,education_Bachelor,education_College,education_Doctor,education_High School or Below,education_Master,day,week,month,total_claim_amount,customer_lifetime_value
0,BU79786,0,2011-02-24,1,0,1,0,0,0.061875,0.400735,56274,69,32,5,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,24,8,2,384.811147,2763.519279
1,AI49188,0,2011-02-19,1,0,0,1,0,0.785631,0.589962,48767,108,18,38,0,2,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,1,1,0,0,0,0,19,7,2,566.472247,12887.431650
2,WW63253,0,2011-01-20,0,1,1,0,0,0.410913,0.551847,0,106,18,65,0,7,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,1,0,0,1,0,0,0,0,20,3,1,529.881344,7645.861827
3,HB64268,0,2011-02-03,0,1,0,1,0,0.065462,0.143781,43836,73,12,44,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,3,5,2,138.130879,2813.692575
4,OC83172,1,2011-01-25,1,0,0,1,0,0.454552,0.165918,62902,69,14,94,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,25,4,1,159.383042,8256.297800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7935,YM19146,0,2011-01-06,1,0,0,1,0,0.157448,0.563723,47761,104,16,58,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,6,1,1,541.282007,4100.398533
7936,PK87824,1,2011-02-12,1,0,1,0,0,0.085681,0.394890,21604,79,14,28,0,1,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,12,6,2,379.200000,3096.511217
7937,TD14365,0,2011-02-06,0,1,1,0,0,0.447946,0.823617,0,85,9,37,3,2,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,0,0,6,5,2,790.784983,8163.890428
7938,UP19263,0,2011-02-03,0,1,0,1,0,0.402232,0.719885,21941,96,34,3,0,3,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,3,5,2,691.200000,7524.442436


## Model : Linear Regression

In [232]:
# X - y Split

y = cleaned_df['total_claim_amount']
X = cleaned_df.drop(['total_claim_amount','total_claim_amount_norm','customer_lifetime_value_norm','customer','effective_to_date'],axis = 1)

# Train test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) # 80% train

# Create the model

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


lm = linear_model.LinearRegression()
model = lm.fit(X_train,y_train)

# Evaluating the model

lm.score(X_train,y_train) # This is the R^2

predictions = lm.predict(X_test) # Here our predictions


r2 = r2_score(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, predictions)

print("R2 value is =", round(r2, 4))
print("The mean squared error of the model is =", round(mse, 2))
print("The root mean squared error of the model is =", round(rmse, 2))
print("The mean absolute error of the model is =", round(mae, 2))


print('\n')
print('The model explains around 80% of the variance in the total claim amount.\nHas an average error of approximately 74.39 in predicting the claim amount.\nIt\'s important to further investigate the significance of individual coefficients and check for model assumption.\nThere is a potential issue of multicollinearity.')

# OLS model

import statsmodels.api as sm
from statsmodels.formula.api import ols
y = y_train
X = sm.add_constant(X_train)
model = sm.OLS(y,X).fit()

model.summary()

R2 value is = 0.7946
The mean squared error of the model is = 9270.09
The root mean squared error of the model is = 96.28
The mean absolute error of the model is = 74.39


The model explains around 80% of the variance in the total claim amount.
Has an average error of approximately 74.39 in predicting the claim amount.
It's important to further investigate the significance of individual coefficients and check for model assumption.
There is a potential issue of multicollinearity.


0,1,2,3
Dep. Variable:,total_claim_amount,R-squared:,0.785
Model:,OLS,Adj. R-squared:,0.783
Method:,Least Squares,F-statistic:,450.7
Date:,"Thu, 09 Nov 2023",Prob (F-statistic):,0.0
Time:,18:37:39,Log-Likelihood:,-38057.0
No. Observations:,6352,AIC:,76220.0
Df Residuals:,6300,BIC:,76570.0
Df Model:,51,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.1665,6.957,0.599,0.549,-9.472,17.805
response,-14.4016,3.847,-3.744,0.000,-21.942,-6.861
gender_f,-0.8150,3.704,-0.220,0.826,-8.075,6.445
gender_m,4.9815,3.678,1.354,0.176,-2.229,12.192
corporate_auto,-3.0961,3.201,-0.967,0.334,-9.372,3.180
personal_auto,-0.0933,2.928,-0.032,0.975,-5.834,5.647
special_auto,7.3559,4.295,1.713,0.087,-1.064,15.776
income,-0.0002,7.03e-05,-2.243,0.025,-0.000,-1.99e-05
monthly_premium_auto,3.5050,0.213,16.460,0.000,3.088,3.922

0,1,2,3
Omnibus:,726.047,Durbin-Watson:,2.021
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1398.164
Skew:,0.742,Prob(JB):,2.47e-304
Kurtosis:,4.756,Cond. No.,8.69e+16


## First function : without chat-gpt

In [233]:
# Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.

y = cleaned_df['total_claim_amount']
X = cleaned_df.drop(['total_claim_amount','total_claim_amount_norm','customer_lifetime_value_norm','customer','effective_to_date'],axis = 1)



def linear_knn_train_evaluate(X, y):
    '''
    This function trains and evaluates two regression models: Linear Regression and K-Nearest Neighbors (KNN)
    It calculates various regression metrics for both models.
    
    Parameters:
    X: Features
    y: Target variable
    '''
    # Import the models
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
    from sklearn.linear_model import LinearRegression
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
    
    # Linear Regression
    lm = LinearRegression()
    model_lm = lm.fit(X_train, y_train)  # Train the Linear Regression model
    predictions_lm = lm.predict(X_test)  # Predict on the test set
    
    # Calculate evaluation metrics for Linear Regression
    r2_lm = r2_score(y_test, predictions_lm)  # R-squared
    mse_lm = mean_squared_error(y_test, predictions_lm)  # Mean Squared Error
    rmse_lm = mse_lm ** 0.5  # Root Mean Squared Error
    mae_lm = mean_absolute_error(y_test, predictions_lm)  # Mean Absolute Error
    
    # Print Linear Regression evaluation results
    print("Linear Regression Evaluation:")
    print("R2 value is =", round(r2_lm, 4))
    print("Mean Squared Error =", round(mse_lm, 2))
    print("Root Mean Squared Error =", round(rmse_lm, 2))
    print("Mean Absolute Error =", round(mae_lm, 2))
    print("\n")
    
    # K-Nearest Neighbors Regression
    knn = KNeighborsRegressor(n_neighbors=10)
    model_knn = knn.fit(X_train, y_train)  # Train the KNN Regression model
    predictions_knn = knn.predict(X_test)  # Predict on the test set
    
    # Calculate evaluation metrics for KNN Regression
    mse_knn = mean_squared_error(y_test, predictions_knn)  # Mean Squared Error
    rmse_knn = mse_knn ** 0.5  # Root Mean Squared Error
    mae_knn = mean_absolute_error(y_test, predictions_knn)  # Mean Absolute Error
    
    # Print KNN Regression evaluation results
    print("K-Nearest Neighbors (KNN) Evaluation:")
    print("Mean Squared Error =", round(mse_knn, 2))
    print("Root Mean Squared Error =", round(rmse_knn, 2))
    print("Mean Absolute Error =", round(mae_knn, 2))


linear_knn_train_evaluate(X, y)

        

    
    


Linear Regression Evaluation:
R2 value is = 0.7946
Mean Squared Error = 9270.09
Root Mean Squared Error = 96.28
Mean Absolute Error = 74.39


K-Nearest Neighbors (KNN) Evaluation:
Mean Squared Error = 35461.01
Root Mean Squared Error = 188.31
Mean Absolute Error = 147.74


## Second Function -> using chatgpt

In [235]:
def train_evaluate_models(models, X, y):
    '''
    Trains and evaluates multiple regression models based on the provided list of models.

    Arguments:
    models -- List of regression models
    X -- Feature matrix
    y -- Target variable

    Returns:
    None
    '''

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

    # Iterate through each model in the provided list of models
    for model in models:
        model_name = model.__class__.__name__  # Get the name of the model class
        
        model.fit(X_train, y_train)  # Train the model
        
        predictions = model.predict(X_test)  # Predict on the test set
        
        # Calculate evaluation metrics
        r2 = r2_score(y_test, predictions)  # R-squared
        mse = mean_squared_error(y_test, predictions)  # Mean Squared Error
        rmse = mse ** 0.5  # Root Mean Squared Error
        mae = mean_absolute_error(y_test, predictions)  # Mean Absolute Error
        
        if isinstance(model,LinearRegression):
            print('Linear Regression evaluation:')
            print("R2 value is =", round(r2, 4))
            print("Mean Squared Error =", round(mse, 2))
            print("Root Mean Squared Error =", round(rmse, 2))
            print("Mean Absolute Error =", round(mae, 2))
            print('\n')
        elif isinstance(model,KNeighborsRegressor):
            print('KNeighborsRegressor evaluation:')
            print("Mean Squared Error =", round(mse, 2))
            print("Root Mean Squared Error =", round(rmse, 2))
            print("Mean Absolute Error =", round(mae, 2))
            
    return

lm = LinearRegression()
knn = KNeighborsRegressor(n_neighbors=10)

models = [lm,knn]

train_evaluate_models(models, X, y)

Linear Regression evaluation:
R2 value is = 0.7946
Mean Squared Error = 9270.09
Root Mean Squared Error = 96.28
Mean Absolute Error = 74.39


KNeighborsRegressor evaluation:
Mean Squared Error = 35461.01
Root Mean Squared Error = 188.31
Mean Absolute Error = 147.74


## Conclusion:
The Linear Regression model outperforms the KNeighborsRegressor model based on these evaluation metrics. It shows lower errors, including MSE, RMSE, and MAE, along with a higher R2 value, signifying better predictive performance and a better fit to the data in comparison to the KNeighborsRegressor model.