# Lab | Model Fitting and Evaluating.

## Activity (Tuesday)

## Linear Regression

## 1 X-y split (y is the target variable, in this case, "total claim amount")
## 2 Train-test split.
## 3 Standardize the data (after the data split!).
## 4 Apply linear regression.
## 5 Model Interpretation.

## Load the data 

In [320]:
import pandas as pd 
import seaborn as sns

In [321]:
csv_path = '/Users/matthewbatchelor/Downloads/marketing_customer_analysis_clean.csv'

In [322]:
df = pd.read_csv(csv_path)

In [323]:
df.head()

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,month
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2011-02-18,Employed,M,...,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,A,2
1,1,KX64629,California,2228.525238,No,Basic,College,2011-01-18,Unemployed,F,...,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,A,1
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2011-02-10,Employed,M,...,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A,2
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,2011-01-11,Employed,M,...,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A,1
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,2011-01-17,Medical Leave,F,...,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,A,1


In [324]:
# Isolate numerical variables 

numericals_df = df.select_dtypes(include=['number'])

In [325]:
numericals_df.head()

Unnamed: 0,unnamed:_0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,total_claim_amount,month
0,0,4809.21696,48029,61,7.0,52,0.0,9,292.8,2
1,1,2228.525238,0,64,3.0,26,0.0,1,744.924331,1
2,2,14947.9173,22139,100,34.0,31,0.0,2,480.0,2
3,3,22332.43946,49078,97,10.0,3,0.0,2,484.013411,1
4,4,9025.067525,23675,117,15.149071,31,0.384256,7,707.925645,1


In [326]:
numericals_df = numericals_df.dropna()

In [327]:
numericals_df = numericals_df.drop_duplicates()

## 1 X-y split (y is the target variable, in this case, "total claim amount")

In [328]:
X = numericals_df.drop(["total_claim_amount"], axis=1) # taking income as independent variable
y = numericals_df[["total_claim_amount"]]

## 2 Train-test split

In [329]:
from sklearn.model_selection import train_test_split

In [330]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

## 3 Standardize the data 

In [389]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PowerTransformer

# Fit the PowerTransformer to your training data

# Side note: How do we decide whether to use PowerTransformer vs MinMaxScaler vs StandardScaler? 
# Source Chat GPT (31.01.24) 

    # Decision Criteria:

    # 1 Distribution of Data:

# - If your data is normally distributed or can be transformed to be more Gaussian-like, consider PowerTransformer.
# - If your data has varying ranges and you want a specific range, consider MinMaxScaler.
# - If your data has different means and standard deviations, and you want to standardize, consider StandardScaler.

    # 2 Handling Outliers:

# - If your data contains outliers and you want a more robust scaler, PowerTransformer is generally less sensitive to outliers.
# - StandardScaler is also less sensitive to outliers compared to MinMaxScaler.

    # 3 Model Assumptions:

# Consider the assumptions of the models you plan to use. Some models assume that features are normally distributed or have certain scaling characteristics.
# Experimentation:

    # 4 Experiment with different scalers and evaluate their impact on your specific machine learning model's performance.

In [391]:
scaler = PowerTransformer() # Here I opted for PowerTransformer as many of the numerical variables have a skew

pt = scaler # Abbreviated for ease of coding 

X_train_fit = pt.fit(X_train) 

# Transform the training data

X_train_trans = pt.transform(X_train)

# Transform X_test data 

X_test_trans = pt.transform(X_test)

## 4 Apply linear regression.

In [390]:
from sklearn.linear_model import LinearRegression

In [367]:
lm = LinearRegression()
model = lm.fit(X_train_trans,y_train)

In [368]:
model.coef_

array([[ 9.82399068e-02,  1.12738011e+01, -1.04032839e+02,
         1.53052473e+02,  3.50539681e+00, -3.85903839e+00,
        -1.41103257e+00, -7.66019405e+00, -2.09632903e-01]])

In [369]:
model.intercept_

array([434.60833836])

## 5 Model Interpretation.

In [370]:
# Predict total claim amount based on income 

In [371]:
random_customer = X_test.sample()

In [372]:
random_customer

Unnamed: 0,unnamed:_0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,month
10441,10441,13438.04189,69693,112,34.0,62,1.0,2,1


In [373]:
model.predict(random_customer)



array([[-7080397.63401566]])

## Create predictions for test set

In [374]:
y_pred = model.predict(X_test_trans)

In [375]:
y_pred = pd.DataFrame(y_pred)

In [376]:
y_test = y_test.reset_index(drop=True)

In [377]:
resiudals_df = pd.concat([y_test,y_pred],axis=1)

In [378]:
residuals_df = resiudals_df.rename(columns={"total_claim_amount":"y_test", 0:"y_pred"})

In [379]:
residuals_df["residual"] = residuals_df["y_test"]-residuals_df["y_pred"]

In [380]:
residuals_df.head()

Unnamed: 0,y_test,y_pred,residual
0,475.423848,219.404346,256.019502
1,350.4,309.353785,41.046215
2,482.4,438.168865,44.231135
3,673.34265,497.926,175.41665
4,302.4,163.53129,138.86871


## Model interpretation 

In [381]:
mean_error = residuals_df["residual"].mean()

In [382]:
mean_error

0.8230562109049625

In [383]:
from sklearn.metrics import mean_squared_error as mse , mean_absolute_error as mae

In [384]:
mse(y_test,y_pred)

45743.03824143722

In [385]:
mae(y_test,y_pred)

152.92551635883393

In [386]:
rmse = mse(y_test,y_pred, squared = False)

In [387]:
rmse

213.87622177660896