## Lab | Comparing regression models

Instructions

1. Concatenate Numerical and Categorical dataframes into one dataframe called data. Split into X=features y=target (total_claim_amount).

2. In this final lab, we will model our data. Import sklearn train_test_split and separate the data.

3. Separate X_train and X_test into numerical and categorical (X_train_cat , X_train_num , X_test_cat , X_test_num)

4. Use X_train_num to fit scalers. Transform BOTH X_train_num and X_test_num.

5. Encode the categorical variables X_train_cat and X_test_cat (See the hint below for encoding categorical data!!!)

6. Since the model will only accept numerical data, check and make sure that every column is numerical, if some are not, change it using encoding.

7. Try a simple linear regression with all the data to see whether we are getting good results.

8. Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.

9. Use the function to check LinearRegressor and KNeighborsRegressor.

10. You can check also the MLPRegressor for this task!

11. Check and discuss the results.

In [93]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn import linear_model                  
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor

#### 1.  X/y Splitt

In [2]:
# Comment to this
# I have already concatenated the numerical and categorical during the previous lab (feature extraction).
# At the end I exported the file as customer_df_con.csv, so now I will load this file here and work on it. 

In [3]:
data = pd.read_csv('customer_df_con.csv')

In [4]:
pd.set_option('display.max_columns', None)
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,state,response,coverage,education,employmentstatus,gender,location_code,marital_status,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size,customer_lifetime_value,effective_to_date,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,total_claim_amount,date,day,month,year,week
0,0,0.0,Washington,No,Basic,Bachelor,Employed,F,Suburban,Married,Corporate,Offer1,Agent,Two-Door Car,Medsize,2763.519279,2011-02-24,56274.0,69.0,32.0,5.0,0.0,1.0,384.811147,2011-02-24,24,2,2011,8
1,1,1.0,Arizona,No,Extended,Bachelor,Unemployed,F,Suburban,Single,Personal,Offer3,Agent,Four-Door Car,Medsize,6979.535903,2011-01-31,0.0,94.0,13.0,42.0,0.0,8.0,1131.464935,2011-01-31,31,1,2011,5
2,2,2.0,Nevada,No,Premium,Bachelor,Employed,F,Suburban,Married,Personal,Offer1,Agent,Two-Door Car,Medsize,12887.43165,2011-02-19,48767.0,108.0,18.0,38.0,0.0,2.0,566.472247,2011-02-19,19,2,2011,7
3,3,3.0,California,No,Basic,Bachelor,Unemployed,M,Suburban,Married,Corporate,Offer1,Call Center,SUV,Medsize,7645.861827,2011-01-20,0.0,106.0,18.0,65.0,0.0,7.0,529.881344,2011-01-20,20,1,2011,3
4,4,4.0,Washington,No,Basic,Bachelor,Employed,M,Rural,Single,Personal,Offer1,Agent,Four-Door Car,Medsize,2813.692575,2011-02-03,43836.0,73.0,12.0,44.0,0.0,1.0,138.130879,2011-02-03,3,2,2011,5


In [5]:
data.isna().sum()

Unnamed: 0.1                       0
Unnamed: 0                       735
state                            735
response                         735
coverage                         735
education                        735
employmentstatus                 735
gender                           735
location_code                    735
marital_status                   735
policy                           735
renew_offer_type                 735
sales_channel                    735
vehicle_class                    735
vehicle_size                     735
customer_lifetime_value          735
effective_to_date                735
income                           735
monthly_premium_auto             735
months_since_last_claim          735
months_since_policy_inception    735
number_of_open_complaints        735
number_of_policies               735
total_claim_amount               735
date                             735
day                                0
month                              0
y

In [6]:
# Observation: 
# I have here so many nulls, because furing cleaning of numerical columns I decided to remove outliers. 
# At this point I need also to remove then all rows with null values. 

# Dropping empty rows
data.dropna(inplace=True)

data.isna().sum().sum()

0

In [7]:
# At this point I see, that there are several columns, which we don't need and I will drop them. 
# I will drop: Unnamed: 0.1 and Unnamed: 0 

# I also decide to drop a column "effective_to_date", as the values were transformed into new columns (day, month, year and week)

In [8]:
data = data.drop(['Unnamed: 0.1', 'Unnamed: 0', 'effective_to_date', 'date', 'week'], axis=1)

In [9]:
display(data.shape)
data.head()

(7582, 24)

Unnamed: 0,state,response,coverage,education,employmentstatus,gender,location_code,marital_status,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,total_claim_amount,day,month,year
0,Washington,No,Basic,Bachelor,Employed,F,Suburban,Married,Corporate,Offer1,Agent,Two-Door Car,Medsize,2763.519279,56274.0,69.0,32.0,5.0,0.0,1.0,384.811147,24,2,2011
1,Arizona,No,Extended,Bachelor,Unemployed,F,Suburban,Single,Personal,Offer3,Agent,Four-Door Car,Medsize,6979.535903,0.0,94.0,13.0,42.0,0.0,8.0,1131.464935,31,1,2011
2,Nevada,No,Premium,Bachelor,Employed,F,Suburban,Married,Personal,Offer1,Agent,Two-Door Car,Medsize,12887.43165,48767.0,108.0,18.0,38.0,0.0,2.0,566.472247,19,2,2011
3,California,No,Basic,Bachelor,Unemployed,M,Suburban,Married,Corporate,Offer1,Call Center,SUV,Medsize,7645.861827,0.0,106.0,18.0,65.0,0.0,7.0,529.881344,20,1,2011
4,Washington,No,Basic,Bachelor,Employed,M,Rural,Single,Personal,Offer1,Agent,Four-Door Car,Medsize,2813.692575,43836.0,73.0,12.0,44.0,0.0,1.0,138.130879,3,2,2011


In [10]:
# X / y Splitt

y = data['total_claim_amount']
X = data.drop(['total_claim_amount'], axis=1)

#### 2. train/test splitt

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#### 3. X_train_cat , X_train_num , X_test_cat , X_test_num

In [12]:
# numerical/categorical on train set
X_train_num = X_train.select_dtypes(include = np.number)
X_train_cat = X_train.select_dtypes(include = object)

# numerical/categorical on test set
X_test_num = X_test.select_dtypes(include = np.number)
X_test_cat = X_test.select_dtypes(include = object)

In [13]:
display(X_train_num.shape)
X_train_num.head()

(5686, 10)

Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,day,month,year
4976,4462.997006,0.0,73.0,35.0,52.0,3.0,5.0,30,1,2011
3624,6331.338496,0.0,87.0,3.0,82.0,0.0,3.0,21,2,2011
1214,10700.4952,75313.0,89.0,16.0,64.0,1.0,2.0,24,2,2011
3989,3176.356488,64706.0,81.0,5.0,3.0,0.0,1.0,10,1,2011
6138,5723.535705,77646.0,71.0,3.0,29.0,0.0,6.0,9,2,2011


In [14]:
display(X_train_cat.shape)
X_train_cat.head()

(5686, 13)

Unnamed: 0,state,response,coverage,education,employmentstatus,gender,location_code,marital_status,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size
4976,Oregon,No,Basic,Bachelor,Employed,F,Rural,Married,Personal,Offer4,Call Center,Four-Door Car,Large
3624,Arizona,No,Extended,Master,Employed,F,Rural,Married,Personal,Offer2,Agent,SUV,Medsize
1214,California,No,Basic,High School or Below,Unemployed,M,Suburban,Single,Corporate,Offer2,Call Center,Two-Door Car,Medsize
3989,Arizona,No,Basic,High School or Below,Employed,F,Urban,Married,Personal,Offer3,Web,Two-Door Car,Medsize
6138,California,No,Basic,Bachelor,Employed,F,Suburban,Married,Personal,Offer1,Web,SUV,Medsize


#### 4. Use X_train_num to fit scalers. Transform BOTH X_train_num and X_test_num

In [15]:
transformer = StandardScaler().fit(X_train_num) 

In [16]:
X_train_normalized = transformer.transform(X_train_num)   # Output is an array 
X_test_normalized = transformer.transform(X_test_num)

#### 5. Encode the categorical variables X_train_cat and X_test_cat 

In [None]:
# Because we have two different types of encoding, I will splitt categorical data sets to:

 
# X_train_cat_ord
# X_test_cat_ord
# X_train_cat_onehot
# X_test_cat_onehot

In [29]:
X_train_cat_ord = X_train_cat[["coverage", "employmentstatus", "location_code", "vehicle_class", "education"]]
# X_train_cat_ord
X_test_cat_ord = X_test_cat[["coverage", "employmentstatus", "location_code", "vehicle_class", "education"]]
X_test_cat_ord


Unnamed: 0,coverage,employmentstatus,location_code,vehicle_class,education
7749,Extended,Others,Suburban,Two-Door Car,Doctor
4643,Extended,Employed,Suburban,Four-Door Car,High School or Below
4205,Basic,Others,Suburban,Four-Door Car,High School or Below
958,Basic,Employed,Urban,Two-Door Car,High School or Below
645,Extended,Employed,Rural,Four-Door Car,College
...,...,...,...,...,...
6896,Basic,Employed,Urban,SUV,Bachelor
6326,Extended,Unemployed,Suburban,Four-Door Car,Bachelor
7586,Basic,Employed,Suburban,Four-Door Car,High School or Below
674,Basic,Employed,Rural,SUV,Bachelor


In [32]:
X_train_cat_onehot = X_train_cat.drop(["coverage", "employmentstatus", "location_code", "vehicle_class", "education"], axis=1)
X_test_cat_onehot = X_test_cat.drop(["coverage", "employmentstatus", "location_code", "vehicle_class", "education"], axis=1)
X_test_cat_onehot

Unnamed: 0,state,response,gender,marital_status,policy,renew_offer_type,sales_channel,vehicle_size
7749,California,No,F,Divorced,Personal,Offer3,Agent,Medsize
4643,Oregon,No,M,Married,Personal,Offer4,Agent,Large
4205,Oregon,Yes,F,Divorced,Personal,Offer1,Agent,Medsize
958,California,No,F,Married,Personal,Offer4,Call Center,Large
645,Oregon,No,F,Married,Personal,Offer1,Agent,Medsize
...,...,...,...,...,...,...,...,...
6896,Arizona,No,F,Single,Personal,Offer3,Agent,Medsize
6326,Oregon,No,F,Single,Personal,Offer1,Branch,Medsize
7586,California,No,M,Married,Personal,Offer2,Agent,Large
674,Arizona,No,M,Married,Personal,Offer1,Agent,Medsize


In [34]:
display(X_train_cat["employmentstatus"].value_counts())
display(X_train_cat["location_code"].value_counts())
display(X_train_cat["vehicle_class"].value_counts())
display(X_train_cat["education"].value_counts())


employmentstatus
Employed      3554
Unemployed    1431
Others         701
Name: count, dtype: int64

location_code
Suburban    3575
Rural       1089
Urban       1022
Name: count, dtype: int64

vehicle_class
Four-Door Car    2963
Two-Door Car     1178
SUV              1106
Sports Car        294
Luxury SUV         78
Luxury Car         67
Name: count, dtype: int64

education
Bachelor                1787
College                 1656
High School or Below    1586
Master                   446
Doctor                   211
Name: count, dtype: int64

In [33]:
X_train_cat_ord.head()

Unnamed: 0,coverage,employmentstatus,location_code,vehicle_class,education
4976,Basic,Employed,Rural,Four-Door Car,Bachelor
3624,Extended,Employed,Rural,SUV,Master
1214,Basic,Unemployed,Suburban,Two-Door Car,High School or Below
3989,Basic,Employed,Urban,Two-Door Car,High School or Below
6138,Basic,Employed,Suburban,SUV,Bachelor


In [36]:
### Ordinal encoding

# Ordinary encoding: giving the values particular hierarchy/order

X_train_cat_ord["coverage"] = X_train_cat_ord["coverage"].map({"Basic" : 0, "Extended" : 1, "Premium" : 2})
X_train_cat_ord["employmentstatus"] = X_train_cat_ord["employmentstatus"].map({"Employed" : 2, "Unemployed" : 0, "Others" : 1})
X_train_cat_ord["location_code"] = X_train_cat_ord["location_code"].map({"Suburban" : 0, "Rural" : 1, "Urban" : 2})
X_train_cat_ord["vehicle_class"] = X_train_cat_ord["vehicle_class"].map({"Four-Door Car" : 1, "Two-Door Car" : 1, "SUV" : 2, \
                                                  "Sports Car" : 2, "Luxury SUV" : 3, "Luxury Car" : 3})
X_train_cat_ord["education"] = X_train_cat_ord["education"].map({"Bachelor" : 2, "College" : 1, "High School or Below" : 0, \
                                          "Master" : 3, "Doctor" : 4})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_cat_ord["coverage"] = X_train_cat_ord["coverage"].map({"Basic" : 0, "Extended" : 1, "Premium" : 2})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_cat_ord["employmentstatus"] = X_train_cat_ord["employmentstatus"].map({"Employed" : 2, "Unemployed" : 0, "Others" : 1})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#re

In [40]:
X_test_cat_ord["coverage"] = X_test_cat_ord["coverage"].map({"Basic" : 0, "Extended" : 1, "Premium" : 2})
X_test_cat_ord["employmentstatus"] = X_test_cat_ord["employmentstatus"].map({"Employed" : 2, "Unemployed" : 0, "Others" : 1})
X_test_cat_ord["location_code"] = X_test_cat_ord["location_code"].map({"Suburban" : 0, "Rural" : 1, "Urban" : 2})
X_test_cat_ord["vehicle_class"] = X_test_cat_ord["vehicle_class"].map({"Four-Door Car" : 1, "Two-Door Car" : 1, "SUV" : 2, \
                                                  "Sports Car" : 2, "Luxury SUV" : 3, "Luxury Car" : 3})
X_test_cat_ord["education"] = X_test_cat_ord["education"].map({"Bachelor" : 2, "College" : 1, "High School or Below" : 0, \
                                          "Master" : 3, "Doctor" : 4})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_cat_ord["coverage"] = X_test_cat_ord["coverage"].map({"Basic" : 0, "Extended" : 1, "Premium" : 2})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_cat_ord["employmentstatus"] = X_test_cat_ord["employmentstatus"].map({"Employed" : 2, "Unemployed" : 0, "Others" : 1})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#return

In [42]:
# display(X_train_cat_ord["employmentstatus"].value_counts())
# display(X_train_cat_ord["location_code"].value_counts())
# display(X_train_cat_ord["vehicle_class"].value_counts())
# display(X_train_cat_ord["education"].value_counts())
# display(X_train_cat_ord["coverage"].value_counts())

# display(X_test_cat_ord["employmentstatus"].value_counts())
# display(X_test_cat_ord["location_code"].value_counts())
# display(X_test_cat_ord["vehicle_class"].value_counts())
# display(X_test_cat_ord["education"].value_counts())
# display(X_test_cat_ord["coverage"].value_counts())

In [46]:
# X_train_cat_ord.head()

Unnamed: 0,coverage,employmentstatus,location_code,vehicle_class,education
4976,0,2,1,1,2
3624,1,2,1,2,3
1214,0,0,0,1,0
3989,0,2,2,1,0
6138,0,2,0,2,2


In [None]:
# Onehot encoding

In [35]:
encoder = OneHotEncoder(drop='first').fit(X_train_cat_onehot)

In [43]:
# Getting names of columns to be able to label features
cols = encoder.get_feature_names_out(input_features=X_train_cat_onehot.columns)

array(['state_California', 'state_Nevada', 'state_Oregon',
       'state_Washington', 'response_Yes', 'gender_M',
       'marital_status_Married', 'marital_status_Single',
       'policy_Personal', 'policy_Special', 'renew_offer_type_Offer2',
       'renew_offer_type_Offer3', 'renew_offer_type_Offer4',
       'sales_channel_Branch', 'sales_channel_Call Center',
       'sales_channel_Web', 'vehicle_size_Medsize', 'vehicle_size_Small'],
      dtype=object)

In [55]:
# Running encoder on a X_train_cat_onehot
X_train_cat_onehot_encoded = encoder.transform(X_train_cat_onehot).toarray()
X_test_cat_onehot_encoded = encoder.transform(X_test_cat_onehot).toarray()

In [None]:
# The outputs of scaling and encoding are now stored as arrays.
# Because the model doesnt need the dataframe, I will not convert arrays to pandas dataframes. 

#### 5. Check and make sure that every column is numerical, if some are not, change it using encoding.

In [None]:
# Concatenating numerical and categorical sets separetly for test and training

In [56]:
X_train_transformed = np.concatenate([X_train_normalized, X_train_cat_ord, X_train_cat_onehot_encoded], axis=1)
X_test_transformed = np.concatenate([X_test_normalized, X_test_cat_ord, X_test_cat_onehot_encoded], axis=1)

In [58]:
# Checking if all features are numerical

X_train_transformed_df = pd.DataFrame(X_train_transformed)
X_test_transformed_df = pd.DataFrame(X_test_transformed)

In [60]:
display(X_train_transformed_df.dtypes)
display(X_test_transformed_df.dtypes)

0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10    float64
11    float64
12    float64
13    float64
14    float64
15    float64
16    float64
17    float64
18    float64
19    float64
20    float64
21    float64
22    float64
23    float64
24    float64
25    float64
26    float64
27    float64
28    float64
29    float64
30    float64
31    float64
32    float64
dtype: object

0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10    float64
11    float64
12    float64
13    float64
14    float64
15    float64
16    float64
17    float64
18    float64
19    float64
20    float64
21    float64
22    float64
23    float64
24    float64
25    float64
26    float64
27    float64
28    float64
29    float64
30    float64
31    float64
32    float64
dtype: object

In [None]:
# Conclusion:
# All columns are numerical. 

#### 6. Linear regression

In [62]:
lm = linear_model.LinearRegression()
lm.fit(X_train_transformed,y_train)

In [63]:
predictions_train = lm.predict(X_train_transformed) # y_train_pred
predictions_test = lm.predict(X_test_transformed)   # y_test_pred

In [64]:
r2_score_train = r2_score(y_train, predictions_train)
r2_score_test = r2_score(y_test, predictions_test)

print("R2_Score for train: ", r2_score_train)  
print("R2_Score for test: ", r2_score_test)

R2_Score for train:  0.49604904449537424
R2_Score for test:  0.49051505328637524


In [None]:
# Comment:
# the r2 score is not bad, but also not really good. It is for sure better than blindly guessing.
# But still I hope to get better results with another model. 

#### 7. function that takes a list of models and train (and tests)

In [86]:
def compare_liner_knn_models(X_train, X_test, y_train, y_test, n):
    # Input: X_train, X_test, y_train, y_test
    # Input: n = number of neighbors
    # Output: r2 scores for models
    
    # Creating models
    linear_regressor = linear_model.LinearRegression()
    knn_regressor = KNeighborsRegressor(n_neighbors=n) 
    
    # Training models with training set
    linear_regressor.fit(X_train, y_train)
    knn_regressor.fit(X_train, y_train)
    
    # Predictions for test set
    linear_preds = linear_regressor.predict(X_test)
    knn_preds = knn_regressor.predict(X_test)
    
    # Calculate r2_score
    linear_r2score = r2_score(y_test, linear_preds)
    knn_r2score = knn_regressor.score(X_test,y_test)
    
    return linear_r2score, knn_r2score

#### 8. Use the function to check LinearRegressor and KNeighborsRegressor

In [90]:
# For 3 neighbors and all features:
linear_r2score, knn_r2score = compare_liner_knn_models(X_train_transformed, X_test_transformed, y_train, y_test, 3)

print("linear_r2score", linear_r2score)
print("knn_score", knn_r2score)

linear_r2score 0.49051505328637524
knn_score 0.29291071610897035


In [91]:
# For 5 neighbors and all features:
linear_r2score, knn_r2score = compare_liner_knn_models(X_train_transformed, X_test_transformed, y_train, y_test, 5)

print("linear_r2score", linear_r2score)
print("knn_score", knn_r2score)

linear_r2score 0.49051505328637524
knn_score 0.35603626598450566


In [92]:
# For 10 neighbors and all features:
linear_r2score, knn_r2score = compare_liner_knn_models(X_train_transformed, X_test_transformed, y_train, y_test, 10)

print("linear_r2score", linear_r2score)
print("knn_score", knn_r2score)

linear_r2score 0.49051505328637524
knn_score 0.40008627960849674


In [99]:
# For 50 neighbors and all features:
linear_r2score, knn_r2score = compare_liner_knn_models(X_train_transformed, X_test_transformed, y_train, y_test, 50)

print("linear_r2score", linear_r2score)
print("knn_score", knn_r2score)

linear_r2score 0.49051505328637524
knn_score 0.40421519292733676


#### 9. MLP Regressor

In [96]:
mlp_regressor = MLPRegressor(hidden_layer_sizes=(100, 50), activation='relu', solver='adam', max_iter=500, random_state=42)

In [98]:
mlp_regressor.fit(X_train_transformed, y_train)
y_pred = mlp_regressor.predict(X_test_transformed)
r2 = r2_score(y_test, y_pred)
print("R2 Score:", r2)

R2 Score: 0.5035491120421212




#### 10. Disscuss

- The best result I got (surprisingly) with the linear regression. 
- KNN model is getting better with more k-value )number of neighbors, but still I got almost the same result with k=10 and k=50.
- This means, that probably the elbow point is somwhere around 10. 
- The MLP Regressor was the best one and gave the best predictions. 
- However the linear and mlp regressions are giving pretty similiar results. 