![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

<h1 style="color: #00BFFF;">00 | Comparing regression models</h1>

For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs. 

In [1]:
# 📚 Basic libraries
import pandas as pd # data manipulation
import numpy as np # numerical operations
import matplotlib.pyplot as plt # 2D visualizations
import os # filemanagment
import seaborn as sns # high-resolution visualization

# 🤖 Machine Learning
from sklearn.model_selection import train_test_split # splitting data into train/test sets
# Data Normalization
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
# linear Regression Models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
# Model evaluation metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# ⚙️ Settings
pd.set_option('display.max_columns', None)

# 🔄 Functions
import sys # system path to our functions
module = "C:/Users/apisi/01. IronData/01. GitHub/01. IronLabs/usefulness/easy"
sys.path.append(os.path.abspath(module))

from functions import open_data  # quick data overview
from functions import snake_columns  # snake_case
from functions import explore_data  # checks for duplicates, NaN & empty spaces

<h1 style="color: #00BFFF;">00 | Data Extraction</h1>

In [2]:
file_path = os.path.join('C:/Users/apisi/01. IronData/01. GitHub/01. IronLabs/unit_4_py/lab-comparing-regression-models/01_data/we_fn_use_c_marketing_customer_value_analysis.csv')
data = pd.read_csv(file_path)
snake_columns(data) # snake_case columns
open_data(data) # returns shape, data types & shows a small sample

Data shape is (9134, 24).

customer                          object
state                             object
customer_lifetime_value          float64
response                          object
coverage                          object
education                         object
effective_to_date                 object
employmentstatus                  object
gender                            object
income                             int64
location_code                     object
marital_status                    object
monthly_premium_auto               int64
months_since_last_claim            int64
months_since_policy_inception      int64
number_of_open_complaints          int64
number_of_policies                 int64
policy_type                       object
policy                            object
renew_offer_type                  object
sales_channel                     object
total_claim_amount               float64
vehicle_class                     object
vehicle_size                  

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,location_code,marital_status,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size
8713,XP41644,California,4335.676738,No,Basic,High School or Below,1/19/11,Employed,M,30574,Urban,Single,114,19,10,1,1,Corporate Auto,Corporate L2,Offer3,Web,396.105473,Sports Car,Medsize
5767,VU99589,Arizona,34611.37896,Yes,Basic,High School or Below,1/14/11,Employed,F,20090,Suburban,Married,109,10,59,0,2,Personal Auto,Personal L3,Offer2,Agent,523.2,Sports Car,Medsize
3202,KI25980,Washington,14722.22122,No,Extended,Master,2/1/11,Employed,M,43982,Suburban,Divorced,126,20,68,1,2,Personal Auto,Personal L3,Offer1,Web,604.8,SUV,Small
4816,VU72280,California,5318.89664,Yes,Basic,Bachelor,1/12/11,Employed,F,25134,Suburban,Married,67,3,0,0,6,Corporate Auto,Corporate L2,Offer2,Call Center,321.6,Four-Door Car,Small
4391,DQ34772,Arizona,6361.845408,No,Extended,High School or Below,2/20/11,Unemployed,M,0,Rural,Married,88,28,23,0,4,Personal Auto,Personal L3,Offer4,Branch,110.527627,Four-Door Car,Medsize


In [3]:
explore_data(data)  # sum & returns duplicates, NaN & empty spaces

There are 0 duplicate rows. Also;


Unnamed: 0,NaN,EmptySpaces
customer,0,0
state,0,0
customer_lifetime_value,0,0
response,0,0
coverage,0,0
education,0,0
effective_to_date,0,0
employmentstatus,0,0
gender,0,0
income,0,0


In [4]:
# Moving on!

<h1 style="color: #00BFFF;">00 | Data Cleaning</h1>

In [5]:
## Summary of all previosu labs ! ##

data_c = data.copy() ## Copy as best practices
data_c = data_c.drop(['customer'], axis=1) # We drop 'customer' ID from data_c (safe copy) like its NaN. ID = uselesness most of the times

# Dates are complex. First, we will change it to datetime format
data_c['effective_to_date'] = data_c['effective_to_date'].astype('datetime64[ns]')
# And now, to have it as numericals in separate days, months and years
data_c['year'] = data_c['effective_to_date'].dt.year
data_c['month'] = data_c['effective_to_date'].dt.month
data_c['day'] = data_c['effective_to_date'].dt.day
# And... bye bye!
data_c = data_c.drop(['effective_to_date'], axis=1)

# Selecting Numericals for this lab
n = data_c.select_dtypes(include=np.number).drop_duplicates()
explore_data(n)

# Why I added .drop_duplicates?

## In explore_data(data) I had 0 duplicates
## In explore_data(data_c) just making the copy, I created 163 duplicates
## Lastly, explore_data(n) first returned me... 1090 duplicates !
## So, I am not sure what the error is in the first place to be honest, but I think it's safe to drop them, they were not present in the first dataset.

There are 0 duplicate rows. Also;


Unnamed: 0,NaN,EmptySpaces
customer_lifetime_value,0,0
income,0,0
monthly_premium_auto,0,0
months_since_last_claim,0,0
months_since_policy_inception,0,0
number_of_open_complaints,0,0
number_of_policies,0,0
total_claim_amount,0,0
year,0,0
month,0,0


In [6]:
n.columns

Index(['customer_lifetime_value', 'income', 'monthly_premium_auto',
       'months_since_last_claim', 'months_since_policy_inception',
       'number_of_open_complaints', 'number_of_policies', 'total_claim_amount',
       'year', 'month', 'day'],
      dtype='object')

In [7]:
n = n[['customer_lifetime_value', 'income', 'monthly_premium_auto',
       'months_since_last_claim', 'months_since_policy_inception',
       'number_of_open_complaints', 'number_of_policies',
       'year', 'month', 'day', 'total_claim_amount']] # Moving total_claim_amount (target) to the right

### Selecting Categoricals

In [8]:
c = data_c.select_dtypes(exclude=np.number).drop_duplicates() # Again, we got the same error :/

### Encoding Categoricals
* We will count `unique` for each feature.
* **If** it follows an hierarchy, ordinal encoding. **Elif**, manual encoding. **Elif** (too many uniques), get dummies. **Else** (dates), transform it to a datetime object and then create new columns for `day`, `month` & `year`

In [9]:
# One by one, we will check unique values to encode them manually if it's necessary
c['response'].unique()

array(['No', 'Yes'], dtype=object)

In [10]:
binary = {'No' : 0, 'Yes' : 1}
c['response'].replace(binary, inplace=True)

In [11]:
c['coverage'].unique()

array(['Basic', 'Extended', 'Premium'], dtype=object)

In [12]:
# In this case, ordinal encoding. Premium > Extended > Basic
ordinal = {'Basic' : 0, 'Extended' : 1, 'Premium' : 2}
c['coverage'].replace(ordinal, inplace=True)

In [13]:
c['education'].unique()

array(['Bachelor', 'College', 'Master', 'High School or Below', 'Doctor'],
      dtype=object)

In [14]:
# Then again, ordinal. Doctor > Master > College > Bachelor > High School or Below
ordinal = {'High School or Below' : 0, 'Bachelor' : 1, 'College' : 2, 'Master' : 3, 'Doctor' : 4}
c['education'].replace(ordinal, inplace=True)

In [15]:
# Next, employmentstatus:
c['employmentstatus'].unique() # In this case, we will use get_dummies, since we don't want to represent a hierarchy

array(['Employed', 'Unemployed', 'Medical Leave', 'Disabled', 'Retired'],
      dtype=object)

In [16]:
c['gender'].unique() # We have two genders in this dataset, so get_dummies

array(['F', 'M'], dtype=object)

In [17]:
c['location_code'].unique() # Again, we don't want to show any hierarchy so we will use get_dummies

array(['Suburban', 'Rural', 'Urban'], dtype=object)

In [18]:
c['marital_status'].unique() # get_dummies

array(['Married', 'Single', 'Divorced'], dtype=object)

In [19]:
c['policy_type'].unique()

array(['Corporate Auto', 'Personal Auto', 'Special Auto'], dtype=object)

In [20]:
# Then again, hierarchy. Special Auto > Corporate Auto > Personal Auto
ordinal = {'Personal Auto' : 0, 'Corporate Auto' : 1, 'Special Auto' : 2}
c['policy_type'].replace(ordinal, inplace=True)

In [21]:
c['policy'].unique() # get_dummies

array(['Corporate L3', 'Personal L3', 'Corporate L2', 'Personal L1',
       'Special L2', 'Corporate L1', 'Personal L2', 'Special L1',
       'Special L3'], dtype=object)

In [22]:
# Then again, hierarchy. Special L3 > Special L2 > Special L1 > Corporate L3 > Corporate L2 > Corporate L1 > Personal L3 > Personal L2 > Personal L1
ordinal = {'Personal L1' : 0, 'Personal L2' : 1, 'Personal L3': 2, 'Corporate L1' : 3, 'Corporate L2' : 4, 'Corporate L3' : 5, 'Special L1' : 6, 'Special L2' : 7, 'Special L3' : 8}
c['policy'].replace(ordinal, inplace=True)

In [23]:
c['renew_offer_type'].unique() # get_dummies, we don't know the hierarchy of the offers

array(['Offer1', 'Offer3', 'Offer2', 'Offer4'], dtype=object)

In [24]:
c['sales_channel'].unique() # get_dummies

array(['Agent', 'Call Center', 'Web', 'Branch'], dtype=object)

In [25]:
c['vehicle_class'].unique() # There is a clear hierarchy Luxury > Sports but not with the others. We will use get_dummies

array(['Two-Door Car', 'Four-Door Car', 'SUV', 'Luxury SUV', 'Sports Car',
       'Luxury Car'], dtype=object)

In [26]:
c['vehicle_size'].unique()

array(['Medsize', 'Small', 'Large'], dtype=object)

In [27]:
ordinal = {'Small' : 0, 'Medsize' : 1, 'Large': 2}
c['vehicle_size'].replace(ordinal, inplace=True)

In [31]:
# We now select all our categoricals encoded before applying get dummies
c_n = c.select_dtypes(include = np.number).drop_duplicates() # rrrgggg
# Now again, we select only categoricals to encode them with get_dummies
c  = c.select_dtypes(exclude = np.number).drop_duplicates()
# Now, get_dummies
c_dumm = pd.get_dummies(c, drop_first=False)
# And... all together
c = pd.concat([c_n, c_dumm ], axis=1) # we concat them with our numerical values, target at our righ
c.head(10)

Unnamed: 0,response,coverage,education,policy_type,policy,vehicle_size,state_Arizona,state_California,state_Nevada,state_Oregon,state_Washington,employmentstatus_Disabled,employmentstatus_Employed,employmentstatus_Medical Leave,employmentstatus_Retired,employmentstatus_Unemployed,gender_F,gender_M,location_code_Rural,location_code_Suburban,location_code_Urban,marital_status_Divorced,marital_status_Married,marital_status_Single,renew_offer_type_Offer1,renew_offer_type_Offer2,renew_offer_type_Offer3,renew_offer_type_Offer4,sales_channel_Agent,sales_channel_Branch,sales_channel_Call Center,sales_channel_Web,vehicle_class_Four-Door Car,vehicle_class_Luxury Car,vehicle_class_Luxury SUV,vehicle_class_SUV,vehicle_class_Sports Car,vehicle_class_Two-Door Car
0,0.0,0.0,1.0,1.0,5.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,1.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,2.0,1.0,0.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,1.0,4.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,1.0,0.0,2.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
6,1.0,0.0,2.0,1.0,5.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,0.0,2.0,3.0,1.0,5.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8,1.0,0.0,1.0,1.0,5.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,0.0,1.0,2.0,2.0,7.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


<h3 style="color: #008080;">1. In this final lab, we will model our data. Import sklearn `train_test_split` and separate the data.</h3>

In [29]:
Y = n['total_claim_amount']
X = n.drop(['total_claim_amount'], axis=1)

In [30]:
# We define train and test for X and Y
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42) 

# test_size = We give 30% for testing and 70% for testing
# random_state = it'll improve the model to divide always the model in the same way

<h3 style="color: #008080;">2. Try a simple linear regression with all the data to see whether we are getting good results.</h3>

In [24]:
lr = LinearRegression() # A simple Linear Regression model
lr.fit(X_train, y_train) # Train data for the model

 # Predictions
predictions = lr.predict(X_test)
r2_3 = r2_score(y_test, predictions)
RMSE_3 = mean_squared_error(y_test, predictions, squared=False)
MSE_3 = mean_squared_error(y_test, predictions)
MAE_3 = mean_absolute_error(y_test, predictions)

#Printing the results
print("R2 = ", round(r2_3, 4))
print("RMSE = ", round(RMSE_3, 4))
print("The value of the metric MSE is ", round(MSE_3, 4))
print("MAE = ", round(MAE_3, 4))

R2 =  0.0117
RMSE =  0.8513
The value of the metric MSE is  0.7248
MAE =  0.5953


<h3 style="color: #008080;">3. Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.</h3>

In [42]:
def model_test(X_train, y_train, X_test):
    # Linear Regression
    lr = LinearRegression() # A simple Linear Regression model
    lr.fit(X_train, y_train) # Train data for the model
    # Predictions
    predictions = lr.predict(X_test)
    r2_3 = r2_score(y_test, predictions)
    RMSE_3 = mean_squared_error(y_test, predictions, squared=False)
    MSE_3 = mean_squared_error(y_test, predictions)
    MAE_3 = mean_absolute_error(y_test, predictions)
    #Printing the results
    print("Linear Regression Results")
    print("R2 = ", round(r2_3, 4))
    print("RMSE = ", round(RMSE_3, 4))
    print("The value of the metric MSE is ", round(MSE_3, 4))
    print("MAE = ", round(MAE_3, 4))
    print()
    
    
    # ElasticNet
    # Settings
    alpha = 1.0
    l1_ratio = 0.5
    elasticnet = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
    elasticnet.fit(X_train, y_train) # Train data for the model
    # Predictions
    predictions = elasticnet.predict(X_test)
    er2_3 = r2_score(y_test, predictions)
    eRMSE_3 = mean_squared_error(y_test, predictions, squared=False)
    eMSE_3 = mean_squared_error(y_test, predictions)
    eMAE_3 = mean_absolute_error(y_test, predictions)
    #Printing the results
    print("ElasticNet Results")
    print("R2 = ", round(er2_3, 4))
    print("RMSE = ", round(eRMSE_3, 4))
    print("The value of the metric MSE is ", round(eMSE_3, 4))
    print("MAE = ", round(eMAE_3, 4))
    print()
    
    # KNeighborsRegressor
    KN = KNeighborsRegressor() # A simple Linear Regression model
    KN.fit(X_train, y_train) # Train data for the model
    # Predictions
    predictions = KN.predict(X_test)
    kr2_3 = r2_score(y_test, predictions)
    kRMSE_3 = mean_squared_error(y_test, predictions, squared=False)
    kMSE_3 = mean_squared_error(y_test, predictions)
    kMAE_3 = mean_absolute_error(y_test, predictions)
    #Printing the results
    print("KNeighborsRegressor Results")
    print("R2 = ", round(kr2_3, 4))
    print("RMSE = ", round(kRMSE_3, 4))
    print("The value of the metric MSE is ", round(kMSE_3, 4))
    print("MAE = ", round(kMAE_3, 4))
    print()
    
    # MLPRegressor
    ML = KNeighborsRegressor() # A simple Linear Regression model
    ML.fit(X_train, y_train) # Train data for the model
    # Predictions
    predictions = ML.predict(X_test)
    mr2_3 = r2_score(y_test, predictions)
    mRMSE_3 = mean_squared_error(y_test, predictions, squared=False)
    mMSE_3 = mean_squared_error(y_test, predictions)
    mMAE_3 = mean_absolute_error(y_test, predictions)
    #Printing the results
    print("MLPRegressor Results")
    print("R2 = ", round(mr2_3, 4))
    print("RMSE = ", round(mRMSE_3, 4))
    print("The value of the metric MSE is ", round(mMSE_3, 4))
    print("MAE = ", round(mMAE_3, 4))
    print()

<h3 style="color: #008080;">4. Use the function to check `LinearRegressor` and `KNeighborsRegressor`.
</h3>

<h3 style="color: #008080;">5. You can check also the `MLPRegressor` for this task!</h3>

In [44]:
model_test(X_train, y_train, X_test)

Linear Regression Results
R2 =  0.0117
RMSE =  0.8513
The value of the metric MSE is  0.7248
MAE =  0.5953

ElasticNet Results
R2 =  -0.0012
RMSE =  0.8569
The value of the metric MSE is  0.7342
MAE =  0.5978

KNeighborsRegressor Results
R2 =  -0.105
RMSE =  0.9002
The value of the metric MSE is  0.8104
MAE =  0.6688

MLPRegressor Results
R2 =  -0.105
RMSE =  0.9002
The value of the metric MSE is  0.8104
MAE =  0.6688



<h3 style="color: #008080;">6. Check and discuss the results.
</h3>

In [45]:
# Data must be not right, because I got weird values as results.
# To be honest, I kind of did this lab in a rush and I will try to get back to it (having a a goodfunction to run different models it's a must)

# From the results above, Linear Regression still proves to have the best r2 score
# ElasticNet is one I used in the past, and it's more usefull for overfitted data

# Again, I'll come back and try to make it better... I just wanted to get this one done.