<b/> Instructions

- fit the models LinearRegression,Lasso and Ridge and compare the model performances.
- (Optional) Define a function that takes a list of models and trains (and tests) them so we can try a lot of them without repeating code.
- Use feature selection techniques (P-Value, RFE) to select subset of features to train the model with(if necessary).
(optional) Refit the models with the selected features.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
df = pd.read_csv('Data_Marketing_Customer_Analysis_Round3.csv')

In [3]:
# Dividing numerical and categorical values
numerical = df.select_dtypes(include=np.number)
numerical

Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,total_claim_amount
0,4809,48029,61,7,52,0,9,292
1,2228,92260,64,3,26,0,1,744
2,14947,22139,100,34,31,0,2,480
3,22332,49078,97,10,3,0,2,484
4,9025,23675,117,33,31,0,7,707
...,...,...,...,...,...,...,...,...
10684,15563,61541,253,12,40,0,7,1214
10685,5259,61146,65,7,68,0,6,273
10686,23893,39837,201,11,63,0,2,381
10687,11971,64195,158,0,27,4,6,618


In [4]:
# It seems there are some NaNs
numerical = numerical.dropna(axis=1)
numerical = numerical.reset_index(col_fill='')

In [5]:
# check for NaN values
print(numerical.isna().sum())

index                            0
customer_lifetime_value          0
income                           0
monthly_premium_auto             0
months_since_last_claim          0
months_since_policy_inception    0
number_of_open_complaints        0
number_of_policies               0
total_claim_amount               0
dtype: int64


In [6]:
# check the data type of every column
numerical.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10689 entries, 0 to 10688
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   index                          10689 non-null  int64
 1   customer_lifetime_value        10689 non-null  int64
 2   income                         10689 non-null  int64
 3   monthly_premium_auto           10689 non-null  int64
 4   months_since_last_claim        10689 non-null  int64
 5   months_since_policy_inception  10689 non-null  int64
 6   number_of_open_complaints      10689 non-null  int64
 7   number_of_policies             10689 non-null  int64
 8   total_claim_amount             10689 non-null  int64
dtypes: int64(9)
memory usage: 751.7 KB


In [7]:
# change data type of columns
numerical = numerical.astype('float32')

In [8]:
# Defining X & Y
X = numerical.drop(columns=["total_claim_amount"])
y = df['total_claim_amount']

In [9]:
# Data Splitting
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

X_train = pd.DataFrame(X_train, columns=X.columns)
X_test  = pd.DataFrame(X_test, columns=X.columns)

In [10]:
X_train.describe()

Unnamed: 0,index,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies
count,8551.0,8551.0,8551.0,8551.0,8551.0,8551.0,8551.0,8551.0
mean,5341.805176,7994.913574,51817.503906,93.295288,15.13554,48.192959,0.375395,2.983511
std,3082.456543,6848.842773,24717.367188,34.575584,10.133221,27.84947,0.899732,2.398449
min,1.0,1898.0,10074.0,61.0,0.0,0.0,0.0,1.0
25%,2665.5,4020.5,29435.0,68.0,6.0,25.0,0.0,1.0
50%,5361.0,5764.0,50446.0,83.0,14.0,48.0,0.0,2.0
75%,8001.0,8964.0,72194.5,109.0,23.0,71.0,0.0,4.0
max,10688.0,74228.0,99981.0,298.0,35.0,99.0,5.0,9.0


### Variance threshold method

Unvariate Method

In [11]:
from sklearn.feature_selection import VarianceThreshold # It only works with numerical features


X_train = X_train.select_dtypes(include=np.number)
X_test  = X_test.select_dtypes(include=np.number)

#display(X_train)
print("Initial number of numerical columns: ",X_train.shape)
print()


selector = VarianceThreshold(100) # Default threshold value is 0
# Features with a training-set variance lower than this threshold will be removed.
selector.fit(X_train)

kept_features_indexes = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
kept_features = list(X_train.iloc[:,kept_features_indexes].columns)

X_train = selector.transform(X_train)
X_test  = selector.transform(X_test)

X_train = pd.DataFrame(X_train, columns=kept_features)
X_test  = pd.DataFrame(X_test, columns=kept_features)

print("Final number of numerical columns: ",X_train.shape)
print()
X_train

Initial number of numerical columns:  (8551, 8)

Final number of numerical columns:  (8551, 6)



Unnamed: 0,index,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception
0,9877.0,21423.0,22379.0,65.0,9.0,31.0
1,10069.0,8391.0,40211.0,106.0,5.0,98.0
2,10317.0,3969.0,49544.0,101.0,3.0,29.0
3,9796.0,14914.0,45963.0,63.0,3.0,73.0
4,8995.0,18060.0,57882.0,115.0,1.0,61.0
...,...,...,...,...,...,...
8546,5734.0,7610.0,98701.0,94.0,22.0,66.0
8547,5191.0,35186.0,86134.0,98.0,17.0,78.0
8548,5390.0,4241.0,19834.0,64.0,26.0,8.0
8549,860.0,12941.0,77060.0,106.0,23.0,90.0


### Recursive feature elimination (RFE)


we need to elimiante NaNs for that

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE  ## recursive feature elemination technique

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

X_train = X_train.select_dtypes(include=np.number)
X_test  = X_test.select_dtypes(include=np.number)

X_train = pd.DataFrame(X_train, columns=X.columns)
X_test  = pd.DataFrame(X_test, columns=X.columns)

#display(X_train)
X_train

Unnamed: 0,index,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies
9877,9877.0,21423.0,22379.0,65.0,9.0,31.0,0.0,2.0
10069,10069.0,8391.0,40211.0,106.0,5.0,98.0,2.0,6.0
10317,10317.0,3969.0,49544.0,101.0,3.0,29.0,0.0,1.0
9796,9796.0,14914.0,45963.0,63.0,3.0,73.0,2.0,2.0
8995,8995.0,18060.0,57882.0,115.0,1.0,61.0,0.0,2.0
...,...,...,...,...,...,...,...,...
5734,5734.0,7610.0,98701.0,94.0,22.0,66.0,0.0,3.0
5191,5191.0,35186.0,86134.0,98.0,17.0,78.0,0.0,2.0
5390,5390.0,4241.0,19834.0,64.0,26.0,8.0,4.0,8.0
860,860.0,12941.0,77060.0,106.0,23.0,90.0,0.0,2.0


In [13]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8551 entries, 9877 to 7270
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   index                          8551 non-null   float32
 1   customer_lifetime_value        8551 non-null   float32
 2   income                         8551 non-null   float32
 3   monthly_premium_auto           8551 non-null   float32
 4   months_since_last_claim        8551 non-null   float32
 5   months_since_policy_inception  8551 non-null   float32
 6   number_of_open_complaints      8551 non-null   float32
 7   number_of_policies             8551 non-null   float32
dtypes: float32(8)
memory usage: 334.0 KB


In [14]:
X_train

Unnamed: 0,index,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies
9877,9877.0,21423.0,22379.0,65.0,9.0,31.0,0.0,2.0
10069,10069.0,8391.0,40211.0,106.0,5.0,98.0,2.0,6.0
10317,10317.0,3969.0,49544.0,101.0,3.0,29.0,0.0,1.0
9796,9796.0,14914.0,45963.0,63.0,3.0,73.0,2.0,2.0
8995,8995.0,18060.0,57882.0,115.0,1.0,61.0,0.0,2.0
...,...,...,...,...,...,...,...,...
5734,5734.0,7610.0,98701.0,94.0,22.0,66.0,0.0,3.0
5191,5191.0,35186.0,86134.0,98.0,17.0,78.0,0.0,2.0
5390,5390.0,4241.0,19834.0,64.0,26.0,8.0,4.0,8.0
860,860.0,12941.0,77060.0,106.0,23.0,90.0,0.0,2.0


In [15]:
lm = LinearRegression()

selector = RFE(lm, n_features_to_select= 5, step = 1, verbose = 1) # Step is how many features to add or drop everytime
selector.fit(X_train, y_train)

kept_features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
kept_features = list(X_train.iloc[:,kept_features].columns)

X_train = selector.transform(X_train)
X_test  = selector.transform(X_test)

X_train = pd.DataFrame(X_train, columns=kept_features)
X_test  = pd.DataFrame(X_test, columns=kept_features)

print("Final selected features: ")
display(X_train)

Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Final selected features: 


Unnamed: 0,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies
0,65.0,9.0,31.0,0.0,2.0
1,106.0,5.0,98.0,2.0,6.0
2,101.0,3.0,29.0,0.0,1.0
3,63.0,3.0,73.0,2.0,2.0
4,115.0,1.0,61.0,0.0,2.0
...,...,...,...,...,...
8546,94.0,22.0,66.0,0.0,3.0
8547,98.0,17.0,78.0,0.0,2.0
8548,64.0,26.0,8.0,4.0,8.0
8549,106.0,23.0,90.0,0.0,2.0


<b/> embedded methods 

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

X_train = X_train.select_dtypes(include=np.number)
X_test  = X_test.select_dtypes(include=np.number)

In [17]:
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train=imp_mean.fit_transform(X_train)
X_test = imp_mean.fit_transform(X_test)

<b/> OLS

In [18]:
model=LinearRegression()
model.fit(X_train, y_train)
print(f"{model.__class__.__name__}: Train -> {model.score(X_train, y_train)}, Test -> {model.score(X_test, y_test)}")

LinearRegression: Train -> 0.4086989399570252, Test -> 0.41132737690716314


<b/> A lasso model can drop features and be a feature selection technique

In [19]:
from sklearn.linear_model import Lasso,Ridge,ElasticNet, LinearRegression
model=Lasso(alpha=0)

model.fit(X_train, y_train)
print(f"{model.__class__.__name__}: Train -> {model.score(X_train, y_train)}, Test -> {model.score(X_test, y_test)}")

Lasso: Train -> 0.4086989381605699, Test -> 0.4113266384469658


<b/> Ridge

In [20]:
model=Ridge(alpha=0)
model.fit(X_train, y_train)
print(f"{model.__class__.__name__}: Train -> {model.score(X_train, y_train)}, Test -> {model.score(X_test, y_test)}")

Ridge: Train -> 0.40869893771569954, Test -> 0.41132661563882145


<b/> ElasticNet

In [21]:
model=ElasticNet(alpha=0.1)
model.fit(X_train, y_train)
print(f"{model.__class__.__name__}: Train -> {model.score(X_train, y_train)}, Test -> {model.score(X_test, y_test)}")

ElasticNet: Train -> 0.40869868016275823, Test -> 0.4113647243347006
