### Use numeric prediction techniques to build a predictive model for the HW3.xlsx dataset. This dataset is provided on the course website and contains data about whether or not different consumers made a purchase in response to a test mailing of a certain catalog and, in case of a purchase, how much money each consumer spent. The data file has a brief description of all the attributes in a separate worksheet. Note that this dataset has two possible outcome variables: Purchase (0/1 value: whether or not the purchase was made) and Spending (numeric value: amount spent).

Your tasks:

(a) Build numeric prediction models that predict Spending based on the other available customer information (obviously, not including the Purchase attribute among the inputs!). Use linear regression, k-NN, regression tree, SVM regreesion and Neural Network and ensembling models. Briefly discuss your explorations and present the best result (best predictive model) for each of these techniques. Compare the techniques; which of them provides the best predictive performance? Please make sure you use best practices for predictive modeling. (I.e., do you need to set which hyper-parameter? Normalize?)

(b) As a variation on this exercise, create a separate “restricted” dataset (i.e., a subset of the original dataset), which includes only purchase records (i.e., where Purchase = 1). Build numeric prediction models to predict Spending for this restricted dataset. All the same requirements as for task (a) apply.

(c) For each predictive modeling technique, discuss the predictive performance differences between the models built for task (a) vs. task (b): which models exhibit better predictive performance? Why do you think that is?

### Loading Required Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score, f1_score
from sklearn import preprocessing 
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.model_selection import GridSearchCV, KFold,train_test_split, StratifiedKFold
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import make_scorer
from sklearn.svm import SVR 
from sklearn.neural_network import MLPRegressor
import xgboost
from xgboost import XGBRegressor
from lightgbm import LGBMClassifier 
import warnings 
warnings.filterwarnings('ignore')

### Loading and exploring the data

In [2]:
df = pd.read_excel("HW3.xlsx", sheet_name = 0)

In [4]:
# Viewing the Dataframe

df.head(5)

Unnamed: 0,sequence_number,US,source_a,source_c,source_b,source_d,source_e,source_m,source_o,source_h,...,source_x,source_w,Freq,last_update_days_ago,1st_update_days_ago,Web order,Gender=male,Address_is_res,Purchase,Spending
0,1,1,0,0,1,0,0,0,0,0,...,0,0,2,3662,3662,1,0,1,1,127.87
1,2,1,0,0,0,0,1,0,0,0,...,0,0,0,2900,2900,1,1,0,0,0.0
2,3,1,0,0,0,0,0,0,0,0,...,0,0,2,3883,3914,0,0,0,1,127.48
3,4,1,0,1,0,0,0,0,0,0,...,0,0,1,829,829,0,1,0,0,0.0
4,5,1,0,1,0,0,0,0,0,0,...,0,0,1,869,869,0,0,0,0,0.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   sequence_number       2000 non-null   int64  
 1   US                    2000 non-null   int64  
 2   source_a              2000 non-null   int64  
 3   source_c              2000 non-null   int64  
 4   source_b              2000 non-null   int64  
 5   source_d              2000 non-null   int64  
 6   source_e              2000 non-null   int64  
 7   source_m              2000 non-null   int64  
 8   source_o              2000 non-null   int64  
 9   source_h              2000 non-null   int64  
 10  source_r              2000 non-null   int64  
 11  source_s              2000 non-null   int64  
 12  source_t              2000 non-null   int64  
 13  source_u              2000 non-null   int64  
 14  source_p              2000 non-null   int64  
 15  source_x             

Most of the variables in this dataset are binary. Some of the variables are continuous. There are no catagorical variables in this dataset.

In [5]:
# Checking the distribution of Data

df.describe()

Unnamed: 0,sequence_number,US,source_a,source_c,source_b,source_d,source_e,source_m,source_o,source_h,...,source_x,source_w,Freq,last_update_days_ago,1st_update_days_ago,Web order,Gender=male,Address_is_res,Purchase,Spending
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1000.5,0.8245,0.1265,0.056,0.06,0.0415,0.151,0.0165,0.0335,0.0525,...,0.018,0.1375,1.417,2155.101,2435.6015,0.426,0.5245,0.221,0.5,102.560745
std,577.494589,0.380489,0.332495,0.229979,0.237546,0.199493,0.358138,0.12742,0.179983,0.223089,...,0.132984,0.344461,1.405738,1141.302846,1077.872233,0.494617,0.499524,0.415024,0.500125,186.749816
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,500.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1133.0,1671.25,0.0,0.0,0.0,0.0,0.0
50%,1000.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,2280.0,2721.0,0.0,1.0,0.0,0.5,1.855
75%,1500.25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,3139.25,3353.0,1.0,1.0,0.0,1.0,152.5325
max,2000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,15.0,4188.0,4188.0,1.0,1.0,1.0,1.0,1500.06


### Create train and test data
I will use a train and test dataset with 80:20 split

In [6]:
# Separating predictor and response variable

x = df.drop(['Purchase', 'Spending'], axis = 1)
y = df['Spending']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)

### Normalization using StandardScale
StandardScaler will normalize the data with mean = 0 and SD = 1 

In [8]:
standardscaler = preprocessing.StandardScaler().fit(X_train)

X_train = standardscaler.fit_transform(X_train) 
X_test = standardscaler.transform(X_test)

### Score Metrics

We will use Mean Squared Error for Score metrics as the evaulation of performance of the models.

In [11]:
# Define score metrics for parameters optimization and model selection
score = 'neg_mean_squared_error'

# Inner and outer CV for models - 3 Folds
inner_cv = KFold(n_splits=3, shuffle=True, random_state=10)
outer_cv = KFold(n_splits=3, shuffle=True, random_state=11)

### Model building

In [46]:
# Linear Regression
lr = LinearRegression()

# Parameters for lr gridsearch
lr_grid = {"fit_intercept":[True,False], "normalize":[True,False]}

In [47]:
lr_g = GridSearchCV(lr, lr_grid, scoring = score, cv = inner_cv)

In [48]:
lr_score = cross_val_score(lr_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
lr_scores = lr_score.mean()

In [49]:
# KNN
knn = KNeighborsRegressor()

# Parameters for knn gridsearch
knn_grid = {'n_neighbors':list(range(3,15)),
            'weights': ['uniform','distance']}

In [50]:
knn_g = GridSearchCV(knn, knn_grid, scoring = score, cv = inner_cv)

In [51]:
knn_score = cross_val_score(knn_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
knn_scores = knn_score.mean()

In [52]:
# Decision Tree
dt = DecisionTreeRegressor()

# Parameters for dt gridsearch
dt_grid = {'max_depth' : list(range(3,15)),
           'min_samples_split' : list(range(2,10)),
           'min_samples_leaf': list(range(1,5))}

In [53]:
dt_g = GridSearchCV(dt, dt_grid, scoring = score, cv = inner_cv)

dt_score = cross_val_score(dt_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
dt_scores = dt_score.mean()

In [54]:
# SVM
svm = SVR()

# Parameters for svm gridsearch
svm_grid = {'kernel': ['rbf'],
            'gamma': [1,0.1,0.01,0.001],
            'C': [0.001,0.01,0.1,1,10,100,1000]}

In [55]:
svm_g = GridSearchCV(svm, svm_grid, scoring = score, cv = inner_cv)

In [56]:
svm_score = cross_val_score(svm_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
svm_scores = svm_score.mean()

In [57]:
# Neural Network
nn = MLPRegressor()

# Parameters for nn gridsearch
nn_grid = {'hidden_layer_sizes':[(1,),(50,)], 
           'activation':['identity','logistic','tanh','relu']}

In [58]:
nn_g = GridSearchCV(nn, nn_grid, scoring = score, cv = inner_cv)

In [59]:
nn_score = cross_val_score(nn_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
nn_scores = nn_score.mean() 

In [40]:
# XGBoost
xgb = XGBRegressor()

# Parameters for xgb gridsearch
xgb_grid = {'max_depth': [5,7,8,9],
            'learning_rate': [0.1, 0.2, 0.3],
            'colsample_bytree': [0.4, 0.8],
            'min_child_weight': [1,5,10],
            'gamma': [0.5, 1, 1.5]}

In [41]:
xgb_g = GridSearchCV(xgb, xgb_grid, scoring = score, cv = inner_cv)

In [42]:
xgb_score = cross_val_score(xgb_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
xgb_scores = xgb_score.mean()

In [60]:
print('Mean MSE of Linear Regression:', lr_scores)
print('Mean MSE of KNN:', knn_scores)
print('Mean MSE of Decision Tree:', dt_scores)
print('Mean MSE of SVM:', svm_scores)
print('Mean MSE of Neural Networks:', nn_scores)
print('Mean MSE of XGBoost:', xgb_scores)

Mean MSE of Linear Regression: -17523.49968853053
Mean MSE of KNN: -23471.77351331543
Mean MSE of Decision Tree: -21685.779046653013
Mean MSE of SVM: -16866.141249547472
Mean MSE of Neural Networks: -26160.399030528144
Mean MSE of XGBoost: -18106.44558297301


SVM got the lowest mean squared error. We will run the best parameters and best scores for further tuning

In [61]:
# Find the best parameter

best_lr = lr_g.fit(X_train, y_train)
best_knn = knn_g.fit(X_train, y_train)
best_dt = dt_g.fit(X_train, y_train)
best_svm = svm_g.fit(X_train, y_train)
best_nn = nn_g.fit(X_train, y_train)
best_xgb = xgb_g.fit(X_train, y_train)

In [81]:
# optimized hyperparameters
print('Best LR para:', best_lr.best_params_)
print('Best KNN para:', best_knn.best_params_)
print('Best DT para:', best_dt.best_params_)
print('Best SVM para:', best_svm.best_params_)
print('Best NN para:', best_nn.best_params_)
print('Best XGB para:', best_xgb.best_params_)

Best LR para: {'fit_intercept': True, 'normalize': True}
Best KNN para: {'n_neighbors': 11, 'weights': 'distance'}
Best DT para: {'max_depth': 5, 'min_samples_leaf': 3, 'min_samples_split': 7}
Best SVM para: {'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}
Best NN para: {'activation': 'identity', 'hidden_layer_sizes': (50,)}
Best XGB para: {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 5}


In [82]:
# best model scores
print('Best LR score:', best_lr.best_score_)
print('Best KNN score:', best_knn.best_score_)
print('Best DT score:', best_dt.best_score_)
print('Best SVM score:', best_svm.best_score_)
print('Best NN score:', best_nn.best_score_)
print('Best XGB score:', best_xgb.best_score_)

Best LR score: -17362.19752321727
Best KNN score: -23523.71048133974
Best DT score: -19184.45571991688
Best SVM score: -16507.730280342807
Best NN score: -24173.248539377135
Best XGB score: -16996.76105463757


Best model with optimized parameter is XGBoost, we will use it on test set to run again

In [89]:
# Run model for predictions
y_pred = best_xgb.predict(X_test)

In [90]:
# Evaluating the best model with MSE on test set

best_svm_mse = mean_squared_error(y_test, y_pred, squared=False)
print(best_xgb_mse)

190.5023386033806


The final best XGB Model with adjusted parameters have a 
RMSE of 120.3323 on the Test Set

## Task (b) Create a new dataset with only records for "Purchase =1" and build the models to compare the performance.

### Filter the original dataset to include only (Purchase = 1)

In [91]:
df2 = df[df['Purchase']==1]

### Create train and test data

In [92]:
x = df2.drop(['Purchase','Spending'], axis=1)
y = df2['Spending']

In [93]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state=42)

### Normalization using StandardScale like (a)

In [94]:
standardscaler = preprocessing.StandardScaler().fit(X_train) 
X_train = standardscaler.fit_transform(X_train) 
X_test = standardscaler.transform(X_test)

In [96]:
# Linear Regression
lr = LinearRegression()

# Parameters for lr gridsearch
lr_grid = {"fit_intercept":[True,False], "normalize":[True,False]}

# Inner and outer CV
inner_cv = KFold(n_splits=3, shuffle=True, random_state=10)
outer_cv = KFold(n_splits=3, shuffle=True, random_state=11)

# Define score metrics for parameters optimization and model selection
score = 'neg_mean_squared_error'

lr_g = GridSearchCV(lr, lr_grid, scoring = score, cv = inner_cv)

lr_score = cross_val_score(lr_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
lr_scores = lr_score.mean()

In [97]:
# KNN
knn = KNeighborsRegressor()

# Parameters for knn gridsearch
knn_grid = {'n_neighbors':list(range(3,15)),
            'weights': ['uniform','distance']}

knn_g = GridSearchCV(knn, knn_grid, scoring = score, cv = inner_cv)

knn_score = cross_val_score(knn_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
knn_scores = knn_score.mean()

In [98]:
# Decision Tree
dt = DecisionTreeRegressor()

# Parameters for dt gridsearch
dt_grid = {'max_depth' : list(range(13,15)),
           'min_samples_split' : list(range(2,4)),
           'min_samples_leaf': list(range(1,2))}

dt_g = GridSearchCV(dt, dt_grid, scoring = score, cv = inner_cv)

dt_score = cross_val_score(dt_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
dt_scores = dt_score.mean()

In [99]:
# SVM
svm = SVR()

# Parameters for svm gridsearch
svm_grid = {'kernel': ['rbf'],
            'gamma': [1,0.1,0.01,0.001],
            'C': [0.001,0.01,0.1,1,10,100,1000]}

svm_g = GridSearchCV(svm, svm_grid, scoring = score, cv = inner_cv)

svm_score = cross_val_score(svm_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
svm_scores = svm_score.mean()

In [100]:
# Neural Network
nn = MLPRegressor()

# Parameters for nn gridsearch
nn_grid = {'hidden_layer_sizes':[(1,),(50,)], 
           'activation':['identity','logistic','tanh','relu']}

nn_g = GridSearchCV(nn, nn_grid, scoring = score, cv = inner_cv)

nn_score = cross_val_score(nn_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
nn_scores = nn_score.mean() 

In [101]:
# XGBoost
xgb = XGBRegressor()

# Parameters for xgb gridsearch
xgb_grid = {'max_depth': [3,4,5],
            'learning_rate': [0.1, 0.2, 0.3],
            'colsample_bytree': [0.4, 0.8],
            'min_child_weight': [1,5,10],
            'gamma': [0.5, 1, 1.5]}

xgb_g = GridSearchCV(xgb, xgb_grid, scoring = score, cv = inner_cv)

xgb_score = cross_val_score(xgb_g, X=X_train, y=y_train, scoring = score, cv=outer_cv)
xgb_scores = xgb_score.mean()

In [102]:
print('Mean MSE of Linear Regression:', lr_scores)
print('Mean MSE of KNN:', knn_scores)
print('Mean MSE of Decision Tree:', dt_scores)
print('Mean MSE of SVM:', svm_scores)
print('Mean MSE of Neural Networks:', nn_scores)
print('Mean MSE of XGBoost:', xgb_scores)

Mean MSE of Linear Regression: -28689.20545500593
Mean MSE of KNN: -37802.45814780904
Mean MSE of Decision Tree: -55308.69564453004
Mean MSE of SVM: -28945.809591083962
Mean MSE of Neural Networks: -68418.53273716995
Mean MSE of XGBoost: -33888.648141369216


Linear Regression and SVM got the best performance, now we will optimize parameters and best scores to see which one we should choose for test set and prediction of this dataset with only records of (purchase = 1). 

In [86]:
# Find the best parameter

best_lr = lr_g.fit(X_train, y_train)
best_knn = knn_g.fit(X_train, y_train)
best_dt = dt_g.fit(X_train, y_train)
best_svm = svm_g.fit(X_train, y_train)
best_nn = nn_g.fit(X_train, y_train)
best_xgb = xgb_g.fit(X_train, y_train)

In [87]:
# optimized hyperparameters
print('Best LR para:', best_lr.best_params_)
print('Best KNN para:', best_knn.best_params_)
print('Best DT para:', best_dt.best_params_)
print('Best SVM para:', best_svm.best_params_)
print('Best NN para:', best_nn.best_params_)
print('Best XGB para:', best_xgb.best_params_)

Best LR para: {'fit_intercept': True, 'normalize': False}
Best KNN para: {'n_neighbors': 4, 'weights': 'uniform'}
Best DT para: {'max_depth': 14, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best SVM para: {'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}
Best NN para: {'activation': 'relu', 'hidden_layer_sizes': (50,)}
Best XGB para: {'colsample_bytree': 0.8, 'gamma': 0.5, 'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 10}


In [88]:
# best model scores
print('Best LR score:', best_lr.best_score_)
print('Best KNN score:', best_knn.best_score_)
print('Best DT score:', best_dt.best_score_)
print('Best SVM score:', best_svm.best_score_)
print('Best NN score:', best_nn.best_score_)
print('Best XGB score:', best_xgb.best_score_)

Best LR score: -27499.283218975794
Best KNN score: -30731.528298583333
Best DT score: -44882.71063830598
Best SVM score: -25984.87576312307
Best NN score: -70157.694948203
Best XGB score: -28899.743754322568


Best model with optimized parameter is SVM, we will use it on test set to run again

In [64]:
# Modeling for prediction

y_pred = best_svm.predict(X_test)

In [65]:
# Evaluating the best model with MSE on test set

best_svm_mse = mean_squared_error(y_test, y_pred, squared=False)
print(best_svm_mse)

128.18814970327554


### Task(c)
### For each predictive modeling technique, discuss the predictive performance differences between the models built for task (a) vs. task (b): which models exhibit better predictive performance? Why do you think that is?

Linear Regression is 36% better in Task (A)

KNN is 23% better in Task (A)

DT is 57% better in Task (A)

SVM is 36% better in Task (A)

NN is 65% better in Task (A)

XGB is 41% better in Task (A)

Task (A) Performance VS Task (B) Performance

Best LR score: -17362.19752321727

Best KNN score: -23523.71048133974

Best DT score: -19184.45571991688

Best SVM score: -16507.730280342807

Best NN score: -24173.248539377135

Best XGB score: -16996.76105463757

VS
    
Best LR score: -27499.283218975794

Best KNN score: -30731.528298583333

Best DT score: -44882.71063830598

Best SVM score: -25984.87576312307

Best NN score: -70157.694948203

Best XGB score: -28899.743754322568
    


### Evaulation 

In task (ii), we filtered out the purchase class of 0, so the training dataset is greatly reduced. All the mean squared errors are much higher in Task (ii) model than task (i) (when filtered out data without purchase). That means purchase is a very important feature to this model, and in the future we need to pay attention to all the features to make sure the same problem will not happen, especially for very heavy weighted features. 