<h1 align="center">MSIN0114: Business Analytics Consulting Project</h1>
<h2 align="center">S2R Analytics, pt. 4</h2>

# Table of Contents

* [Part 5](#part5): Regression
    * [5.0](#5_0): Data splitting
<br />
<br />
* [Part 6](#part6): Classification
    * [6.0](#6_0): Data splitting
<br />
<br />
* [Part 7](#part7): Decomposition
    * [7.0](#7_0): Data splitting
<br />
<br />
* [Part 8](#part8): Performance evaluation
* [Part 9](#part9): Feature importance and statistical tests
* [Part 10](#part10): Converting the output

## Notebook Setup

In [4]:
#Essentials
import pandas as pd
from pandas import Series, DataFrame
from pandas.api.types import CategoricalDtype
pd.options.display.max_columns = None
import sqlite3
import pyodbc
import numpy as np

#Image creation and display
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.patches as mpatches
from matplotlib import pyplot
import plotly.express as px
import plotly.graph_objects as go
#from image import image, display

#Preprocessing
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

#Models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.naive_bayes import GaussianNB

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.base import clone
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier

#Metrics of accuracy
from numpy import mean
from numpy import std
from sklearn import metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import roc_curve, auc, precision_recall_curve
from sklearn.metrics import roc_auc_score

#Other
import itertools as it
import io
import os
os.sys.path
import sys
import glob
import concurrent.futures
from __future__ import print_function
import binascii
import struct
from PIL import Image
import scipy
import scipy.misc
import scipy.cluster
import datetime, time
import functools, operator
from datetime import datetime
from numpy.random import seed
from numpy.random import randn
from numpy import percentile

In [5]:
df = pd.read_csv('csv-files/preprocessed_data.csv')

## Part 5: <a class="anchor" id="part5"></a> Regression

### 5.0 <a class="anchor" id="5_1"></a> Data splitting

In [6]:
# Choose dependent variables
Y = df[['Avg_Profit']]

# Drop the dependent variables from the feature data set
X = df.drop(columns = ['Project_ID','Avg_Profit', 'Avg_Rec', 'Rec_Class', 'Profit_Class'])

# Scale the explanatory variables
df_1 = pd.DataFrame(RobustScaler().fit_transform(X), columns=X.columns)

# Split data set into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=1)

print(f'No. of training data: {X_train.shape[0]}')
print(f'No. of training targets: {Y_train.shape[0]}')
print(f'No. of testing data: {X_test.shape[0]}')
print(f'No. of testing targets: {Y_test.shape[0]}')

No. of training data: 7704
No. of training targets: 7704
No. of testing data: 1927
No. of testing targets: 1927


### 5.1 <a class="anchor" id="5_1"></a> Linear regression

In [7]:
lr = LinearRegression()
lr.fit(X_train, Y_train)

In [8]:
# Predicting price using Linear Regression model on train data set
y_pred_train_lr = lr.predict(X_train)
lr_mse_train = mean_squared_error(Y_train, y_pred_train_lr)
lr_rmse_train = np.sqrt(lr_mse_train)
lr_r2_train = metrics.r2_score(Y_train, y_pred_train_lr)
print("Linear Regression MSE: ", lr_mse_train)
print("Linear Regression RMSE: ", lr_rmse_train)
print("Linear Regression R2: ", lr_r2_train)

Linear Regression MSE:  0.04524369894551325
Linear Regression RMSE:  0.21270566270203822
Linear Regression R2:  0.07924143031518804


In [9]:
# Predicting price using Linear Regression model on test data set
y_pred_test_lr = lr.predict(X_test)
lr_mse_test = mean_squared_error(Y_test, y_pred_test_lr)
lr_rmse_test = np.sqrt(lr_mse_test)
lr_r2_test = metrics.r2_score(Y_test, y_pred_test_lr)
print("Linear Regression MSE: ", lr_mse_test)
print("Linear Regression RMSE: ", lr_rmse_test)
print("Linear Regression R2: ", lr_r2_test)

Linear Regression MSE:  0.04329139500594088
Linear Regression RMSE:  0.208065842958283
Linear Regression R2:  0.08462804145547598


### 5.2 <a class="anchor" id="5_2"></a> Lasso regression

In [10]:
lasso = Lasso(random_state = 1)
lasso.fit(X_train, Y_train)

In [11]:
# Predicting price using Lasso model on train data set
y_pred_train_lasso = lasso.predict(X_train)
lasso_mse_train = mean_squared_error(Y_train, y_pred_train_lasso)
lasso_rmse_train = np.sqrt(lasso_mse_train)
lasso_R2_train = metrics.r2_score(Y_train, y_pred_train_lasso)
print("Lasso Regression MSE: ",lasso_mse_train)
print("Lasso Regression RMSE: ",lasso_rmse_train)
print("Lasso Regression R2: ",lasso_R2_train)

Lasso Regression MSE:  0.04880400030211004
Lasso Regression RMSE:  0.22091627441659892
Lasso Regression R2:  0.006785418513526098


In [12]:
# Predicting price using Lasso model on test data set
y_pred_test_lasso = lasso.predict(X_test)
lasso_mse_test = mean_squared_error(Y_test, y_pred_test_lasso)
lasso_rmse_test = np.sqrt(lasso_mse_test)
lasso_R2_test = metrics.r2_score(Y_test, y_pred_test_lasso)
print("Lasso Regression MSE: ",lasso_mse_test)
print("Lasso Regression RMSE: ",lasso_rmse_test)
print("Lasso Regression R2: ",lasso_R2_test)

Lasso Regression MSE:  0.047313547618668156
Lasso Regression RMSE:  0.2175167754879337
Lasso Regression R2:  -0.00041809101893108824


### 5.3 <a class="anchor" id="5_3"></a> Decision tree regressor

In [13]:
dt = DecisionTreeRegressor(random_state = 1)
dt.fit(X_train, Y_train)

In [14]:
# Train data set
y_pred_train_dt = dt.predict(X_train)
dt_mse_train = mean_squared_error(Y_train, y_pred_train_dt)
dt_rmse_train = np.sqrt(dt_mse_train)
dt_r2_train = metrics.r2_score(Y_train, y_pred_train_dt)
print("Decision Tree MSE: ", dt_mse_train)
print("Decision Tree RMSE: ", dt_rmse_train)
print("Decision Tree R2: ", dt_r2_train)

Decision Tree MSE:  3.0758881147119096e-34
Decision Tree RMSE:  1.7538210041825562e-17
Decision Tree R2:  1.0


In [15]:
# Test data set
y_pred_test_dt = dt.predict(X_test)
dt_mse_test = mean_squared_error(Y_test, y_pred_test_dt)
dt_rmse_test = np.sqrt(dt_mse_test)
dt_r2_test = metrics.r2_score(Y_test, y_pred_test_dt)
print("Decision Tree MSE: ", dt_mse_test)
print("Decision Tree RMSE: ", dt_rmse_test)
print("Decision Tree R2: ", dt_r2_test)

Decision Tree MSE:  0.05365230231910129
Decision Tree RMSE:  0.23162966631910795
Decision Tree R2:  -0.13444745884301113


### 5.4 <a class="anchor" id="5_4"></a> Random forest

In [16]:
rf = RandomForestRegressor(random_state = 1)
rf.fit(X_train, Y_train)

  rf.fit(X_train, Y_train)


In [17]:
# Train data set
y_pred_train_rf = rf.predict(X_train)
rf_mse_train = mean_squared_error(Y_train, y_pred_train_rf)
rf_rmse_train = np.sqrt(rf_mse_train)
rf_R2_train = metrics.r2_score(Y_train, y_pred_train_rf)

print("Random Forest MSE: ",rf_mse_train)
print("Random Forest RMSE: ",rf_rmse_train)
print("Random Forest R2: ",rf_R2_train)

Random Forest MSE:  0.004213460364131468
Random Forest RMSE:  0.06491117287595001
Random Forest R2:  0.9142514907330301


In [18]:
# Test data set
y_pred_test_rf = rf.predict(X_test)
rf_mse_test = mean_squared_error(Y_test, y_pred_test_rf)
rf_rmse_test = np.sqrt(rf_mse_test)
rf_R2_test = metrics.r2_score(Y_test, y_pred_test_rf)

print("Random Forest MSE: ",rf_mse_test)
print("Random Forest RMSE: ",rf_rmse_test)
print("Random Forest R2: ",rf_R2_test)

Random Forest MSE:  0.027939758593346804
Random Forest RMSE:  0.16715190275120054
Random Forest R2:  0.40922967390301557


### 5.6 <a class="anchor" id="5_6"></a> XGBoost

In [19]:
xgb = XGBRegressor(random_state = 1)
xgb.fit(X_train, Y_train)

In [20]:
# Train data set
y_pred_train_xgb = xgb.predict(X_train)
xgb_mse_train = mean_squared_error(Y_train, y_pred_train_xgb)
xgb_rmse_train = np.sqrt(xgb_mse_train)
xgb_R2_train = metrics.r2_score(Y_train, y_pred_train_xgb)

print("XGBoost MSE: ",xgb_mse_train)
print("XGBoost RMSE: ",xgb_rmse_train)
print("XGBoost R2: ",xgb_R2_train)

XGBoost MSE:  0.009377839330214379
XGBoost RMSE:  0.09683924478337477
XGBoost R2:  0.8091507518246702


In [21]:
# Test data set
y_pred_test_xgb = xgb.predict(X_test)
xgb_mse_test = mean_squared_error(Y_test, y_pred_test_xgb)
xgb_rmse_test = np.sqrt(xgb_mse_test)
xgb_R2_test = metrics.r2_score(Y_test, y_pred_test_xgb)

print("XGBoost MSE: ",xgb_mse_test)
print("XGBoost RMSE: ",xgb_rmse_test)
print("XGBoost R2: ",xgb_R2_test)

XGBoost MSE:  0.029250173136015526
XGBoost RMSE:  0.17102681993189117
XGBoost R2:  0.38152170269388075


### 5.7  <a class="anchor" id="5_7"></a> Models comparison

Worst performing model:
Best:

## Part 6: <a class="anchor" id="part6"></a> Classification

### 6.0 <a class="anchor" id="6_0"></a> Data splitting

In [22]:
# Choose dependent variables
Y = df[['Profit_Class']]

# Drop the dependent variables from the feature data set
X = df.drop(columns = ['Project_ID','Avg_Profit', 'Avg_Rec', 'Rec_Class', 'Profit_Class'])

# Scale the explanatory variables
#df_2 = pd.DataFrame(RobustScaler().fit_transform(X), columns=X.columns)

# Split data set into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=1, stratify = Y)

print(f'No. of training data: {X_train.shape[0]}')
print(f'No. of training targets: {Y_train.shape[0]}')
print(f'No. of testing data: {X_test.shape[0]}')
print(f'No. of testing targets: {Y_test.shape[0]}')

No. of training data: 7704
No. of training targets: 7704
No. of testing data: 1927
No. of testing targets: 1927


### 6.1 <a class="anchor" id="6_1"></a> Logistic regression

In [None]:
#URL: https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python

# Instantiate the model
log = LogisticRegression(random_state = 1)

#Train the model using the training sets
log.fit(X_train, Y_train)

#Predict the response for test dataset
log_y_pred=lr.predict(X_test)

# Accuracy measures
#print("Accuracy score of LR: " + str(round(metrics.accuracy_score(Y_test, log_y_pred), 4)*100)+"%")
print("Precision score of LR: " + str(round(metrics.precision_score(Y_test, log_y_pred, average="weighted", zero_division=1), 4)*100)+"%")
print("Recall score of LR: " + str(round(metrics.recall_score(Y_test, log_y_pred, average="weighted"), 4)*100)+"%")
print("F1 of LR: " + str(round(metrics.f1_score(Y_test, log_y_pred, average="weighted"), 4)*100)+"%")

### 6.2 <a class="anchor" id="6_2"></a> KNeighbors classifier

In [26]:
#Create KNN Classifier
#random.seed(2022)
knn_3 = KNeighborsClassifier(n_neighbors = 3)

#Train the model using the training sets
knn_3.fit(X_train, Y_train)

#Predict the response for test dataset
knn_3_y_pred = knn_3.predict(X_test)

# Accuracy measures
print("Accuracy score of KNN-3: " + str(round(metrics.accuracy_score(Y_test, knn_3_y_pred), 4)*100)+"%")
print("Precision score of KNN-3: " + str(round(metrics.precision_score(Y_test, knn_3_y_pred, average="weighted", zero_division=1), 4)*100)+"%")
print("Recall score of KNN-3 " + str(round(metrics.recall_score(Y_test, knn_3_y_pred, average="weighted"), 4)*100)+"%")
print("F1 of KNN-3: " + str(round(metrics.f1_score(Y_test, knn_3_y_pred, average="weighted"), 4)*100)+"%")

Accuracy score of KNN-3: 48.620000000000005%
Precision score of KNN-3: 49.46%
Recall score of KNN-3 48.620000000000005%
F1 of KNN-3: 48.99%


  return self._fit(X, y)


In [27]:
#random.seed(2022)
knn_7 = KNeighborsClassifier(n_neighbors=7)
knn_7.fit(X_train, Y_train)
knn_7_y_pred = knn_7.predict(X_test)

print("Accuracy score of KNN-7: " + str(round(metrics.accuracy_score(Y_test, knn_7_y_pred), 4)*100)+"%")
print("Precision score of KNN-7: " + str(round(metrics.precision_score(Y_test, knn_7_y_pred, average="weighted", zero_division=1), 4)*100)+"%")
print("Recall score of KNN-7 " + str(round(metrics.recall_score(Y_test, knn_7_y_pred, average="weighted"), 4)*100)+"%")
print("F1 of KNN-7: " + str(round(metrics.f1_score(Y_test, knn_7_y_pred, average="weighted"), 4)*100)+"%")

Accuracy score of KNN-7: 52.410000000000004%
Precision score of KNN-7: 50.690000000000005%
Recall score of KNN-7 52.410000000000004%
F1 of KNN-7: 51.38%


  return self._fit(X, y)


### 6.3 <a class="anchor" id="6_3"></a> Decision tree classifier

In [28]:
# Create Decision Tree Classifer object
#random.seed(2022)
dtc = DecisionTreeClassifier(random_state = 1)

# Train Decision Tree Classifer
dtc = dtc.fit(X_train, Y_train)

#Predict the response for test dataset
dtc_y_pred = dtc.predict(X_test)

# Accuracy measures
print("Accuracy score of DTC: " + str(round(metrics.accuracy_score(Y_test, dtc_y_pred), 4)*100)+"%")
print("Precision score of DTC: " + str(round(metrics.precision_score(Y_test, dtc_y_pred, average="weighted", zero_division=1), 4)*100)+"%")
print("Recall score of DTC: " + str(round(metrics.recall_score(Y_test, dtc_y_pred, average="weighted"), 4)*100)+"%")
print("F1 of DTC: " + str(round(metrics.f1_score(Y_test, dtc_y_pred, average="weighted"), 4)*100)+"%")

Accuracy score of DTC: 61.39%
Precision score of DTC: 61.58%
Recall score of DTC: 61.39%
F1 of DTC: 61.46%


### 6.4 <a class="anchor" id="6_4"></a> Random forest classifier

In [29]:
#Create a Gaussian Classifier
#random.seed(2022)
rfc = RandomForestClassifier(random_state = 1)

#Train the model using the training sets y_pred=clf.predict(X_test)
rfc.fit(X_train, Y_train)

# prediction on test set
rfc_y_pred=rfc.predict(X_test)

# Accuracy measures
print("Accuracy score of RFC: " + str(round(metrics.accuracy_score(Y_test, rfc_y_pred), 4)*100)+"%")
print("Precision score of RFC: " + str(round(metrics.precision_score(Y_test, rfc_y_pred, average="weighted", zero_division=1), 4)*100)+"%")
print("Recall score of RFC: " + str(round(metrics.recall_score(Y_test, rfc_y_pred, average="weighted"), 4)*100)+"%")
print("F1 of RFC: " + str(round(metrics.f1_score(Y_test, rfc_y_pred, average="weighted"), 4)*100)+"%")

  rfc.fit(X_train, Y_train)


Accuracy score of RFC: 69.69%
Precision score of RFC: 70.00999999999999%
Recall score of RFC: 69.69%
F1 of RFC: 69.57%


### 6.5 <a class="anchor" id="6_5"></a> Gaussian classifier

In [30]:
#Create a XGBoost model
#random.seed(2022)
xgb = XGBClassifier(n_estimators=100, learning_rate=0.05, booster='gbtree', random_state = 1)

#Train the model using the training sets y_pred=clf.predict(X_test)
xgb.fit(X_train, Y_train)

# Prediction on test set
xgb_y_pred=xgb.predict(X_test)

# Accuracy measures
print("Accuracy score of XGB: " + str(round(metrics.accuracy_score(Y_test, xgb_y_pred), 4)*100)+"%")
print("Precision score of XGB: " + str(round(metrics.precision_score(Y_test, xgb_y_pred, average="weighted", zero_division=1), 4)*100)+"%")
print("Recall score of XGB: " + str(round(metrics.recall_score(Y_test, xgb_y_pred, average="weighted"), 4)*100)+"%")
print("F1 of XGB: " + str(round(metrics.f1_score(Y_test, xgb_y_pred, average="weighted"), 4)*100)+"%")

Accuracy score of XGB: 69.89999999999999%
Precision score of XGB: 70.5%
Recall score of XGB: 69.89999999999999%
F1 of XGB: 69.77%


### 6.6 <a class="anchor" id="6_6"></a> Naive Bayes

In [27]:
gnb = GaussianNB()
gnb.fit(X_train, Y_train)

#Predict the response for test dataset
gnb_y_pred = gnb.predict(X_test)

# Accuracy measures
print("Accuracy score of GNB: " + str(round(metrics.accuracy_score(Y_test, gnb_y_pred), 4)*100)+"%")
print("Precision score of GNB: " + str(round(metrics.precision_score(Y_test, gnb_y_pred, average="weighted", zero_division=1), 4)*100)+"%")
print("Recall score of GNB: " + str(round(metrics.recall_score(Y_test, gnb_y_pred, average="weighted"), 4)*100)+"%")
print("F1 of GNB: " + str(round(metrics.f1_score(Y_test, gnb_y_pred, average="weighted"), 4)*100)+"%")

Accuracy score of GNB: 6.950000000000001%
Precision score of GNB: 59.06%
Recall score of GNB: 6.950000000000001%
F1 of GNB: 7.46%


  y = column_or_1d(y, warn=True)


### 6.7  <a class="anchor" id="6_7"></a> Models comparison

The worst performing model was the decision tree with a harmonic mean equaling 0.88. The second worst was the SVC model with a 0.91 f1-score. Logistic regression, KNeighborsClassifier and XGBoost performed about the same with a 0.93 accuracy score. Random forest performed the best with 0.95 accuracies and 0.94 weighted average precision score.

In [None]:
# Setting figure to compare the confusion matrix of each model based on heatmaps
plt.figure(figsize=(16,7))
ax.set_title('Confusion matrix heatmaps of trained models', size = 17, pad = 10)

# Logistic model confusion matrix heatmap
plt.subplot(2,3,1)
sns.heatmap(confusion_matrix(Y_test,log_y_pred)/np.sum(confusion_matrix(Y_test,log_y_pred)), annot=True, fmt='.2%',cmap='Greens')
plt.title('Logistic Regression')

# KNN confusion matrix heatmap
plt.subplot(2,3,2)
sns.heatmap(confusion_matrix(Y_test,knn_7_y_pred)/np.sum(confusion_matrix(Y_test,knn_7_y_pred)), annot=True, fmt='.2%',cmap='Greens')
plt.title('KNN')

# Decision Tree confusion matrix heatmap
plt.subplot(2,3,3)
sns.heatmap(confusion_matrix(Y_test,dtc_y_pred)/np.sum(confusion_matrix(Y_test,dtc_y_pred)), annot=True, fmt='.2%',cmap='Greens')
plt.title('Decision Tree')

# Random Forest confusion matrix heatmap
plt.subplot(2,3,4)
sns.heatmap(confusion_matrix(Y_test,rfc_y_pred)/np.sum(confusion_matrix(Y_test,rfc_y_pred)), annot=True, fmt='.2%',cmap='Greens')
plt.title('Random Forest')

# XGBoost confusion matrix heatmap
plt.subplot(2,3,5)
sns.heatmap(confusion_matrix(Y_test,xgb_y_pred)/np.sum(confusion_matrix(Y_test,xgb_y_pred)), annot=True, fmt='.2%',cmap='Greens')
plt.title('XGBoost')

# Vaive Bayes model confusion matrix heatmap
plt.subplot(2,3,6)
sns.heatmap(confusion_matrix(Y_test,gnb_y_pred)/np.sum(confusion_matrix(Y_test,gnb_y_pred)), annot=True, fmt='.2%',cmap='Greens')
plt.title('Naive Bayes')


plt.savefig('Confusion matrix heatmaps of trained classification models')

## Part 7: <a class="anchor" id="part7"></a> Decomposition

### 7.0 <a class="anchor" id="7_0"></a> Data splitting

### 7.1 <a class="anchor" id="7_1"></a> Feature selection

### 7.2 <a class="anchor" id="7_2"></a> Feature extraction