# Machine Learning: 3 Approaches

Dataset from Rabie El Kharoua on Kaggle: https://www.kaggle.com/datasets/rabieelkharoua/predict-customer-purchase-behavior-dataset 

This notebook will look at a dataset of customer purchases. The goal is to explore the data and build several models using all features in the dataset, test each model and get a baseline for performance. Then feature selection will be performed and the models will be retested to see if feature selection has a significant impace on model performance. 

To begin, three libraries will be imported and several warning categories will be supressed as these warnings do not affect the output of the code: 

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import warnings

# Supressing warnings
warnings.filterwarnings("ignore", category=UserWarning) # suppress user warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) # suppress depreciation warnings
warnings.filterwarnings("ignore", category=FutureWarning) 

Next, the data can be imported from a local CSV file: 

In [2]:
df = pd.read_csv("customer_purchase_data.csv")
df

Unnamed: 0,Age,Gender,AnnualIncome,NumberOfPurchases,ProductCategory,TimeSpentOnWebsite,LoyaltyProgram,DiscountsAvailed,PurchaseStatus
0,40,1,66120.267939,8,0,30.568601,0,5,1
1,20,1,23579.773583,4,2,38.240097,0,5,0
2,27,1,127821.306432,11,2,31.633212,1,0,1
3,24,1,137798.623120,19,3,46.167059,0,4,1
4,31,1,99300.964220,19,1,19.823592,0,0,1
...,...,...,...,...,...,...,...,...,...
1495,39,1,65048.141834,13,0,34.590743,0,5,1
1496,67,1,28775.331069,18,2,17.625707,0,1,1
1497,40,1,57363.247541,7,4,12.206033,0,0,0
1498,63,0,134021.775532,16,2,37.311634,1,0,1


The dataset cotains 9 columns (from the Kaggle page): 
* Age: Customer's age
* Gender: Customer's gender (0: Male, 1: Female)
* Annual Income: Annual income of the customer in dollars
* Number of Purchases: Total number of purchases made by the customer
* Product Category: Category of the purchased product (0: Electronics, 1: Clothing, 2: Home Goods, 3: Beauty, 4: Sports)
* Time Spent on Website: Time spent by the customer on the website in minutes
* Loyalty Program: Whether the customer is a member of the loyalty program (0: No, 1: Yes)
* Discounts Availed: Number of discounts availed by the customer (range: 0-5)
* PurchaseStatus (Target Variable): Likelihood of the customer making a purchase (0: No, 1: Yes)

## Cleaning the Data

While most of the columns can be left unchanged, ProductCategory is a categorical, non-ordinal variable that needs to be modified for use in the models:

In [3]:
df["ProductCategory"].value_counts().sort_index()

ProductCategory
0    289
1    331
2    273
3    286
4    321
Name: count, dtype: int64

Because the column is categorical it can be one-hot encoded, which is the process of making each category a column and indicating presence of each value with a 1 and absence with a 0. This is done below: 

In [4]:
cat_df = pd.get_dummies(df["ProductCategory"]).astype(int)
cat_df

Unnamed: 0,0,1,2,3,4
0,1,0,0,0,0
1,0,0,1,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,1,0,0,0
...,...,...,...,...,...
1495,1,0,0,0,0
1496,0,0,1,0,0
1497,0,0,0,0,1
1498,0,0,1,0,0


The first row of data had a ProductCategory of 0 (electronics), so there is a 1 in the '0' column in the dataframe above. This is good for our machine learning model, but it would be helpful to indicate that each column represents a category when it is added to the original dataframe. So "cat_" will be added to the beginning of each column name to indicate that it represents a given category: 

In [5]:
cols = [] # create empty array

for col in cat_df.columns: # loop through the columns
    cols.append(f"cat_{col}") # add each new column name to the 'cols' array

cat_df.columns = cols # assign the 'cols' array to the columns in the dataframe
cat_df

Unnamed: 0,cat_0,cat_1,cat_2,cat_3,cat_4
0,1,0,0,0,0
1,0,0,1,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,1,0,0,0
...,...,...,...,...,...
1495,1,0,0,0,0
1496,0,0,1,0,0
1497,0,0,0,0,1
1498,0,0,1,0,0


Now that the one-hot encoded categories have been made, they can be combined with the original dataframe:

In [6]:
df = pd.concat([df, cat_df], axis=1) # Concatenates the dataframes
df

Unnamed: 0,Age,Gender,AnnualIncome,NumberOfPurchases,ProductCategory,TimeSpentOnWebsite,LoyaltyProgram,DiscountsAvailed,PurchaseStatus,cat_0,cat_1,cat_2,cat_3,cat_4
0,40,1,66120.267939,8,0,30.568601,0,5,1,1,0,0,0,0
1,20,1,23579.773583,4,2,38.240097,0,5,0,0,0,1,0,0
2,27,1,127821.306432,11,2,31.633212,1,0,1,0,0,1,0,0
3,24,1,137798.623120,19,3,46.167059,0,4,1,0,0,0,1,0
4,31,1,99300.964220,19,1,19.823592,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,39,1,65048.141834,13,0,34.590743,0,5,1,1,0,0,0,0
1496,67,1,28775.331069,18,2,17.625707,0,1,1,0,0,1,0,0
1497,40,1,57363.247541,7,4,12.206033,0,0,0,0,0,0,0,1
1498,63,0,134021.775532,16,2,37.311634,1,0,1,0,0,1,0,0


Since we have the one-hot encoded columns added, the ProductCategory column is no longer needed: 

In [7]:
del df["ProductCategory"]

For the final step in the data cleaning process, the column names can all be set to lowercase: 

In [8]:
cols = []

for col in df.columns:
    cols.append(col.lower())

df.columns = cols
df

Unnamed: 0,age,gender,annualincome,numberofpurchases,timespentonwebsite,loyaltyprogram,discountsavailed,purchasestatus,cat_0,cat_1,cat_2,cat_3,cat_4
0,40,1,66120.267939,8,30.568601,0,5,1,1,0,0,0,0
1,20,1,23579.773583,4,38.240097,0,5,0,0,0,1,0,0
2,27,1,127821.306432,11,31.633212,1,0,1,0,0,1,0,0
3,24,1,137798.623120,19,46.167059,0,4,1,0,0,0,1,0
4,31,1,99300.964220,19,19.823592,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,39,1,65048.141834,13,34.590743,0,5,1,1,0,0,0,0
1496,67,1,28775.331069,18,17.625707,0,1,1,0,0,1,0,0
1497,40,1,57363.247541,7,12.206033,0,0,0,0,0,0,0,1
1498,63,0,134021.775532,16,37.311634,1,0,1,0,0,1,0,0


## Defining Features and Splitting the Data

Now that the data has been cleaned, features need to be defined so that they can be used as input to the models: 

In [9]:
features = ["age", "gender","annualincome","numberofpurchases","timespentonwebsite","loyaltyprogram",
            "cat_0","cat_1","cat_2","cat_3", "cat_4"] # Input
output = "purchasestatus"# Target

Using the features and target columns, the X and y variables can be defined: 

In [10]:
X = df[features]
y = df[output]

Finally, the dataset can be split into training and testing sets for model training and evaluation: 

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

X_train

Unnamed: 0,age,gender,annualincome,numberofpurchases,timespentonwebsite,loyaltyprogram,cat_0,cat_1,cat_2,cat_3,cat_4
382,54,1,81098.474861,6,16.521722,0,1,0,0,0,0
538,61,1,75731.262728,8,15.374676,1,0,0,0,0,1
1493,34,1,20418.374269,9,54.459955,0,0,1,0,0,0
1112,47,1,145165.346420,2,12.150154,0,1,0,0,0,0
324,38,0,132143.261249,14,42.546358,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1130,66,1,141127.983526,0,1.128864,0,0,1,0,0,0
1294,57,1,117678.761184,10,30.935996,0,1,0,0,0,0
860,23,0,22485.092276,10,31.839858,0,0,0,1,0,0
1459,18,0,99007.775893,10,16.263599,0,0,1,0,0,0


The data is now ready to be used in machine learning models. 

## Creating the Models

### Basic Models

To start, 5 different models will be created:

1. KNeighborsClassifier: clusters similar data points together.
2. LinearSVC: creates hyperplanes that define which category a given data point should fall into.
3. GaussianNB: calculates the probability of each class based on the data points and selects the class with the highest probability.
4. DecisionTreeClassifier: creates branches based on data points and makes predictions based on the branches. 
5. RandomForestClassifier: creates several subsets of data and forms different trees. It then uses the combined output from these trees to make predictions.

First the KNeighborsClassifier needs to be instantiated: 

In [12]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

Now that the model has been instantiated, it can be fitted to the training data: 

In [13]:
knn.fit(X_train, y_train)

We need to be able to test the model without "peeking" at the test set. This allows our model to remain unbiased, only using data that it has already seen. In order to accomplish this, cross-validation can be used to test the model against subsets of training data, which allows us to see how it will perform without using the testing data. 

Below, a function is created to break the data into 6 distinct groups and using each group as a validation set against which the model can be tested. The output is the best score from the cross-validated model: 

In [14]:
from sklearn.model_selection import cross_val_score

def cv_model(model, cv=6):
    cv = cross_val_score(model, X_train, y_train, cv=6, scoring='accuracy')
    return cv.max()

We can now use the cross-validation model to get a score for the KNeighborsClassifier: 

In [15]:
knn_cv = cv_model(knn)
knn_cv

0.62

The model's best accuracy in cross-validation was 0.62, not terrible but not great. 

The same can now be done to each additional model. All steps done to the KNeighborsClassifier will be done in a single cell for each of the following models. Next, the LinearSVC model can be created, trained, and cross-validated: 

In [16]:
from sklearn.svm import LinearSVC

svc = LinearSVC(random_state=42)
svc.fit(X_train, y_train)

svc_cv = cv_model(svc)
svc_cv

0.57

This model's best accuracy was 0.57, performing worse than the KNeighborsClassifier. Next is the GaussianNB model, which is a Naive-Bayes classifier:  

In [17]:
from sklearn.naive_bayes import GaussianNB

naive = GaussianNB()
naive.fit(X_train, y_train)

naive_cv = cv_model(naive)
naive_cv

0.765

This model did very well, with it's best accuracy being 0.765 during cross-validation. Next, DecisionTreeClassifier can be created and cross validated:

In [18]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

dt_cv = cv_model(dt)
dt_cv

0.77

With it's best accuracy being 0.77, the DecisionTreeClassifier had the best score of the 4 models tested.

The final model that will be trained is the Random Forest Classifier. However, performing cross validation takes extremely long, so it will simply be trained using grid search, with no cross-validation being applied to it: 

In [19]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

While some of the scores are decent, none of the models performed as well as one might have hope. The scores for each model are below:

In [20]:
print(f"KNN: {knn_cv}")
print(f"SVC: {svc_cv}")
print(f"Naive Bayes: {naive_cv}")
print(f"Decision Tree: {dt_cv}")
print("Random Forest: N/A")

KNN: 0.62
SVC: 0.57
Naive Bayes: 0.765
Decision Tree: 0.77
Random Forest: N/A


### Evaluating the Promising Models

From the scores above, the Naive-Bayes and Decision Tree models performed the best based on accuracy. We can make predictions with these models and test how well each performs on data that it has not yet seen. 

The first model to be tested is the Naive-Bayes model. It had an accuracy score of 0.765 when cross validating. We are hoping for a score as good or better than this. However, before we can measure the accuracy, we need to make predictions based on data the model has not yet seen: 

In [21]:
naive_preds = naive.predict(X_test) # X_test has not yet been seen by the model

Now that we have predictions, we can see how accurate they are compared to our actual test target values (y_test): 

In [22]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, naive_preds)

0.73

The model performed slightly worse on new data than it did in cross validation. 

Although accuracy can give us a good idea of how many predictions the model got right, it doesn't tell us how well the model did at predicting a 1 when the value was actually 1. Conversely, it also doesn't tell us how well the model was at predicting 0s when the value is 0. 

To determine this, we can use the confusion matrix. It shows the actual values in the rows and the predicted values in the columns, as can be seen below: 

In [23]:
from sklearn.metrics import confusion_matrix

def create_conf_matrix(actual, preds):

    df_data = {"Predicted 0": confusion_matrix(actual, preds)[0], 
            "Predicted 1": confusion_matrix(actual, preds)[1]}

    return pd.DataFrame(data=df_data, index=["Actual 0", "Actual 1"])

create_conf_matrix(y_test, # actual values
                   naive_preds) # our models predictions

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,142,51
Actual 1,30,77


While an overall accuracy of 0.73 sounds very good, if we look at the confusion matrix, is misclassified 51 0s as 1s and 30 1s as 0s. We can use precision and recall scores to quantify these deficiencies in our model. The Precision score is the ability of the model to detect <b><u>only</u></b> the positive cases. It is calculated as the true positives (77) divided by the sum of true positives and false positives (30 + 77). The calculation is below: 

> 77 / (77 + 30) = 0.72

This is very close the the overall accuracy score of 0.73. Second, the recall score is the ability of the model to correctly classify <b><u>all</u></b> positive cases. It is calculated as the number of true positives (77) divided by the sum of true positives and false negatives (77 + 51). The calculation for recall is below: 

> 77 / (77 + 51) = 0.6

The recall score is significantly lower than the overall accuracy score. This means that the model doesn't do a good job of detecting all positive instances in the data. 

I showed the math behind precision and recall to demonstrate what each score represents. However, SK-Learn has built in functions for calculating this as well: 

In [24]:
from sklearn.metrics import precision_score, recall_score

print("Precision:", round(precision_score(y_test, naive_preds), 2))
print("Recall:", round(recall_score(y_test, naive_preds), 2))

Precision: 0.72
Recall: 0.6


I will modify the confusion matrix function from above to show the precision scores along with the confusion matrix to make model evaluation easier: 

In [25]:
def create_conf_matrix(actual, preds):

    df_data = {"Predicted 0": confusion_matrix(actual, preds)[0], 
            "Predicted 1": confusion_matrix(actual, preds)[1]}

    print("Accuracy:", round(accuracy_score(actual, preds), 2))
    print("Precision:", round(precision_score(actual, preds), 2))
    print("Recall:", round(recall_score(actual, preds), 2))

    return pd.DataFrame(data=df_data, index=["Actual 0", "Actual 1"])

We can also evaluate the Decision Tree model as it performed slightly better than the Naive-Bayes model on accuracy:  

In [26]:
dt_preds = dt.predict(X_test)
create_conf_matrix(y_test, dt_preds)

Accuracy: 0.77
Precision: 0.76
Recall: 0.67


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,145,42
Actual 1,27,86


While the model did better than the Naive-Bayes model, the recall score is still low. Finally, the Random Forest model can be evaluated as well: 

In [27]:
rf_preds = rf.predict(X_test)
create_conf_matrix(y_test, rf_preds)

Accuracy: 0.8
Precision: 0.78
Recall: 0.73


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,146,34
Actual 1,26,94


The Random Forest model did better than the Naive-Bayes and Decision Tree models. 

After testing all these models, the performance of each is subpar. So, how can the models be improved? one method is a grid search. 

### Fine Tuning Models with Grid Search

A grid search is a method for hypertuning model parameters. It takes several parameters and creates models using each parameter. It then decides which parameters produce the best results and outputs the optimal model. 

Below, a GridSearch object is used in conjuction with user-defined parameters to find the optimal model: 

In [28]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

params = { # the parameters that will be used in each model
    "n_neighbors": [2, 3, 4, 5, 6],
    "weights": ["uniform","distance"],
    "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
    "p": [1, 2]
}

knn_grid = GridSearchCV(KNeighborsClassifier(), params) # fitting the parameters to the model
knn_grid.fit(X_train, y_train) # fitting the data to the model

Below, the parameters that the grid search selected are shown: 

In [29]:
for param in knn_grid.best_params_:
    print(param, ":", knn_grid.best_params_[param])

algorithm : auto
n_neighbors : 2
p : 1
weights : distance


Now the best model can be cross validated: 

In [30]:
knn_grid_cv = cv_model(knn_grid)
knn_grid_cv

0.665

The model accuracy has improved marginally, from 0.62 to 0.665. Now the same can be done for each of the other models. 

LinearSVC: 

In [31]:
from sklearn.svm import LinearSVC

params = {
    "loss": ["hinge", "squared_hinge"]
}

svc_grid = GridSearchCV(LinearSVC(penalty="l2", random_state=42), params)
svc_grid.fit(X_train, y_train)

svc_grid_cv = cv_model(svc_grid)
svc_grid_cv

0.57

The model's accuracy did not improve at all, maintaining a score of 0.57. 

GaussianNB: Because the GaussianNB model does not have applicable parameters for a grid search, it will not be hypertuned. 

DecisionTreeClassifier:

In [32]:
from sklearn.tree import DecisionTreeClassifier

params = {
    "criterion": ["gini", "entropy", "log_loss"],
    "min_samples_split": [2, 4, 6, 8],
    "min_samples_leaf": [1, 2, 3, 4]
}

dt_grid = GridSearchCV(DecisionTreeClassifier(random_state=42), params)
dt_grid.fit(X_train, y_train)

dt_grid_cv = cv_model(dt_grid)
dt_grid_cv

0.775

The DecisionTreeClassifier's accuracy improved slightly, increase by 0.005 to achieve a score of 0.775. 

The RandomForestClassifier will be trained as well. However, as stated above, cross validation will not be performed on it: 

In [33]:
from sklearn.ensemble import RandomForestClassifier

params = {
    "n_estimators": [100, 150, 200],
    "min_samples_split": [4, 6, 8],
    "min_samples_leaf": [2, 3, 4]
}

rf_grid = GridSearchCV(RandomForestClassifier(random_state=42), params)
rf_grid.fit(X_train, y_train)

So, the best scores for each model on the current features are as follows: 

In [34]:
print(f"KNN: {knn_grid_cv}")
print(f"SVC: {svc_grid_cv}")
print(f"Naive-Bayes (No GridSearch): {naive_cv}")
print(f"Decision Tree: {dt_grid_cv}")
print(f"Random Forest: N/A")

KNN: 0.665
SVC: 0.57
Naive-Bayes (No GridSearch): 0.765
Decision Tree: 0.775
Random Forest: N/A


### Evaluating the Promising GridSearch Models

As above, we can evaluate the best models to see how they perform now that a grid search has been performed. Because no grid search was performed on the Naive-Bayes model, it will be skipped in this section of evaluation as the base model was already evaluated. 

Decision Tree: 

In [35]:
dt_preds = dt_grid.predict(X_test)
create_conf_matrix(y_test, dt_preds)

Accuracy: 0.76
Precision: 0.75
Recall: 0.66


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,144,44
Actual 1,28,84


While the model imrpoved, the recall score is still low. 

Random Forest: 

In [36]:
rf_preds = rf_grid.predict(X_test)
create_conf_matrix(y_test, rf_preds)

Accuracy: 0.82
Precision: 0.81
Recall: 0.75


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,149,32
Actual 1,23,96


After testing all these models, the performance has improved but it could be better. So, what can we do to increase each model's score? We can select only relevant features, which will remove noise from the model. 

### New Features

Feature selection can help the model hone in on what actually matters in the model. One way of doing this is to look at the correlation between each feature and the target value to determine which items are correlated: 

In [37]:
corr_matrix = df.corr()
corr_matrix["purchasestatus"]

age                  -0.255747
gender                0.002627
annualincome          0.188214
numberofpurchases     0.222691
timespentonwebsite    0.277112
loyaltyprogram        0.310838
discountsavailed      0.303297
purchasestatus        1.000000
cat_0                -0.026781
cat_1                 0.019498
cat_2                -0.006753
cat_3                 0.032369
cat_4                -0.018613
Name: purchasestatus, dtype: float64

We can see that most of the features are highly correlated, with the exceptions of gender and category (cat_1 - cat_5). Based on this information, we can set a threshold of 0.1 as the requirement for selecting the feature. If a feature has a correlation greater than 0.1, we add it, otherwise we ignore it. 

Below, we only select the features that meet the correlation threshold of 0.1: 

In [38]:
features = [feature for feature in df.columns if abs(corr_matrix["purchasestatus"].mean()) > 0.1 
            and feature != "purchasestatus"] # we don't want to include the purchasestatus column

Now that we have our features, we need to define the target as well:

In [39]:
target = "purchasestatus"

Now we can define the X and y variables and split the data into training and testing sets:

In [40]:
X = df[features]
y = df[output]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

Now we can retrain and test our models to see how they perform on the new features. 

### Retraining the Models

We will use the same models we created above. As this has already been done previously, each block will be run without additional explanation. Remember that the output of the following code blocks are the accuracy scores for each model: 

In [41]:
params = {
    "n_neighbors": [2, 3, 4, 5, 6],
    "weights": ["uniform","distance"],
    "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
    "p": [1, 2]
}

knn_grid2 = GridSearchCV(KNeighborsClassifier(), params)
knn_grid2.fit(X_train, y_train)

knn_grid_cv2 = cv_model(knn_grid2)
knn_grid_cv2 # Accuracy

0.66

In [42]:
from sklearn.svm import LinearSVC

params = {
    "loss": ["hinge", "squared_hinge"]
}

svc_grid2 = GridSearchCV(LinearSVC(random_state=42), params)
svc_grid2.fit(X_train, y_train)

svc_grid_cv2 = cv_model(svc_grid2)
svc_grid_cv2

0.57

In [43]:
naive2 = GaussianNB()
naive2.fit(X_train, y_train)

naive_cv2 = cv_model(naive2)
naive_cv2

0.8

In [44]:
params = {
    "criterion": ["gini", "entropy", "log_loss"],
    "min_samples_split": [2, 4, 6, 8],
    "min_samples_leaf": [1, 2, 3, 4]
}

dt_grid2 = GridSearchCV(DecisionTreeClassifier(random_state=42), params)
dt_grid2.fit(X_train, y_train)

dt_grid_cv2 = cv_model(dt_grid2)
dt_grid_cv2

0.925

Once again, the random forest will be trained but not cross validated: 

In [45]:
from sklearn.ensemble import RandomForestClassifier

params = {
    "n_estimators": [100, 150, 200],
    "min_samples_split": [4, 6, 8],
    "min_samples_leaf": [2, 3, 4]
}

rf_grid2 = GridSearchCV(RandomForestClassifier(random_state=42), params)
rf_grid2.fit(X_train, y_train)

The final cross-validation accuracy scores for these models:

In [46]:
print(f"KNN: {knn_grid_cv2}")
print(f"SVC: {svc_grid_cv2}")
print(f"Naive-Bayes (No GridSearch): {naive_cv2}")
print(f"Decision Tree: {dt_grid_cv2}")
print(f"Random Forest: N/A")

KNN: 0.66
SVC: 0.57
Naive-Bayes (No GridSearch): 0.8
Decision Tree: 0.925
Random Forest: N/A


### Reevaluating the Promising Models

In [47]:
naive2_preds = naive2.predict(X_test)
create_conf_matrix(y_test, naive2_preds)

Accuracy: 0.81
Precision: 0.84
Recall: 0.67


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,156,42
Actual 1,16,86


In [48]:
dt_grid2_preds = dt_grid2.predict(X_test)
create_conf_matrix(y_test, dt_grid2_preds)

Accuracy: 0.91
Precision: 0.93
Recall: 0.86


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,164,18
Actual 1,8,110


In [49]:
rf_grid2_preds = rf_grid2.predict(X_test)
create_conf_matrix(y_test, rf_grid2_preds)

Accuracy: 0.95
Precision: 0.97
Recall: 0.91


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,169,12
Actual 1,3,116


Each model has improved significantly, with the Random Forest model attaining the best scores overall. 

Below is the table for the final scores for each model, before and after we performed grid searches and feature selection. Each column is described below: 

1. Features: What features were used.
    * Default - All Features
    * Grid Search - All Features
    * Grid Search - Selected Features
2. Model: Type of model used.
    * Naive-Bayes
    * Decision Tree
    * Random Forest
3. Accuracy: The model's accuracy score.
4. Precision: The model's precision score.
5. Recall: The model's recall score.

In [50]:
final_scores = {
    "Accuracy": [0.73, 0.77, 0.8, np.nan, 0.76, 0.82, 0.81, 0.91, 0.95],
    "Precision": [0.72, 0.76, 0.78, np.nan, 0.75, 0.81, 0.84, 0.93, 0.97],
    "Recall": [0.6, 0.67, 0.73, np.nan, 0.66, 0.75, 0.67, 0.86, 0.91]
}
levels = [["Default - All Features", 'Grid Search - All Features', 'Grid Search - Selected Features'], ['Naive-Bayes', 'Decision Tree', "Random Forest"]]

index = pd.MultiIndex.from_product(levels, names=["Features", "Model"])

pd.DataFrame(final_scores, index=index)

Unnamed: 0_level_0,Unnamed: 1_level_0,Accuracy,Precision,Recall
Features,Model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Default - All Features,Naive-Bayes,0.73,0.72,0.6
Default - All Features,Decision Tree,0.77,0.76,0.67
Default - All Features,Random Forest,0.8,0.78,0.73
Grid Search - All Features,Naive-Bayes,,,
Grid Search - All Features,Decision Tree,0.76,0.75,0.66
Grid Search - All Features,Random Forest,0.82,0.81,0.75
Grid Search - Selected Features,Naive-Bayes,0.81,0.84,0.67
Grid Search - Selected Features,Decision Tree,0.91,0.93,0.86
Grid Search - Selected Features,Random Forest,0.95,0.97,0.91


## Conclusion

Although default models can be very helpful when trying to quickly create a machine learning algorithm, they can often be lacking when it comes to results. While libraries like SK-Learn do a good job of delivering usable, quickly deployable models, the hyperparameters often need to be fine-tuned to deliver the best results. However, hypertuning parameters only had a small effect on the performance of the models above. So, another method had to be deployed: feature selection. 

Although the idea that machine learning models always get better when you input more data, clearly thats not always the case. We saw above that our three best performing models actually did significantlly better when we intentionally selected each feature. Although the Naive-Bayes model didn't improve as much as we might have hoped, the Decision Tree and Random Forest models got excellent scores after we performed feature selection on the dataset. 

Sometimes selecting fewer features is the best course of action to improve a model. However, we cannot always assume this, hence why we trained models on the full dataset before we performed feature selection. This allowed us get a baseline of how good we cannot expect the models to perform with the full dataset. We could then use the baseline scores to determine if feature selection imrpoved each model. 

Overall, the Decision Tree and Random Forest models were improved to a point where they perform well on the testing data, which means that we can assume (to the best of our knowledge) that they will generalize well to new data. 

Thank you. 