# Classifying Likely Purchases for Social Network Ads

By: Matt Purvis

This project will train a naive bayes algorithm to predict customers that will likely purchase based off of their age and estimated salary. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix

# Import Data

In [3]:
filepath = 'C:\\Users\\v-mpurvis\OneDrive\\Personal Files\\Python Machine Learning Examples\\DataSets-Modules\\'

dataset = pd.read_csv(filepath + 'Social_Network_Ads.csv')
dataset

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0
...,...,...,...
395,46,41000,1
396,51,23000,1
397,50,20000,1
398,36,33000,0


# Train Test Split

In [4]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Scale the Features

In [6]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Create and Fit Model

In [7]:
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

# Evaluate

In [8]:
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.96      0.93        68
           1       0.89      0.78      0.83        32

    accuracy                           0.90       100
   macro avg       0.90      0.87      0.88       100
weighted avg       0.90      0.90      0.90       100

[[65  3]
 [ 7 25]]


Ten observations were misclassified. There were 7 false negatives and 3 false positives. Accuracy was around 90%. The precision is the measure of correctly predicted positive observations over total number of predicted positive observations (25/28) ~ 89%. The recall is the "true positive rate", or when it is actually yes how often does it predict yes (25/32) ~ 78%. The F1 score is the harmonic mean between precision and recall. Therefore, many times, it is a better indicator of model success than just accuracy alone. In this case the f1-score and the accuracy float around 88 and 90% which is not bad. There are still questions surrounding generalizabilty due to it being a very small dataset. More observations could strengthen the model's predictive power and generalizability.