This is a social network ad dataset that includes three columns: Age, Estimated Salary, and Purchased.

    Age:              Integer
    Estimated Salary: Integer
    Purchased:        Integer (0 or 1)
    
The goal with this data is to build a classifier that is able to predict whether or not a given user will purchase an ad or not given the age and estimated salary of the user. Obviously, this is not a very good dataset to predict this; however, this is just a little intro for me to play around with different classification methods.

First we need to import the necessary libraries.

In [1]:
import numpy as np
import pandas as pd

In [4]:
df = pd.read_csv('Social_Network_Ads.csv')
print(df)

     Age  EstimatedSalary  Purchased
0     19            19000          0
1     35            20000          0
2     26            43000          0
3     27            57000          0
4     19            76000          0
..   ...              ...        ...
395   46            41000          1
396   51            23000          1
397   50            20000          1
398   36            33000          0
399   49            36000          1

[400 rows x 3 columns]
0      0
1      0
2      0
3      0
4      0
      ..
395    1
396    1
397    1
398    0
399    1
Name: Purchased, Length: 400, dtype: int64



My first classification method will be using Support Vectors (SVM). SVMs are efficient on datasets that are linearly separable and allow a kernel trick to work with data that is not linearly separable. I do not currently know whether the data is linearly separable or not so I will first attempt to a linear SVM and then move to utilizing the kernel trick if that is needed.

First I am going to check to see if the dataset is balanced.

In [9]:
print("Num Purchased = ", len(df.loc[df['Purchased'] == 1]))
print("Num Not Purchased = ", len(df.loc[df['Purchased'] == 0]))


Num Purchased =  143
Num Not Purchased =  257


Sadly this dataset is not balanced.

In [25]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

x = df.iloc[:,0:2].values
y = df['Purchased'].to_numpy()

x_train, x_test, y_train, y_test = train_test_split(x, y)
print(len(x_train))
print(len(x_test))
print(len(y_train))
print(len(y_test))

300
100
300
100


In [28]:
model = SVC()

model.fit(x_train, y_train)

predictions = model.predict(x_test)

print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

           0       0.64      1.00      0.78        63
           1       1.00      0.05      0.10        37

    accuracy                           0.65       100
   macro avg       0.82      0.53      0.44       100
weighted avg       0.78      0.65      0.53       100

[[63  0]
 [35  2]]




Use this to explore more: https://scikit-learn.org/stable/modules/svm.html