# SVM Demonstration

In this tutorial we will demonstrate how to use the `SVM` class in `scikit-learn` to perform logistic regression on a dataset. 

NOTE: We are not splitting the data in this example. For this example we focus on the fitting process and results of the model on training data. As we know, this isn't how you would normally use a model. You can easily add splitting the data (as we did in the previous examples).

## 1. Setup

Import modules

In [1]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression



np.random.seed(28)

## 2. Data Load

Load data (it's already cleaned and preprocessed)

In [2]:
# Uncomment the following snippet of code to debug problems with finding the .csv file path
# This snippet of code will exit the program and print the current working directory.
#import os
#print(os.getcwd())

In [3]:
df = pd.read_csv('C:/Users/surya/OneDrive/Documents/USF/Classes/Semester-2/DSP/Week-3 Codes/RidingMowers.csv') # let's use the same data as we did in the logistic regression example
df.head(5)

Unnamed: 0,Income,Lot_Size,Ownership
0,60.0,18.4,Owner
1,85.5,16.8,Owner
2,64.8,21.6,Owner
3,61.5,20.8,Owner
4,87.0,23.6,Owner


Now, Let us know what our data is...

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Income     24 non-null     float64
 1   Lot_Size   24 non-null     float64
 2   Ownership  24 non-null     object 
dtypes: float64(2), object(1)
memory usage: 704.0+ bytes


Let us now check for Null values if any in the data

In [5]:
df.isna().sum()

Income       0
Lot_Size     0
Ownership    0
dtype: int64

Encoding the Target Variable using Label Encoder

In [6]:
from sklearn.preprocessing import LabelEncoder


In [7]:
LE = LabelEncoder()
df['Ownership'] = LE.fit_transform(df['Ownership'])

Since the Encoding is done, let us now split the data into training and validation sets

We will be using a 70:30 split for training and validation as we will have sufficient data to tarin to get accurate results.

In [8]:
df_train, df_test = train_test_split(df, test_size=0.3)

Let us now define what our predictor and target variables are.

In [9]:
target = 'Ownership'
predictors = list(df.columns)
predictors.remove(target)

Preparing the training and test sets from the data

In [10]:
X_train = df_train[predictors]
X_test = df_test[predictors]
y_train = df_train[target]
y_test = df_test[target]

## 3. Model the data

First, let's create a dataframe to load the model performance metrics into.

In [11]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

### 3.1 Fit a SVM classification model using linear kernal

In [12]:
svm_lin_model = SVC(kernel="linear", probability=True)
_ = svm_lin_model.fit(X_train, np.ravel(y_train))

In [13]:
model_preds = svm_lin_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"linear svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### 3.2 Fit a SVM classification model using rbf kernal

In [14]:
svm_rbf_model = SVC(kernel="rbf", C=10, gamma='scale')
_ = svm_rbf_model.fit(X_train, np.ravel(y_train))

In [15]:
model_preds = svm_rbf_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"rbf svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### 3.3 Fit a SVM classification model using polynomial kernal

In [16]:
svm_poly_model = SVC(kernel="poly", degree=3, coef0=1, C=10)
_ = svm_poly_model.fit(X_train, np.ravel(y_train))

In [17]:
model_preds = svm_poly_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"poly svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

## 4. Summary


In [18]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,linear svm,0.875,0.833333,1.0,0.909091
0,rbf svm,0.625,0.666667,0.8,0.727273
0,poly svm,0.75,0.8,0.8,0.8


## 5. ANALYSIS

Choosing a model completely depends on how we define our business problem. I am defining the business problem as:

There are 2 ways in which we can make a wrong prediction.

i. When the person is not actually an owner but we estimated him as owner which means a false positive

ii. When the person actually is an owner but we estimated him as a non-owner which is a True Negative.

According to me, False Positive is more impactful than True Negative because, We put in so much effort when there is a potential customer and when that prediction turns out to be inaccurate, we not only do any business but also spend a lot in the preocess of trying to acquire the customer.

So, our main aim should be reducing the False Positives. So, Let us choose Precision as the parameter.

In [19]:
performance.sort_values(by=['Precision'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,rbf svm,0.625,0.666667,0.8,0.727273
0,poly svm,0.75,0.8,0.8,0.8
0,linear svm,0.875,0.833333,1.0,0.909091


We can clearly see that the Linear SVM perform well when we consider Precision. 

So, Linear SVM is my winning model.

## 6. Prediction using Winning Model

In [20]:
df_test['predicted'] = svm_lin_model.predict(X_test)
df_test.head(5)

Unnamed: 0,Income,Lot_Size,Ownership,predicted
10,51.0,22.0,1,1
19,66.0,18.4,0,1
17,49.2,17.6,0,0
11,81.0,20.0,1,1
4,87.0,23.6,1,1


In [21]:
df_test['pred_prob'] = svm_lin_model.predict_proba(X_test)[:,1]
df_test.head(5)

Unnamed: 0,Income,Lot_Size,Ownership,predicted,pred_prob
10,51.0,22.0,1,1,0.446892
19,66.0,18.4,0,1,0.446376
17,49.2,17.6,0,0,0.426178
11,81.0,20.0,1,1,0.468516
4,87.0,23.6,1,1,0.490436


### 7. Saving the model to disk

In [22]:
import pickle

pickle.dump(svm_lin_model, open('SVM_LinearModel.pkl', "wb"))
