<a href="https://colab.research.google.com/github/jcdumlao14/Supervised-Learning-Algorithms--Classification/blob/main/Micro_Courses_Supervised_Algorithms_Classification_Exercises_SVM_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Objective**

Find the digit that a set of image pixels might be representing.

For this exercise, the Sklearn's digit dataset will be used. It is one of the inbuilt datasets in Sklearn that can be directly imported from sklearn.datasets.


# **Tasks:**

The dataset has already been loaded and provided to you as a DataFrame named df.

1. Split the data into train and test sets with ratio 80:20
2. Determine the default kernel that is used while creating SVM.
3. Create a SVM model with RBF kernel. Train it and find out its score.
4. Create a SVM model with Linear kernel. Train it and find out its score.
5. Compare the two and see which one performs better.
6. Try using diffferent values of c and gamma to improve model performance.

# **What Does Support Vector Machine (SVM) Mean?**
---
A support vector machine (SVM) is machine learning algorithm that analyzes data for classification and regression analysis. SVM is a supervised learning method that looks at data and sorts it into one of two categories. An SVM outputs a map of the sorted data with the margins between the two as far apart as possible. SVMs are used in text categorization, image classification, handwriting recognition and in the sciences.

A support vector machine is also known as a support vector network (SVN).

References: https://www.techopedia.com/

# **SVM Kernel Functions**
---
SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of the kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types such as linear, nonlinear, polynomial, radial basis function(RBF), and sigmoid.


# **Loading Libraries and Loading the Data**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.datasets import load_digits
digits = load_digits() 
#colab currently has sklearn version 0.22, if you have version 0.23 or above, you can directly use `as_frame=True` here to get the data as a DataFrame

In [None]:
digits.target_names

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

We thus have the digits 0-9 in the dataset.

In [None]:
# Converting data to DataFrame
df = pd.DataFrame(digits.data,digits.target)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


In [None]:
# Adding the target column to the dataframe
df['target'] = digits.target
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,target
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0,1
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0,3
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0,4


# **1. Split the data into train and test sets with ratio 80:20**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('target',axis='columns'), df.target, test_size=0.2)

In [None]:
X = df.drop(['target'], axis = 'columns')
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


In [None]:
y = df.target
y

0    0
1    1
2    2
3    3
4    4
    ..
9    9
0    0
8    8
9    9
8    8
Name: target, Length: 1797, dtype: int64

Splitting the Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)

In [None]:
print(len(X_train))
print(len(X_test))

1437
360


# **2. Determine the default kernel that is used while creating SVM.**

In [None]:

model = SVC()
model.fit(X_train, y_train)


SVC()

In [None]:
model.score(X_train,y_train)

0.9965205288796103

In [None]:
model.score(X_test,y_test)

0.9916666666666667

It's has 98% score of SVM

# **3. Create a SVM model with RBF kernel. Train it and find out its score.**

In [None]:
model = SVC(kernel = 'rbf')
model.fit(X_train, y_train)

SVC()

In [None]:
model.score(X_train,y_train)

0.9965205288796103

It gives us a 99% score on the RBF kernel of the trained model.

# **4. Create a SVM model with Linear kernel. Train it and find out its score.**

In [None]:
model = SVC(kernel = 'linear')
model.fit(X_train, y_train)

SVC(kernel='linear')

In [None]:
model.score(X_train,y_train)

1.0

It gives us a 10% score on the linear kernel of the trained model.

# **5. Compare the two and see which one performs better.**

The kernel function and the kernel RBF (radial basis function) they are both perform better and it has 99% performance.

In [None]:
import numpy as np
import pickle
from sklearn import preprocessing


In [None]:
predictions  = model.predict(X_test)
print(predictions)

[8 4 2 1 2 1 8 5 8 6 7 2 0 9 0 5 8 6 9 3 9 4 7 5 7 1 5 4 9 8 7 0 3 5 0 2 8
 3 1 0 2 2 9 0 9 8 1 4 3 5 2 7 1 7 4 4 4 8 5 7 2 1 0 5 3 2 9 5 6 5 6 6 5 3
 2 9 0 7 3 3 5 3 9 4 5 4 8 1 0 3 7 1 3 8 4 8 3 6 8 5 7 6 9 6 2 5 2 6 2 4 8
 2 2 2 7 5 8 0 9 1 5 6 4 0 0 2 8 5 6 8 1 3 4 0 4 9 9 5 8 7 6 8 9 2 9 1 9 1
 9 1 0 4 5 3 0 7 2 2 3 2 3 3 7 8 6 9 4 6 9 3 5 3 6 5 4 5 3 3 5 7 4 7 8 4 8
 2 1 4 5 8 6 1 6 8 8 4 2 8 9 6 3 9 2 3 7 5 4 8 8 1 4 2 8 4 0 8 6 2 9 6 7 3
 7 6 7 2 7 3 5 0 9 6 9 3 1 5 1 7 7 1 5 1 5 2 3 2 3 9 4 4 4 5 6 4 9 0 6 9 7
 4 1 0 3 6 1 7 6 8 5 8 0 0 1 3 2 7 8 7 3 8 4 6 6 6 2 5 2 6 0 7 9 5 3 8 1 6
 3 8 8 5 4 0 3 0 1 7 3 9 4 5 0 2 5 5 0 0 7 9 1 2 5 0 5 7 0 4 6 0 4 7 0 1 3
 6 4 5 5 4 0 0 3 0 3 6 3 6 3 6 2 6 4 2 1 0 4 1 0 3 6 4]


In [None]:
percentage = model.score(X_test,y_test)

In [None]:
from sklearn.metrics import confusion_matrix
res = confusion_matrix(y_test, predictions)
print("Confusion Matrix")
print(res)
print(f"Test Set: {len(X_test)}")
print(f"Accuracy = {percentage * 100}%")

Confusion Matrix
[[36  0  0  0  0  0  0  0  0  0]
 [ 0 29  0  0  0  0  0  0  0  0]
 [ 0  0 36  0  0  0  0  0  0  0]
 [ 0  0  0 41  0  0  0  0  0  0]
 [ 0  0  0  0 39  0  0  0  0  0]
 [ 0  0  0  0  0 42  0  0  0  0]
 [ 0  0  0  0  0  0 38  0  0  0]
 [ 0  0  0  0  0  0  0 32  0  0]
 [ 0  1  0  0  0  0  0  0 36  0]
 [ 0  0  0  0  0  0  0  0  0 30]]
Test Set: 360
Accuracy = 99.72222222222223%


# **6. Try using diffferent values of c and gamma to improve model performance.**

values of c  to improve model performance
------

In [None]:
model = SVC(C=10)
model.fit(X_train, y_train)

SVC(C=10)

In [None]:
model.score(X_train,y_train)

1.0

values of gamma =10  to improve model performance
------

In [None]:
model = SVC(gamma=10)
model.fit(X_train, y_train)

SVC(gamma=10)

In [None]:
model.score(X_train,y_train)

1.0

There are the same model score performance.