## SVM-

- SVM or Support Vector Machine is a linear model for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems. The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes.

### Main Idea Behind SVM

1. Start with data in a relatively low dimension.
2. Move the data to a higher dimension. 
3. Find the __Support Vector Classifier__ that seperate the higher dimensional data into 2 groups.


- When data is 1D then SV Classifier forms a __POINT__ on a number line.
-              2D data =                 __1D line__.
-              3D data =                __2D PLANE__
-              4 or more Dimensional data then SV Classifier forms a __HYPERPLANE__

### About The Project And Dataset

- The dataset we are going to use comes from the __National Institute of Diabetes and Digestive and Kidney Diseases__, and contains anonymized diagnostic measurements for a set of female patients.  
- We will train a support vector machine to predict whether a new patient has diabetes based on such measurements.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

### Getting Data

In [2]:
column_names = ['pregnancies', 'glucose', 'bpressure', 'skinfold', 'insulin', 'bmi', 'pedigree', 'age', 'class']

In [3]:
df = pd.read_csv("data.csv", names= column_names)

In [4]:
df.head()

Unnamed: 0,pregnancies,glucose,bpressure,skinfold,insulin,bmi,pedigree,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
df.shape

(768, 9)

### Extracting Data


In [6]:
X = df.iloc[:,:8]

In [7]:
X.head()

Unnamed: 0,pregnancies,glucose,bpressure,skinfold,insulin,bmi,pedigree,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


### Extracting Class Labels

In [8]:
y = df['class']

In [9]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: class, dtype: int64

### Splitting The Dataset In Training and Testing 

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [11]:
print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)

(576, 8)
(576,)
(192, 8)
(192,)


### Normalizing Features

In [12]:
scaler = StandardScaler()

In [13]:
scaler.fit(X_train)

StandardScaler()

In [14]:
X_train = scaler.transform(X_train)

In [15]:
X_train[:5, :]

array([[ 2.80346794,  0.25977903, -3.78077929,  0.61677038, -0.69205168,
         1.03974028,  0.29608546,  0.96352088],
       [ 0.07832678,  0.25977903,  0.89724451, -0.03210586,  1.63307692,
         0.40945373, -0.70087555, -0.86295593],
       [-0.22446668, -1.85825286,  0.67966201,  0.48699513, -0.69205168,
         0.31753694, -0.66548048,  1.13747105],
       [-0.52726014, -1.2353023 ,  0.13570575, -0.35654397, -0.03757104,
        -0.24709476,  0.2311945 , -0.68900576],
       [-1.13284707, -0.58120422,  0.29889263,  0.16255702, -0.69205168,
        -4.19951667,  0.30493422, -1.03690611]])

### Training The Support Vector Machines

In [16]:
clf = svm.SVC(kernel='sigmoid')
clf.fit(X_train, y_train)

SVC(kernel='sigmoid')

### Decision Boundary

In [17]:
y_pred = clf.predict(X_train)

In [18]:
print(y_pred)

[0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1
 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0
 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 1 0 0 0 1
 0 0 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0
 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 1 1 1 0 1
 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0
 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0
 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 0
 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0
 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0
 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0
 0 0 0 1 0 1 0 0 1 0 0 0 

In [19]:
accuracy_score(y_train, y_pred)

0.6875

### Checking SVM for Different Kernels

In [20]:
for k in ('linear', 'poly', 'rbf', 'sigmoid'):
    clf = svm.SVC(kernel=k)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_train)
    print(k)
    print(accuracy_score(y_train, y_pred))

linear
0.78125
poly
0.7951388888888888
rbf
0.8315972222222222
sigmoid
0.6875


### Instantiating the Best Model

In [21]:
clf = svm.SVC(kernel='rbf')

In [22]:
clf.fit(X_train, y_train)

SVC()

### Making the Single Prediction

In [23]:
#pregnancies	glucose	bpressure	skinfold	insulin	bmi	pedigree	age

patient = np.array([[1., 50. , 75., 40., 0., 45., 1.5, 20]])
patient = scaler.transform(patient)
clf.predict(patient)

#ie no diabetes

array([0], dtype=int64)

In [24]:
patient = np.array([[1., 200. , 75., 40., 0., 45., 1.5, 20]])
patient = scaler.transform(patient)
clf.predict(patient)

#ie has diabetes

array([1], dtype=int64)

### Testing Set Prediction

In [25]:
patient = np.array([X_test.iloc[0]])
patient = scaler.transform(patient)
print(clf.predict(patient))
print(y_test.iloc[0])

# predicted correct

[0]
0


In [26]:
patient = np.array([X_test.iloc[1]])
patient = scaler.transform(patient)
print(clf.predict(patient))
print(y_test.iloc[1])

#correct

[0]
0


In [27]:
patient = np.array([X_test.iloc[2]])
patient = scaler.transform(patient)
print(clf.predict(patient))
print(y_test.iloc[2])

[0]
0


In [28]:
patient = np.array([X_test.iloc[8]])
patient = scaler.transform(patient)
print(clf.predict(patient))
print(y_test.iloc[8])

#predicted wrong

[1]
0


### Accuracy On Testing Set

In [29]:
X_test = scaler.transform(X_test)

In [30]:
y_pred = clf.predict(X_test)

In [31]:
print(accuracy_score(y_test,y_pred))

# our model is around 73% accurate

0.7291666666666666


### Comparison to All-Zero Prediction

i.e when we predict none of the patients have diabetes
(or)   all the values of y_pred = 0

In [32]:
y_zero  = np.zeros(y_test.shape)

In [33]:
print(accuracy_score(y_test, y_zero))

# So we can say we have around 64% of patients correctly classified
# We can also say that our database is unbalanced - because more number of people has no diabetes

0.640625


### Precision and Recall


__Precision__
- What % your prediction were correct?
- Precision is the ability of a classifier not to label an instance positive that is actually negative. 
- For each class it is defined as the ratio of true positives to the sum of true and false positives.
- Precision: Accuracy of positive predictions.
- Precision = TP/(TP + FP)

__Recall__
- What percent of the positive cases did you catch?
- Recall is the ability of a classifier to find all positive instances.
- For each class it is defined as the ratio of true positives to the sum of true positives and false negatives.
- Recall: Fraction of positives that were correctly identified.
- Recall = TP/(TP+FN)

__f1-score__
- What percent of positive predictions were correct? 
- The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.
- F1 Score = 2*(Recall * Precision) / (Recall + Precision)


In [34]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.82      0.80       123
           1       0.64      0.57      0.60        69

    accuracy                           0.73       192
   macro avg       0.71      0.69      0.70       192
weighted avg       0.72      0.73      0.73       192



In [38]:
#saving the model 
import pickle
pickle.dump(clf, open('model.pkl', 'wb'))

In [39]:
model = pickle.load(open('model.pkl', 'rb'))

In [40]:
# Saving the scaler
#We’ll also save the scaler in the present working directory 
#so that we can scale user input data before passing it to our model and displaying the results on our website.

import joblib
joblib.dump(scaler,'model_scaler.pkl')

['model_scaler.pkl']

In [41]:
object = pd.read_pickle(r'model.pkl')

In [42]:
object

SVC()