# Support Vector Machines (SVM)

A Support Vector Machines (SVM) is a binary linear classification whose decision boundary is explicitly constructed to minimize generalization error. It is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression and even outlier detection. 

SVM is well suited for classification of complex but small or medium sized datasets.

**In other words...**

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

## What are Support Vectors?

![alt text](https://www.dtreg.com/uploaded/pageimg/SvmMargin2.jpg "Logo Title Text 1")
 
Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set, they are what help us build our SVM. 

## Whats a hyperplane?

![alt text](http://slideplayer.com/slide/1579281/5/images/32/Hyperplanes+as+decision+surfaces.jpg "Logo Title Text 1")

Geometry tells us that a hyperplane is a subspace of one dimension less than its ambient space. For instance, a hyperplane of an n-dimensional space is a flat subset with dimension n − 1. By its nature, it separates the space into two half spaces.

### Separate two classes

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

![alt text](https://cdn-images-1.medium.com/max/540/0*9jEWNXTAao7phK-5.png) ![alt text](https://cdn-images-1.medium.com/max/540/0*0o8xIA4k3gXUDCFU.png)

![alt text](https://miro.medium.com/v2/resize:fit:1100/1*XE9jt0r1yAW8LnliQ3mllQ.png)

# How does SVM classify?

It's important to start with the intuition for SVM with the **special linearly separable** classification case.

If classification of observations is **"linearly separable"**, SVM fits the **"decision boundary"** that is defined by the largest margin between the closest points for each class. This is commonly called the **"maximum margin hyperplane (MMH)"**.

![linearly separable SVM](https://raw.githubusercontent.com/nalamidi/Breast-Cancer-Classification-with-Support-Vector-Machine/master/linear_separability_vs_not.png)

## Example: Classify Tumors

   Classify tumors into malignant (cancer) or benign using features obtained from several cell images.
   
   Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
   
   
**Attribute Information:**
   
1.  ID number 
2.  Diagnosis (M = malign, B = benign) 

**Ten real-valued features are computed for each cell nucleus:**

1. Radius (mean of distances from center to points on the perimeter) 
2. Texture (standard deviation of gray-scale values) 
3. Perimeter 
4. Area 
5. Smoothness (local variation in radius lengths) 
6. Compactness (perimeter^2 / area - 1.0) 
7. Concavity (severity of concave portions of the contour) 
8. Concave points (number of concave portions of the contour) 
9. Symmetry 
10. Fractal dimension ("coastline approximation" - 1)




# Import needed Python Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

#Import Cancer data from the Sklearn library
# Dataset can also be found here (http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29)

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
cancer = load_breast_cancer()
cancer

# Let's view the data in a dataframe.

In [None]:
df_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df_cancer['target'] = pd.Series(cancer.target)
df_cancer.head()

# Let's Explore Our Dataset

In [None]:
df_cancer.shape

As we can see,we have 596 rows (Instances) and 31 columns(Features)

In [None]:
df_cancer.columns

Above is the name of each columns in our dataframe.

# The next step is to Visualize our data

In [None]:
# Let's plot out just the first 5 variables (features)
sns.pairplot(df_cancer, vars = ['mean radius', 'mean texture', 'mean perimeter', 'mean area'])

The above plots shows the relationship between our features. But the only problem with them is that they do not show us which of the "dots" is Malignant and which is Benign. 

This issue will be addressed below by using "target" variable as the "hue" for the plots.

In [None]:
# Let's plot out just the first 5 variables (features)
sns.pairplot(df_cancer, hue = 'target', vars = ['mean radius', 'mean texture', 'mean perimeter','mean area'] )

**Note:** 
    
  1.0 (Orange) = Benign (No Cancer)
  
  0.0 (Blue) = Malignant (Cancer)

# How many Benign and Malignant do we have in our dataset?

In [None]:
df_cancer['target'].value_counts()

As we can see, we have 212 - Malignant, and 357 - Benign

 Let's visulaize our counts

In [None]:
sns.countplot(x=df_cancer['target'], label = "Count") # TO FIX

# Let's check the correlation between our features 

In [None]:
plt.figure(figsize=(20,12)) 
sns.heatmap(df_cancer.corr(), annot=True) 

There is a strong correlation between the mean radius and mean perimeter, mean area and mean primeter

# Model Training

**From our dataset, let's create the target and predictor matrix**

- "y" = Is the feature we are trying to predict (Output). In this case we are trying to predict wheither our "target" is Cancer (Malignant) or not (Benign). I.e. we are going to use the "target" feature here.
- "X" = The predictors which are the remaining columns (mean radius, mean texture, mean perimeter, mean area, mean smoothness, etc)

In [None]:
#X = df_cancer.drop(['target'], axis = 1) # We drop our "target" feature and use all the remaining features in our dataframe to train the model.
#X.head()
#X = df_cancer.drop(['target'], axis = 'columns')
#X.head()

X = df_cancer.iloc[:, 0:30]

In [None]:
y = df_cancer['target']

# Create the training and testing data

Now that we've assigned values to our "X" and "y", the next step is to import the python library that will help us to split our dataset into training and testing data.

- Training data = Is the subset of our data used to train our model.
- Testing data =  Is the subset of our data that the model hasn't seen before. This is used to test the performance of our model.

In [None]:
from sklearn.model_selection import train_test_split

Let's split our data using 80% for training and the remaining 20% for testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 100)
y_test.value_counts()

One final consideration is for **classification problems only**.

Some classification problems do not have a balanced number of examples for each class label. As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

This is called a stratified **train-test split**.

We can achieve this by setting the “**stratify**” argument to the "y" component (**target variable**) of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10, stratify=y)
y_test.value_counts()

Let now check the size our training and testing data.

In [None]:
print ('The size of our training "X" (input features - X_train) is', X_train.shape)
print ('The size of our testing "X" (input features - X_test) is', X_test.shape)
print ('The size of our training "y" (output feature - Y_train) is', y_train.shape)
print ('The size of our testing "y" (output features - Y_test) is', y_test.shape)

# Import Support Vector Machine (SVM) Model 

In [None]:
# Tambem sera possivel fazer regressões com SVM - Support Vector Regressor
# from sklearn.svm import SVR

 # Support Vector Classifier
from sklearn.svm import SVC

# Try other kernels ("rbf" - Radial Based Function, "linear" - Linear and "poly" - Polynomial)
# C = Regularization
svc_algo = SVC(gamma="auto", kernel="rbf", C=1)

## Kernels

The SVM algorithm is implemented in practice using a kernel. A kernel transforms an input data space into the required form. SVM uses a technique called the kernel trick. Here, the kernel takes a low-dimensional input space and transforms it into a higher dimensional space. In other words, you can say that it converts nonseparable problem to separable problems by adding more dimension to it. It is most useful in non-linear separation problem. Kernel trick helps you to build a more accurate classifier.

The most used Kernels are:

- **Linear Kernel** A linear kernel can be used as normal dot product any two given observations. The product between two vectors is the sum of the multiplication of each pair of input values.


- **Polynomial Kernel** A polynomial kernel is a more generalized form of the linear kernel. The polynomial kernel can distinguish curved or nonlinear input space.


- **Radial Basis Function Kernel** The Radial basis function kernel is a popular kernel function commonly used in support vector machine classification. RBF can map an input space in infinite dimensional space.

<img src="https://qph.fs.quoracdn.net/main-qimg-866b6450fc8c20dbf5dbd6a404cfe58a">

## Tuning Hyperparameters

- **Kernel:** The main function of the kernel is to transform the given dataset input data into the required form. There are various types of functions such as linear, polynomial, and radial basis function (RBF). Polynomial and RBF are useful for non-linear hyperplane. Polynomial and RBF kernels compute the separation line in the higher dimension. In some of the applications, it is suggested to use a more complex kernel to separate the classes that are curved or nonlinear. This transformation can lead to more accurate classifiers.


- **Regularization:** Regularization parameter in python's Scikit-learn C parameter used to maintain regularization. Here C is the penalty parameter, which represents misclassification or error term. The misclassification or error term tells the SVM optimization how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term. A smaller value of C creates a small-margin hyperplane and a larger value of C creates a larger-margin hyperplane.


- **Gamma**: A lower value of Gamma will loosely fit the training dataset, whereas a higher value of gamma will exactly fit the training dataset, which causes over-fitting. In other words, you can say a low value of gamma considers only nearby points in calculating the separation line, while the a value of gamma considers all the data points in the calculation of the separation line.

<img src="https://amueller.github.io/COMS4995-s18/slides/aml-09-021418-support-vector-machines/images/img_4.png">

# Now, let's train our SVM model with our "training" dataset.

In [None]:
# svc_model.fit vai treinar o nosso modelo com o subset treino (X_train, y_train)
model = svc_algo.fit(X_train, y_train)

# Let's use our trained model to make a prediction using our testing data

In [None]:
# previsoes efetuadas pelo nosso modelo (utilizando x_test e guardando as previsoes numa varialvel - y_predict)
y_predict = model.predict(X_test)
y_predict

**Next step is to check the accuracy of our prediction by comparing it to the output we already have (y_test). We are going to use confusion matrix for this comparison**

In [None]:
# Import metric libraries
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
cm = confusion_matrix(y_test, y_predict, labels=[0,1])
confusion = pd.DataFrame(cm, index=['is_cancer', 'is_not_cancer'],
                         columns=['predicted_cancer','predicted_not_cancer'])
confusion

In [None]:
sns.heatmap(confusion, annot=True)

In [None]:
print(classification_report(y_test, y_predict))

**As we can see, our model did not do a very good job in its predictions.**

**One way of improving the model is by changing kernels and its hyperparameters.** 

But, let's explore another ways to improve the performance of our model.

# Improving our Model

The first process we will try is by Normalizing our data

Data Normalization is a feature scaling process that brings all values into range [0,1], or between [-1,1] or between any interval, which depends on the used scaler.

**For the scaling between [0,1], it applies the following formula**:

<img src="https://miro.medium.com/max/682/0*8btMQlMD6O50pUDP">

Let's see what happens when we do and don't normalize data:

<img src="https://scikit-learn.org/0.18/_images/sphx_glr_plot_robust_scaling_001.png">

In order to normalize, we used a method from Sciki-learn called **MinMaxScaler**.

# Normalize Training Data

In [None]:
# crio o scaler (MinMaxScaler)
scaler = MinMaxScaler()
#scaler = StandardScaler()

# Fit and Transform (fit vai decidir os valores, e o transform vai retornar esses valores)
X_train_scaled = scaler.fit_transform(X_train)

# criamos um novo dataset de treino, mas desta vez escalado
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_train_scaled.head()

# Normalize Testing Data

In [None]:
scaler_2 = MinMaxScaler()
#scaler_2 = StandardScaler()
X_test_scaled = scaler_2.fit_transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
X_test_scaled.head()

# SVM with Normalized data

In [None]:
svc_algo_2 = SVC(gamma="auto", kernel="sigmoid", C=9)
model = svc_algo_2.fit(X_train_scaled, y_train)

In [None]:
y_predict_2 = model.predict(X_test_scaled)

In [None]:
cm = confusion_matrix(y_test, y_predict_2, labels=[0,1])
confusion = pd.DataFrame(cm, index=['is_cancer', 'is_not_cancer'],
                         columns=['predicted_cancer','predicted_is_not_cancer'])
confusion

In [None]:
sns.heatmap(confusion, annot=True)

In [None]:
print(classification_report(y_test,y_predict_2))

**Awesome performance! We only have 1 false prediction.**

## Visualize using 2 features

In [None]:
# adapted from: http://bit.ly/2iv7FFL

def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy

def plot_contours(ax, clf, xx, yy, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

def plot_model(model, x1, x2, title):

    X0, X1 = X.iloc[:, 0], X.iloc[:, 1]
    xx, yy = make_meshgrid(X0, X1)
    
    plt.figure()    
    plot_contours(plt, model, xx, yy, alpha=0.75)
    plt.scatter(X0, X1, c=y, s=15, alpha=0.95, edgecolors='#333333', linewidths=0.3) 
    plt.xlabel(x1)
    plt.ylabel(x2)
    plt.title(title)
    plt.show()

In [None]:
X = X.iloc[:,:2]

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

svclf = SVC(C=9, kernel="rbf", gamma='auto')
svclf.fit(X_train, y_train)

y_predict = svclf.predict(X_test)
print(classification_report(y_test,y_predict))

In [None]:
x1 = 'Mean Radius'
x2 = 'Mean Texture'
title = 'SVM with RBF Kernel'

plot_model(svclf, x1, x2, title)

In [None]:
# prever para valores diferentes de mean_radius e mean_texture
X_test_2 = [
    [50.10, 10.52], 
    [50.50, 10.1]
]
svclf.predict(X_test_2)

### Can we do better by normalizing data?

### Normalize the new training set

In [None]:
scaler_3 = MinMaxScaler()
X_train_scaled_2 = scaler_3.fit_transform(X_train)
X_train_scaled_2 = pd.DataFrame(X_train_scaled_2, columns=X_train.columns)
X_train_scaled_2.head()

### And the new test set

In [None]:
scaler_4 = MinMaxScaler()
X_test_scaled_2 = scaler_4.fit_transform(X_test)
X_test_scaled_2 = pd.DataFrame(X_test_scaled_2, columns=X_test.columns)
X_test_scaled_2.head()

In [None]:
svclf2 = SVC(kernel="rbf", gamma='auto')
svclf2.fit(X_train_scaled_2, y_train)

y_predict_2 = svclf2.predict(X_test_scaled_2)
print(classification_report(y_test,y_predict_2))

**Yes! We did a little bit better. Normalizing data works!**

## Pros and Cons associated with SVM

### Pros:
- It works really well with clear margin of separation
- It is effective in high dimensional spaces.
- It is effective in cases where number of dimensions is greater than the number of samples.
- It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

### Cons:
- It doesn’t perform well, when we have large data set because the required training time is higher
- It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
- SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. - - It is related SVC method of Python scikit-learn library.