# **Support Vector Machine**

- It's very powerful model which is capable of:
    - Linear Classification
    - Non-linear Classification
    - Linear Regression
    - Non-linear Regression
    - Outlier Detection
- They are binary classifiers

---
---

# **SVM Classification**

The general idea behind a *SVM Classification* is to try and seperate the classes and fit a street between them.
- It tries to fit the widest street possible between the classes *(known as Large Margin Classifier)*
- Adding more training instances off the street will not affect the decision boundary at all, since it's fully determined by the support vectors

There are two types of *SVMs*:
- Hard Margin Classification
- Soft Margin Classification

## *Hard Margin Classification*
- It strictly imposes that all the instances must be off the street and on the right side of the street
- There are two main issues with this:
    - It only works if the data is linearly seperable
    - It's sensitive to any outlier

## *Soft Margin Classification*
- It doesn't have restrictions as *Hard Margin Classification*
- It tries to find a good balance between keeping the street as wide as possible and limiting any margin violations

I will implement SVC on both *linear* and *non-linear* data using three different methods:
- [`LinearSVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- [`SVC` with `kernel`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
- [`SGDClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)

---

## **Linear SVM Classification**

Preparing the data

In [8]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

iris = datasets.load_iris()
X = iris['data'][:, (2, 3)]  # Petal lenght, Petal width
y = (iris["target"] == 2).astype(np.float64)  # Iris virginica

### Using `LinearSVC` :

In [6]:
linear_svc_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('linear_svc', LinearSVC(C=1, loss='hinge'))
])
linear_svc_clf.fit(X, y)
print(f'Prediction using LinearSVC: {linear_svc_clf.predict([[5.5, 1.7]])}')

Prediction using LinearSVC: [1.]


### Using `SVC` with the `kernel`=*linear* :

In [9]:
kernel_svc_lin = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(C=1, kernel='linear'))
])
kernel_svc_lin.fit(X, y)
print(f'Prediction using Kernel SVC: {kernel_svc_lin.predict([[5.5, 1.7]])}')

Prediction using Kernel SVC: [1.]


### Using `SGDClassifier` :

In [12]:
sgd_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('sgd', SGDClassifier(loss='hinge'))
])
sgd_clf.fit(X, y)
print(f'Prediction using SGDClassifier: {sgd_clf.predict([[5.5, 1.7]])}')

Prediction using SGDClassifier: [1.]


- `LinearSVC` converges the fastest
- `SGDClassifier` can be useful when you need to classify something online or on huge datasets that do not fit in the memory
- `SVC` with kernel trick is more useful in a non-linear model

---

## **Nonlinear SVM Classification**

Preparing the data

In [13]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples=100, noise=0.15)

### Using `LinearSVC` :

In [18]:
polynomial_lin_svc = Pipeline([
    ('poly_features', PolynomialFeatures(degree=3)),
    ('scale', StandardScaler()),
    ('svm_clf', LinearSVC(C=10, loss='hinge', max_iter=1000))
])

polynomial_lin_svc.fit(X, y)
print(f'Prediction using LinearSVC: {polynomial_lin_svc.predict([[5.5, 1.7]])}')

Prediction using LinearSVC: [1]


### Using `SVC` with kernel :

#### Polynomial Kernel
- We can use `kernel`=*poly*

In [19]:
poly_kernel_svm_clf = Pipeline([
    ('scale', StandardScaler()),
    ('svm_clf', SVC(kernel='poly', degree=3, coef0=1, C=5))
])

#### Gausian RBF Kernel
- It adds similarity features to your dataset

In [20]:
rbf_kernel_svm_clf = Pipeline([
    ('scale', StandardScaler()),
    ('svm_clf', SVC(kernel='rbf', gamma=5, C=0.001))
])

---

## **Regularizing SVMs**

## *Applicable to all SVMs*
- `C` *regularization parameter*
    - Low value:
        - Results in a wider street
        - Generalizes better
        - More margin violations
    - High value:
        - Results in a narrower street
        - Generalizes poorly
        - Less margin violations

## *Non-Linear SVMs*

### Polynomial SVMs
- `degree` hyperparameter can be used for regularization
- `coef0` controls how much the model is influenced by high-degree polynomials versus low-degree polynomials

### Gausian RBF Kernel
- `gamma` hyperparamer:
    - Acts as a regularization parameter
    - If the model is underfitting then you should increase it's value
    - Increasing `gamma` hyperparameter 
        - Makes the bell-shaped curve narrower
        - Each instance's range of influence is smaller
        - Decision boundary ends up being more irregular.
    - Reducing `gamma` hyperparameter
        - Makes the bell-shaped curve wider
        - Instances have a larger range of influence
        - Decision boundary ends up smoother

---

# **Computational Complexity**

|Class|Time complexity|Out-of-core support|Scaling required|Kernel trick|
|---|---|---|---|---|
|`LinearSVC`|$O(m*n)$|No|Yes|No|
|`SGDClassifier`|$O(m*n)$|Yes|Yes|No|
|`SVC`|$O(m^2*n)$ to $(m^3*n)$|No|Yes|No|


---

# **Points to note about SVMs:**

- SVMs are sensitive to feature scales, so it's kind of mandatory to scale the features. We can use Scikit-Learn's `StandardScalar` to scale the features
- SVMs do not output probabilites for each class

---

## **General Terms:**

### Linearly Seperable:
- If the classes in a dataset can be seperated by a single straight line then they are called as *linearly seperable* classes

### Support Vectors:
- The instances which decide the boundary of a street for SVM

### Margin Violations:
- Any instance which are on the street or on the wrong side of the street

---
---

# **SVM Regression**

- Instead of fitting the instances off the street, *SVM Regressor* tries to fit as much instances within the street
- Adding more instances within the margin doesn't effect the model

---

## **Linear SVM Regression**

### Preparing the data

In [21]:
X = 2 * np.random.rand(50, 1)
y = (4 + 3 * X + np.random.randn(50, 1)).ravel()

### Implementing SVR Regression using `LinearSVR`

In [22]:
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)

---

## **Non-linear SVM Regression**

### Preparing the data

In [23]:
X = 2 * np.random.rand(100, 1) - 1
y = (0.2 + 0.1 * X + 0.5 * X**2 + np.random.randn(100, 1)/10).ravel()

### Implementing SVR Regression using SVR

In [24]:
from sklearn.svm import SVR

svm_poly_reg = SVR(kernel='poly', degree=2, C=100, epsilon=0.1)

---

## **Regularizing SVMs**

## *Applicable to all SVMs*
- `epsilon`
    - Controls the width of the street
    - Low value - narrower street
    - High value - wider street
- `C`
    - Low value - higher regularization
    - High value - less regularization


---
---

# **Support Vector Machine on MNIST Data**

## *Preparing the data*

In [2]:
import numpy as np
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)

X = mnist['data']
y = mnist['target'].astype(np.uint8)

X_train = X[:60000]
y_train = y[:60000]
X_test = X[60000:]
y_test = y[60000:]

The MNIST dataset is already shuffled and placed in a particular order for *train* and *test* sets

## *Linear SVM*
Let's start with the simplest of SVMs, *Linear SVM*. By default, it will use One-vs-All *(One-vs-Rest)*

### Initializing the model

In [3]:
from sklearn.svm import LinearSVC
lin_clf = LinearSVC(random_state=42)
lin_clf.fit(X_train, y_train)



LinearSVC(random_state=42)

### Evaluating the model

In [7]:
from sklearn.metrics import accuracy_score

y_pred = lin_clf.predict(X_train)
print(f'Accuracy score: {round(accuracy_score(y_train, y_pred)*100, 2)}%')

Accuracy score: 83.49%


Accuracy score of 83.49% for a simple linear model without any data preprocessing and any regularization is pretty good. But let's try to preprocess the data first and see the result.

### Preprocessing the data

In [8]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float32))
X_test_scaled = scaler.fit_transform(X_test.astype(np.float32))

Now let's apply the same model to the scaled data

In [9]:
lin_clf = LinearSVC(random_state=42)
lin_clf.fit(X_train_scaled, y_train)



LinearSVC(random_state=42)

Now let's look at the accuracy score for the scaled model

In [11]:
y_pred = lin_clf.predict(X_train_scaled)
print(f'Accuracy score for the scaled model: {round(accuracy_score(y_train, y_pred)*100, 2)}%')

Accuracy score for the scaled model: 92.17%


Ah! that's impressive. Just by scaling the data we were able to improve the accuracy score to 92.17%

## *Kernel SVM*

Instead of training on the complete set, let's first train on a sub-set, fine-tune the parameters and see the results

In [12]:
from sklearn.svm import SVC
svm_clf = SVC(kernel='rbf',
             gamma='scale')
svm_clf.fit(X_train_scaled[:10000], y_train[:10000])

SVC()

Checking the accuracy score

In [13]:
y_pred = svm_clf.predict(X_train_scaled)
print(f'Accuracy score of SVM: {round(accuracy_score(y_train, y_pred)*100, 2)}%')

Accuracy score of SVM: 94.55%


So it turns out that we get a better perfomance even though the SVM model is trained on 6 times less data.

Let's now tune the hyperparameters using a randomized search with cross validation. We will do this on a small dataset just to speed things up.

In [14]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform

param_distribution = {'gamma': reciprocal(0.001, 0.1), 'C': uniform(1, 10)}
rnd_search_cv = RandomizedSearchCV(svm_clf, 
                                   param_distribution, 
                                   n_iter=10, 
                                   verbose=2, 
                                   cv=3)
rnd_search_cv.fit(X_train_scaled[:1000], y_train[:1000])

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END ....C=8.019100659027712, gamma=0.017696273174877687; total time=   0.2s
[CV] END ....C=8.019100659027712, gamma=0.017696273174877687; total time=   0.1s
[CV] END ....C=8.019100659027712, gamma=0.017696273174877687; total time=   0.1s
[CV] END ....C=9.638323939420122, gamma=0.010263097329434766; total time=   0.1s
[CV] END ....C=9.638323939420122, gamma=0.010263097329434766; total time=   0.1s
[CV] END ....C=9.638323939420122, gamma=0.010263097329434766; total time=   0.1s
[CV] END .....C=9.802707110189608, gamma=0.06617091538557746; total time=   0.2s
[CV] END .....C=9.802707110189608, gamma=0.06617091538557746; total time=   0.2s
[CV] END .....C=9.802707110189608, gamma=0.06617091538557746; total time=   0.2s
[CV] END ..C=10.071907048103903, gamma=0.0033974504088832807; total time=   0.1s
[CV] END ..C=10.071907048103903, gamma=0.0033974504088832807; total time=   0.1s
[CV] END ..C=10.071907048103903, gamma=0.0033974

RandomizedSearchCV(cv=3, estimator=SVC(),
                   param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001F131ADDA30>,
                                        'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001F130EC8C40>},
                   verbose=2)

Let's look at the best estimator and best estimator

In [16]:
print(f'Best Estimator: {rnd_search_cv.best_estimator_}')
print(f'Best Score: {rnd_search_cv.best_score_}')

Best Estimator: SVC(C=10.071907048103903, gamma=0.0033974504088832807)
Best Score: 0.8009866153578727


Huh! this looks really low. Let's try to train the model on the entire dataset instead of the subset

In [17]:
rnd_search_cv.best_estimator_.fit(X_train_scaled, y_train)

SVC(C=10.071907048103903, gamma=0.0033974504088832807)

Let's look at the accuracy score for this model:

In [18]:
y_pred = rnd_search_cv.best_estimator_.predict(X_train_scaled)
print(f'Accuracy score for RBF SVM: {round(accuracy_score(y_train, y_pred)*100, 2)}%')

Accuracy score for RBF SVM: 100.0%


Now that looks promising! Let's try to evaluate the model on the test set and see the accuracy score.

In [19]:
y_pred = rnd_search_cv.best_estimator_.predict(X_test_scaled)
print(f'Accuracy score for RBF SVM on test data: {round(accuracy_score(y_test, y_pred)*100, 2)}%')

Accuracy score for RBF SVM on test data: 96.04%
