# Support Vector Machines

## Introduction <br>
*  In this practical, we will learn to train linear and nonlinear SVM classifiers, experiment with different settings of hyperparameters and look at how they affect the decision boundaries, and evaluate the performance of the classifiers with commonly used error metrics. <br>
*  We will use Scikit-Learn's `Pipeline` function to demonstrate simple machine learning workflow management in this practical.

In [1]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## 1. Linear and nonlinear SVM classifiers with wine quality dataset <br>
*  We will train a `LinearSVC` classifier, experiment with different hyperparameter settings, and evaluate the performance of the classifier using commonly used performance metrics <br>
*  We will try to improve the performance by scaling the data (how does it help?) <br>
*  We will also learn one technique of "feature selection" (i.e. we do not train the model with all the features, but a subset of more relevant features), in a bid to improve the performance (counter-intuitive?) <br>
*  We will repeat the above process with a nonlinear `SVC` classifier with the RBF kernel <br>
*  The dataset that we'll be using is the wine quality dataset. This data set contains various chemical properties of the wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white) <br> <br>
(Some parts of the code in this section are adopted from Reference [1])

We'll begin by importing the data

In [2]:
data = pd.read_csv('Wine_Quality_Data.csv')

Let's explore the dataset to know more about it (e.g. the size of the dataset, what are the features and their data types and ranges, are there any missing values, etc.)

In [None]:
data.shape

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
data["color"].values

We will be using the features to predict the `color` of the wine (red or white)

But first, we need to code the target feature `color` as numeric data (1 = `red`, 0 = `white`) <br>
Why do we need to do that?

In [8]:
data["color"] = data.color.replace("white", 0).replace("red", 1).astype(np.int)

In [None]:
data.info()

In [None]:
data.describe()

We notice that there are a lot more white wines than red wines in the dataset (from the mean value of the `color` feature) <br> Let's confirm it

In [None]:
data["color"].value_counts()

In [None]:
data["color"].value_counts(normalize=True)

Like the datasets we previously experimented with, this dataset is also not balanced      

Separate the dataset into the features (`X`) and the target (`y`)

In [13]:
X = data.drop("color", axis=1)
y = data["color"]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
y.tail()

Split the dataset into a training set and a test set <br>
This time, let's experiment with Scikit-Learn's `train_test_split()` function (note that `train_test_split()` also has a `stratify` hyperparameter) <br>
We want to keep 20% of the dataset to be used as the test set

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Note that the training set (also the test set) is shuffled

In [None]:
X_train.head(10)

Check to be assured that the two classes of wine are proportionately distributed between the training and test sets

In [None]:
print(y_train.value_counts())
print(y_test.value_counts())

In [None]:
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

Train a `LinearSVC` classifier on the dataset using the default parameters

In [36]:
from sklearn.svm import LinearSVC

In [37]:
LSVC = LinearSVC()

In [None]:
LSVC.fit(X_train, y_train)

How does the `LinearSVC` classifier perform?

In [39]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def measure_error(y_true, y_pred, label):
    return pd.Series({'accuracy':accuracy_score(y_true, y_pred),
                      'precision': precision_score(y_true, y_pred),
                      'recall': recall_score(y_true, y_pred),
                      'f1': f1_score(y_true, y_pred)},
                      name=label)

The prediction accuracy on the training and test sets

In [None]:
LSVC_y_train_pred = LSVC.predict(X_train)
LSVC_y_test_pred = LSVC.predict(X_test)

train_test_full_error = pd.concat([measure_error(y_train, LSVC_y_train_pred, 'train'),
                                   measure_error(y_test, LSVC_y_test_pred, 'test')],
                                   axis=1)

train_test_full_error

Look at the confusion matrix, precision and recall scores

In [41]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(confusion_matrix(y_test, LSVC_y_test_pred))

In [None]:
print(classification_report(y_test, LSVC_y_test_pred))

Not too bad <br>
Can we improve by scaling the data?

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
X_train_scaled.mean(axis=0)

In [None]:
X_train_scaled.std(axis=0)

In [None]:
X_train_scaled.max(axis=0)

In [None]:
X_train_scaled.min(axis=0)

In [None]:
LSVC.fit(X_train_scaled, y_train)

In [None]:
LSVC_y_train_pred = LSVC.predict(X_train_scaled)
LSVC_y_test_pred = LSVC.predict(X_test_scaled)

train_test_full_error = pd.concat([measure_error(y_train, LSVC_y_train_pred, 'train'),
                              measure_error(y_test, LSVC_y_test_pred, 'test')],
                              axis=1)

train_test_full_error

In [None]:
print(confusion_matrix(y_test, LSVC_y_test_pred))

In [None]:
print(classification_report(y_test, LSVC_y_test_pred))

Some improvement. Can we improve further?

Experiment with different settings of hyperparameters for `LinearSVC` classifier (e.g. `C=10`, `C=100`, `C=500`, etc.)

Let's explore using only the most relevant features

First, get some idea of how data instances distribute in the feature space (in a pairwise manner), by creating a pairplot for the dataset

In [82]:
import seaborn as sns

sns.set_context('talk')
sns.set_palette('dark')
sns.set_style('white')

In [None]:
sns.pairplot(data, hue='color')

How does each feature correlate with the wine color?

In [None]:
correlations = X_train.corrwith(y_train)  
correlations.sort_values(inplace=True)
correlations

Create a bar plot showing the correlations between each feature and the target class `color`

In [None]:
ax = correlations.plot(kind='bar')
ax.set(ylim=[-1, 1], ylabel='pearson correlation');

Try to train the model without the three features that are least correlated with the target attribute (wine color)

In [60]:
X_train_9_scaled = X_train_scaled[:, [c!='citric_acid' and c!='quality' and c!='alcohol' for c in X.columns]]
X_test_9_scaled = X_test_scaled[:, [c!='citric_acid' and c!='quality' and c!='alcohol' for c in X.columns]]

In [None]:
X_train_9_scaled.shape

In [None]:
X_test_9_scaled.shape

In [None]:
LSVC.fit(X_train_9_scaled, y_train)

In [None]:
LSVC_y_train_pred = LSVC.predict(X_train_9_scaled)
LSVC_y_test_pred = LSVC.predict(X_test_9_scaled)

train_test_full_error = pd.concat([measure_error(y_train, LSVC_y_train_pred, 'train'),
                                   measure_error(y_test, LSVC_y_test_pred, 'test')],
                                   axis=1)

train_test_full_error

In [None]:
print(confusion_matrix(y_test, LSVC_y_test_pred))

In [None]:
print(classification_report(y_test, LSVC_y_test_pred))

How do the results compare with those obtained from the model that is trained with the full set of features?

Try to train the model with different values of hyperparameter `C` and see if we can further improve classification performance 

Now, let's train a SVC with gaussian kernel on the scaled dataset <br>
We want to experiment with different settings of `gamma` and `C`

In [76]:
from sklearn.svm import SVC

In [None]:
gamma1, gamma2 = 0.1, 50
C1, C2 = 0.1, 10000
hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)

for gamma, C in hyperparams:
    GSVC = SVC(kernel="rbf", gamma=gamma, C=C) 
    GSVC.fit(X_train_scaled, y_train)
    GSVC_y_train_pred = GSVC.predict(X_train_scaled)
    GSVC_y_test_pred = GSVC.predict(X_test_scaled)
    train_test_full_error = pd.concat([measure_error(y_train, GSVC_y_train_pred, 'train'),
                                       measure_error(y_test, GSVC_y_test_pred, 'test')],
                                       axis=1)
    print("gamma =", gamma)
    print("C =", C)
    print(train_test_full_error)
    print(confusion_matrix(y_test, GSVC_y_test_pred))
    print(classification_report(y_test, GSVC_y_test_pred))
    print()

Experiment with more combinations of `gamma` and `C` <br>
How much more can we improve?

## 2. Decision boundaries of linear and nonlinear SVM classifiers
-  In this section, we will look at the decision boundaries of linear and nonlinear SVM classifiers on the moons dataset <br>
-  We will experiment with different hyperparameter settings to see how they alter the decision boundaries <br> <br>
(Some parts of the code in this section are adopted from Reference [2])

First, two plotting functions for visualization

In [81]:
def plot_dataset(X, y, axes):
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
    plt.axis(axes)
    plt.grid(True, which='both')
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.ylabel(r"$x_2$", fontsize=20, rotation=0)

In [82]:
def plot_predictions(clf, axes):
    x0s = np.linspace(axes[0], axes[1], 100)
    x1s = np.linspace(axes[2], axes[3], 100)
    x0, x1 = np.meshgrid(x0s, x1s)
    X = np.c_[x0.ravel(), x1.ravel()]
    y_pred = clf.predict(X).reshape(x0.shape)
    y_decision = clf.decision_function(X).reshape(x0.shape)
    plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)
    plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)

Get the moons dataset

In [83]:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

Let's explore the dataset as we usually do

In [None]:
X.shape

In [None]:
X

In [None]:
y.shape

In [None]:
y

Plot the dataset (nothing beats acually seeing it!)

In [None]:
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
plt.show()

Decision boundary of `LinearSVC` <br>
What is the effect of hyperparameter `C` on the decision boundary? <br>
Try `C=100` and `C=10000` <br>
Take note of the use of the `Pipeline` function

In [99]:
from sklearn.pipeline import Pipeline

In [None]:
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler

linear100_svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("svm_clf", LinearSVC(C=100, loss="hinge"))
    ])
linear100_svm_clf.fit(X, y)

linear10000_svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("svm_clf", LinearSVC(C=10000, loss="hinge"))
    ])
linear10000_svm_clf.fit(X, y)

Plot the decision boundaries

In [None]:
plt.figure(figsize=(11, 4))

plt.subplot(121)
plot_predictions(linear100_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
plt.title(r"$C=100$", fontsize=18)

plt.subplot(122)
plot_predictions(linear10000_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
plt.title(r"$C=10000$", fontsize=18)

plt.show()

Try other values of `C` <br>
Can we make sense of the change to the decision boundary?

Decision boundary of `SVC` with polynomial kernel <br>
How the decision boundary change with different values of `degree`, `coef0` and `C`? <br>

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

poly3_kernel_svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
    ])
poly3_kernel_svm_clf.fit(X, y)

In [None]:
poly10_kernel_svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel="poly", degree=10, coef0=100, C=5))
    ])
poly10_kernel_svm_clf.fit(X, y)

In [None]:
plt.figure(figsize=(11, 4))

plt.subplot(121)
plot_predictions(poly3_kernel_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
plt.title(r"$d=3, r=1, C=5$", fontsize=18)

plt.subplot(122)
plot_predictions(poly10_kernel_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
plt.title(r"$d=10, r=100, C=5$", fontsize=18)

plt.show()

Try and explore other combinations of the hyperparameters

Decision boundary of `SVC` with RBF kernel <br>
What are the hyperparameters that affect the decision bounbary?

In [None]:
rbf_kernel_svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
    ])
rbf_kernel_svm_clf.fit(X, y)

We can plot the decision boudary here, for `gamma=5`, `C=0.001`

Alternatively, we can try different combinations of `gamma` and `C` in one shot

In [None]:
gamma1, gamma2 = 0.1, 5
C1, C2 = 0.001, 1000
hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)

svm_clfs = []
for gamma, C in hyperparams:
    rbf_kernel_svm_clf = Pipeline([
            ("scaler", StandardScaler()),
            ("svm_clf", SVC(kernel="rbf", gamma=gamma, C=C))
        ])
    rbf_kernel_svm_clf.fit(X, y)
    svm_clfs.append(rbf_kernel_svm_clf)

plt.figure(figsize=(11, 7))

for i, svm_clf in enumerate(svm_clfs):
    plt.subplot(221 + i)
    plot_predictions(svm_clf, [-1.5, 2.5, -1, 1.5])
    plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
    gamma, C = hyperparams[i]
    plt.title(r"$\gamma = {}, C = {}$".format(gamma, C), fontsize=16)

plt.show()

#### References <br>
[1] Intel AI Academy, Machine Learning 501. <br>
[2] A. Geron (2017), Hands-on machine learning with Scikit-Learn and TensorFlow, Chapter 5 (O’Reilly).