# Kernels Tutorial

Support vector machines (SVM) are classification models which base their decisions on the kernel type. In this project we use the SVM model within the sklearn library. The documentation for this library is available at https://scikit-learn.org/stable/modules/svm.html

In [None]:
# Original Code source: Gaël Varoquaux & Andreas Müller
# Modified for documentation by Jaques Grobler
# Edited for course project by Joel Sjöberg and Ion Petre
# License: BSD 3 clause

# Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
# For data standardization
from sklearn.preprocessing import StandardScaler

# Import the support vector classifier 
from sklearn.svm import SVC


In [None]:
import random

random.seed(2021)

We will see how the kernels are used to form decision boundaries for a simple dataset. We will use the "make_moons" method for generating a dataset.

In [None]:
# Generate dataset
X, y = make_moons(noise=0.2, random_state=0,  n_samples = 1000)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
)

# Create an instance of the standardization scaler
std_scaler = StandardScaler()

# Standardize the training data to mean 0 and standard deviation 1.
# We get this through fitting the scaler to the training data.
X_train = std_scaler.fit_transform(X_train)

# Use the same scaler to transform the test data. 
# This is needed because the model will be trained on standardized data.
# Even for novel "production" data, we will have to apply the same standardization.
# Note that the scaler isn't fit again on the test data!!
X_test = std_scaler.transform(X_test)

# Scatter the training points according to their category
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=30, cmap=plt.cm.Paired)
plt.show()

## Creating a model

We now create a model to categorize the dataset. We begin with a linear kernel model.

In [None]:
# Import the support vector classifier 
from sklearn.svm import SVC

# Fit a linear support vector classifier
clf = SVC(kernel="linear", C=1.0, random_state = 0)
clf.fit(X_train, y_train)

# Measure accuracy on the train data
print("Accuracy on the training set:", clf.score(X_train, y_train))

In [None]:
# Plot the decision boundary and the support vectors

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=30, cmap=plt.cm.Paired)

# plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)

# plot decision boundary and margins
ax.contour(
    XX, YY, Z, colors="k", levels=[-1, 0, 1], alpha=0.5, linestyles=["--", "-", "--"]
)
# plot support vectors
ax.scatter(
    clf.support_vectors_[:, 0],
    clf.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
)
plt.show()

Let's try now a polynomial kernel. We will get better results.

In [None]:
# Initialize model
# Suggestion: try several different degrees in the kernel by modifying the parameter "degree"
pol_model = SVC(kernel="poly", degree=3, C=1.0, random_state = 0)

# Train the model
pol_model.fit(X_train, y_train)

# Measure accuracy on the training data
print("Accuracy on the training set:", pol_model.score(X_train, y_train))

In [None]:
# Plot the decision boundary and the support vectors

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=30, cmap=plt.cm.Paired)

# plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = pol_model.decision_function(xy).reshape(XX.shape)

# plot decision boundary and margins
ax.contour(
    XX, YY, Z, colors="k", levels=[-1, 0, 1], alpha=0.5, linestyles=["--", "-", "--"]
)
# plot support vectors
ax.scatter(
    pol_model.support_vectors_[:, 0],
    pol_model.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
)
plt.show()

Let's try now the Radial Basis Function kernel (RBF). We will get even better results (which was to be expected, given the elyptic distribution of the datapoints).

In [None]:
# Initialize model
rbf_model = SVC(kernel="rbf", C=1.0, gamma = 2, random_state = 0)

# Train the model
rbf_model.fit(X_train, y_train)

# Measure accuracy on the training data
print("Accuracy on the training set:", rbf_model.score(X_train, y_train))

In [None]:
# Plot the decision boundary and the support vectors

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=30, cmap=plt.cm.Paired)

# plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = rbf_model.decision_function(xy).reshape(XX.shape)

# plot decision boundary and margins
ax.contour(
    XX, YY, Z, colors="k", levels=[-1, 0, 1], alpha=0.5, linestyles=["--", "-", "--"]
)
# plot support vectors
ax.scatter(
    rbf_model.support_vectors_[:, 0],
    rbf_model.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
)
plt.show()

Conclusion: the RBF kernel is the best in this example, the decision boundary follows the data quite faithfully.

# Kernels Project

Let's load a simple income dataset. Each data point is an individual with various features such as age, education, marital status, native country, number of work hours per week, etc. The target value is 1 if the person earns 50k per year or more, 0 otherwise. The dataset is available in the Penn Machine Learning Benchmarks library. Information about this collection can be found at https://github.com/EpistasisLab/pmlb

In [None]:
# Install the pmlb library on this runtime
!pip install pmlb
import pmlb

Link to dataset metadata (feature descriptions) : https://github.com/EpistasisLab/pmlb/blob/master/datasets/adult/metadata.yaml

In [None]:
#import pandas as pd
#pd.set_option('display.max_columns', None)

# Get the "adult" dataset
frame = pmlb.fetch_data("adult")

# Print the head of the resulting dataframe
print(frame.head(10))

# Print the name of the features in the data
print(frame.columns)

In [None]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
        frame.to_numpy()[:, :-1], frame.to_numpy()[:, -1], test_size=0.3, random_state=0
)

## Modelling

Let's build a support vector machine model separating the individuals with respect to whether they may over 50k/year. 

In [None]:
random.seed(2021)

In [None]:
# Try the linear kernel
# Q1: What is the accuracy on the training dataset? Use random_state = 0.

# Initialize model

# Train the model

# Measure accuracy on the training data


In [None]:
# Try the polynomial kernel
# Q2: What is the best fitting polynomial kernel between degrees 1, 2, 3, 4, 5. 
# Use random_state = 0.

random.seed(2021)



# Q3. For the best fitting polynomial kernel, what is the accuracy on the training dataset?

In [None]:
# Try the RBF kernel
# Q4: What is the accuracy on the training dataset? Use random_state = 0.

random.seed(2021)


# Train the model


# Measure accuracy on the training data



In [None]:
# Q5. For the model with the best accuracy on the training dataset, what is its accuracy on the test dataset?


In [None]:
# Measure accuracy on the test data
