# A Very Brief Introduction to Support Vector Machines in Practice

This Jupyter notebook demonstrates a simple but reasonably standard use of the support vector machine (SVM) algorithm. For better or worse, the mathematical details behind this algorithm are beyond the technical background required of this notebook's audience. The mathematically sophisticated (or simply adventurous) student can visit https://en.wikipedia.org/wiki/Support_vector_machine to see a more exact treatment of what the algorithm does, and https://en.wikipedia.org/wiki/Convex_optimization for a glimpse of exactly how and why it works.

We begin with a (fictional) dataset describing 300 young basketball players. The first two columns gives each player's height in inches and 100m sprint time in seconds*, as measured during their senior year season. The third column indicates whether each player did (1) or did not (0) end up playing basketball on a college team. We think of height and sprint times as predictive variables and college playing as an output to be predicted, and we will end up with an algorithm that predicts just that.

But first, the boring bits. Let's import the various libraries we'll need for this exercise.

*For context: An active adult can usually run 100m in under 20s. A typical college athletes can run it in under 15s. 10s and under is the domain of Olympic sprinters.

In [1]:
# various imports
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import io
import requests
import tarfile
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

Next we need to grab the data. Finding, importing, and formatting data sets is its own art, but today we're lucky: the data exists as a .csv file and is already formatted neatly. To ensure everything worked as intended, we display three different ways of summarizing the data.

In [2]:
url="https://raw.githubusercontent.com/mathewphilipc/HandsOnDeep/master/basketball_data.csv"
s=requests.get(url).content
data=pd.read_csv(io.StringIO(s.decode('utf-8')))

In [3]:
data.head()

Unnamed: 0,Height,Sprint,College
0,73.617769,11.177916,1
1,74.288237,11.828745,0
2,72.972113,11.680184,0
3,74.351292,12.501294,1
4,72.827937,12.170944,0


In [4]:
data.describe()

Unnamed: 0,Height,Sprint,College
count,300.0,300.0,300.0
mean,73.877209,12.021287,0.593333
std,1.997583,0.393754,0.492032
min,68.98388,10.877144,0.0
25%,72.56573,11.733058,0.0
50%,73.852688,12.02203,1.0
75%,75.263846,12.298332,1.0
max,80.134534,13.176756,1.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 3 columns):
Height     300 non-null float64
Sprint     300 non-null float64
College    300 non-null int64
dtypes: float64(2), int64(1)
memory usage: 7.1 KB


Great, everything checks out. Using data.head() we see the first five rows of out data set, corresponding to the first five athletes. And the numbers we see are reasonable: the first athlete is a bit under 6'4" (quite tall, but not unusual for a basketball player) and can run 100m in a little over 11sec (quite fast, but not unheard of for a high-level athlete). The next four athletes are reasonable as well.

data.describe() offers some aggregate information about each category. For example, all 300 athletes have listed values for each attribute. The average athlete on our list is a bit under 6'4" and runs 100m in just over 12s, and 59.3% of them went on to play in college. We can also see slightly more sophisticated statistical information such as standard deviations and quartiles for each trait.

data.info() tells us basic info about our data as a data structure. It's something called a pandas DataFrame. Our two input attributes are floats and the output is an int. And apparently it occupies 7.1kb in memory.

As nice as this data is, we need to do just a little bit of reshaping before we can feed it into our support vector machine.

In [6]:
y = data["College"]
X1 = data["Height"].values.reshape(-1,1)
X2 = data["Sprint"].values.reshape(-1,1)
X = data.drop("College",axis=1).values.reshape(-1,2)

Now that that's out of the way, we can start building a model for our data. As we mentioned above, the exact details of support vector machines are going to be a black box. For now it suffices to think of this technique as something like a generalization of logistic and linear regression. It models the data using a function with several free parameters, tweaking those parameters until the function best matches the data. We are free to decide what particular function to use - logistic, linear, quadratic, or something more exotic. In the land of SVMs, the choice of function amounts to choosing something called a 'kernel' (another black box).

In this tutorial we'll be using the svm.SVC function provided by scikit-learn. According to the official scikit-learn documentation for this function (found at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) we can choose our kernel by inserting

> kernel = YourChoiceHere

as an argument of svm.SVC(). Let's start with a linear kernel:

In [7]:
linear_model = svm.SVC(kernel='linear').fit(X,y)

Just for fun and as a sanity check, let's make up a few edge case athletes. The first five are very short and pretty slow (~4'6" and ~20s), the next are extremely tall and impossibly fast (~7'0" and ~5s). If our algorithm isn't completely broken, it should return 0 for the first five athletes and 1 for the next five.

In [8]:
madeUpAthletes = [[57.6,20.7],
                  [52.8,20.5],
                  [54.5,20.6],
                  [56.4,21.6],
                  [54.3,21.5],
                  [80.5,5.3],
                  [82.8,4.1],
                  [82.4,7.3],
                  [82.7,5.4],
                  [81.7,5.1]]

In [9]:
linear_model.predict(madeUpAthletes)

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

Great! It passes our sanity check. Without going off the deepend of sophisticated metrics and cost functions, let's go just one step further and see what fraction of the test cases are correctly predicted by our linear model.

In [10]:
accuracy_score(y, linear_model.predict(X))

0.68333333333333335

It turns out our linear model is correct just over 2/3 of the time. So there's probably nothing wrong with our code, but the real underlying relationship behind our data probably isn't linear. Let's try some other models. Just as we avoided the deep end up performance metrics, we avoid the full range of SVM parameters provided by sklearn. Instead, let's try a few common nonlinear kernels, starting with rbf (i.e., a Gaussian distribution).

In [11]:
rbf_model = svm.SVC(kernel='rbf').fit(X,y)
accuracy_score(y, rbf_model.predict(X))

0.69333333333333336

This is better, but not by much. Next we try a sigmoid / tanh function.

In [13]:
sigmoid_model = svm.SVC(kernel='sigmoid').fit(X,y)
accuracy_score(y, sigmoid_model.predict(X))

0.59333333333333338

# Further Reading

Tutorials:

https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

https://sadanand-singh.github.io/posts/svmpython/

Documentation:

http://scikit-learn.org/stable/modules/svm.html

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html