This Jupyter notebook demonstrates a simple but reasonably standard use of the support vector machine (SVM) algorithm. For better or worse, the mathematical details behind this algorithm are beyond the technical background required of this notebook's audience. The mathematically sophisticated (or simply adventurous) student can visit https://en.wikipedia.org/wiki/Support_vector_machine to see a more exact treatment of what the algorithm does, and https://en.wikipedia.org/wiki/Convex_optimization for a glimpse of exactly how and why it works.

We begin with a (fictional) dataset describing 300 young basketball players. The first two columns gives each player's height in inches and 100m sprint time in seconds, as measured during their senior year season. The third column indicates whether each player did (1) or did not (0) end up playing basketball on a college team. We think of height and sprint times as predictive variables and college playing as an output to be predicted, and we will end up with an algorithm that predicts just that.

But first, the boring bits. Let's import the various libraries we'll need for this exercise.

In [21]:
# various imports
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import io
import requests
import tarfile
from sklearn import svm

Next we need to grab the data. Finding, importing, and formatting data sets is its own art, but today we're lucky: the data exists as a .csv file and is already formatted neatly. To ensure everything worked as intended, we display three different ways of summarizing the data.

In [2]:
url="https://raw.githubusercontent.com/mathewphilipc/HandsOnDeep/master/basketball_data.csv"
s=requests.get(url).content
data=pd.read_csv(io.StringIO(s.decode('utf-8')))

In [3]:
data.head()

Unnamed: 0,Height,Sprint,College
0,73.617769,11.177916,1
1,74.288237,11.828745,0
2,72.972113,11.680184,0
3,74.351292,12.501294,1
4,72.827937,12.170944,0


In [8]:
data.describe()

Unnamed: 0,Height,Sprint,College
count,300.0,300.0,300.0
mean,73.877209,12.021287,0.593333
std,1.997583,0.393754,0.492032
min,68.98388,10.877144,0.0
25%,72.56573,11.733058,0.0
50%,73.852688,12.02203,1.0
75%,75.263846,12.298332,1.0
max,80.134534,13.176756,1.0


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 3 columns):
Height     300 non-null float64
Sprint     300 non-null float64
College    300 non-null int64
dtypes: float64(2), int64(1)
memory usage: 7.1 KB


Great, everything checks out. Using data.head() we see the first five rows of out data set, corresponding to the first five athletes. And the numbers we see are reasonable: the first athlete is a bit under 6'4" (quite tall, but not unusual for a basketball player) and can run 100m in a little over 11sec (quite fast, but not unheard of for a high-level athlete). The next four athletes are reasonable as well.

data.describe() offers some aggregate information about each category. For example, all 300 athletes have listed values for each attribute. The average athlete on our list is a bit under 6'4" and runs 100m in just over 12s, and 59.3% of them went on to play in college. We can also see slightly more sophisticated statistical information such as standard deviations and quartiles for each trait.

data.info() tells us basic info about our data as a data structure. It's something called a pandas DataFrame. Our two input attributes are floats and the output is an int. And apparently it occupies 7.1kb in memory.

As nice as this data is, we need to do just a little bit of reshaping before we can feed it into our support vector machine.

In [40]:
y = data["College"].values.reshape(-1,1)
X1 = data["Height"].values.reshape(-1,1)
X2 = data["Sprint"].values.reshape(-1,1)
X = data.drop("College",axis=1).values.reshape(-1,2)

Now that that's out of the way, 

In [47]:
model = svm.SVC(kernel='linear')
model.fit(X,y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)