# Linear Models & Support Vector Machines

In this lesson we will explore a new way to think about classification. Rather than trying to directly classify new examples based on old ones to model the entire distribution of classes (eg. *k*-Nearest Neighbor), we can simply model the *boundaries* between different classes. Ideally, distinctly bounded classes will be a simpler problem, and easier to learn compared to kNN.

First, we will gain some intuition around what modeling boundaries means, and define some terms to help us in our understanding.

Next, we will explore two different methods for generating boundaries between class distributions:
1. Linear Discriminant Analysis
2. Support Vector Machines

## Introduction

As mentioned above, we are going to shift our way of thinking about classification from directly classifying new examples based on old ones, to directly modeling a function that separates classes (ie. creates boundaries). The **discriminant** is the function that models decision boundaries between classes. In a well-bounded example, any novel example will simply belong to the class it is bounded by.

Like all frameworks we choose for classification, there are some trade-offs we must consider. Creating decision boundaries between class distributions, or the range of classes that exist on a plane, depends on how complex the discriminant is. If the discriminant is simple (compared to the class distribution) then this is a good approach. However, if the data presents ambiguous overlap between class distributions, then this method becomes inaccurate. In theory, using very complex, non-linear discriminants to approximate boundaries sounds nice, and can potentially separate boundaries perfectly. In practice, however, it results in a space that is intractable.

<img src="../img/3_discriminant.png" alt="http://uc-r.github.io/discriminant_analysis" style="width: 500px;"/>

Consider the image above with class distributions 1 (red), 2 (green), and 3 (blue). The plane the distribution exists in (not necessarily a 2-dimensional plane!) is separated into three distinct sections by two purple discriminant lines. Here is where the intuition behind our trade-off begin to take effect. The image on the left is divided by linear boundaries, which is performant for many classes but lacks predictive accuracy (notice the red "1s" creeping into the blue "3s" boundary). The image on the right will improve on the linear models accuracy by generating non-linear boundaries (quadratic in this case), at the expense of longer training time, and an irregular hypothesis space. The above example is simple enough that a non-linear boundary for three classes is feasible, if we consider examples with many dimensions and many classes, then generating non-linear function boundaries becomes much harder to manage.

In future lessons we will return to this idea, and hopefully change the notion that non-linear, high-dimensional boundaries are difficult by applying some clever tricks, but first we will explore the simplest case, by restricting ourselves to *linear* discriminant functions.

### Mathematical Intuition

Before we tackle the use of a Linear Discriminant, let us once again consider the mathematics behind our newly found intuition.

Bear in mind, when we say "linear" we mean that the boundary will be a **hyperplane**, which is a space 1 dimension less than our distribution space. In 2-dimensions this hyperplane is a line, but in higher dimensions we can still refer to this as a linear model.

In low dimensional spaces, the concept of a hyperplane is immediately obvious. However in much higher dimensions it becomes nearly impossible to conceptualize, let alone determine which boundary a data point exists in. Luckily, there is only one trick you need to know to determine which side of a hyperplane you are on. Consider the example data point as a vector $\vec{a}=[a_1\:\dots\:a_n]$, and the hyperplane boundary as $\vec{w}=[w_1\:\dots\:w_n]$ Project the point onto a line perpendicular to the hyperplane using a **dot product** of vectors, written as:

\begin{equation*}
\vec{a}\cdot\vec{w}=\sum_{i=1}^n{a_iw_i}=a_1w_1+\dots+a_nw_n
\end{equation*}

then check whether the projected value (which is a scalar) is greater or less than a threshold value based on where the hyperplane intersects the line (ie. $\vec{a}\cdot\vec{w}+b<0$, and $\vec{a}\cdot\vec{w}+b>0$, where $b$ is a threshold value.

If our data point can be correctly classified by this hyperplane boundary, then we say the classes are **linearly separable**. If no hyperplane exists that can separate our classes correctly, we say our data is not linearly separable.

How do we choose how many boundaries exist for our dataset? Well it is trivial to show a single boundary can separate two classes of data points. However, if we have more than two classes we need multiple decision boundaries. The typical approach is to use a *one-vs-all* method, meaning there will be one decision boundary per class. This can lead to some problems, specifically where a new point lies in a region intersected by more than one hyperplane boundary. There are several algorithms that exist to choose a hyperplane, most notably is through "gradient descent", which works by picking a random candidate split plane, and repeatedly making small adjustments to it in ways that reduce the error measure. While this level of specificity is beyond the scope of this lesson, if this subject is interesting to you I encourage you to read about techniques for generating "good" hyperplane boundaries.

As we will see, finding the "best" boundary hyperplane is non-trivial. There may be choices that give perfect accuracy on a training set, or there may be none. In either case, let's first explore a linear technique for generating boundaries.

## Linear Discriminant Analysis

In [None]:
# TODO: This section