# Linear Models & Support Vector Machines

In this lesson we will explore a new way to think about classification. Rather than trying to directly classify new examples based on old ones to model the entire distribution of classes (eg. *k*-Nearest Neighbor), we can simply model the *boundaries* between different classes.

First, we will gain some intuition around what modeling boundaries means, and define some terms to help guide our motivation before exploring two different algorithms for boundary creation.

Next, we will explore Linear Discriminant Analysis (LDA), which will be our first exposure to representation learning; specifically classification via dimensionality reduction. LDA is a supervised algorithm, typically used for binary (two class) classification problems.

Finally, we will expand on LDA's data transformation techniques with Support Vector Machines. SVM is a supervised algorithm, typically used for classification or regression problems.

## Introduction

We are going to shift our way of thinking about classification from directly classifying new examples based on old ones, to directly modeling a function to separates classes (ie. creating boundaries). The **discriminant** is the function that models decision boundaries between classes. In a well-bounded example, any novel example will simply belong to the class it is bound by.

Like all frameworks we choose, there are trade-offs we must consider. In the case of classification, creating decision boundaries between class distributions, or the range of classes that exist on a plane, depends on how complex the discriminant is. If the discriminant is simple (compared to the class distribution) then this is a good approach. However, if the data presents ambiguous overlap between class distributions, then this method will not make accurate predictions. In theory, using very complex, non-linear discriminants to approximate boundaries sounds nice, and can potentially separate boundaries perfectly. In practice, however, it often results in a space that is intractable.

<img src="../img/3_discriminant.png" alt="http://uc-r.github.io/discriminant_analysis" style="width: 400px;"/>

Consider the image above with class distributions 1 (red), 2 (green), and 3 (blue). The plane the distribution exists in (not necessarily a 2-dimensional plane!) is separated into three distinct sections by two (or three) purple discriminant lines. Here is where the intuition behind our trade-off begin to take effect. The image on the left is divided by linear boundaries, which is performant for many classes but lacks predictive accuracy (notice the red "1s" creeping into the blue "3s" boundary). The image on the right will improve on the linear models accuracy by generating non-linear boundaries, quadratic in this case, at the expense of longer training time, and an irregular hypothesis space. The above example is simple enough that a non-linear boundary for three classes is feasible, if we consider examples with many dimensions and many classes, then generating non-linear function boundaries becomes much harder to manage.

We will return to this problem, and hopefully change the notion that non-linear, high-dimensional boundaries are difficult by applying some clever tricks, but first we will explore the simplest case, by restricting ourselves to *linear* discriminant functions.

### Mathematical Intuition

Let us once again consider the mathematics behind our newly found intuition.

When we say "linear" we mean that the boundary will be a **hyperplane**, which is a space 1 dimension less than our distribution space. In 2-dimensions this hyperplane is a line, but in higher dimensions we can still refer to this as a "linear" model.

In low dimensional spaces, the concept of a hyperplane is immediately obvious. However in much higher dimensions it becomes nearly impossible to conceptualize, let alone determine which boundary a data point exists in. Luckily, there is only one trick you need to know to determine which side of a hyperplane you are on. Consider the example vector $\vec{a}=[a_1\:\dots\:a_n]$, and the hyperplane boundary $\vec{w}=[w_1\:\dots\:w_n]$. Project the point onto a line perpendicular to the hyperplane using a **dot product** of vectors, written as:

\begin{equation*}
\vec{a}\cdot\vec{w}=\sum_{i=1}^n{a_iw_i}=a_1w_1+\dots+a_nw_n
\tag{3.1}
\end{equation*}

then check whether the projected value (which is a scalar) is greater or less than a threshold value based on where the hyperplane intersects the line (ie. $\vec{a}\cdot\vec{w}+b<0$, and $\vec{a}\cdot\vec{w}+b>0$, where $b$ is a threshold value.

If our data point can be correctly classified by this hyperplane boundary, then we say the classes are **linearly separable**. If no hyperplane exists that can separate our classes correctly, we say our data is not linearly separable.

How do we choose how many boundaries exist for our dataset? Well, it is trivial to show a single boundary can separate two classes of data points. If we are presented with more than two classes we need multiple decision boundaries. The typical approach is to use a *one-vs-all* method, meaning there will be one decision boundary per class. This can lead to some problems, specifically where a new point lies in a region intersected by more than one hyperplane. There are several algorithms that exist to choose a hyperplane, most notably is through "gradient descent", which works by picking a random candidate split plane, and repeatedly making small adjustments to it in ways that reduce the error measure. While this level of specificity is beyond the scope of this lesson, if this subject is interesting to you I encourage you to read about techniques for generating "good" hyperplane boundaries.

As we will see, finding the "best" boundary hyperplane is non-trivial. There may be choices that give perfect accuracy on a training set, or there may be none. In either case, let's first explore a linear technique for generating boundaries.

## Linear Discriminant Analysis

Consider following data point distribution of two classes, red and blue:

<img src="../img/3_linearly_inseperable_data.png" alt="https://sthalles.github.io/fisher-linear-discriminant/" style="width: 300px;"/>

Clearly, there is no way to draw a line to separate the red and blue points and achieve good predictive results, as seen in the left hand image below. Instead, we can *transform* the data in such a way that we can separate classes with a line, as seen in the right hand image below by squaring the input feature vectors. How did we know to square the input feature vectors to make this problem linearly separable? The answer to this is not trivial, and involves learning a representation, which we will cover in more depth in the "Deep Learning" lessons.

<img src="../img/3_feature_transformation.png" alt="https://sthalles.github.io/fisher-linear-discriminant/" style="width: 500px;"/>

LDA, often referred to as **Fisher’s Linear Discriminant**, relies on a process of transforming data into a lower dimensional representation of the original space, known as **dimensionality reduction**. Dimensionality reduction will be covered in much greater detail in Lesson 8, but for now we will use it as a tool simply to show how boundaries can be created in multi-class datasets. We attempt to find a transformation representation of our data samples by projecting the samples onto a plane with one less dimension. For example, using a dimensionality reduction algorithm, we can project data points in an x-y-z, 3-dimensional plane onto an x-y 2-dimensional plane. You can essentially run the same reduction repeatedly as a form of dimensionality reduction, but this is not advised for classification problems. Typically, you want to project your data into a space one dimension less, then apply some other classification algorithm to the projected data.

Fisher wanted to maximize a function that returns a large variance among classes, and a small variance within classes. In other words, a function that, when maximized, increases the distance between *different* classes, and reduces the distance between *same* classes, thus minimizing the overlap among *different* classes. Recall our issue stated in the Introduction, where a linear separation is not accurate, since there are classes of a different label belonging to another label's boundary. LDA attempts to resolve this issue by finding a *distinct* linear boundary via dimensionality reduction. 

It should be noted, like all frameworks, there is no single model for every type of data, and it is possible that LDA can't find a linearly separable boundary between classes that gives reasonable accuracy. We are simply gaining another method to make predictions; and in Lesson 6 we will apply our machine learning models to different types of data, then evaluate and understand why some models work better in different situations.

## Mathematical Intuition

### Linear Algebra

Recall from the section above, we can encode a hyperplane as a vector, $\vec{w}$. Let us treat this vector as a **basis vector**, and project our data points onto it. A basis vector is a vector that **spans** a **vector space**, defined as follows: Given a vector space $V=\mathbb{R}^n$, and a vector $\vec{v}=[v_1\:\dots\:v_n]\in V$, $\vec{v}$ is a basis vector for $V$ if every element in $V$ can be written as a unique linear combination of elements of $\vec{v}$. Given this, we can now encode each data point as a scalar value (ie. a distance from the origin along the vector).

Here is some intuition behind the paragraph above. Let's use our previously defined vector space $V=\mathbb{R}^2$, defined in the x-y coordinate plane, where points are written $(x,y)$. Consider a point $(1,2)$ on this plane. We can write this coordinate as the sum (or linear combination) of it's $x$ and $y$ components:

\begin{equation*}
(1,2)=1\cdot(1,0)+2\cdot(0,1)
\end{equation*}

Where $(1,0)$ and $(0,1)$ are unit vectors. The pair of unit vectors *spans* any vector in $V$, since any vector in $V$ can be decomposed as a matrix:

\begin{equation*}
(a,b)=a\cdot(1,0)+b\cdot(0,1)=
\begin{pmatrix}
1 & 0\\
0 & 1
\end{pmatrix}
\end{equation*}

Notice that the unit vectors are not the only basis vectors for $V$, we can choose $\{(1,1),(0,1)\}$ and form the linear combination:

\begin{equation*}
(a,b)=a\cdot(1,1)+(b-a)\cdot(0,1)
\end{equation*}

In fact, we can take any real value $t\in\mathbb{R}$ such that:
\begin{equation*}
(a,b)=a\cdot(1,t)+(b-a)\cdot(0,1)=
\begin{pmatrix}
1 & t\\
0 & 1
\end{pmatrix}
\end{equation*}

This combination of vectors will span any vector in $V$.

We can multiply vectors (or 1-dimensional matrices) together as follows:
\begin{equation*}
\begin{bmatrix}a\\b\\c\end{bmatrix}\times\begin{bmatrix}d & e & f\end{bmatrix}=a\cdot d + b\cdot e + c\cdot f
\end{equation*}

Note that the result is a scalar value, and that we can only multiply a row vector by a column vector, or vice versa, since their dimensions must be compatible. Suppose we are given two column vectors (ie. $1\times n$ matrices) $A$ and $B$. If we want to perform matrix multiplication, we must take the **transpose** of a matrix, say matrix $B$, denoted $B^\intercal$, to convert it to a compatible row vector for multiplication with $A$, hence $A\times B^\intercal$.

This is the extent of understanding required to understand LDA, but I encourage you to engage in your understanding of Linear Algebra as it will be useful for future lessons.

### Linear Discriminant Analysis

For the sake of example, we will consider a problem space in $\mathbb{R}^2$, and a projection space in $\mathbb{R}$ represented by a straight line. Consider the image below, where classes (represented as green and black dots on an x-y axis) are projected onto a new hyperplane in red. The hyperplane line is simply a basis vector that spans the set of feature vectors.

<img src="../img/3_lda.jpg" alt="https://www.geeksforgeeks.org/ml-linear-discriminant-analysis/" style="width: 350px;"/>

So how does LDA choose a "best" hyperplane? There are two criteria for choosing:
1. Maximizing the distance between mean projected values of different classes.
2. Minimizing the **scatter**, or the "same-class" variance in the projected space, within each class.

Formally, LDA aims to maximize the **objective function** $J$, the ratio of the *absolute* difference of per-class projected means ($\hat{m}$) to the sum of per class scatters ($\hat{s}$) for classes $a$ and $b$:

\begin{equation*}
J(\vec{w})=\frac{(\hat{m_a}-\hat{m_b})^2}{\hat{s_a}+\hat{s_b}}
\tag{3.2}
\end{equation*}

Ideally, the numerator of equation (3.2) will be large, and the denominator will be small, satisfying the criteria above. This seems satisfying, but let's dig a little deeper into what LDA is actually doing.

Let vector $\vec{w}$ be a vector **normal**, or perpendicular, to the hyperplane, illustrated by the dotted lines above. $\vec{w}$ is used to project vector $\vec{x}$ in our input space $\mathbb{R}^2$ onto $y$, a scalar value in $\mathbb{R}$. Clearly, the size of $\vec{w}$ is the same as $\vec{x}$. We can write our scalar value $y$ as a linear function:

$$y=\vec{w}^\intercal\vec{x}\tag{3.3}$$

Let $t$ is a threshold that separates classes in our projected space, $\mathbb{R}$. For an input vector $\vec{x}$, if the projected value $y\geq t$ then $\vec{x}$ belongs to class $a$, or $C_a$, otherwise $\vec{x}$ belongs to class $b$, or $C_b$. We would like to choose a line to maximize the separability of points $y$ based on our first criteria.

In order to do this, we must first define a measure of separation between projected classes. Let $m_i$, and $\hat{m_i}$ represent the mean scalar value of each class $i$ in the original and projected space respectively:

\begin{align}
m_i&=\frac{1}{N_i}\sum_{\vec{x}\in C_i}{\vec{x}} \\
\hat{m_i}&=\frac{1}{N_i}\sum_{y\in C_i}{y}=\frac{1}{N_i}\sum_{x\in C_i}{\vec{w}^\intercal\vec{x}}=\frac{\vec{w}^\intercal}{N_i}\sum_{x\in C_i}{\vec{x}}=\vec{w}^\intercal m_i
\tag{3.4}
\end{align}

We aren't quite done yet, as Fisher Linear Discriminant function aims to normalize the means by the same-class variance, or scatter, defined in our second criteria. Let $s_i$, and $\hat{s_i}$ represent the same-class variance for class $i$ in the original and projected space respectively. The variance is the sum of square differences between values and their classes mean:

\begin{align}
\hat{s_i}&=\sum_{y\in C_i}{(y-\hat{m_i})^2} \\
s_i&=\sum_{\vec{x}\in C_i}{(\vec{w}^\intercal\vec{x}-\vec{w}^\intercal m_i)^2}=\vec{w}^\intercal\left(\sum_{\vec{x}\in C_i}{(x-m_i)(x-m_i)^\intercal}\right)\vec{w}
\tag{3.5}
\end{align}

Let $S_i$ represent the **covariance matrix** for class $i$:
\begin{equation*}
S_i=\left(\sum_{\vec{x}\in C_i}{(x-m_i)(x-m_i)^\intercal}\right)
\tag{3.6}
\end{equation*}

The same-class scatter matrix, $S_w$, for classes $a$, and $b$ is $S_w=S_a+S_b$. Therefore we can write equation (3.5) as:

\begin{equation*}
\hat{s_a}+\hat{s_b}=\vec{w}^\intercal S_w\vec{w}=\hat{S_w}
\tag{3.7}
\end{equation*}

Where $\hat{S_w}$ is the same-class scatter matrix of projected values $y$.

Using a similar method, we can express the difference of means between the projected values in $y$ space in terms of our original $\vec{x}$ space.

\begin{equation*}
(\hat{m_a}-\hat{m_b})^2=\vec{w}^\intercal(m_a-m_b)(m_a-m_b)^\intercal\vec{w}=\vec{w}^\intercal S_m\vec{w}=\hat{S_m}
\tag{3.8}
\end{equation*}

Where $\hat{S_m}$ is the different-class scatter matrix of projected values $y$.

We can now write $J$ in terms of our original feature space:

\begin{equation*}
J(\vec{w})=\frac{(\hat{m_a}-\hat{m_b})^2}{\hat{s_a}+\hat{s_b}}=\frac{\vec{w}^\intercal S_w\vec{w}}{\vec{w}^\intercal S_m\vec{w}}
\tag{3.9}
\end{equation*}

In order to find the maximum of $J$, we can differentiate and equate it to zero.