# Linear Models & Support Vector Machines

In this lesson we will explore a new way to think about classification. Rather than trying to directly classify new examples based on old ones to model the entire distribution of classes (eg. *k*-Nearest Neighbor), we can simply model the *boundaries* between different classes.

First, we will gain some intuition around what modeling boundaries means, and define some terms to help guide our motivation before exploring two different algorithms for boundary creation.

Next, we will explore the first of our boundary creation algorithms: Linear Discriminant Analysis (LDA). LDA will be our first exposure to representation learning; specifically classification via dimensionality reduction. LDA is a supervised algorithm, typically used for binary (two class) classification problems.

Finally, we will expand on LDA's data transformation techniques with Support Vector Machines. SVM is a supervised algorithm, typically used for classification or regression problems.

## Decision Boundaries

We are going to shift our way of thinking about classification from directly classifying new examples based on old ones, to directly modeling a function to separate classes (ie. creating boundaries). The **discriminant** is the function that models decision boundaries between classes. In a well-bounded space, a novel data point will simply belong to the class it is bounded by. This bounded region is known as the **decision region**.

Like all frameworks we choose, there are trade-offs we must consider. In the case of classification, creating decision region between classes depends on how complex the discriminant is. If the discriminant is simple (compared to the class data point distribution) then this is a good approach. However, if the data presents ambiguous overlap between class distributions, then this method will not make accurate predictions. In theory, using very complex, non-linear discriminants to approximate boundaries sounds nice, and can potentially separate boundaries perfectly. In practice, however, it often results in a space that is intractable.

<img src="../img/3_discriminant.png" alt="http://uc-r.github.io/discriminant_analysis" style="width: 400px;"/>

Consider the image above with class distributions 1 (red), 2 (green), and 3 (blue). The plane the distribution exists in (not necessarily a 2-dimensional plane!) is separated into three distinct sections by two (or three) purple discriminant lines. Here is where the intuition behind our trade-off begin to take effect. The image on the left is divided by linear boundaries, which is performant for many classes but lacks predictive accuracy (notice the red "1s" creeping into the blue "3s" boundary). The image on the right will improve on the linear models accuracy by generating non-linear boundaries, quadratic in this case, at the expense of an irregular hypothesis space. The above example is simple enough that a non-linear boundary for three classes is feasible, if we consider examples with many dimensions and many classes, then generating non-linear function boundaries becomes much harder to manage.

First, we will explore the simplest case, by restricting ourselves to *linear* discriminant functions.

### Mathematical Intuition

Let us once again consider the mathematics behind our newly found intuition.

When we say "linear" we mean the boundary will be a **hyperplane** (or **decision surface**), which is a space 1 dimension less than our example space. In 2-dimensions this hyperplane is a line, but in higher dimensions we can still refer to this as a "linear" model.

In low dimensional spaces, the concept of a hyperplane is immediately obvious. However in much higher dimensions it becomes nearly impossible to conceptualize, let alone determine which boundary a data point exists in. Luckily, there is only one trick you need to know to determine which side of a hyperplane you are on. Consider the example vector $\mathbf{a}=[a_1\:\dots\:a_n]$, and the hyperplane boundary $\mathbf{w}=[w_1\:\dots\:w_n]$. Project the point onto a line perpendicular to the hyperplane using a **dot product** of vectors, written as:

\begin{equation*}
\mathbf{a}\cdot\mathbf{w}=\sum_{i=1}^n{a_iw_i}=a_1w_1+\dots+a_nw_n
\label{3.1}\tag{3.1}
\end{equation*}

Next, check whether the projected value (which is a scalar) is greater or less than a threshold value based on where the hyperplane intersects the line (ie. $\mathbf{a}\cdot\mathbf{w}+b<0$, and $\mathbf{a}\cdot\mathbf{w}+b>0$, where $b$ is a scalar threshold value.

If our data point can be correctly classified by this hyperplane boundary, then we say the classes are **linearly separable**. If no hyperplane exists that can separate our classes correctly, we say our data is not linearly separable.

How do we choose how many boundaries exist for our dataset? Well, it is trivial to show a single boundary can separate two classes of data points. If we are presented with more than two classes we need multiple decision boundaries. The typical approach is to use a *one-vs-all* method, meaning there will be one decision boundary per class. This can lead to some problems, specifically where a new point lies in a region intersected by more than one hyperplane. There are several algorithms that exist to choose a hyperplane, most notably is through "gradient descent", which works by picking a random candidate split plane, and repeatedly making small adjustments to it in ways that reduce the error measure. While this level of specificity is beyond the scope of this lesson, if this subject is interesting to you I encourage you to read about techniques for generating "good" hyperplane boundaries.

As we will see, finding the "best" boundary hyperplane is non-trivial. There may be choices that give perfect accuracy on a training set, or there may be none. In either case, let's first explore a linear technique for generating boundaries.

## Linear Discriminant Analysis

Consider following data point distribution of two classes, red and blue:

<img src="../img/3_linearly_inseperable_data.png" alt="https://sthalles.github.io/fisher-linear-discriminant/" style="width: 350px;"/>

Clearly, there is no way to draw a line to separate the red and blue points and achieve good predictive results, as seen in the left hand image below. Instead, we can *transform* the data in such a way that we can separate classes with a line, as seen in the right hand image below by squaring the input feature vectors. How did we know to square the input feature vectors to make this problem linearly separable? The answer to this is not trivial, and involves learning a representation, which we will cover in more depth in the "Deep Learning" lessons.

<img src="../img/3_feature_transformation.png" alt="https://sthalles.github.io/fisher-linear-discriminant/" style="width: 500px;"/>

LDA, often referred to as **Fisher’s Linear Discriminant**, relies on a process of transforming data into a lower dimensional representation of the original space, known as **dimensionality reduction**. Dimensionality reduction will be covered in much greater detail in Lesson 8, but for now we will use it as a tool to create boundaries in two-class datasets. We attempt to find a transformation representation of our data samples by projecting the samples onto a plane of one less dimension. You can essentially run the LDA repeatedly as a form of dimensionality reduction (as long as the projection is orthogonal to the last), but this is not advised for classification problems. Typically, you want to project your data into a space one dimension less, then apply some other classification algorithm to the projected data.

Fisher wanted to maximize a function that returns a large variance among classes, and a small variance within classes. In other words, a function that, when maximized, increases the distance between *different* classes, and reduces the distance within *same* class clusters. Recall our issue stated in the Introduction, where a linear separation is not accurate, since there are classes of a different label belonging to another label's decision region. LDA attempts to resolve this issue by finding a *distinct* linear boundary via changing the representation of the original data. 

It should be noted, it is possible that LDA won't be able to find a linearly separable boundary between classes that gives reasonable accuracy. We are simply gaining another method to make predictions; and in Lesson 6 we will apply our machine learning models to different types of data, then evaluate and understand why some models work better in different datasets.

### Mathematical Intuition

For the sake of example, we will consider a binary classification problem with class labels $G$ and $B$, with features in $\mathbb{R}^2$, and a projection space in $\mathbb{R}$. Consider the image below, where classes (represented as green and black circles on an xy-axis) are *projected* onto the hyperplane in red.

Once points are projected onto a line, we can describe how they are dispersed. In the image below, if we were to split the red line in half we would notice the majority of green points exist above our imaginary split, and the majority of black points exist below it. We would like to find a distribution of points that are cleanly separated such that when a new point is added we can easily classify it based on what side of the hyperplane it exists.

<img src="../img/3_lda.jpg" alt="https://www.geeksforgeeks.org/ml-linear-discriminant-analysis/" style="width: 350px;"/>

So how does LDA help us choose a "best" hyperplane configuration? Fisher outlines the following two criteria:
1. Maximizing the distance between mean projected values of different classes.
2. Minimizing the **scatter**, or the "same-class" variance in the projected space, within each class.

Formally, LDA aims to maximize the **objective function**, $J$, which is the ratio of the *absolute* difference between projected means ($\hat{m}$) normalized by the sum of projected same-class scatter ($\hat{s}$):

\begin{equation}
J(\mathbf{w})=\frac{(\hat{m_G}-\hat{m_B})^2}{\hat{s_G}+\hat{s_B}}
\label{3.2}\tag{3.2}
\end{equation}

Subscripts $G$ and $B$ represent the projected means, and projected scatter of class for classes $G$, and $B$ respectively. Let's dig a little deeper into what LDA is actually doing.

Let vector $\mathbf{w}$ represent our hyperplane boundary as previously defined. The **normal**, or vector perpendicular at any point on our hyperplane surface, is the **transpose** of the hyperplane itself, denoted by $\mathbf{w}^\intercal$. Fisher's Linear Discriminant is defined by the linear function \ref{3.3}. Visually, The vectors normal to $\mathbf{w}$ are represented by the dotted lines mapping feature vectors to the red hyperplane. Fisher's Linear Discriminant projects vector $\mathbf{x}$ from our input space in $\mathbb{R}^2$ onto $y$, a scalar value in $\mathbb{R}$ by some linear combination of $\mathbf{x}$ and $\mathbf{w}^\intercal$. We can write our scalar projection point $y$ as follows:

\begin{equation}
y=\mathbf{w}^\intercal\mathbf{x}
\label{3.3}\tag{3.3}
\end{equation}

Let $t$ be the threshold that separates classes in our projected space. For an input vector $\mathbf{x}$, if the projected value $y\geq t$ then $\mathbf{x}$ belongs to class $G$, otherwise $\mathbf{x}$ belongs to class $B$. We would like to maximize the separability of projected points, $y$, based on Fisher's first criteria. In order to do this, we must first define a measure of separation between different classes. Let $m_i$ represent the mean vector in a series of vectors representing a class, $C_i$, where $\mathbf{x}\in C_i$, and $N_i$ represents the number of points in that class:

\begin{equation}
m_i=\frac{1}{N_i}\sum_{\mathbf{x}\in C_i}{\mathbf{x}}
\label{3.4}\tag{3.4}
\end{equation}

Recall, \ref{3.2} requires difference of per-class projected means. Let $\hat{m_i}$ represent the mean vector for each class projected onto a hyperplane:

\begin{equation}
\hat{m_i}=\frac{1}{N_i}\sum_{y\in C_i}{y}=\frac{1}{N_i}\sum_{\mathbf{x}\in C_i}{\mathbf{w}^\intercal\mathbf{x}}=\mathbf{w}^\intercal\cdot\frac{1}{N_i}\sum_{\mathbf{x}\in C_i}{\mathbf{x}}=\mathbf{w}^\intercal m_i
\label{3.5}\tag{3.5}
\end{equation}

Notice, the projected mean vector, $\hat{m_i}$, is simply a linear combination of our vector normal to the hyperplane and the original mean vector value, $m_i$. The numerator of \ref{3.2} is the squared difference between projected class mean values. Why are we squaring the difference? Spatially, the difference between mean values is a scalar distance, meaning $\hat{m_G}-\hat{m_B}$ is the distance between projected mean values of our two classes. Since we don't know if $\hat{m_G}\gt\hat{m_B}$, or if $\hat{m_G}\lt\hat{m_B}$, we square the mean value so the distance will always result in a positive number. Intuitively, this makes sense, since a negative distance value is not possible.

The second condition of Fisher's Linear Discriminant function also aims to minimize the same-class variance, or scatter, defined by our second criteria. Let $\hat{s_i}$ represent the same-class variance for class $C_i$ in the projected space. The variance is the sum of square differences between values and their classes mean value. Consider the variability within a class after projected onto the $y$-space:

\begin{equation}
\hat{s_i}=\sum_{y\in C_i}{(y-\hat{m_i})^2}
\label{3.6}\tag{3.6}
\end{equation}

In our two-class example, $\hat{s_G}+\hat{s_B}$ measures the variability within two classes after projection onto the hyperplane. We have successfully defined our *projected* values to satisfy \ref{3.2}, however, in order to find a maximized projection we need to express $J(\mathbf{w})$ as a function of our input vectors, $\mathbf{x}$. Let $S_i$ represent the **covariance matrix**, or **scatter matrix** for class $C_i$:

\begin{equation*}
S_i=\left(\sum_{\mathbf{\mathbf{x}}\in C_i}{(\mathbf{x}-m_i)(\mathbf{x}-m_i)^\intercal}\right)
\label{3.7}\tag{3.7}
\end{equation*}

The scatter matrix will tell us the variance within each class, revealing how scattered points are around the mean. Let's return to \ref{3.6}, and substitute values for  $y$ and $\hat{m_i}$ to find $\hat{s_i}$ in terms of input vectors:

\begin{equation}
\hat{s_i}=\sum_{y\in C_i}{(y-\hat{m_i})^2}=\sum_{\mathbf{x}\in C_i}{(\mathbf{w}^\intercal\mathbf{x}-\mathbf{w}^\intercal m_i)^2}=\mathbf{w}^\intercal\left(\sum_{\mathbf{x}\in C_i}{(\mathbf{x}-m_i)(\mathbf{x}-m_i)^\intercal}\right)\mathbf{w}
\label{3.8}\tag{3.8}
\end{equation}

Using the definition $\hat{s_G}+\hat{s_B}$ as the same-class variance after projection, the same-class scatter matrix *before* projection, $S_w$, for classes $G$, and $B$ is $S_w=S_G+S_B$:

\begin{equation}
\hat{s_G}+\hat{s_B}=\mathbf{w}^\intercal S_w\mathbf{w}=\hat{S_w}
\label{3.9}\tag{3.9}
\end{equation}

Where $\hat{S_w}$ is the same-class scatter matrix of projected values $y$, and $S_w$ is the same-class scatter matrix of the original feature vectors. Using the same logic, we can express the difference of means between the projected values in $y$-space in terms of our original space:

\begin{equation}
(\hat{m_G}-\hat{m_B})^2=\mathbf{w}^\intercal(m_G-m_B)(m_G-m_B)^\intercal\mathbf{w}=\mathbf{w}^\intercal S_m\mathbf{w}=\hat{S_m}
\label{3.10}\tag{3.10}
\end{equation}

Where $\hat{S_m}$ is the between-class scatter matrix of the projected values $y$, and $S_m$ is the between-class scatter matrix of the original feature vectors. We can now write $J$ in terms of our original feature space:

\begin{equation}
J(\mathbf{w})=\frac{(\hat{m_a}-\hat{m_b})^2}{\hat{s_a}+\hat{s_b}}=\frac{\mathbf{w}^\intercal S_w\mathbf{w}}{\mathbf{w}^\intercal S_m\mathbf{w}}
\label{3.11}\tag{3.11}
\end{equation}

In order to find the maximum of $J$, we can differentiate and equate it to zero.

For classification problems with more than 2 classes, we can extend our definition by finding the between-class covariance for each class, then take the difference between the class mean and global mean (the mean-value for all vectors in that space). Finally, we can scale by the number of feature vectors in that particular class.

### Algorithm