# 6.4  The Perceptron

- As we have seen with logistic regression we treat classification as a particular form of nonlinear regression (employing - with the choice of label values $y_p \in \left\{-1,+1\right\}$ - a tanh nonlinearity). 


- This results in the learning of a proper nonlinear regressor, and a corresponding *linear decision boundary* 

\begin{equation}
\mathring{\mathbf{x}}_{\,}^{T}\mathbf{w}^{\,}=0.
\end{equation}

- Instead of learning this decision boundary as a result of a nonlinear regression, the *perceptron* derivation described in this Section aims at determining this ideal lineary decision boundary directly.  


- While we will see how this direct approach leads back to the *Softmax cost function*.


- Practically speaking the perceptron and logistic regression *often results in learning the same linear decision boundary*, the perceptron's focus on learning the decision boundary directly provides a valuable new perspective on the process of two-class classification.  

- You can toggle the code on and off in this presentation via the button below.

In [2]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [1]:
# This code cell will not be shown in the HTML version of this notebook
# import custom library
import sys
sys.path.append('../../')
from mlrefined_libraries import superlearn_library as superlearn
from mlrefined_libraries import math_optimization_library as optlib
datapath = '../../mlrefined_datasets/superlearn_datasets/'

# demos for this notebook
regress_plotter = superlearn.lin_regression_demos
optimizers = optlib.optimizers
static_plotter = superlearn.classification_static_plotter.Visualizer();

# import autograd functionality to bulid function's properly for optimizers
import autograd.numpy as np

# import timer
from datetime import datetime 

# This is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

## The Perceptron cost function

- With two-class classification we have a training set of $P$ points $\left\{ \left(\mathbf{x}_{p},y_{p}\right)\right\} _{p=1}^{P}$ - where $y_p$'s take on just two label values from $\{-1, +1\}$ - consisting of two classes which we would like to learn how to distinguish between automatically.  


- As we saw in our discussion of logistic regression, in the simplest instance our two classes of data are largely separated by a *linear decision boundary* with each class (largely) lying on either side.  


- This decision boundary, written as

\begin{equation}
\mathring{\mathbf{x}}^{T}\mathbf{w}^{\,} = 0
\end{equation}

- This boundary is a *point* when the dimension of the input is $N=1$, a *line* when $N = 2$, and is more generally for arbitray $N$ a *hyperplane* defined in the input space of a dataset.  


- This scenario can be best visualized in the case $N=2$, where we view the problem of classification 'from above' - showing the input of a dataset colored to denote class membership.  

  <img src= '../../mlrefined_images/superlearn_images/Fig_4_1.png' width="80%" height="80%" alt=""/>

- The default coloring scheme we use here - matching the scheme used in the previous Section - is to color points with label $y_p = -1$ blue and $y_p = +1$ red.  


- The linear decision boundary is here a line that best separates points from the $y_p = -1$ class from those of the $y_p = +1$ class.

- A linear decision boundary cuts the input space into two *half-spaces*, one lying 'above' the hyperplane where $\mathring{\mathbf{x}}^{T}\mathbf{w}^{\,} > 0$ and one lying 'below' it where $\mathring{\mathbf{x}}^{T}\mathbf{w}^{\,}  < 0$.  


- Notice then that a proper set of weights $\mathbf{w}$ define a linear decision boundary that separates a two-class dataset as well as possible with *as many members of one class as possible lying above it, and likewise as many members as possible of the other class lying below it*.  

- So our *desired* set of weights define a hyperplane where as often as possible we have that

\begin{equation}
\begin{array}
\
\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,} >0 & \,\,\,\,\text{if} \,\,\, y_{p}=+1\\
\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,} <0 & \,\,\,\,\text{if} \,\,\, y_{p}=-1.
\end{array}
\end{equation}

- Because of our choice of label values we can consolidate the ideal conditions above into the single equation below

\begin{equation}
-\overset{\,}{y}_{p}\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,} <0.
\end{equation}


- Again we can do so specifically because we chose the label values $y_p \in \{-1,+1\}$.  


- Likewise by taking the maximum of this quantity and zero we can then write this ideal condition as

\begin{equation}
g_p\left(\mathbf{w}\right) = \text{max}\left(0,\,-\overset{\,}{y}_{p}\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,}\right)=0
\end{equation}

- Note that the expression $\text{max}\left(0,-\overset{\,}{y}_{p}\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,}\right)$ is always nonnegative.


- The functional form of this point-wise cost $\text{max}\left(0,\cdot\right)$ is called a *rectified linear unit*.  


- Because these point-wise costs are nonnegative and equal *zero* when our weights are tuned correctly, we can take their average over the entire dataset to form a proper cost function as


\begin{equation}
g\left(\mathbf{w}\right)=  \frac{1}{P}\sum_{p=1}^Pg_p\left(\mathbf{w}\right) = \frac{1}{P}\sum_{p=1}^P\text{max}\left(0,-\overset{\,}{y}_{p}\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,}\right).
\end{equation}

- When minimized appropriately this cost function can be used to recover the ideal weights.


- This cost function goes by many names such as the *perceptron* cost, the *rectified linear unit* cost (or *ReLU cost* for short), and the *hinge cost*.


- This cost function is *always convex* but has only a single (discontinuous) derivative in each input dimension. 


- This implies that we can only use zero and first order local optimization schemes (i.e., not Newton's method).  


- Note that the perceptron cost *always* has a trivial solution at  $\mathbf{w} = \mathbf{0}$, since indeed $g\left(\mathbf{0}\right) = 0$, thus one may need to take care in practice to avoid finding it (or a point too close to it) accidentally.

## The smooth softmax approximation to the ReLU cost

- Learning and optimization go hand in hand, and as we know from the discussion above the ReLU function limits the number of optimization tools we can bring to bear for learning. 


- Here we describe a common approach to ameliorating this issue by introducing a smooth approximation to this cost function. 


- If the approximation closely matches the true cost function then for the small amount of accuracy (we will after all be minimizing the approximation, not the true function itself) we significantly broaden the set of optimization tools we can use.

- One popular way of doing this for the ReLU cost function is via the *softmax* function defined as

\begin{equation}
\text{soft}\left(s_0,s_1,...,s_{C-1}\right) = \text{log}\left(e^{s_0} + e^{s_1} + \cdots + e^{s_{C-1}} \right)
\end{equation}


- Here $s_0,\,s_1,\,...,s_{C-1}$ are any $C$ scalar vaules - which is a generic smooth approximation to the *max* function, i.e., 


\begin{equation}
\text{soft}\left(s_0,s_1,...,s_{C-1}\right) \approx \text{max}\left(s_0,s_1,...,s_{C-1}\right)
\end{equation}

- The fact that the softmax approximates the maximum can be proved formally (see text).


- Below we show the one dimensional comparison.  The softmax is shown in dashed black, and the max in green.


  <img src= '../../mlrefined_images/superlearn_images/Fig_4_2.png' width="50%" height="auto"/>

- Replace the $p^{th}$ summand of our ReLU cost with its softmax approximation

\begin{equation}
g_p\left(\mathbf{w}\right) = \text{soft}\left(0,-\overset{\,}{y}_{p}\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,}\right) = \text{log}\left(1 + e^{-\overset{\,}{y}_{p}\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,}}\right)
\end{equation}


- The overall cost function is then: $g\left(\mathbf{w}\right)=\sum_{p=1}^P g_p\left(\mathbf{w}\right) = \underset{p=1}{\overset{P}{\sum}}\text{log}\left(1 + e^{-\overset{\,}{y}_{p}\mathring{\mathbf{x}}_{p}^T\mathbf{w}^{\,}}\right)$


- This is the *Softmax cost* we saw previously derived from the logistic regression perspective.

- This is why the cost is called *Softmax*, since it derives from the general softmax approximation to the max function.


- Note that *like* the ReLU cost - as we already know - the Softmax cost is convex. 


- However *unlike* the ReLU cost, the softmax has infinitely many derivatives and Newton's method can therefore be used to minimize it. 


- Moreover, softmax does not have a trivial solution at zero like the ReLU cost does.  

- Nonetheless, the fact that the Softmax cost so closely approximates the ReLU shows just how closely aligned both logistic regression and the perceptron truly are. 


- Practically speaking their differences lie in how well - for a particular dataset - one can optimize either one, along with (what is very often slight) differences in the quality of each cost function's learned decision boundary.  


- Of course when the Softmax is employed from the perceptron perspective there is no qualitative difference between the perceptron and logistic regression at all.