# Notes on Chapter 2: Training Machine Learning Algorithms for Classification

### Background

The first algorithms to be developed for classification tasks were the **perceptron** and **adaptive linear neurons**. These algorithms are the part of the early history of machine intelligence.

The idea of **artificial neurons** was (according to Raschka) first developed by Warren McCulloch and Walter Pitts in 1943 – hence the name McCulloch-Pitts (MCP) neuron. McCulloch and Pitts were inspired by developments in cognitive biology (*early evidence of biology's status as an inspiration to ML -JC*).

The intuition behind the MCP neuron is that a computational classifier could have a similar structure to a (*vastly oversimplified*) neuron: 

```
Input -\                              1
Input --\                            /
Input -----> function --> ?threshold?
Input --/                            \ 
Input -/                              0
```

In 1957, Frank Rosenblatt extended this idea to what he called the **perceptron**, a neuron that would *learn* the optimal weights to apply to its input "features".

## Rosenblatt's perceptron

Formally, Rosenblatt's perceptron defines a weight vector $w$ and an input vector $x$:

$$
w = \left( \begin{array}{c} w_{1} \\ w_{2} \\ \vdots \\ w_{m} \end{array} \right), \quad 
x = \left( \begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{array} \right) 
$$
  
With an activation function:

$$\phi(z) = \bigg\{ \begin{array}{rr} 1 & if \> z\ge\theta \\ -1 & otherwise \end{array}$$

Where $\theta$ is the **threshold** (sometimes referred to as the **bias**) and $z$ is a linear combination (dot product) of the weight vector and the input vector:

$$z \; = \; w \cdot x \; = \; w_{1}x_{1} + w_{2}x_{2} + \; ... \; + w_{m}x_{m}$$

Hence, we can move $\theta$ to the lefthand side:

$$z \ge \theta \quad \Leftrightarrow \quad -\theta + z \ge 0$$

Include it as part of the inputs:

$$
w = \left( \begin{array}{c} -\theta \\ w_{1} \\ w_{2} \\ \vdots \\ w_{m} \end{array} \right), \quad
x = \left( \begin{array}{c} 1 \\ x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{array} \right)
$$

And redefine the activation function in terms of $0$:

$$\phi(z) = \bigg\{ \begin{array}{rr} 1 & if z \ge 0 \\ -1 & otherwise \end{array}$$

### Updating the perceptron weights

The algorithm for updating the Rosenblatt perceptron weights is quite simple:

1. Set the weights to 0, or small random numbers

2. For each training vector $x^{(i)}$ (and its labeled output $\hat{y^{(i)}}$):

    1. Compute the output class label $\hat{y^{(i)}}$
    
    2. Update the weights based on the error, $y^{(i)} - \hat{y^{(i)}}$
    
Easy! So how do we update the weights? To do this, we'll sum the initial weights with an adjustment term $\Delta w_{j}$:

$$w_{j} = w_{j} + \Delta w_{j}$$

We can compute $\Delta w_{j}$ like so:

$$\Delta x_{j} = \eta \; (y^{(i)} - \hat{y^{(i)}}) \; x^{(i)}_{j}$$

Where $\eta$ is a constant **learning rate** coefficient on the interval $[0, 1]$, $y_{i} - \hat{y_{i}}$ is the **error term**, and $x^{(i)}_{j}$ is the element of the input vector at index $j$. Note that the coefficient `learning_rate * error` of the update value is constant for *all values* of the input vector $x$, but it is scaled according to the value of $x$ at index $j$, making it proportional to that value. This has the really nice property of adjusting the weights for larger values of $x_{j}$ more dramatically than for smaller values.

Note that the perceptron algorithm assumes our classes are **linearly separable** – that is, we can find some linear combination that cleanly divides the classes in two.

If the classes aren't linearly separable, we have two options:

1. Define a finite number of iterations (**epoch**) after which we'll stop updating the weights
2. Set a **threshold** of acceptable misclassification

These approaches aren't mutually exclusive, of course!

### What if we have multiple classes?

Rosenblatt's perceptron is a **binary classifier**, meaning that it can only return two possible classifications: either a sample $x_{i}$ *is* part of a certain class (when the activation function $\phi(z) \ge 0$) or it *isn't*. The perceptron cannot select from more than two classes, or tell us if a sample is one class and not another – at least in this implementation.

If we want to, we can use some clever ideas from set theory to extend the perceptron to multiple classes. The technique is called **One-vs-All**, and the intuition goes something like this:

1. For each class $A$ in the set of classes $\omega$:

    1. Bind $A$ to the positive output of the classifier (a success – an output of 1) and bind all other classes $A^{C}$ to the negative output (-1)
    
    2. Classify the sample as $A$ or $A^{C}$ ("anything but $A$")
    
    3. Compute the confidence of the classification and store it
    
2. Select the positive class label with the highest confidence, and apply that label to the sample

This process is also known as **multi-label classification**. Raschka doesn't go into much detail about the implementation (in particular, how to compute the confidence for any given class label), but it seems like we'll learn more about it later. 

### A note on the decision surface

The **decision surface** of a binary classifier is the "line" (or plane, or hyperplane, depending on the dimension) that divides the two classes. In Rosenblatt's perceptron, the decision surface corresponds to a linear combination of the form

$$z = 0 \quad \Leftrightarrow \quad w \cdot x = 0 \quad \Leftrightarrow \quad w_{1}x_{1} + w_{2}x_{2} + \; ... \; + w_{m}x_{m} = 0$$

where the weight vector $w$ is no longer a variable, but is instead the optimal weight coefficients learned by the algorithm. Hence, the input vector $x$ will be the only variable, and we can plot the surface to see the dividing line that the perceptron has learned from the data.

## Adaptive Linear Neurons (Adaline)

The idea of the **Adaptive Linear Neuron** – or **"Adaline"** for short – was first proposed (according to Raschka) by Bernard Widrow and Ted Hoff in 1960 as an improvement on Rosenblatt's perceptron. Most importantly, the Adaline rule introduces the concept of a **cost function**, differentiable with respect to the weights, that we can use to mathematically minimize our classifier's error.

The Adaline rule also redefines the activation function to separate the process of calculating error from the classification itself. Formally, the Adaline rule replaces the step (or Heavisine) function $\phi(z) \ge 0$ with a **linear activation function**. In the basic case, the linear activation function is just the identity function:

$$\phi(z) \; = \; \phi(w \cdot x) \; = \; w \cdot x$$

which makes the Adaline rule very similar to linear regression! But since our goal is to classify data and not to fit a regression line, we introduce a step function, or **quantizer**, after the linear activation function that will take output and classify it. The neural structure then looks more like this:

```
Input -\  <------------------------------- error: adjust weights              1
Input --\                                                      |            /
Input -----> net input function --> linear activation function |-> quantizer
Input --/                                                                   \ 
Input -/                                                                     0
```