# Chapter.02 Rosenblatt's Perceptron
---

### 2.1 Rosenblatt's Perceptron Model
2.1.1. Overview<br>

1. The first model/algorithm for supervised learning, in <strong>1957</strong> 
2. Single-layer single-output neural network for binary classification of <strong>linearly separable</strong> patterns

2.1.2. Activation function : Sign function(threshold function)

$$
\varphi(x) = 
\begin{cases}
+1, \quad if \,\ x > 0 \\
-1, \quad if \,\ x < 0 \\
\end{cases}
$$

2.1.3. Network Architecture<br>
- Basic model :

<img src="./res/ch02/fig_2_1.png" width="550" height="300"><br>
<div align="center">
  Figure.2.1.1
</div>

- Compact model : 

$ Let \,\ \mathbf{x} = [+1, \,\ x_1, \,\ x_2, \,\ \cdots, \,\ x_m]^T \,\ and \,\ 
\mathbf{w} = [b, \,\ w_1, \,\ w_2, \,\ \cdots, \,\ w_m]^T$<br>
$$ v = \sum_{i = 0}^{m} w_i x_i = \mathbf{w}^T \mathbf{x}, \quad y = sgn(v) = sgn(\mathbf{w}^T \mathbf{x}) $$

<img src="./res/ch02/fig_2_2.png" width="600" height="300"><br>
<div align="center">
  Figure.2.1.2
</div>

2.1.4. Assumption <br>
 - This is binary classification.
 - Two classes are linearly separable.
 - Decision boundary(Hyper plane) : 
 $$ \mathbf{w}^T \mathbf{x} = 0 $$
 - Decision rule for binary classification :
 $$ \mathbf{x} \in C_1 \,\ if \,\ y = +1 \,\ \nLeftrightarrow \,\ \mathbf{w}^T \mathbf{x} > 0 $$
 $$ \mathbf{x} \in C_2 \,\ if \,\ y = -1 \,\ \nLeftrightarrow \,\ \mathbf{w}^T \mathbf{x} < 0 $$

2.1.5. Training Problem Definition<br>
To find a weight vector $ \mathbf{w} $ such that<br>
$ \mathbf{w}^T \mathbf{x} > 0 $ for every input vector $ \mathbf{x} $ belonging to class $ C_1 $<br>
$ \mathbf{w}^T \mathbf{x} \le 0 $ for every input vector $ \mathbf{x} $ belonging to class $ C_2 $<br>

2.1.6. Training Algorithm for Perceptron<br>
- Cost function : Use distance metric, not MSE. Total Distance between the classifier and misclassified samples.

$$ J(\mathbf{w}) = \sum_{\mathbf{x} \in \mathcal{H}} | \mathbf{w}^T \mathbf{x} |, \quad where \,\ \mathcal{H} \text{ is set of misclassfied samples.} $$
Therefore, 
$$ (\mathbf{w}^T \mathbf{x}) d > 0  \quad (\text{correctly classified}) $$
$$ (\mathbf{w}^T \mathbf{x}) d \le 0  \quad (\text{wrongly classified}) $$

- Learning via gradient descent

$$ \nabla J(\mathbf{w}) = \frac{\partial}{\partial \mathbf{w}} ( - \sum_{\mathbf{x} \in \mathcal{H}} d \mathbf{w}^T \mathbf{x} ) = - \sum_{\mathbf{x} \in \mathcal{H}} d \mathbf{x} $$
It follows <strong>Widrow-Hoff(or LMS)</strong> learning rule. <br>
    - Correction depends on the <strong>error</strong>
    - Small(or large) update when the error is small(or large)
An equivalent update take the following form:
$$
\begin{equation}
\mathbf{w} \leftarrow \mathbf{w} + \eta \sum_{\mathbf{x} \in \mathcal{H}} d \mathbf{x} \quad \mathcal{H} : \text{ set of misclassfied samples} \\
= \mathbf{w} + \frac{\eta}{2} \sum_{\mathbf{x} \in \mathcal{H}} (d - y) \mathbf{x} \qquad where \,\ (d-y)  \text{ is error} \\
\end{equation}
$$

Batch training :
$$ \mathbf{w} \leftarrow \mathbf{w} + \eta \sum_{\mathbf{x} \in \mathcal{H}} (d-y) \mathbf{x} $$ 

Online Version : 
$$ \mathbf{w} \leftarrow \mathbf{w} + \eta (d-y) \mathbf{x} $$ 

Both learning algorithms were shown to converge by NoviKov's Theorem.

<img src="./res/ch02/alg_2_1.png" width="700" height="500"><br>
<div align="center">
  Algorithm.2.1.1
</div>


2.1.7. Geometric Interpretation<br>
$$ \mathbf{w}_{new} \leftarrow \mathbf{w}_{old} + d \mathbf{x} $$

<img src="./res/ch02/fig_2_3.png" width="550" height="300"><br>
<div align="center">
  Figure.2.1.3
</div>

It can work in linear separable sets.

2.1.8. 

### 2.2 Perceptron Convergence Theorem
2.2.1. Perceptron Convergence Theorem<br>
For the subsets of training vectors to be linearly separable(look at the assumption), the perceptron converges after some $ n_0 $ iterations, in the sense that

$$ _{}^{\forall}i, \quad \mathbf{w}(n_0) = \mathbf{w}(n_0 + i) \qquad where \,\ i \in \mathbb{N} $$

$ \mathbf{w(n_0)} $ is a solution vector for $ n_0 \le n_{max} $<br><br>

<strong>Proof.</strong><br>
When correctly classified, $ \nexists $ correction
$$ \mathbf{w}(n+1) = \mathbf{w}(n)  \quad if \,\ \mathbf{w}^T(n) \mathbf{x}(n) > 0, \,\ \mathbf{x}(n) \in C_1 $$
$$ \mathbf{w}(n+1) = \mathbf{w}(n)  \quad if \,\ \mathbf{w}^T(n) \mathbf{x}(n) \le 0, \,\ \mathbf{x}(n) \in C_2 $$
When wrongly classified, the weight vector is updated as
$$ \mathbf{w}(n+1) = \mathbf{w}(n) - \eta \mathbf{x}(n) \quad if \,\ \mathbf{w}^T(n) \mathbf{x}(n) > 0, \,\ \mathbf{x}(n) \in C_2 $$
$$ \mathbf{w}(n+1) = \mathbf{w}(n) + \eta \mathbf{x}(n) \quad if \,\ \mathbf{w}^T(n) \mathbf{x}(n) \le 0, \,\ \mathbf{x}(n) \in C_1 $$
<br>
Suppose the worst case
$$ \eta = 1, \,\ \mathbf{w}(0) = 0, \,\ \mathbf{w}^T(n) \mathbf{x}(n) < 0, \,\ \mathbf{x}(n) \in C_1 , \,\ _{}^{\forall}n $$

### 2.3 Example

https://www.ctan.org/pkg/algorithmicx<br>