## 1. Learning Rule of the Perceptron

* We have **P training points (patterns)**, each with inputs and a label $y_p$ which is either $-1$ or $+1$.
* The perceptron has weights $w_j$ connecting inputs to output.
* For input $x_p$, the output is:

$$
O_p = \text{sign}(w^T x_p)
$$

* If the output $O_p$ is wrong (not equal to $y_p$), update weights:

$$
w_j \leftarrow w_j + \Delta_j
$$

where

$$
\Delta_j = \begin{cases}
2 \eta y_p x_{jp}, & \text{if } O_p \neq y_p \\
0, & \text{otherwise}
\end{cases}
$$

* $\eta$ is the learning rate (a small positive number).

* Equivalent update formulas are:

$$
\Delta_j = \eta (1 - O_p y_p) x_{jp} \quad \text{or} \quad \Delta_j = \eta (y_p - O_p) x_{jp}
$$

---

## 2. Perceptron Optimality and Margin

* Instead of just correct sign matching, we want the product $y_p w^T x_p$ to be **larger than a margin** $N \rho$, where:

  * $N$ is the number of inputs
  * $\rho > 0$ is the margin size (a fixed positive number)

* Condition:

$$
y_p w^T x_p > N \rho
$$

* When this is true, the pattern is confidently classified (not just barely correct).

* Update rule with margin becomes:

$$
\Delta_j = \eta \, \Theta \left(N \rho - y_p w^T x_p\right) y_p x_{jp}
$$

where $\Theta$ is the step function:

$$
\Theta(z) = \begin{cases}
1 & \text{if } z \geq 0 \\
0 & \text{otherwise}
\end{cases}
$$

* This means weights update only if the margin condition fails.

---

## 3. Margin and Optimal Weights

* Define $\hat{x}_p = y_p x_p$ to combine input and label.

* The margin condition is:

$$
w^T \hat{x}_p > N \rho
$$

* **Margin $M(w)$** is the distance of the closest point to the decision boundary:

$$
M(w) = \frac{1}{\|w\|} \min_p w^T \hat{x}_p
$$

* The **best weights $w^*$** maximize the margin:

$$
w^* = \arg \max_w M(w) = \arg \max_w \frac{\min_p w^T \hat{x}_p}{\|w\|}
$$

* Maximizing margin means better, more robust classification.

---

## 4. **Proof of Convergence of the Learning Rule** (Simple Explanation)

### What we want to prove:

If data is linearly separable (there exists some $w^*$ that classifies all correctly with margin > 0), then the perceptron learning algorithm **will find weights $w$ that classify correctly after a finite number of updates**.

---

### Step 1: Define margin of best solution

$$
M(w^*) = \frac{1}{\|w^*\|} \min_p w^{*T} \hat{x}_p > 0
$$

---

### Step 2: Express current weights after $H$ updates

Starting from $w=0$, after updating on misclassified points:

$$
w = \eta \sum_{p=1}^P H^p \hat{x}_p
$$

where $H^p$ is how many times pattern $p$ was used in an update, and $H = \sum_p H^p$ is total updates.

---

### Step 3: Growth of overlap with $w^*$

Calculate the dot product between $w$ and $w^*$:

$$
w \cdot w^* = \eta \sum_p H^p \hat{x}_p \cdot w^* \geq \eta \min_p (\hat{x}_p \cdot w^*) \sum_p H^p = \eta H M(w^*) \|w^*\|
$$

This means the projection $w \cdot w^*$ grows **at least linearly with $H$**.

---

### Step 4: Growth of weight length $\|w\|$

At each update:

$$
\Delta \|w\|^2 = \|w + \eta \hat{x}_p\|^2 - \|w\|^2 = \eta^2 \|\hat{x}_p\|^2 + 2 \eta w \cdot \hat{x}_p
$$

Since $\hat{x}_p$ components are bounded (usually $\pm 1$), assume:

$$
\|\hat{x}_p\|^2 \leq N
$$

and

$$
w \cdot \hat{x}_p \leq N p
$$

So,

$$
\Delta \|w\|^2 \leq N \eta (\eta + 2p)
$$

After $H$ steps,

$$
\|w\|^2 \leq H N \eta (\eta + 2p)
$$

Thus, $\|w\|$ grows **at most proportional to $\sqrt{H}$**.

---

### Step 5: Combining the two

Look at normalized overlap:

$$
\frac{w \cdot w^*}{\|w\|} \geq \frac{\eta H M(w^*) \|w^*\|}{\sqrt{H} \sqrt{N \eta (\eta + 2p)}} = \text{something that grows as } \sqrt{H}
$$

This cannot grow without bound because cosine between $w$ and $w^*$ is at most 1.

---

### Step 6: Bound on $H$

Define:

$$
\phi = \frac{(w \cdot w^*)^2}{\|w\|^2 \|w^*\|^2} = \cos^2(\alpha) \leq 1
$$

From the previous inequality,

$$
1 \geq \phi \geq H \frac{M(w^*)^2 \eta}{N (\eta + 2p)}
$$

Rearranged:

$$
H \leq \frac{N (\eta + 2p)}{M(w^*)^2 \eta}
$$

This says the total number of updates $H$ is **bounded and finite**.

---

## Summary:

* If data is linearly separable, the perceptron learning will stop after finite updates.
* The number of updates depends on the margin $M(w^*)$, learning rate $\eta$, and number of inputs $N$.
* Larger margin means fewer updates needed.

---

### If you want, I can also provide a simple example with numbers to illustrate how the weights update!

---

Would you like me to add a short example or explain any part more simply?
