<div style="position: relative; text-align: center; padding: 30px;">
  <h1><strong>Perceptron Algorithm</strong></h1>
</div>


The main goal of the **Perceptron algorithm** is to compute the unknown **parameters** $w_i$, $i=0,1,\ldots, l$ defining the decision hyperplane.

First, we assume that two classes $\omega_1, \omega_2$ are **linearly separable**. In other words, we assume that **there exists** a hyperplane, defined by
$$\textbf{w}^{*T} \textbf{x} =0$$
such that
$$
\textbf{w}^{*T}\textbf{x}  > 0 \quad \forall \textbf{x}\in \omega_1\\
\textbf{w}^{*T}\textbf{x}  < 0 \quad \forall \textbf{x}\in \omega_2\\
$$

This formulation  also **covers the case of a hyperplane not crossing the origin**,
that is,
$$\textbf{w}^{*T} \textbf{x} +w_{0}^{*} =0$$
since this can be incorporated into the previous formulation by defining the extended $(l+1)$-dimensional vectors

$$
\textbf{x}^{'} \equiv (\textbf{x}^T, 1)^T\\
\textbf{w}^{'} \equiv (\textbf{w}^{*T}, w_0^{*})^T
$$
Then
$$\textbf{w}^{*T} \textbf{x} + w_0^{*} = \textbf{w}^{'T} \textbf{x}^{'}$$

We will approach the problem as a typical **optimization task**.

Thus we need to adopt
* An appropriate **cost function**
* An **algorithmic scheme** to optimize it.

## **The perceptron cost**

The **Perceptron cost** is defined as

$$J(\textbf{w}) = \sum_{x\in Y} (\delta_{x}\textbf{w}^T \textbf{x}) \tag{1}$$

where $Y$ is the subset of the training vectors, which are **misclassified** by the hyperplane defined by the weigh vector $\textbf{w}$.

The variable $\delta_x$ is chosen so that
$$\delta_x =\begin{cases} -1 & \text{if} \ \textbf{x}\in \omega_1\\
                          1 & \text{if} \ \textbf{x}\in \omega_2
\end{cases}$$

Therefore, the sum in (1) is **always nonnegative**, and it becomes zero when $Y$ becomes the empty set, that is, if there are not misclassified vectors $\textbf{x}$.

Indeed:
> If $x \in \omega_1$ and it is misclassified, then $\textbf{w}^T \textbf{x} <0$, and the product with $\delta_x=-1$ is positive.

> If $x \in \omega_2$ and it is misclassified, then $\textbf{w}^T \textbf{x} >0$, and the product with $\delta_x=1$ is positive.

When the cost function takes its minimum value, 0, a solution has been obtained, since all training feature vectors are correctly classified.

### **Alternative formulation**


Another way to understand  the **Perceptron** is as follows. Let $\{(\textbf{x}_i, y_i)\}_{i=1}^N$ the training set, where $\textbf{x}_i$ is the feature vector of the $i$-instance, and $y_i\in \{-1,1\}$ is its label: $y_i=1$ if instance $i$ belongs to $\omega_1$, and $-1$ if it belongs to $\omega_2$.

We consider the following functions

$$
f_{\textbf{x}}(\textbf{w})  = g_{\textbf{w}}(\textbf{x}) := \textbf{w}^T\textbf{x}
$$

Then, the **perceptron cost function** becomes

$$
J(\textbf{w}) = \sum_{i=1}^N \max \{0, \ -y_i \cdot f_{\textbf{x}_i}(\textbf{w}) \tag{2}\}$$



## **The gradient descent scheme**

Note, that the perceptron cost $J$ is **continuous** and **piecewise linear**.

Indeed, if we change the weight vector smoothly, the cost $J(\textbf{w})$ changes linearly until the point at which there is a change in the number of misclassified vectors.

In [1]:
# Libraries
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.colors as mcolors
from mpl_toolkits.axes_grid1 import make_axes_locatable
import ipywidgets as widgets

In [2]:
# Training set
np.random.seed(0)
X1 = np.random.randn(5, 2) + np.array([2, 2])  # Class 1
X2 = np.random.randn(5, 2) + np.array([-2, -2])  # Class 2
X = np.vstack((X1, X2))
y = np.hstack((np.ones(5), -np.ones(5)))  # Labels: 1 y -1
lim = (-5, 5)
# Feature space w
w1_vals = np.linspace(lim[0], lim[1], 100)
w2_vals = np.linspace(lim[0], lim[1], 100)
W1, W2 = np.meshgrid(w1_vals, w2_vals)

# Perceptron cost
def perceptron_cost(w):
    return sum(max(0, -y[i] * (w @ X[i])) for i in range(len(X)))

def draw(w1, w2):
  fig, ax = plt.subplots(1, 3, figsize=(20, 6))

  # Scatter plot of training data+
  w = np.array([w1, w2])
  normal_slope = -w[0] / w[1] if w[1] != 0 else np.inf
  x_vals = np.linspace(lim[0], lim[1], 100)
  y_vals = normal_slope * x_vals
  ax[0].plot(x_vals, y_vals, linestyle='--', color='black', label=f"$H$")
  ax[0].scatter(X1[:, 0], X1[:, 1], color='blue', label="Class 1")
  ax[0].scatter(X2[:, 0], X2[:, 1], color='red', label="Class 2")
  ax[0].quiver(0, 0, w[0], w[1], angles='xy', scale_units='xy', scale=1, color='black', label=r'$\mathbf{w}$')
  ax[0].set_xlabel("$x_1$")
  ax[0].set_ylabel("$x_2$")
  ax[0].set_title("Training data")
  ax[0].axhline(0, color='black', linewidth=0.5, linestyle='--')
  ax[0].axvline(0, color='black', linewidth=0.5, linestyle='--')
  ax[0].set_xlim(lim)
  ax[0].set_ylim(lim)
  ax[0].legend()
  ax[0].set_aspect('equal')
  ax[0].grid(linestyle='--')

  W1_flat = W1.flatten()
  W2_flat = W2.flatten()
  J = np.array([perceptron_cost(np.array(weights)) for weights in zip(W1_flat, W2_flat)]).reshape(W1.shape)
  J_w = perceptron_cost(w)

  # Contour plot in ax[1]
  contour = ax[1].contourf(W1, W2, J, cmap='viridis', alpha=0.9)
  ax[1].scatter(w[0], w[1], color='black', s=100, label=r"$\mathbf{w}$")
  ax[1].text(w[0], w[1], f"$J(w)={np.round(J_w, 3)}$",
             color="white", horizontalalignment="left", va="bottom")
  ax[1].set_xlabel("$w_1$")
  ax[1].set_ylabel("$w_2$")
  ax[1].set_title("Level sets of of cost function")
  ax[1].grid(linestyle='--')
  ax[1].axhline(0, color='black', linewidth=0.5, linestyle='--')
  ax[1].axvline(0, color='black', linewidth=0.5, linestyle='--')
  ax[1].set_aspect('equal')
  ax[1].set_xlim(lim)
  ax[1].set_ylim(lim)
  ax[1].legend()
  divider = make_axes_locatable(ax[1])
  cax = divider.append_axes("right", size="5%", pad=0.05)
  cbar = plt.colorbar(contour, cax=cax)
  cbar.set_label("Cost function value")

  # 3D Surface plot in ax[2]
  ax[2].spines["top"].set_visible(False)
  ax[2].spines["bottom"].set_visible(False)
  ax[2].spines["left"].set_visible(False)
  ax[2].spines["right"].set_visible(False)
  ax[2].set_xticks([])
  ax[2].set_yticks([])
  ax[2] = fig.add_axes(133, projection='3d')
  ax[2].plot_surface(W1, W2, J, cmap='viridis', alpha=0.75, linewidth=1)
  ax[2].set_xlabel("$w_1$")
  ax[2].set_ylabel("$w_2$")
  ax[2].set_zlabel("$J(w)$")
  ax[2].set_title("Cost function surface")
  ax[2].view_init(elev=30, azim=-45)
  ax[2].set_xlim(lim)
  ax[2].set_ylim(lim)
  ax[2].scatter(w[0], w[1], J_w, color='black', s=100, label=r"$\mathbf{w}$")
  ax[2].legend()
  ax[2].grid(linestyle='--')

  plt.show()

In [3]:
# Interactive draw
w1_slider = widgets.FloatSlider(min=lim[0], max=lim[1], step=1, value=1, description='w1')
w2_slider = widgets.FloatSlider(min=lim[0], max=lim[1], step=1, value=1, description='w1')
widgets.interactive(draw, w1=w1_slider, w2=w2_slider)

interactive(children=(FloatSlider(value=1.0, description='w1', max=5.0, min=-5.0, step=1.0), FloatSlider(value…

To derive the algorithm for the **iterative minimization** of the cost function,
we will adopt an iterative scheme in the spirit of the **gradient descent method**, that is,

$$\textbf{w}(t+1) = \textbf{w}(t) - \alpha_t \frac{\partial J(\textbf{w})}{\partial \textbf{w}} |_{\textbf{w}=\textbf{w}(t)}$$

where $\textbf{w}(t)$ is the weight vector estimate at the $t$th iteration step and $\alpha_t$ is a sequence of positive real numbers.

However, this is not defined at the points of discontinuity.

From the definition in (1) and at the points where this is valid, we get
$$
\frac{\partial J(\textbf{w})}{\partial \textbf{w}} = \frac{\partial
}{\partial \textbf{w}}\sum_{x\in T} (\delta_x\textbf{w}^T \textbf{x})\\
= \sum_{x\in T} \delta_x\frac{\partial (\textbf{w}^T \textbf{x})
}{\partial \textbf{w}}\\
= \sum_{x\in T} \delta_x\textbf{x}
$$
Substituting, we obtain
$$\textbf{w}(t+1) = \textbf{w}(t) - \alpha_t  \sum_{x\in T} \delta_x\textbf{x} \tag{3}$$

Note that this equation is defined at all points.

The equation (3) is known as the **perceptron algorithm** or the **perceptron rule**

# **The Algorithm**

1. The algorithm is initialized from an arbitrary weight vector $\textbf{w}(0)$, and the correction vector $\sum_{x\in Y} \delta_x \textbf{x}$ is calculate
using the misclassified features.

2. The weight vector is then corrected according to the preceding rule.

3. This is repeated until the algorithm converges to a solution, that is, all features are correctly classified.

Intuitively, in geometric terms, the algorithm updates the weight vector w, towards the vector resulting from the sum of all the misclassified vectors.

### **Pseudocode**

Perceptron Algorithm
---------------------
1. Choose $w(0)$ randomly
2. Choose $\alpha_0$
3. $t \leftarrow 0$
4. Repeat:

  a. $Y \leftarrow ∅$

  b. For $i = 1$ to $N$ do:

      * If $\delta_x \textbf{w}(t)^T \textbf{x}_i \leq 0$ then

          * $Y \leftarrow Y \cup {x_i}$

  c. $\textbf{w}(t + 1) \leftarrow \textbf{w}(t) + \alpha_t \sum_{x\in Y} \delta_x \textbf{x}$

  d. Update $\alpha_t$

  e. $t \leftarrow t + 1$

5. Until $Y=\emptyset$


### **Convergence**

It can be shown that the perceptron algorithm **converges** to a solution $\textbf{w}^{*}$ in a **finite number** of iterations steps, provided that the sequence $\alpha_t$ meets certain conditions (and the problem is **linearly separable**).

Particularly if the sequence $\alpha_t$ is chosen to satisfy the following two conditions, the algorithm converges in finite time:

$$
\lim_{t\rightarrow \infty} \sum_{k=0}^t p_k = \infty \\
\lim_{t\rightarrow \infty} \sum_{k=0}^t p_k^2 < \infty  \\
$$

In other words, the corrections become increasingly small. What these conditions basically state is that $\alpha_t$ should vanish as $t\rightarrow \infty$ (eq. 4) but on the other hand, it should not go to zero very fast (eq. 5).

An example of a sequence satisfying both conditions is
$$\alpha_t = \frac{c}{t}$$
where $c$ is a contant.

Also, it is possible to show that the algorithm converges for constante
$\alpha_t = \alpha$, provided $\alpha$ is properly bounded.

In practice, the proper choice of the sequence $\alpha_t$ is vital for the convergence speed of the algorithm.


**Note**: The solution $\textbf{w}^{*}$ **is not unique**, because there are more than one hyperplanes separating two linearly separable classes

¡Claro! Vamos con ejemplos concretos de modelos que **son lineales en los parámetros**, pero **no son lineales en las variables**. Estos son súper comunes en la práctica y se pueden ajustar con regresión lineal si haces las transformaciones correctas.

---

### 🔹 Ejemplos de modelos **no lineales en las variables** (pero **sí lineales en los parámetros**):

---

#### 1. **Modelo cuadrático**  
\[
y = \beta_0 + \beta_1 x + \beta_2 x^2
\]

- Aquí la variable \(x^2\) hace que el modelo sea **no lineal en las variables**  
- Pero los parámetros \(\beta_0, \beta_1, \beta_2\) están en forma lineal → ✅ se puede usar OLS

---

#### 2. **Modelo con logaritmos:**
\[
y = \beta_0 + \beta_1 \ln(x)
\]

- La variable está transformada con logaritmo  
- Pero los parámetros están en forma lineal → ✅

---

#### 3. **Modelo con raíz cuadrada:**
\[
y = \beta_0 + \beta_1 \sqrt{x}
\]

- \(x\) aparece con una transformación no lineal (raíz cuadrada)  
- Pero los coeficientes están bien → se puede usar OLS

---

#### 4. **Modelo con funciones trigonométricas:**
\[
y = \beta_0 + \beta_1 \sin(x) + \beta_2 \cos(x)
\]

- Las variables \(x\) están transformadas con funciones seno y coseno  
- Pero el modelo sigue siendo lineal en los parámetros → OLS funciona

---

#### 5. **Modelo polinómico de grado \(n\):**
\[
y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_n x^n
\]

- Claramente no lineal en las variables (por el polinomio)
- Pero sí completamente lineal en los parámetros → ajustable con regresión lineal

---

### ⚠️ Resumen visual:

| Modelo                          | ¿Lineal en variables? | ¿Lineal en parámetros? | ¿Se puede usar OLS? |
|--------------------------------|------------------------|-------------------------|----------------------|
| \( y = \beta_0 + \beta_1 x \)         | ✅ Sí                  | ✅ Sí                   | ✅ Sí                |
| \( y = \beta_0 + \beta_1 x^2 \)       | ❌ No                 | ✅ Sí                   | ✅ Sí                |
| \( y = \beta_0 + \beta_1 \ln(x) \)    | ❌ No                 | ✅ Sí                   | ✅ Sí                |
| \( y = \beta_0 + e^{\beta_1 x} \)     | ❌ No                 | ❌ No                   | ❌ No                |

---

¿Quieres que armemos un gráfico con alguno de estos para ver cómo se ve la forma no lineal en las variables?
