# The Perceptron: Where It All Began

In 1957, Frank Rosenblatt published the first concept of the perceptron learning rule. It was the first algorithm that could learn from examples.

In this notebook, we will:
1. Understand what machine learning actually is
2. Implement a perceptron from scratch
3. Train it on real data (Iris dataset)
4. Visualize the decision boundary
5. See where it fails (XOR problem)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

## What is Machine Learning?

**Traditional programming:**
```
Rules + Data → Output
```

**Machine learning:**
```
Data + Output → Rules
```

Instead of writing rules by hand, we let the computer learn them from examples.

## The Perceptron Model

The perceptron mimics a biological neuron:

1. Take inputs (x)
2. Multiply each by a weight (w)
3. Add them up, plus a bias (b)
4. If the sum >= 0, output 1. Otherwise, output 0.

**Net input:**
$$z = w_1 x_1 + w_2 x_2 + ... + w_m x_m + b = \mathbf{w} \cdot \mathbf{x} + b$$

**Decision function (unit step):**
$$\hat{y} = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{otherwise} \end{cases}$$

## Implementation

Let's build it step by step.

In [None]:
class Perceptron:
    """Perceptron classifier.
    
    Parameters
    ----------
    eta : float
        Learning rate (between 0.0 and 1.0)
    n_iter : int
        Number of passes over the training dataset (epochs)
    random_state : int
        Random seed for weight initialization
    
    Attributes
    ----------
    w_ : 1d-array
        Weights after fitting
    b_ : float
        Bias unit after fitting
    errors_ : list
        Number of misclassifications in each epoch
    """
    
    def __init__(self, eta=0.01, n_iter=50, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state
    
    def fit(self, X, y):
        """Fit training data.
        
        Parameters
        ----------
        X : array-like, shape = [n_examples, n_features]
            Training vectors
        y : array-like, shape = [n_examples]
            Target values (0 or 1)
        
        Returns
        -------
        self : object
        """
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=X.shape[1])
        self.b_ = np.float64(0.0)
        self.errors_ = []
        
        for _ in range(self.n_iter):
            errors = 0
            for xi, target in zip(X, y):
                update = self.eta * (target - self.predict(xi))
                self.w_ += update * xi
                self.b_ += update
                errors += int(update != 0.0)
            self.errors_.append(errors)
            
            # Early stopping if converged
            if errors == 0:
                break
        
        return self
    
    def net_input(self, X):
        """Calculate net input: w . x + b"""
        return np.dot(X, self.w_) + self.b_
    
    def predict(self, X):
        """Return class label after unit step"""
        return np.where(self.net_input(X) >= 0.0, 1, 0)

## The Learning Rule

The update rule is simple:

$$w = w + \eta (y - \hat{y}) x$$
$$b = b + \eta (y - \hat{y})$$

Where:
- $\eta$ is the learning rate
- $y$ is the true label
- $\hat{y}$ is the predicted label

**When correct:** $(y - \hat{y}) = 0$ → no update

**When wrong:**
- Predicted 0, actual 1 → $(y - \hat{y}) = +1$ → weights move toward the input
- Predicted 1, actual 0 → $(y - \hat{y}) = -1$ → weights move away from the input

## Training on the Iris Dataset

The classic machine learning dataset: 150 flower samples, 3 species, 4 features.

We will use:
- 2 species: setosa and versicolor (first 100 samples)
- 2 features: sepal length and petal length (for visualization)

In [None]:
# Load Iris dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(url, header=None, encoding='utf-8')

# Show first few rows
df.head()

In [None]:
# Extract setosa and versicolor (first 100 samples)
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', 0, 1)

# Extract sepal length (column 0) and petal length (column 2)
X = df.iloc[0:100, [0, 2]].values

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"Classes: {np.unique(y)}")

In [None]:
# Visualize the data
plt.figure(figsize=(8, 6))
plt.scatter(X[:50, 0], X[:50, 1], color='red', marker='o', label='Setosa')
plt.scatter(X[50:100, 0], X[50:100, 1], color='blue', marker='s', label='Versicolor')
plt.xlabel('Sepal length [cm]')
plt.ylabel('Petal length [cm]')
plt.legend(loc='upper left')
plt.title('Iris Dataset: Setosa vs Versicolor')
plt.show()

The two classes are clearly separable by a line. This is called **linear separability**.

In [None]:
# Train the perceptron
ppn = Perceptron(eta=0.1, n_iter=10)
ppn.fit(X, y)

print(f"Errors per epoch: {ppn.errors_}")
print(f"Final weights: {ppn.w_}")
print(f"Final bias: {ppn.b_}")

In [None]:
# Plot convergence
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(ppn.errors_) + 1), ppn.errors_, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Number of misclassifications')
plt.title('Perceptron Convergence')
plt.show()

The perceptron converges to zero errors, meaning it found a line that perfectly separates the two classes.

## Visualizing the Decision Boundary

In [None]:
def plot_decision_regions(X, y, classifier, resolution=0.02):
    """Plot decision regions for a 2D dataset."""
    # Setup marker generator and color map
    markers = ('o', 's', '^', 'v', '<')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    
    # Plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(
        np.arange(x1_min, x1_max, resolution),
        np.arange(x2_min, x2_max, resolution)
    )
    
    lab = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    lab = lab.reshape(xx1.shape)
    
    plt.contourf(xx1, xx2, lab, alpha=0.3, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    
    # Plot class examples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(
            x=X[y == cl, 0],
            y=X[y == cl, 1],
            alpha=0.8,
            c=colors[idx],
            marker=markers[idx],
            label=f'Class {cl}',
            edgecolor='black'
        )

In [None]:
plt.figure(figsize=(8, 6))
plot_decision_regions(X, y, classifier=ppn)
plt.xlabel('Sepal length [cm]')
plt.ylabel('Petal length [cm]')
plt.legend(loc='upper left')
plt.title('Perceptron Decision Boundary')
plt.show()

## Check Accuracy

In [None]:
predictions = ppn.predict(X)
accuracy = np.mean(predictions == y)
print(f"Training accuracy: {accuracy * 100:.1f}%")

## The Limitation: XOR Problem

The perceptron can only solve **linearly separable** problems. Here is a classic example where it fails:

In [None]:
# XOR dataset
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])  # XOR: same inputs → 0, different inputs → 1

print("XOR Truth Table:")
print("x1  x2  |  y")
print("-" * 12)
for (x1, x2), y_val in zip(X_xor, y_xor):
    print(f" {x1}   {x2}  |  {y_val}")

In [None]:
# Visualize XOR
plt.figure(figsize=(6, 6))
plt.scatter(X_xor[y_xor == 0, 0], X_xor[y_xor == 0, 1], 
            color='red', marker='o', s=200, label='Class 0')
plt.scatter(X_xor[y_xor == 1, 0], X_xor[y_xor == 1, 1], 
            color='blue', marker='s', s=200, label='Class 1')
plt.xlabel('x1')
plt.ylabel('x2')
plt.legend()
plt.title('XOR Problem: Try drawing a single line to separate the classes')
plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Try to train perceptron on XOR
ppn_xor = Perceptron(eta=0.1, n_iter=100)
ppn_xor.fit(X_xor, y_xor)

print(f"Errors per epoch (last 10): {ppn_xor.errors_[-10:]}")
print(f"Final accuracy: {np.mean(ppn_xor.predict(X_xor) == y_xor) * 100:.1f}%")

The perceptron never converges on XOR. It keeps making errors because no single line can separate the classes.

This limitation was pointed out by Minsky and Papert in 1969, which led to the first "AI Winter."

**The solution:** Add more layers (multi-layer perceptron).

## Exercises

1. **Try different learning rates:** What happens with eta=0.001 vs eta=1.0?

2. **Use different features:** Try sepal width and petal width instead.

3. **Versicolor vs Virginica:** Use samples 50-150 instead. Does the perceptron converge? Why or why not?

4. **Implement from scratch:** Without looking at the code above, write your own `predict` and `fit` functions.

In [None]:
# Exercise 1: Try different learning rates
# Your code here


In [None]:
# Exercise 2: Use different features
# Your code here


In [None]:
# Exercise 3: Versicolor vs Virginica
# Your code here


## Summary

The perceptron is just:

```python
if dot_product + bias >= 0:
    return 1
else:
    return 0
```

But it introduced every key idea we still use today:
- Weighted sums
- Activation functions
- Bias terms
- Iterative learning from examples

Every layer of every modern neural network does the same thing: **weighted sum → activation → repeat**.

## Next Steps

In the next session, we will cover:
- **Adaline:** Using gradient descent instead of the perceptron rule
- **Backpropagation:** How to train multi-layer networks

## Resources

- [Interactive Demo](https://i33ym.cc/demo-perceptron/)
- [Full Essay](https://i33ym.cc/the-perceptron/)
- [Slides](https://i33ym.cc/slides-perceptron/)