# Feedforward Neural Networks, Backpropagation

- 📺 **Video:** [https://youtu.be/8WhPYIWyR5g](https://youtu.be/8WhPYIWyR5g)

## Overview
- Review how forward and backward passes compute gradients for multilayer perceptrons.
- Link analytical derivatives to practical training loops that update weights by gradient descent.

## Key ideas
- **Composition:** activations compose linear layers with nonlinearities to create expressive decision boundaries.
- **Chain rule:** backpropagation reuses partial derivatives layer by layer, reducing computation dramatically.
- **Loss gradients:** gradients with respect to weights and biases drive learning.
- **Implementation:** vectorized NumPy code mirrors the equations derived in the lecture.

## Demo
Implement a two-layer neural network from scratch on a moons dataset, explicitly coding the backward pass as shown in the lecture (https://youtu.be/pTWVHNJImDM).

In [1]:
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

rng = np.random.default_rng(0)
X, y = make_moons(n_samples=600, noise=0.2, random_state=0)
y_onehot = np.eye(2)[y]
X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=0.3, random_state=0)

input_dim = X.shape[1]
hidden_dim = 16
output_dim = 2

W1 = rng.normal(scale=0.3, size=(input_dim, hidden_dim))
B1 = np.zeros(hidden_dim)
W2 = rng.normal(scale=0.3, size=(hidden_dim, output_dim))
B2 = np.zeros(output_dim)

learning_rate = 0.5

def relu(x):
    return np.maximum(0, x)

for epoch in range(1, 401):
    z1 = X_train @ W1 + B1
    a1 = relu(z1)
    logits = a1 @ W2 + B2
    exp = np.exp(logits - logits.max(axis=1, keepdims=True))
    probs = exp / exp.sum(axis=1, keepdims=True)

    loss = -np.mean(np.sum(y_train * np.log(probs + 1e-8), axis=1))

    grad_logits = (probs - y_train) / len(X_train)
    grad_W2 = a1.T @ grad_logits
    grad_B2 = grad_logits.sum(axis=0)

    grad_a1 = grad_logits @ W2.T
    grad_z1 = grad_a1 * (z1 > 0)
    grad_W1 = X_train.T @ grad_z1
    grad_B1 = grad_z1.sum(axis=0)

    W2 -= learning_rate * grad_W2
    B2 -= learning_rate * grad_B2
    W1 -= learning_rate * grad_W1
    B1 -= learning_rate * grad_B1

    if epoch % 100 == 0:
        preds = np.argmax(probs, axis=1)
        acc = accuracy_score(np.argmax(y_train, axis=1), preds)
        print(f"epoch {epoch:3d} | loss {loss:.4f} | train acc {acc:.3f}")

z1_test = X_test @ W1 + B1
probs_test = np.exp(relu(z1_test) @ W2 + B2)
probs_test /= probs_test.sum(axis=1, keepdims=True)

test_acc = accuracy_score(np.argmax(y_test, axis=1), np.argmax(probs_test, axis=1))
print()
print('Test accuracy:', test_acc)


epoch 100 | loss 0.2307 | train acc 0.898
epoch 200 | loss 0.1876 | train acc 0.914
epoch 300 | loss 0.1265 | train acc 0.960
epoch 400 | loss 0.0899 | train acc 0.967

Test accuracy: 0.9333333333333333


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 4.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Multiclass lecture note](https://www.cs.utexas.edu/~gdurrett/courses/online-course/multiclass.pdf)
- [A large annotated corpus for learning natural language inference](https://www.aclweb.org/anthology/D15-1075/)
- [Authorship Attribution of Micro-Messages](https://www.aclweb.org/anthology/D13-1193/)
- [50 Years of Test (Un)fairness: Lessons for Machine Learning](https://arxiv.org/pdf/1811.10104.pdf)
- [[Article] Amazon scraps secret AI recruiting tool that showed bias against women](https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G)
- [[Blog] Neural Networks, Manifolds, and Topology](http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/)
- [Eisenstein Chapter 3.1-3.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Dropout: a simple way to prevent neural networks from overfitting](https://dl.acm.org/doi/10.5555/2627435.2670313)
- [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)
- [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980)
- [The Marginal Value of Adaptive Gradient Methods in Machine Learning](https://papers.nips.cc/paper/2017/hash/81b3833e2504647f9d794f7d7b9bf341-Abstract.html)


*Links only; we do not redistribute slides or papers.*