# Neural Networks

- 📺 **Video:** [https://youtu.be/DU_p-RBy5gM](https://youtu.be/DU_p-RBy5gM)

## Overview
- Motivate neural networks as flexible function approximators that learn hierarchical features.
- See how depth, activation functions, and capacity interact.

## Key ideas
- **Nonlinearity:** stacking linear layers without nonlinearities collapses to a single linear map.
- **Capacity:** width and depth increase expressiveness but risk overfitting.
- **Optimization:** gradient-based methods enable large-scale training.
- **Generalization:** regularization and data augmentation tame over-parameterized models.

## Demo
Fit a shallow and a deeper network on the same data to compare their capacity, echoing the lecture (https://youtu.be/bHnf4UxwZls) on why depth matters.

In [1]:
import numpy as np
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score

rng = np.random.default_rng(8)
X, y = make_moons(n_samples=500, noise=0.25, random_state=8)
y_onehot = np.eye(2)[y]

def relu(x):
    return np.maximum(0, x)

def forward(X, layers):
    activations = X
    caches = []
    for W, B in layers[:-1]:
        z = activations @ W + B
        activations = relu(z)
        caches.append((z, activations))
    W_last, B_last = layers[-1]
    logits = activations @ W_last + B_last
    exp = np.exp(logits - logits.max(axis=1, keepdims=True))
    probs = exp / exp.sum(axis=1, keepdims=True)
    return probs

layers_shallow = [
    (rng.normal(scale=0.3, size=(2, 16)), np.zeros(16)),
    (rng.normal(scale=0.3, size=(16, 2)), np.zeros(2))
]

layers_deep = [
    (rng.normal(scale=0.3, size=(2, 16)), np.zeros(16)),
    (rng.normal(scale=0.3, size=(16, 16)), np.zeros(16)),
    (rng.normal(scale=0.3, size=(16, 2)), np.zeros(2))
]

def train(layers, lr=0.3, epochs=400):
    W1, B1 = layers[0]
    W_out, B_out = layers[-1]
    for epoch in range(1, epochs + 1):
        if len(layers) == 2:
            z1 = X @ layers[0][0] + layers[0][1]
            a1 = relu(z1)
            logits = a1 @ layers[1][0] + layers[1][1]
            hidden_stack = [(z1, a1)]
        else:
            z1 = X @ layers[0][0] + layers[0][1]
            a1 = relu(z1)
            z2 = a1 @ layers[1][0] + layers[1][1]
            a2 = relu(z2)
            logits = a2 @ layers[2][0] + layers[2][1]
            hidden_stack = [(z1, a1), (z2, a2)]

        exp = np.exp(logits - logits.max(axis=1, keepdims=True))
        probs = exp / exp.sum(axis=1, keepdims=True)
        loss = -np.mean(np.sum(y_onehot * np.log(probs + 1e-8), axis=1))
        grad_logits = (probs - y_onehot) / len(X)

        if len(layers) == 2:
            grad_W2 = hidden_stack[-1][1].T @ grad_logits
            grad_B2 = grad_logits.sum(axis=0)
            grad_hidden = grad_logits @ layers[1][0].T
            grad_z1 = grad_hidden * (hidden_stack[-1][0] > 0)
            grad_W1 = X.T @ grad_z1
            grad_B1 = grad_z1.sum(axis=0)

            layers[1] = (layers[1][0] - lr * grad_W2, layers[1][1] - lr * grad_B2)
            layers[0] = (layers[0][0] - lr * grad_W1, layers[0][1] - lr * grad_B1)
        else:
            grad_W3 = hidden_stack[-1][1].T @ grad_logits
            grad_B3 = grad_logits.sum(axis=0)
            grad_a2 = grad_logits @ layers[2][0].T
            grad_z2 = grad_a2 * (hidden_stack[-1][0] > 0)
            grad_W2 = hidden_stack[0][1].T @ grad_z2
            grad_B2 = grad_z2.sum(axis=0)
            grad_a1 = grad_z2 @ layers[1][0].T
            grad_z1 = grad_a1 * (hidden_stack[0][0] > 0)
            grad_W1 = X.T @ grad_z1
            grad_B1 = grad_z1.sum(axis=0)

            layers[2] = (layers[2][0] - lr * grad_W3, layers[2][1] - lr * grad_B3)
            layers[1] = (layers[1][0] - lr * grad_W2, layers[1][1] - lr * grad_B2)
            layers[0] = (layers[0][0] - lr * grad_W1, layers[0][1] - lr * grad_B1)

        if epoch % 200 == 0:
            preds = np.argmax(probs, axis=1)
            acc = accuracy_score(y, preds)
            print(f"{'deep' if len(layers)==3 else 'shallow'} epoch {epoch:3d} | loss {loss:.4f} | acc {acc:.3f}")

train(layers_shallow)
train(layers_deep)

for label, layers in [('shallow', layers_shallow), ('deep', layers_deep)]:
    if len(layers) == 2:
        probs = forward(X, layers)
    else:
        probs = forward(X, layers)
    preds = np.argmax(probs, axis=1)
    print(f"Final {label} accuracy:", accuracy_score(y, preds))


shallow epoch 200 | loss 0.2841 | acc 0.878
shallow epoch 400 | loss 0.2372 | acc 0.896
deep epoch 200 | loss 0.1544 | acc 0.940
deep epoch 400 | loss 0.1331 | acc 0.938
Final shallow accuracy: 0.896
Final deep accuracy: 0.94


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 4.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Multiclass lecture note](https://www.cs.utexas.edu/~gdurrett/courses/online-course/multiclass.pdf)
- [A large annotated corpus for learning natural language inference](https://www.aclweb.org/anthology/D15-1075/)
- [Authorship Attribution of Micro-Messages](https://www.aclweb.org/anthology/D13-1193/)
- [50 Years of Test (Un)fairness: Lessons for Machine Learning](https://arxiv.org/pdf/1811.10104.pdf)
- [[Article] Amazon scraps secret AI recruiting tool that showed bias against women](https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G)
- [[Blog] Neural Networks, Manifolds, and Topology](http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/)
- [Eisenstein Chapter 3.1-3.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Dropout: a simple way to prevent neural networks from overfitting](https://dl.acm.org/doi/10.5555/2627435.2670313)
- [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)
- [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980)
- [The Marginal Value of Adaptive Gradient Methods in Machine Learning](https://papers.nips.cc/paper/2017/hash/81b3833e2504647f9d794f7d7b9bf341-Abstract.html)


*Links only; we do not redistribute slides or papers.*