# Neural Net Training, Optimization

- ðŸ“º **Video:** [https://youtu.be/KPZb2rYS4BE](https://youtu.be/KPZb2rYS4BE)

## Overview
- Investigate how optimizer choice and hyperparameters affect neural network training speed and stability.
- Diagnose underfitting versus overfitting through learning curves.

## Key ideas
- **Learning rate:** too large causes divergence; too small slows progress.
- **Momentum/Adam:** adaptive optimizers smooth noisy gradients.
- **Regularization:** dropout or weight decay prevent overfitting.
- **Monitoring:** track validation loss to time early stopping.

## Demo
Train the same MLP with plain SGD and with Adam-style momentum on a noisy dataset and compare their loss trajectories, echoing the lecture (https://youtu.be/aQNySJU0vZ4).

In [1]:
import numpy as np
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score

rng = np.random.default_rng(7)
X, y = make_moons(n_samples=500, noise=0.25, random_state=7)
y_onehot = np.eye(2)[y]

def relu(x):
    return np.maximum(0, x)

input_dim = X.shape[1]
hidden = 20
output_dim = 2

weights = {
    'sgd': {
        'W1': rng.normal(scale=0.3, size=(input_dim, hidden)),
        'B1': np.zeros(hidden),
        'W2': rng.normal(scale=0.3, size=(hidden, output_dim)),
        'B2': np.zeros(output_dim)
    },
    'adam': {
        'W1': rng.normal(scale=0.3, size=(input_dim, hidden)),
        'B1': np.zeros(hidden),
        'W2': rng.normal(scale=0.3, size=(hidden, output_dim)),
        'B2': np.zeros(output_dim)
    }
}

adam_m = {key: {k: np.zeros_like(v) for k, v in params.items()} for key, params in weights.items()}
adam_v = {key: {k: np.zeros_like(v) for k, v in params.items()} for key, params in weights.items()}

sgd_lr = 0.3
adam_lr = 0.05
beta1, beta2 = 0.9, 0.999

for epoch in range(1, 401):
    for mode in ['sgd', 'adam']:
        W1, B1, W2, B2 = weights[mode]['W1'], weights[mode]['B1'], weights[mode]['W2'], weights[mode]['B2']
        z1 = X @ W1 + B1
        a1 = relu(z1)
        logits = a1 @ W2 + B2
        exp = np.exp(logits - logits.max(axis=1, keepdims=True))
        probs = exp / exp.sum(axis=1, keepdims=True)
        loss = -np.mean(np.sum(y_onehot * np.log(probs + 1e-8), axis=1))

        grad_logits = (probs - y_onehot) / len(X)
        grad_W2 = a1.T @ grad_logits
        grad_B2 = grad_logits.sum(axis=0)
        grad_a1 = grad_logits @ W2.T
        grad_z1 = grad_a1 * (z1 > 0)
        grad_W1 = X.T @ grad_z1
        grad_B1 = grad_z1.sum(axis=0)

        if mode == 'sgd':
            W2 -= sgd_lr * grad_W2
            B2 -= sgd_lr * grad_B2
            W1 -= sgd_lr * grad_W1
            B1 -= sgd_lr * grad_B1
        else:
            for name, grad in [('W2', grad_W2), ('B2', grad_B2), ('W1', grad_W1), ('B1', grad_B1)]:
                adam_m[mode][name] = beta1 * adam_m[mode][name] + (1 - beta1) * grad
                adam_v[mode][name] = beta2 * adam_v[mode][name] + (1 - beta2) * (grad ** 2)
                m_hat = adam_m[mode][name] / (1 - beta1 ** epoch)
                v_hat = adam_v[mode][name] / (1 - beta2 ** epoch)
                weights[mode][name] -= adam_lr * m_hat / (np.sqrt(v_hat) + 1e-8)

        weights[mode]['W1'], weights[mode]['B1'], weights[mode]['W2'], weights[mode]['B2'] = W1, B1, W2, B2

    if epoch % 100 == 0:
        metrics = {}
        for mode in ['sgd', 'adam']:
            W1, B1, W2, B2 = weights[mode]['W1'], weights[mode]['B1'], weights[mode]['W2'], weights[mode]['B2']
            logits = relu(X @ W1 + B1) @ W2 + B2
            probs = np.exp(logits - logits.max(axis=1, keepdims=True))
            probs /= probs.sum(axis=1, keepdims=True)
            preds = np.argmax(probs, axis=1)
            metrics[mode] = accuracy_score(y, preds)
        print(f"epoch {epoch:3d} | SGD acc {metrics['sgd']:.3f} | Adam-style acc {metrics['adam']:.3f}")

for mode in ['sgd', 'adam']:
    W1, B1, W2, B2 = weights[mode]['W1'], weights[mode]['B1'], weights[mode]['W2'], weights[mode]['B2']
    logits = relu(X @ W1 + B1) @ W2 + B2
    preds = np.argmax(logits, axis=1)
    print(f"Final accuracy ({mode}):", accuracy_score(y, preds))


epoch 100 | SGD acc 0.890 | Adam-style acc 0.946
epoch 200 | SGD acc 0.892 | Adam-style acc 0.950
epoch 300 | SGD acc 0.898 | Adam-style acc 0.950
epoch 400 | SGD acc 0.896 | Adam-style acc 0.950
Final accuracy (sgd): 0.896
Final accuracy (adam): 0.95


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 4.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Multiclass lecture note](https://www.cs.utexas.edu/~gdurrett/courses/online-course/multiclass.pdf)
- [A large annotated corpus for learning natural language inference](https://www.aclweb.org/anthology/D15-1075/)
- [Authorship Attribution of Micro-Messages](https://www.aclweb.org/anthology/D13-1193/)
- [50 Years of Test (Un)fairness: Lessons for Machine Learning](https://arxiv.org/pdf/1811.10104.pdf)
- [[Article] Amazon scraps secret AI recruiting tool that showed bias against women](https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G)
- [[Blog] Neural Networks, Manifolds, and Topology](http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/)
- [Eisenstein Chapter 3.1-3.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Dropout: a simple way to prevent neural networks from overfitting](https://dl.acm.org/doi/10.5555/2627435.2670313)
- [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)
- [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980)
- [The Marginal Value of Adaptive Gradient Methods in Machine Learning](https://papers.nips.cc/paper/2017/hash/81b3833e2504647f9d794f7d7b9bf341-Abstract.html)


*Links only; we do not redistribute slides or papers.*