# Basics of Learning, Gradient Descent

- 📺 **Video:** [https://youtu.be/_We4tlPkaj0](https://youtu.be/_We4tlPkaj0)

## Overview
- Walk through how gradient descent iteratively reduces empirical risk by following the slope of a differentiable loss.
- See why tuning learning rate and convergence criteria matters for scalable NLP training loops.

## Key ideas
- **Empirical risk minimization:** treat learning as optimizing an average loss over training examples.
- **Gradient direction:** the negative gradient gives the step that most rapidly decreases the loss locally.
- **Learning rate schedules:** step size controls convergence speed versus stability; too large overshoots, too small stalls.
- **Stopping:** monitor loss reduction or gradient norm to decide when the model has converged.

## Demo
Minimize a quadratic regression objective with batch gradient descent to highlight how the loss and parameters evolve.
The video (https://youtu.be/_We4tlPkaj0) motivates this procedure conceptually; the code shows it numerically.

In [1]:
import numpy as np

rng = np.random.default_rng(3)
X_raw = rng.normal(size=200)
y = 3.0 * X_raw - 2.0 + rng.normal(scale=0.4, size=200)
X = np.c_[np.ones_like(X_raw), X_raw]

w = np.zeros(X.shape[1])
learning_rate = 0.1
history = []

for step in range(1, 601):
    preds = X @ w
    error = preds - y
    grad = X.T @ error / len(X)
    w -= learning_rate * grad
    if step % 100 == 0:
        mse = np.mean(error ** 2)
        history.append((step, mse))
        print(f"step {step:3d} | mse {mse:.4f} | weights {w}")

print()
print('Final parameters:', w)


step 100 | mse 0.1551 | weights [-1.99269966  3.01262177]
step 200 | mse 0.1551 | weights [-1.99279116  3.01269009]
step 300 | mse 0.1551 | weights [-1.99279116  3.0126901 ]
step 400 | mse 0.1551 | weights [-1.99279116  3.0126901 ]
step 500 | mse 0.1551 | weights [-1.99279116  3.0126901 ]
step 600 | mse 0.1551 | weights [-1.99279116  3.0126901 ]

Final parameters: [-1.99279116  3.0126901 ]


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 2.0-2.5, 4.2-4.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Perceptron and logistic regression](https://www.cs.utexas.edu/~gdurrett/courses/online-course/perc-lr-connections.pdf)
- [Eisenstein 4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Perceptron and LR connections](https://www.cs.utexas.edu/~gdurrett/courses/online-course/perc-lr-connections.pdf)
- [Thumbs up? Sentiment Classification using Machine Learning Techniques](https://www.aclweb.org/anthology/W02-1011/)
- [Baselines and Bigrams: Simple, Good Sentiment and Topic Classification](https://www.aclweb.org/anthology/P12-2018/)
- [Convolutional Neural Networks for Sentence Classification](https://www.aclweb.org/anthology/D14-1181/)
- [[GitHub] NLP Progress on Sentiment Analysis](https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md)


*Links only; we do not redistribute slides or papers.*