# Optimization Basics

- 📺 **Video:** [https://youtu.be/65ui-GdtY0Q](https://youtu.be/65ui-GdtY0Q)

## Overview
- Contrast batch, stochastic, and mini-batch optimization strategies for convex losses.
- See how gradients, momentum, and adaptive schedules interact in practice.

## Key ideas
- **Batch vs. stochastic:** full gradients are stable but slow; SGD adds noise that can escape shallow minima.
- **Mini-batching:** balances stability and efficiency.
- **Momentum:** accumulates gradient history to smooth directions.
- **Adaptive rates:** algorithms like Adagrad rescale steps per-parameter based on observed gradients.

## Demo
Compare batch gradient descent and stochastic gradient descent on the same logistic regression objective to visualize their convergence, tying back to the lecture (https://youtu.be/XOqa0dQDdJY).

In [1]:
import numpy as np

rng = np.random.default_rng(21)
X = rng.normal(size=(400, 3))
true_w = np.array([0.5, -1.2, 2.0])
logits = X @ true_w
probs = 1 / (1 + np.exp(-logits))
y = (probs > 0.5).astype(int)
X = np.c_[np.ones(len(X)), X]

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def logistic_grad(w, xi, yi):
    pred = sigmoid(xi @ w)
    return (pred - yi) * xi

w_batch = np.zeros(X.shape[1])
w_sgd = np.zeros_like(w_batch)

for step in range(1, 301):
    preds = sigmoid(X @ w_batch)
    grad = (X.T @ (preds - y)) / len(X)
    w_batch -= 0.2 * grad

    xi = X[step % len(X)]
    yi = y[step % len(X)]
    w_sgd -= 0.2 * logistic_grad(w_sgd, xi, yi)

    if step % 60 == 0:
        batch_loss = -np.mean(y * np.log(preds + 1e-8) + (1 - y) * np.log(1 - preds + 1e-8))
        sgd_preds = sigmoid(X @ w_sgd)
        sgd_loss = -np.mean(y * np.log(sgd_preds + 1e-8) + (1 - y) * np.log(1 - sgd_preds + 1e-8))
        print(f"step {step:3d} | batch loss {batch_loss:.4f} | sgd loss {sgd_loss:.4f}")

print()
print('Batch weights:', w_batch)
print('SGD weights:', w_sgd)


step  60 | batch loss 0.2694 | sgd loss 0.2830
step 120 | batch loss 0.2051 | sgd loss 0.2096
step 180 | batch loss 0.1753 | sgd loss 0.1770
step 240 | batch loss 0.1571 | sgd loss 0.1615
step 300 | batch loss 0.1445 | sgd loss 0.1476

Batch weights: [ 0.11624997  0.85447811 -2.20080151  3.70600393]
SGD weights: [ 0.19620556  0.89518683 -1.94777362  3.87739243]


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 2.0-2.5, 4.2-4.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Perceptron and logistic regression](https://www.cs.utexas.edu/~gdurrett/courses/online-course/perc-lr-connections.pdf)
- [Eisenstein 4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Perceptron and LR connections](https://www.cs.utexas.edu/~gdurrett/courses/online-course/perc-lr-connections.pdf)
- [Thumbs up? Sentiment Classification using Machine Learning Techniques](https://www.aclweb.org/anthology/W02-1011/)
- [Baselines and Bigrams: Simple, Good Sentiment and Topic Classification](https://www.aclweb.org/anthology/P12-2018/)
- [Convolutional Neural Networks for Sentence Classification](https://www.aclweb.org/anthology/D14-1181/)
- [[GitHub] NLP Progress on Sentiment Analysis](https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md)


*Links only; we do not redistribute slides or papers.*