# Lecture Notes: Hyperparameters and the Limits of Gradient Descent

## Recap

<br>

<img src="./images/381.png" width="500" style="display: block; margin: auto;">

<br>

We've previously studied **stochastic gradient descent (SGD)** and how it is used to optimize model parameters during training.

## Can SGD Optimize Everything?

No — while SGD can optimize any parameter we can compute gradients with respect to, **some important parameters in neural network training are *not* differentiable**.

### Example:
- Epsilon ($\epsilon$), in the update rule:
  
  $$ \theta \leftarrow \theta - \epsilon \cdot \nabla_\theta L(\theta) $$

  The learning rate $\epsilon$ is **not** optimized via SGD — it's set manually.

---

## Hyperparameters

<br>

<img src="./images/382.png" width="500" style="display: block; margin: auto;">

<br>

### Definition:
**Hyperparameters** are parameters that **cannot be learned** directly through gradient descent.

They are **external to the training loop** and need to be **manually specified or tuned**.

### Common hyperparameters include:
- Learning rate ($\epsilon$)
- Number of epochs
- Model architecture (e.g., number of layers, hidden units)
- Loss function
- Choice of optimizer or optimization variant (e.g., SGD, Adam, RMSprop)

These are **not part of the model's differentiable graph**, so gradient-based optimization can’t adjust them.

---

<br>

<img src="./images/383.png" width="500" style="display: block; margin: auto;">

<br>

## How Are Hyperparameters Tuned?

There is **no universal algorithm** that reliably finds the best hyperparameter configuration.

### In practice:
- Hyperparameters are often **tuned by hand**
- Choosing good values often relies on:
  - Experience
  - Intuition
  - Trial and error
  - Iterating over what worked for similar tasks

This manual tuning process is humorously referred to as:

> "**Graduate Student Descent**" — because graduate students spend a lot of time adjusting hyperparameters manually during research.

---

<br>

<img src="./images/384.png" width="500" style="display: block; margin: auto;">

<br>

## Key Takeaways

- **SGD can only optimize differentiable parameters**.
- **Hyperparameters are non-learnable** and require manual or external tuning.
- These include settings related to:
  - The training algorithm
  - The model architecture
  - The loss function
- Hyperparameter tuning is **iterative** and often guided by experimentation and domain experience.

