## Problem with Batch Gradient Descent

Batch Gradient Descent calculates the gradient of the loss function using the **entire training dataset** for every parameter update. While this ensures stable and accurate updates, it introduces several challenges:
- **High Computational Cost:** Each update requires processing all training examples, making it very slow for large datasets.
- **Memory Limitations:** The entire dataset must fit in memory, which is not feasible for very large datasets.
- **Infrequent Updates:** Parameters are updated only once per epoch, leading to slower learning, especially when the dataset is large.

---

## Introduction to Stochastic Gradient Descent

**Stochastic Gradient Descent (SGD)** is an optimization technique designed to overcome the limitations of batch gradient descent. Instead of using the whole dataset to compute the gradient, SGD updates the model parameters using **only one randomly selected data point** at each iteration. This results in much faster and more frequent updates, making SGD particularly suitable for large-scale and online learning tasks.

---

## Mathematical Formulation of Stochastic Gradient Descent

Given a cost function $J(\theta)$ and a dataset with $m$ examples, the update rule for the parameters $\theta$ in SGD is:

$$
\theta := \theta - \alpha \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
$$

where:
- $\alpha$ is the learning rate,
- $(x^{(i)}, y^{(i)})$ is a randomly chosen training example,
- $\nabla_\theta J(\theta; x^{(i)}, y^{(i)})$ is the gradient of the loss function with respect to $\theta$ for the $i$-th example.

For linear regression, the update for each parameter $\theta_j$ is:

$$
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta; x^{(i)}, y^{(i)})
$$

---

## Advantages of Stochastic Gradient Descent

- **Faster Parameter Updates:** Updates occur after each training example, enabling quicker learning and responsiveness.
- **Scalable to Large Datasets:** Can efficiently handle datasets that are too large to fit into memory.
- **Potential to Escape Local Minima:** The randomness in updates can help the algorithm jump out of shallow local minima.
- **Ideal for Online and Real-Time Learning:** Suitable for scenarios where data arrives sequentially or in streams.

---

## Problems with Stochastic Gradient Descent

- **Noisy Updates:** The randomness introduces noise, causing the loss to fluctuate rather than decrease smoothly.
- **Possible Oscillation:** The algorithm may not converge exactly to the minimum, but instead oscillate around it.
- **Sensitive to Learning Rate:** Choosing the right learning rate is crucial; a value too high can cause divergence, while too low can slow down convergence.