## Problem with Batch Gradient Descent

Batch Gradient Descent calculates the gradient of the loss function using the **entire training dataset** for every parameter update. While this ensures stable and accurate updates, it introduces several challenges:
- **High Computational Cost:** Each update requires processing all training examples, making it very slow for large datasets.
- **Memory Limitations:** The entire dataset must fit in memory, which is not feasible for very large datasets.
- **Infrequent Updates:** Parameters are updated only once per epoch, leading to slower learning, especially when the dataset is large.

---

## Introduction to Stochastic Gradient Descent

**Stochastic Gradient Descent (SGD)** is an optimization technique designed to overcome the limitations of batch gradient descent. Instead of using the whole dataset to compute the gradient, SGD updates the model parameters using **only one randomly selected data point** at each iteration. This results in much faster and more frequent updates, making SGD particularly suitable for large-scale and online learning tasks.

---

## Mathematical Formulation of Stochastic Gradient Descent

Given a cost function $J(\theta)$ and a dataset with $m$ examples, the update rule for the parameters $\theta$ in SGD is:

$$
\theta := \theta - \alpha \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
$$

where:
- $\alpha$ is the learning rate,
- $(x^{(i)}, y^{(i)})$ is a randomly chosen training example,
- $\nabla_\theta J(\theta; x^{(i)}, y^{(i)})$ is the gradient of the loss function with respect to $\theta$ for the $i$-th example.

For linear regression, the update for each parameter $\theta_j$ is:

$$
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta; x^{(i)}, y^{(i)})
$$

---

## Advantages of Stochastic Gradient Descent

- **Faster Parameter Updates:** Updates occur after each training example, enabling quicker learning and responsiveness.
- **Scalable to Large Datasets:** Can efficiently handle datasets that are too large to fit into memory.
- **Potential to Escape Local Minima:** The randomness in updates can help the algorithm jump out of shallow local minima.
- **Ideal for Online and Real-Time Learning:** Suitable for scenarios where data arrives sequentially or in streams.

---

## Problems with Stochastic Gradient Descent

- **Noisy Updates:** The randomness introduces noise, causing the loss to fluctuate rather than decrease smoothly.
- **Possible Oscillation:** The algorithm may not converge exactly to the minimum, but instead oscillate around it.
- **Sensitive to Learning Rate:** Choosing the right learning rate is crucial; a value too high can cause divergence, while too low can slow down convergence.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model  import LinearRegression
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split

In [2]:
X,y=load_diabetes(return_X_y=True)

In [3]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

In [4]:
reg=LinearRegression()
reg.fit(x_train,y_train)

In [5]:
reg.intercept_,reg.coef_

(np.float64(152.6275795791894),
 array([ -31.80406415, -214.66452697,  445.14791885,  376.34786891,
        -802.50380919,  525.99207358,  106.4287134 ,  172.3171033 ,
         799.50335529,   57.1646797 ]))

In [6]:
y_pred1=reg.predict(x_test)

In [14]:
class GdRegressor:
    def __init__(self, learning_rate, epochs):
        self.lr = learning_rate
        self.epochs = epochs
        self.intercept = None
        self.coef_ = None

    def fit(self, x, y):
        self.intercept = 0
        self.coef_ = np.ones(x.shape[1])
        n = x.shape[0]
        for i in range(self.epochs):
            idx = np.random.randint(0, n)
            xi = x[idx]
            yi = y[idx]
            y_pred = np.dot(xi, self.coef_) + self.intercept
            error = y_pred - yi
            # Update coefficients and intercept
            self.coef_ -= self.lr * error * xi
            self.intercept -= self.lr * error
        print(self.intercept, self.coef_)

    def predict(self, x):
        return np.dot(x, self.coef_) + self.intercept

In [22]:
gd=GdRegressor(0.01,10000)
gd.fit(x_train,y_train)

152.52237167580654 [ 47.26698511 -10.42346256 149.51438389 123.63666939  47.1470109
  31.51389329 -95.71645665  99.36672061 157.03295034  87.90303921]
