## Problem with Batch Gradient Descent

Batch Gradient Descent calculates the gradient of the loss function using the **entire training dataset** for every parameter update. While this ensures stable and accurate updates, it introduces several challenges:
- **High Computational Cost:** Each update requires processing all training examples, making it very slow for large datasets.
- **Memory Limitations:** The entire dataset must fit in memory, which is not feasible for very large datasets.
- **Infrequent Updates:** Parameters are updated only once per epoch, leading to slower learning, especially when the dataset is large.

---

## Introduction to Stochastic Gradient Descent

**Stochastic Gradient Descent (SGD)** is an optimization technique designed to overcome the limitations of batch gradient descent. Instead of using the whole dataset to compute the gradient, SGD updates the model parameters using **only one randomly selected data point** at each iteration. This results in much faster and more frequent updates, making SGD particularly suitable for large-scale and online learning tasks.

---

## Mathematical Formulation of Stochastic Gradient Descent

Given a cost function $J(\theta)$ and a dataset with $m$ examples, the update rule for the parameters $\theta$ in SGD is:

$$
\theta := \theta - \alpha \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
$$

where:
- $\alpha$ is the learning rate,
- $(x^{(i)}, y^{(i)})$ is a randomly chosen training example,
- $\nabla_\theta J(\theta; x^{(i)}, y^{(i)})$ is the gradient of the loss function with respect to $\theta$ for the $i$-th example.

For linear regression, the update for each parameter $\theta_j$ is:

$$
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta; x^{(i)}, y^{(i)})
$$

---

## Advantages of Stochastic Gradient Descent

- **Faster Parameter Updates:** Updates occur after each training example, enabling quicker learning and responsiveness.
- **Scalable to Large Datasets:** Can efficiently handle datasets that are too large to fit into memory.
- **Potential to Escape Local Minima:** The randomness in updates can help the algorithm jump out of shallow local minima.
- **Ideal for Online and Real-Time Learning:** Suitable for scenarios where data arrives sequentially or in streams.

---

## Problems with Stochastic Gradient Descent

- **Noisy Updates:** The randomness introduces noise, causing the loss to fluctuate rather than decrease smoothly.
- **Possible Oscillation:** The algorithm may not converge exactly to the minimum, but instead oscillate around it.
- **Sensitive to Learning Rate:** Choosing the right learning rate is crucial; a value too high can cause divergence, while too low can slow down convergence.

In [153]:
import numpy as np
import pandas as pd
from sklearn.linear_model  import LinearRegression
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
import time

In [154]:
X,y=load_diabetes(return_X_y=True)

In [155]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

In [156]:
reg=LinearRegression()
reg.fit(x_train,y_train)

In [157]:
reg.intercept_,reg.coef_

(np.float64(152.6275795791894),
 array([ -31.80406415, -214.66452697,  445.14791885,  376.34786891,
        -802.50380919,  525.99207358,  106.4287134 ,  172.3171033 ,
         799.50335529,   57.1646797 ]))

In [158]:
y_pred1=reg.predict(x_test)

In [159]:
#for stochastic gradient descent
class SdRegressor:
    def __init__(self, learning_rate, epochs):
        self.lr = learning_rate
        self.epochs = epochs
        self.intercept = None
        self.coef_ = None

    def fit(self, x, y):
        self.intercept = 0
        self.coef_ = np.ones(x.shape[1])
        n = x.shape[0]
        for i in range(self.epochs):
            for j in range(x.shape[0]):
                idx = np.random.randint(0, n)
                xi = x[idx]
                yi = y[idx]
                y_pred = np.dot(xi, self.coef_) + self.intercept
                error = y_pred - yi
                # Update coefficients and intercept
                self.coef_ -= self.lr * 2*error * xi
                self.intercept -= self.lr *2* error
        print(self.intercept, self.coef_)

    def predict(self, x):
        return np.dot(x, self.coef_) + self.intercept

In [160]:
start=time.time()
gd=SdRegressor(0.1,100)
gd.fit(x_train,y_train)
print('Time taken is:',time.time()-start)

134.53171136668246 [ -45.46660132 -212.47875607  482.18090685  387.89477304 -110.69010715
  -33.20881573 -220.78545581   99.19336955  519.82644435   83.09649849]
Time taken is: 0.09905600547790527


In [161]:
y_pred2=gd.predict(x_test)

In [162]:
from sklearn.metrics import r2_score

In [163]:
print("for the ols method:",r2_score(y_test,y_pred1))
print("for the gd method:",r2_score(y_test,y_pred2))

for the ols method: 0.5012577617513012
for the gd method: 0.4559352004820073


Sotchastic gradient descent doesnot work consistently which means the next iteration may be bad then the previous one.

In [164]:
#for the batch gradient descent
class GdRegressor:
    def __init__(self,learning_rate,epochs):
        self.lr=learning_rate
        self.epochs=epochs
        self.intercept=None
        self.coef_=None
    
    def fit(self,x,y):
        self.intercept=0
        self.coef_=np.ones(x.shape[1])
        for i in range(self.epochs):
            #dot product will give the proper result
            #process is called vectorization
            y_hat = np.dot(x,self.coef_) + self.intercept
            intercept_der = -2 * np.mean(y - y_hat)
            self.intercept = self.intercept - (self.lr * intercept_der)
            #for 356 rows (y-y hat ) gives 356 rows(1x356), when dot product: (1x356).(356*10) gives 1*10 and we will get the same result as the formula above
            coef_der = -2 * np.dot((y - y_hat),x)/x.shape[0]
            self.coef_ = self.coef_ - (self.lr * coef_der)
        print(self.intercept,self.coef_)
    
    def predict(self,x):
        return np.dot(x,self.coef_) + self.intercept
    

In [165]:
start=time.time()
gdr = GdRegressor(epochs=1000,learning_rate=0.5)
gdr.fit(x_train,y_train)
print('Time taken:',time.time()-start)

152.7908903056216 [  -1.98714805 -178.52647764  422.78228872  342.45490237  -20.1557654
  -86.29425447 -212.27779724  133.24587264  437.27773939  102.99345759]
Time taken: 0.0013000965118408203


Above code was written for the time comparsion.

**When to use Stochastic Gradient Descent?**
- for the big data.
- for the data with multiple local maxima.


### **Learning schedules**
```
t0,t1=5,50
def learning_rate(t):
    return t/(t0+t1)
for i in range(epochs):
    for j in range(X.shape[0]):
        lr=learning_rate(i*X.shape[0]+j)
```

In [166]:

from sklearn.linear_model import SGDRegressor

In [167]:
reg = SGDRegressor(max_iter=100,learning_rate='constant',eta0=0.01)

In [168]:

reg.fit(x_train,y_train)

y_pred = reg.predict(x_test)

r2_score(y_test,y_pred)



0.4209803958818156