# Stochastic Gradient Descent (SGD)

`Stochastic Gradient Descent (SGD)` is a variant of gradient descent where model parameters are updated
using **one training example at a time** instead of the entire dataset.

This makes SGD significantly faster and more scalable for large datasets, but the updates are noisier.

---

## How Stochastic Gradient Descent Works

At each iteration:
1. Randomly select **one training sample**
2. Compute the gradient using only that sample
3. Update model parameters immediately
4. Repeat for all samples (one epoch)

This results in **many updates per epoch**.

---

## Update Rule

For a single training example \((x^{(i)}, y^{(i)})\), the update rule is:

$$
\theta := \theta - \alpha \nabla_\theta L(\hat{y}^{(i)}, y^{(i)})
$$

Where:
- $(\theta)$ = model parameters  
- $(\alpha)$ = learning rate  
- $L(\hat{y}^{(i)}, y^{(i)})$ = loss for a single sample  

---

## Example: Linear Regression

For Mean Squared Error (single sample):

$$
L = (\hat{y}^{(i)} - y^{(i)})^2
$$

The gradient is computed using **only one data point** at a time.

---

## Characteristics

- Very frequent parameter updates  
- Noisy and fluctuating loss curve  
- Faster initial convergence  
- Approximate gradient estimation  

---

## Advantages

- Extremely fast for large datasets  
- Low memory usage  
- Suitable for online and streaming data  
- Can escape shallow local minima  

---

## Limitations

- Noisy updates cause instability  
- Loss function may not converge smoothly  
- Sensitive to learning rate choice  
- Harder to debug  

---

## When to Use SGD

SGD is preferred when:
- Dataset is very large  
- Fast updates are required  
- Online or real-time learning is needed  
- Approximate solutions are acceptable  

---

## SGD vs Batch Gradient Descent

| Method | Data Used per Update | Speed | Stability |
|------|---------------------|-------|-----------|
| Batch GD | Entire dataset | Slow | Very High |
| SGD | Single sample | Very Fast | Low |

---

## Summary

Stochastic Gradient Descent updates model parameters using one training sample at a time.
While it introduces noise into the optimization process, its speed and scalability make it
a powerful choice for large-scale machine learning and deep learning applications.


In [1]:
%%capture
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
from sklearn.datasets import load_diabetes
X,y = load_diabetes(return_X_y=True)

In [4]:
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [6]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

### Creating Own Stochastic Gradient Descent Class

In [7]:
class SGDRegressor:
    
    def __init__(self,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            for j in range(X_train.shape[0]):
                idx = np.random.randint(0,X_train.shape[0])
                
                y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
                
                intercept_der = -2 * (y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)
                
                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

In [8]:
sgd = SGDRegressor(learning_rate=0.01,epochs=40)

In [10]:
import time
start = time.time()
sgd.fit(X_train,y_train)
print("The time taken is",time.time() - start)

161.1196269105991 [  61.89624205  -56.69764592  312.16578868  230.16020066   33.71997831
  -11.30192306 -153.20151766  125.68041968  297.42476089  125.8321059 ]
The time taken is 0.1340339183807373


In [11]:
y_pred = sgd.predict(X_test)

In [12]:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.4078961695456099

### Using Scikit-Learn Built-in Gradient Descent Class

In [13]:
from sklearn.linear_model import SGDRegressor

In [14]:
reg = SGDRegressor(max_iter=100,learning_rate='constant',eta0=0.01)
reg.fit(X_train,y_train)




0,1,2
,loss,'squared_error'
,penalty,'l2'
,alpha,0.0001
,l1_ratio,0.15
,fit_intercept,True
,max_iter,100
,tol,0.001
,shuffle,True
,verbose,0
,epsilon,0.1


In [15]:
y_pred = reg.predict(X_test)

In [16]:
r2_score(y_test,y_pred)

0.43166684528927035