## Problems of Batch and Stochastic Gradient Descent

- **Batch Gradient Descent**: Computes the gradient using the entire dataset.  
    - **Problems**:
        - Slow for large datasets.
        - High memory usage.
        - Can get stuck in local minima.

- **Stochastic Gradient Descent (SGD)**: Updates parameters using one data point at a time.
    - **Problems**:
        - High variance in updates, leading to noisy convergence.
        - May oscillate and not settle at the minimum.

---

## Description of Mini-Batch Gradient Descent

Mini-batch gradient descent is a compromise between batch and stochastic gradient descent. It splits the dataset into small batches and updates the model parameters for each batch. This approach balances the efficiency of batch gradient descent with the speed and robustness of SGD.

---

## Mathematical Formulation

Given a loss function $J(\theta)$ and a mini-batch of size $m$, the parameter update rule is:

$$
\theta := \theta - \eta \frac{1}{m} \sum_{i=1}^{m} \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
$$

where:
- $\theta$ = model parameters  
- $\eta$ = learning rate  
- $m$ = mini-batch size  
- $(x^{(i)}, y^{(i)})$ = $i$-th training example in the mini-batch

---

## Advantages of Mini-Batch Gradient Descent

- Faster convergence than batch gradient descent.
- More stable updates than SGD.
- Efficient use of vectorized operations and hardware acceleration (e.g., GPUs).
- Reduces memory requirements compared to batch gradient descent.

---

## Problems Associated with Mini-Batch Gradient Descent

- Choice of mini-batch size can affect performance and convergence.
- Still may get stuck in local minima or saddle points.
- Requires tuning of additional hyperparameters (batch size, learning rate).
- May introduce some noise in the updates, though less than SGD.

In [13]:
import numpy as np
import pandas as pd
from sklearn.linear_model  import LinearRegression
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
import time
import random
from sklearn.metrics import r2_score

In [14]:
X,y=load_diabetes(return_X_y=True)

In [15]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

In [16]:
reg=LinearRegression()
reg.fit(x_train,y_train)

In [17]:
x_train.shape

(309, 10)

In [18]:
reg.intercept_,reg.coef_

(np.float64(152.6275795791894),
 array([ -31.80406415, -214.66452697,  445.14791885,  376.34786891,
        -802.50380919,  525.99207358,  106.4287134 ,  172.3171033 ,
         799.50335529,   57.1646797 ]))

In [19]:
y_pred1=reg.predict(x_test)

In [20]:
class MBGDRegressor:
    
    def __init__(self,batch_size,learning_rate=0.01,epochs=40):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            
            for j in range(int(X_train.shape[0]/self.batch_size)):
                
                idx = random.sample(range(X_train.shape[0]),self.batch_size)
                
                y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
                #print("Shape of y_hat",y_hat.shape)
                intercept_der = -2 * np.mean(y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

In [21]:
mbgdr=MBGDRegressor(30,0.1)
mbgdr.fit(x_train,y_train)


154.0854833328983 [ -39.61453485 -221.40747944  465.74679809  360.1205132   -62.48912994
  -99.36783642 -218.30534648  116.87496593  485.32270423   58.36973404]


In [22]:
y_pred2=mbgdr.predict(x_test)

In [23]:
print("For ols:",r2_score(y_test,y_pred1))
print("For gd:",r2_score(y_test,y_pred2))

For ols: 0.5012577617513012
For gd: 0.5042660262010064
