# Mini-Batch Gradient Descent

`Mini-Batch Gradient Descent` is a variant of gradient descent that updates model parameters using
**small batches of training samples** instead of the entire dataset or a single sample.

It provides the **best trade-off between speed and stability** and is the most widely used optimization
method in modern machine learning and deep learning.

---

## How Mini-Batch Gradient Descent Works

At each iteration:
1. Shuffle the training dataset  
2. Split data into small batches (mini-batches)  
3. Compute gradients using one mini-batch  
4. Update model parameters  
5. Repeat for all mini-batches (one epoch)

---

## Update Rule

For a mini-batch of size \( b \):

$$
\theta := \theta - \alpha \frac{1}{b}
\sum_{i=1}^{b} \nabla_\theta L(\hat{y}^{(i)}, y^{(i)})
$$

Where:
- $(\theta)$ = model parameters  
- $(\alpha)$ = learning rate  
- $(b)$ = mini-batch size  

---

## Choosing the Batch Size

Common mini-batch sizes:
- 16, 32, 64, 128  

Batch size affects:
- Training speed  
- Memory usage  
- Convergence stability  

Smaller batches → noisier updates  
Larger batches → smoother but slower updates  

---

## Characteristics

- Multiple updates per epoch  
- Balanced gradient estimation  
- Efficient use of hardware (CPU/GPU)  
- Smooth but fast convergence  

---

## Advantages

- Faster than Batch Gradient Descent  
- More stable than SGD  
- Scales well to large datasets  
- Suitable for parallel computation  

---

## Limitations

- Requires tuning batch size  
- Still sensitive to learning rate  
- Can get stuck in local minima for non-convex problems  

---

## When to Use Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is ideal when:
- Dataset is large  
- GPU or parallel hardware is available  
- Stability and speed are both important  
- Training deep learning models  

---

## Comparison with Other Variants

| Method | Data Used per Update | Speed | Stability |
|------|---------------------|-------|-----------|
| Batch GD | Entire dataset | Slow | Very High |
| SGD | Single sample | Very Fast | Low |
| Mini-Batch GD | Small batch | Fast | Medium–High |

---

## Summary

Mini-Batch Gradient Descent combines the strengths of Batch Gradient Descent and Stochastic Gradient Descent.
By using small batches of data, it achieves fast training, stable convergence, and efficient hardware utilization,
making it the default choice for most machine learning and deep learning models.


In [1]:
%%capture
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
from sklearn.datasets import load_diabetes
X,y = load_diabetes(return_X_y=True)

In [4]:
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

### Creating Own Mini Batch Gradient Descent Class

In [6]:
import random

class MBGDRegressor:
    
    def __init__(self,batch_size,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            
            for j in range(int(X_train.shape[0]/self.batch_size)):
                
                idx = random.sample(range(X_train.shape[0]),self.batch_size)
                
                y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
                #print("Shape of y_hat",y_hat.shape)
                intercept_der = -2 * np.mean(y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

In [7]:
mbr = MBGDRegressor(batch_size=int(X_train.shape[0]/50),learning_rate=0.01,epochs=100)

In [8]:
mbr.fit(X_train,y_train)

152.73914716353028 [  25.48513263 -132.09677548  454.7927651   304.54619762  -17.54755095
  -87.02674564 -186.48037159  112.92870132  411.52692739  111.79906255]


In [9]:
y_pred = mbr.predict(X_test)

In [11]:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.4515814462813902

### Using Scikit-Learn Built-in Gradient Descent Class

In [12]:
from sklearn.linear_model import SGDRegressor

In [13]:
sgd = SGDRegressor(learning_rate='constant',eta0=0.1)

In [14]:
batch_size = 35

for i in range(100):
    
    idx = random.sample(range(X_train.shape[0]),batch_size)
    sgd.partial_fit(X_train[idx],y_train[idx])

In [15]:
sgd.coef_

array([  58.38755699,  -70.86857683,  356.55716034,  259.91093245,
         11.81546429,  -33.17869112, -179.77522828,  134.87474357,
        324.52423427,  134.09083859])

In [16]:
sgd.intercept_

array([150.43560229])

In [17]:
y_pred = sgd.predict(X_test)

In [18]:
r2_score(y_test,y_pred)

0.4351399142157185