# Batch Gradient Descent

`Batch Gradient Descent` is a variant of gradient descent where the **entire training dataset**
is used to compute the gradient of the loss function **at each update step**.

It provides stable and smooth convergence but can be computationally expensive for large datasets.

---

## How Batch Gradient Descent Works

At each iteration:
1. Use **all training samples** to compute the gradient  
2. Update model parameters once per iteration  
3. Repeat until the cost function converges  

Only **one update** is performed per epoch.

---

## Update Rule

For a cost function \( J(\theta) \), the update rule is:

$$
\theta := \theta - \alpha \frac{1}{m} \sum_{i=1}^{m} \nabla_\theta L(\hat{y}^{(i)}, y^{(i)})
$$

Where:
- $(m)$ = number of training samples  
- $(\alpha)$ = learning rate  
- $\nabla_\theta L$ = gradient of the loss function  

---

## Example: Linear Regression

For Mean Squared Error (MSE):

$$
J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2
$$

The gradient is computed using **all samples** before updating parameters.

---

## Characteristics

- Deterministic updates  
- Smooth and stable convergence  
- Exact gradient at each step  
- One update per epoch  

---

## Advantages

- Guaranteed convergence for convex loss functions  
- Stable and predictable updates  
- Easier to debug and analyze  

---

## Limitations

- Slow for large datasets  
- High memory usage  
- Not suitable for online learning  
- Can be inefficient when data is large  

---

## When to Use Batch Gradient Descent

Batch Gradient Descent is suitable when:
- Dataset size is small to medium  
- High precision in gradient computation is required  
- The loss function is convex  
- Training time is not a major concern  

---

## Batch Gradient Descent vs Other Variants

| Method | Data Used per Update | Speed | Stability |
|------|---------------------|-------|-----------|
| Batch GD | Entire dataset | Slow | Very High |
| SGD | Single sample | Fast | Low |
| Mini-Batch GD | Small batch | Balanced | Medium |

---

## Summary

Batch Gradient Descent computes parameter updates using the entire dataset at each iteration.
While it offers stable and accurate convergence, it is computationally expensive and less suitable
for large-scale machine learning problems compared to mini-batch methods.


In [1]:
%%capture
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
from sklearn.datasets import load_diabetes

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X,y = load_diabetes(return_X_y=True)

In [6]:
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [7]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [8]:
X_train.shape

(353, 10)

### Creating Own Batch Gradient Descent Class

In [9]:
class GDRegressor:
    
    def __init__(self,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            # update all the coef and the intercept
            y_hat = np.dot(X_train,self.coef_) + self.intercept_
            #print("Shape of y_hat",y_hat.shape)
            intercept_der = -2 * np.mean(y_train - y_hat)
            self.intercept_ = self.intercept_ - (self.lr * intercept_der)
            
            coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
            self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

In [10]:
gdr = GDRegressor(epochs=1000,learning_rate=0.5)

In [11]:
gdr.fit(X_train,y_train)

152.01351687661833 [  14.38990585 -173.7235727   491.54898524  323.91524824  -39.32648042
 -116.01061213 -194.04077415  103.38135565  451.63448787   97.57218278]


In [12]:
y_pred = gdr.predict(X_test)

In [13]:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.4534503034722803