## Batch Gradient Descent

Batch Gradient Descent is an optimization algorithm used to minimize the cost function in machine learning and deep learning models. In this method, the entire training dataset is used to compute the gradient of the loss function with respect to the model parameters. The parameters are then updated in the direction that reduces the loss.

### Mathematical Description

Given a cost function \( J(\theta) \) over a dataset with \( m \) examples, the update rule for the parameters $( \theta )$ is:

$$
\theta := \theta - \alpha \nabla_\theta J(\theta)
$$

where:
- $( \alpha $) is the learning rate,
- $( \nabla_\theta J(\theta) $) is the gradient of the cost function with respect to \( \theta \), computed over the entire dataset.

For linear regression, the cost function is:

$$
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
$$

and the update for each parameter \( \theta_j \) is:

$$
\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}
$$

### Advantages

- **Stable Convergence:** Since the gradient is computed using the whole dataset, updates are more stable and less noisy.
- **Deterministic:** For a given dataset and initial parameters, the updates are deterministic and reproducible.
- **Efficient Vectorization:** Allows for efficient computation using matrix operations, leveraging optimized linear algebra libraries.

### Disadvantages

- **Computationally Expensive:** Processing the entire dataset for each update can be slow and resource-intensive, especially for large datasets.
- **Memory Intensive:** Requires loading the entire dataset into memory, which may not be feasible for very large datasets.
- **Slower Updates:** Model parameters are updated less frequently compared to stochastic or mini-batch gradient descent, potentially leading to slower convergence.

### **MATHEMATICAL FORMULATION**

For an $n$-dimensional dataset, the derivative of the loss function (Mean Squared Error) with respect to each parameter $\beta_j$ ($j = 0, 1, \ldots, n$) is:

$$
\frac{\partial J}{\partial \beta_j} = \frac{1}{m} \sum_{i=1}^{m} \left( h_\beta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}
$$

where:
- $h_\beta(x^{(i)}) = \beta_0 + \beta_1 x_1^{(i)} + \cdots + \beta_n x_n^{(i)}$
- $x_0^{(i)} = 1$ (for the intercept term $\beta_0$)
- $x_j^{(i)}$ is the $j$-th feature of the $i$-th example
- $m$ is the number of training examples


In [55]:
import numpy as np
import pandas as pd
from sklearn.linear_model  import LinearRegression
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split

In [56]:
X,y=load_diabetes(return_X_y=True)

In [57]:
X.shape,y.shape

((442, 10), (442,))

In [58]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

In [59]:
reg=LinearRegression()
reg.fit(x_train,y_train)

In [60]:
reg.intercept_,reg.coef_

(np.float64(152.6275795791894),
 array([ -31.80406415, -214.66452697,  445.14791885,  376.34786891,
        -802.50380919,  525.99207358,  106.4287134 ,  172.3171033 ,
         799.50335529,   57.1646797 ]))

In [61]:
y_pred1=reg.predict(x_test)

In [None]:
class GdRegressor:
    def __init__(self,learning_rate,epochs):
        self.lr=learning_rate
        self.epochs=epochs
        self.intercept=None
        self.coef_=None
    
    def fit(self,x,y):
        self.intercept=0
        self.coef_=np.ones(x.shape[1])
        for i in range(self.epochs):
            #dot product will give the proper result
            #process is called vectorization
            y_hat = np.dot(x,self.coef_) + self.intercept
            intercept_der = -2 * np.mean(y - y_hat)
            self.intercept = self.intercept - (self.lr * intercept_der)
            #for 356 rows (y-y hat ) gives 356 rows(1x356), when dot product: (1x356).(356*10) gives 1*10 and we will get the same result as the formula above
            coef_der = -2 * np.dot((y - y_hat),x)/x.shape[0]
            self.coef_ = self.coef_ - (self.lr * coef_der)
        print(self.intercept,self.coef_)
    
    def predict(self,x):
        return np.dot(x,self.coef_) + self.intercept
    

In [63]:
gdr = GdRegressor(epochs=1000,learning_rate=0.5)

In [64]:
gdr.fit(x_train,y_train)

152.7908903056216 [  -1.98714805 -178.52647764  422.78228872  342.45490237  -20.1557654
  -86.29425447 -212.27779724  133.24587264  437.27773939  102.99345759]


In [65]:
y_pred2=gdr.predict(x_test)

In [66]:
from sklearn.metrics import r2_score

In [67]:
print('For the OLS:',r2_score(y_test,y_pred1))
print('For the gdr:',r2_score(y_test,y_pred2))

For the OLS: 0.5012577617513012
For the gdr: 0.49367273964010805


<span style='color:green'>Here we got the similar result in both of the cases</span>