## Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is an optimization algorithm used to minimize the cost function for linear regression (or other machine learning models) by updating the model parameters (coefficients and intercept) iteratively, one data point (or a small batch of data points) at a time. Unlike batch gradient descent, which computes the gradients using the entire training dataset, stochastic gradient descent computes the gradients using only one data point at a time. This makes SGD computationally more efficient and allows it to handle large datasets.

### Mathematical Formulation:

In multiple linear regression, the objective is to minimize the mean squared error (MSE) between the predicted and actual values of the response variable:

$$ \text{MSE} = \frac{1}{2m} \sum_{i=1}^{m} (Y_i - \hat{Y}_i)^2 $$

Where:
- $m$ is the number of training examples (rows),
- $Y_i$ is the actual response variable for the $i$-th training example,
- $\hat{Y}_i$ is the predicted response variable for the $i$-th training example.

The linear regression model can be represented as:

$$ \hat{Y}_i = w_0 + w_1 X_{i1} + w_2 X_{i2} + \ldots + w_n X_{in} $$

Where:
- $w_0$ is the intercept,
- $w_1, w_2, \ldots, w_n$ are the coefficients (weights) corresponding to the features $X_{i1}, X_{i2}, \ldots, X_{in}$ respectively.

### Stochastic Gradient Descent Algorithm:

1. **Initialize Parameters:**
   - Start with initial values for the intercept ($w_0$) and coefficients ($w_1, w_2, \ldots, w_n$).

2. **Compute Gradients:**
   - Compute the partial derivatives of the cost function (MSE) with respect to each parameter:
   $$ \frac{\partial \text{MSE}}{\partial w_0} = -2(Y_i - \hat{Y}_i) \quad \text{for } i = 1, 2, \ldots, m $$
   $$ \frac{\partial \text{MSE}}{\partial w_i} = -2(Y_i - \hat{Y}_i) X_i \quad \text{for } i = 1, 2, \ldots, m $$

3. **Update Parameters:**
   - Update the intercept and coefficients using the gradients and a learning rate ($\alpha$):
   $$ w_0 := w_0 - \alpha \frac{\partial \text{MSE}}{\partial w_0} $$
   $$ w_i := w_i - \alpha \frac{\partial \text{MSE}}{\partial w_i} \quad \text{for } i = 1, 2, \ldots, m $$

4. **Iterate:**
   - Repeat steps 2 and 3 until the cost function converges to a minimum or a predefined number of iterations is reached.
  
### Interpretation:

- Stochastic gradient descent updates the model parameters (coefficients and intercept) using only one data point at a time, making it computationally more efficient than batch gradient descent.
- It is more suitable for large datasets and online learning scenarios where the entire dataset cannot fit into memory.

### Summary:

Stochastic Gradient Descent is an optimization algorithm used to minimize the cost function for linear regression and other machine learning models. By randomly selecting one data point at each iteration and updating the parameters based on the gradient computed from that data point, SGD iteratively converges to the optimal values of the coefficients and intercept that minimize the mean squared error (MSE) and provide the best-fit linear model for the given dataset.

In [16]:
from sklearn.datasets import load_diabetes

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [17]:
X,y = load_diabetes(return_X_y=True)

In [18]:
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [19]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=2)

In [20]:
reg = LinearRegression()
reg.fit(X_train, y_train)

In [21]:
print(reg.coef_)
print(reg.intercept_)

[  -9.15865318 -205.45432163  516.69374454  340.61999905 -895.5520019
  561.22067904  153.89310954  126.73139688  861.12700152   52.42112238]
151.88331005254167


In [22]:
y_pred = reg.predict(X_test)
r2_score(y_test, y_pred)

0.4399338661568968

### Creating our own SGD class

In [66]:
class MySGDRegressor:
    def __init__(self, learning_rate = 0.01, epochs=100):
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs

    def fit(self, X_train, y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])

        for i in range(self.epochs):
            for j in range(X_train.shape[0]):
                idx = np.random.randint(0,X_train.shape[0])

                y_hat = np.dot(X_train[idx], self.coef_) + self.intercept_

                intercept_der = -2 * (y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat), X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)

        print(self.intercept_, self.coef_)

    def predict(self, X_test):
        return np.dot(X_test, self.coef_) + self.intercept_

In [99]:
my_sgd = MySGDRegressor(learning_rate=0.01, epochs=100)

In [100]:
my_sgd.fit(X_train, y_train)

149.25178109623482 [  31.40306852 -140.18719161  457.4584895   301.84052217  -34.23875626
 -105.70556629 -199.53909988  110.46616617  421.34526077  107.75413467]


In [101]:
y_pred = my_sgd.predict(X_test)
r2_score(y_test, y_pred)

0.45128762739253414

### Using SGDRegressor from sklearn library

In [102]:
from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor(max_iter=100, learning_rate='constant', eta0=0.01)

In [103]:
sgd.fit(X_train, y_train)



In [104]:
y_pred = sgd.predict(X_test)
r2_score(y_test, y_pred)

0.430322792279596