# **Stochastic Gradient Descent**
Stochastic Gradient Descent (SGD) with a single-row update, also known as online SGD, is a variant of the traditional SGD algorithm where instead of using mini-batches of data for each parameter update, you update the model's parameters one data point (or row) at a time. In other words, after processing each individual data point, the model's parameters are updated based on the gradient computed for that single data point. This approach is sometimes referred to as "online learning."

<center><img src="https://editor.analyticsvidhya.com/uploads/58182variations_comparison.png" style="width: 60%"></ceneter>

Here's how SGD with a single-row update works:

1. **Initialization**: Start with an initial set of model parameters.

2. **Data Shuffling**: The training dataset is often shuffled to ensure that the order of data points does not bias the training process.

3. **Iterative Updates**: For each training iteration (or epoch), the algorithm processes one data point from the training set. The model's parameters are updated based on the gradient of the loss function with respect to that single data point.

4. **Gradient Computation**: The gradient of the loss function with respect to the model parameters is computed by backpropagating errors through the network (for neural networks) or using analytical derivatives (for simpler models). The gradient represents how the loss changes with small perturbations in the model parameters for that specific data point.

5. **Parameter Update**: The model parameters are adjusted in the opposite direction of the gradient, just like in traditional SGD. The learning rate controls the step size during each update.

6. **Repeat**: Steps 3-5 are repeated for the entire training dataset or until convergence criteria are met.

Online SGD can have some advantages and disadvantages:

**Advantages:**

1. **Efficiency**: Online SGD can be very efficient, as it processes one data point at a time, making it suitable for streaming data or scenarios with limited memory.

2. **Quick Convergence**: Online SGD can converge quickly, especially when the data is abundant and diverse.

3. **Adaptability**: It can adapt to changing data distributions and non-stationary data, making it useful in online learning and real-time applications.

**Disadvantages:**

1. **High Variability**: Since updates are based on individual data points, the parameter updates can be highly variable and noisy, which may result in a less stable convergence.

2. **Slower Convergence**: Online SGD can converge slower than traditional SGD with mini-batches due to the high variance in parameter updates.

3. **Difficulty in Hyperparameter Tuning**: Choosing an appropriate learning rate and other hyperparameters can be more challenging because of the high variance in updates.

Online SGD is typically used in situations where computational resources or memory are limited, or when the data distribution is constantly changing. It's commonly employed in online learning scenarios, such as recommendation systems, where new data arrives continuously and must be processed as it comes in. However, it may require careful tuning and monitoring to achieve optimal convergence and performance.

## **Import Required Libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings
warnings.filterwarnings("ignore")

## **Load a Data**

In [2]:
from sklearn.datasets import load_diabetes

In [3]:
# Read the Diabetes data
X, y = load_diabetes(return_X_y=True)

In [4]:
X.shape

(442, 10)

## **Train Test Split**

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((309, 10), (133, 10))

## **Apply Linear Regression with Ordinary Least Squares (OLS)**

In [7]:
from sklearn.linear_model import LinearRegression

In [8]:
# Store the initial time in a variable
start = time.time()

# Instantiate a linear regression object
lr = LinearRegression()

# Fit the data
lr.fit(X_train, y_train)

# Print the actual time taken to fit the data
print("The Time taken is:", time.time() - start)

The Time taken is: 0.017316102981567383


In [9]:
# Print the coefficients and intercept
print("Coefficients:\n", lr.coef_, "\n")
print("Intercept:", lr.intercept_)

Coefficients:
 [ -52.46478548 -193.50733393  579.49108514  272.453666   -504.64830389
  241.62372969  -69.76596029   86.61313961  721.92083806   26.78067442] 

Intercept: 153.71901624380382


In [10]:
# Predict the test data
y_pred = lr.predict(X_test)

In [11]:
# Calculate the R2 Score
from sklearn.metrics import r2_score

In [12]:
print("R2 Score:", r2_score(y_test, y_pred))

R2 Score: 0.39289927216962917


## **Apply Multiple Linear Regression with Stochastic Gradient Descent**

In [13]:
# Create a class to apply gradient descent
class SGDRegressor:
    def __init__(self, lr=0.01, epochs=100):
        self.lr = lr
        self.epochs = epochs
        self.coef_ = None
        self.intercept_ = None
        
    def fit(self, X_train, y_train):
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            for j in range(X_train.shape[0]):
                idx = np.random.randint(0, X_train.shape[0])
                
                # Predict the y_hat
                y_hat = np.dot(X_train[idx], self.coef_) + self.intercept_
                
                # Update the intercept using a single row
                intercept_der = -2 * (y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)
                
                # Update the coefficients using a single row
                coef_der = -2 * np.dot((y_train[idx] - y_hat), X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)
    
    def predict(self, X):
        return np.dot(X, self.coef_) + self.intercept_

In [14]:
# Store the initial time in a variable
start = time.time()

# Instantiate a SGDRegressor object
sgdr = SGDRegressor(lr=0.03, epochs=100)

# Fit the data
sgdr.fit(X_train, y_train)

# Print the actual time taken to fit the data
print("The Time taken is:", time.time() - start)

The Time taken is: 0.34311676025390625


In [15]:
# Print the coefficients and intercept
print("Coefficients:\n", sgdr.coef_, "\n")
print("Intercept:", sgdr.intercept_)

Coefficients:
 [ -41.03889143 -194.92599674  590.11828356  272.37981486  -69.86465318
 -152.76619768 -238.47971595   88.93869043  513.01930642   56.26845156] 

Intercept: 155.20404866654545


In [16]:
# Predict the test data
y_pred = sgdr.predict(X_test)

In [17]:
# Calculate the R2 Score
print("R2 Score:", r2_score(y_test, y_pred))

R2 Score: 0.38663888603937424


## **Stochastic Gradient Descent with Scikit-Learn**

In [18]:
from sklearn.linear_model import SGDRegressor

In [19]:
# Instantiate a SGDRegressor object
reg = SGDRegressor(loss='squared_error', learning_rate='constant', eta0=0.01)

# Fit the data
reg.fit(X_train, y_train)

In [20]:
# Predict the test data
y_pred = reg.predict(X_test)

In [21]:
# Calculate the R2 Score
print("R2 Score:", r2_score(y_test, y_pred))

R2 Score: 0.39411766076013666
