# Introduction:

In gradient descent we calculate the gradient over the whole dataset and update our weights accordingly while in stochastic gradient descent we perform update over individual data points making SGD much faster.The update over each data point will cause a lot of fluctuations in the weight vector, however decreasing the learning rate over time SGD shows the same convergence behaviour as batch gradient descent. In order to get the best of both worlds we can use Mini-Batch gradient descent.

# Algorithm:
1. Initalize weight vector.
2. for i = 1 to number of epochs do
    3. shuffle data points
    4. for j = 1 to number of data points do
        5. calculate the gradient 
        6. decrease the learning rate
        7. update the weight vector
8. return weights

# Python Implementation:

In [80]:
'Import libraries'
from __future__ import division
import numpy as np
import pandas as pd
from collections import defaultdict
import copy
import time
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler

**Before implementing the code let us first define few functions that we will need.**

In [81]:
"Util functions"
def paddingData(data):
    '''
    :param data: Data to be paded
    :return: Padded data with value 1 in the first column
    '''
    return np.c_[np.ones(data.shape[0]), data]

def setWeights(numFeat):
    '''
    :param numFeat: Total number of features in data
    :return: vector of ones of length equal to number of features in data
    '''
    return np.ones(numFeat).reshape((numFeat, 1))

def geterror(true, pred):
    '''
    :param true: true labels
    :param pred: predicted labels
    :return: residual
    '''
    return np.linalg.norm(np.subtract(pred, true))/true.shape[0]

def splitData(test_size,cv, numpoints):
    #This function from sklearn takes the length of the data and test size and returns bootstrapped indices 
    #depending on how many boostraps are required
    '''
    :param test_size: size of the test data required (value between 0 and 1).
    :param cv: Number of re-shuffling.
    :param numpoints: Total number of data points.
    :return: indices of the shuffled splits.
    '''
    ss = ShuffleSplit(n=numpoints, n_iter=cv, test_size=test_size, random_state=32)
    return ss

# SGD Linear Regression:

In [82]:
'Implements SGD with l2 regularization'
class stochasticgradientdescent():

    def __init__(self,alpha,epoch,Lambda):
        '''
        :param alpha: learning rate
        :param epoch: number of passes over data
        :param Lambda: regularization parameter
        '''
        self.alpha = alpha
        self.epoch = epoch
        self.Lambda = Lambda
        self.weights = None

    def fit(self,Xtrain,ytrain):
        start = time.time()
        print "Running Stochastic Gradient Descent"
        'Do a padding of one'
        Xtrain = paddingData(Xtrain)
        self.weights = setWeights(Xtrain.shape[1])
        'Random seed for result reproducibility'
        np.random.seed(32)
        for _ in range(self.epoch):
            'Generate random integers in range of number of rows in data'
            ite = np.random.choice(a=Xtrain.shape[0], size=Xtrain.shape[0], replace=False)
            for i in ite:
                oneData = Xtrain[i, :].reshape((1, Xtrain.shape[1]))
                oneLabel = ytrain[i,:]
                loss = np.dot(oneData, self.weights) - oneLabel
                gradient = np.dot(oneData.transpose(), loss)
                self.weights = (1 - 2 * self.Lambda * self.alpha) * self.weights - self.alpha * gradient
        end = time.time()
        print 'Time taken to fit data:', end-start
        
    def predict(self,Xtest):
        Xtest = paddingData(Xtest)
        return np.dot(Xtest, self.weights)

# Load the data:
We will use the diabetes dataset from sklearn to test our algorithm. This data set has 10 continious features and 442 samples.

In [83]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
numSamples, numFeat = diabetes.data.shape
ss = splitData(test_size=0.25,cv=1, numpoints=numSamples)
for train_index, test_index in ss:
    Xtrain = diabetes.data[train_index, :]
    ytrain = diabetes.target[train_index].reshape((train_index.shape[0], 1))
    Xtest = diabetes.data[test_index, :]
    ytest = diabetes.target[test_index].reshape((test_index.shape[0], 1))
# from sklearn.datasets import load_boston
# boston = load_boston()
# numSamples, numFeat = boston.data.shape
# ss = splitData(test_size=0.25,cv=1, numpoints=numSamples)
# for train_index, test_index in ss:
#     Xtrain = boston.data[train_index, :]
#     ytrain = boston.target[train_index].reshape((train_index.shape[0], 1))
#     Xtest = boston.data[test_index, :]
#     ytest = boston.target[test_index].reshape((test_index.shape[0], 1))

In [84]:
'Normalize the data to zero mean and unit variance'
scalar = StandardScaler()
Xtrain = scalar.fit_transform(Xtrain)
Xtest = scalar.transform(Xtest)

# Running our algorithm:
I have also compared the performance of our implemented SGD with sklearn SGDRegressor as a sanity check.

In [85]:
ourSGD = stochasticgradientdescent(alpha = 0.001,epoch = 200,Lambda = 0.0001)
ourSGD.fit(Xtrain,ytrain)
pred = ourSGD.predict(Xtest)
pred = pred.reshape((pred.shape[0], 1))
print 'Error:',geterror(ytest, pred)

Running Stochastic Gradient Descent
Time taken to fit data: 0.588402986526
Error: 5.10859016007


In [86]:
from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor()
sgd.fit(Xtrain,ytrain)
pred1 = sgd.predict(Xtest)
pred1 = pred1.reshape((pred1.shape[0], 1))
print 'Error using sklearn SGDRegressor:',geterror(ytest, pred1)

Error using sklearn SGDRegressor: 5.05439165818


  y = column_or_1d(y, warn=True)


# Conclusion:
Our algorithm almost gives the same error as sklearn.

# Refrences:
My class notes and slides.