# Linear Regression via Gradient Descent

In [21]:
import numpy as np
from bokeh.plotting import figure, output_notebook,show
output_notebook()

### Set up the data

We use the multivariate simulated data from the regression lab, with the target variable being column 1 and the features being columns 2 and 3. We append a column of ones to the data matrix. 

In [50]:
data = np.genfromtxt("data/multivar_simulated/data.csv", skip_header=1, delimiter=",")
Y = data[:, 1].reshape(-1,1)
X = data[:, 2:]
X = np.concatenate([X, np.ones(shape=(X.shape[0], 1))], axis=1)

Find our initial guess and the matrices needed for the gradient.  Choose a learning rate and a tolerance.


In [80]:
M0 = np.random.normal(0,1,size=(X.shape[1],1))

lr = .0001
epsilon=.00001
A = ((X.T) @ Y).reshape(3,1)
D = X.T @ X

Do the gradient descent iteration


In [81]:
MSE0=0
losses = []
for i in range(10000):
    M = M0 - lr*(-2*A+2*(D@M0))
    MSE = np.sum(np.square(Y-(X@M)))
    if np.abs(MSE-MSE0)<epsilon:
        print("converged after {} iterations".format(i))
        break
    M0=M.copy()
    MSE0=MSE
    losses.append(MSE)

converged after 2006 iterations


Compare the results of the gradient descent with the direct solution.

In [82]:
direct_solution = np.linalg.inv(D)@A
print(f"Gradient Descent yields {M.ravel()} while direct computation yields {direct_solution.ravel()}")

Gradient Descent yields [ 1.78971545 -3.47779267  6.05091821] while direct computation yields [ 1.78777492 -3.47899986  6.0608333 ]


Behavior of the MSE over the iterations.


In [90]:

f=figure(title='MSE for gradient descent vs iterations',x_axis_label='Iterations (x10)')
x=list(range(len(losses)))
f.scatter(x=[y/10 for y in x[500::100]],y=losses[500::100])
show(f)

In [89]:

f2=figure(title='MSE vs iterations for gradient descent',x_axis_label='Iterations (x10)')
x=list(range(len(losses)))
f2.scatter(x=[y/10 for y in x[1500::10]],y=losses[1500::10])
show(f2)

## Stochastic Gradient Descent

In stochastic gradient descent, each loop uses just one pair (x,y) from the dataset. So x is just a row of the X matrix, y is an entry from the
y vector, and the error is $\|(y-xM)\|^2$.  The gradient of this is $2x\cdot(y-xM)=2x\cdot y - 2x\cdot xM$.

Stochastic gradient descent avoids having to do any big matrix multiplications, and in particular avoids computing $X^{T}X$ which, if $X$ has many rows, can be prohibitively expensive.

In [201]:
M0 = np.random.normal(0,1,size=(X.shape[1],1))
lr = .0001
epsilon=.00001

In [202]:
losses=[]
for j in range(10000):
    for i in range(X.shape[0]):
        data=X[i,:].reshape(1,X.shape[1])
        target = Y[i,0]
        grad = -2*data*(target-((data@M0)[0,0]))
        M = M0 - lr*grad.T
        M0 = M.copy()
    MSE = np.sum(np.square(Y-(X@M)))
    losses.append(MSE)

In [203]:
M0

array([[ 1.78866833],
       [-3.48161702],
       [ 6.05914503]])

In [204]:
len(losses)

10000

In [205]:
f3=figure(y_range=[0,100])
f3.scatter(x=[z/100 for z in list(range(0,10000,100))],y=losses[::100])
show(f3)