In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('../src/')

In [52]:
from data.create_dataset import *
from visualization.visualize import *
from modelling import ols,ridge
from model_evaluation.metrics import *
from processing.data_preprocessing import *
from utils.utils import *
import numpy as np
from sklearn.model_selection import  train_test_split

In [48]:
import numpy as np

In [3]:
beta = np.random.randn(2,1)
n = 1000
x = 2*np.random.rand(n,1)
y = 4+3*x+np.random.randn(n,1)

X = np.c_[np.ones((n,1)), x]

In [20]:
beta_real = ols.fit_beta(X,y)
beta_real

array([[3.82630258],
       [3.09085066]])

In [16]:
from sklearn.linear_model import SGDRegressor
sgdreg = SGDRegressor(max_iter = 100, penalty=None, eta0=0.01, tol=None)
sgdreg = sgdreg.fit(x,y.ravel())

In [17]:
[sgdreg.intercept_,sgdreg.coef_]

[array([3.82248455]), array([3.08709076])]

In [23]:
beta = ols.fit_beta_sgd(X,y,0.01,100,100)
beta

array([[3.82203767],
       [3.09061134]])

So when making a schedule for lr for instance using 
$
\gamma = \frac{t_0}{em+i + t_1}
$
it seems that the lower the t1 the faster the lr will decrease in the beginning epochs. t0 seems to mostly be affectly the interval at which the decrease happens. Lower t0 means that the lr decreases faster in the beginning and less later. t1 simple shifts the curve horizontally, which is why it can cause faster decrease in the beginning. I.e the vertical asymptote is at -t1.

So why use learning_schedule? Seems like it's mostly so that the optimization converges faster. This is because it allows you to have larger learning rates in the beginning and small enough at the end. Notice 'small enough', recall that learning rate needs to be sufficiently small to ensure convergence. However, should it not be able to converge anyways? If it has enough n_epochs? with learning rate 1 and 10000 epochs it's closing in on the real betas. Though not very well.

In [31]:
beta = ols.fit_beta_sgd(X,y,1,100,10000)
beta

array([[5.06058942],
       [1.87303311]])

How to implement? Say we use above function, we need then to define t0 and t1, should make class out of it? Maybe a bit too much? Could simply decide something, don't want another parameter to tweak, since it will only alter the training speed.

In [34]:
beta = ols.fit_beta_sgd(X,y,0.1,100,100)
beta

array([[3.84556117],
       [3.07184904]])

Maybe make it optional with learning schedule? In that case how?
Because I want to pass one argument either as a number indicating constant or a schedule which decides learning rate based on epoch and batch. Thing is, I don't want an if statement in the loop, which is slow, I want the function return the same lr if set to constant, but return a variable one if set to schedule. Which means, that in one case it only needs lr as input, and then return it again. And the other case it needs epoch, batch and num batches as input. So either way, must be able to vary what input is given. But that I don't think can be done, unless passing a list? That might work, might be ugly though. Why even list? That would require me to input every parameter to both functions, but use some of them based on which function. Can still do this...

In [69]:
beta = ols.fit_beta_sgd(X,y, batch_size=100,n_epochs=100)
beta

array([[3.84313045],
       [3.07424732]])

Now we do sgd on ridge. I guess it's the cost function that's the only difference.

In [64]:
beta_ridge = ridge.fit_beta_sgd(X,y, batch_size=100,n_epochs=10,lr=0.01)
beta_ridge

array([[5.20757294],
       [1.72801081]])

In [61]:
from sklearn.linear_model import SGDRegressor
sgdreg = SGDRegressor(max_iter = 100, penalty='l2', eta0=0.01, tol=None)
sgdreg = sgdreg.fit(x,y.ravel())
[sgdreg.intercept_,sgdreg.coef_]

[array([3.8238777]), array([3.08625028])]