# Multi-variate Regression

The ML library uses the same SKLearn libraries under the hood. So the same approach that we used before to include multivariate and polynomial data also works in these libraries. 

# Data

We use the advertising dataset. One thing to bear in mind is that when we have multiple features which have much different scales, Gradient Descent may have problems converging. Very often it helps to normalize the data to a z-distribution, which we do by supplying scale=True. We will further discuss scaling in the part on data preprocessing.

In [1]:
from ml import *
data = advertising('Sales', scale=True)

# Model

Use the SGDRegressor with a 'squared_loss' loss-function and a learning rate alpha of 0.01.

In [2]:
model = SGDRegressor(eta0=1e-2, learning_rate='invscaling', penalty = None)

# Training

We see that the loss converges quite quickly to around 2.89 which is very close to what we obtained using the normal equation.

In [3]:
for _ in range(101):
    model.partial_fit(data.train_X, data.train_y )
    if _ % 10 == 0:
        y_predict = model.predict(data.train_X)
        print(mean_squared_error(y_predict, data.train_y))

80.06753795218285
3.0789260599368515
2.898927949103308
2.897049739085272
2.8969786108073494
2.8969845495829567
2.8969723913170453
2.896996546267101
2.89697701132697
2.8969669401701976
2.896967462954102


# Finding a good learning rate

To find a proper learning rate $\alpha$, try out a few values in the sequence 1e-5, 3e-5, 1e-4, 3e-4, 1e-3, etc. When the learning rate is set too high, the model does not converge. When the learning rate is set too low, it converges very slowly. Usually when you hit the sweet spot you will see that.

We will transfer the learning loop into a function to try out.

In [4]:
def learn(ùõº):
    model = SGDRegressor(eta0=ùõº, learning_rate='invscaling', penalty = None)    
    for _ in range(101):
        model.partial_fit(data.train_X, data.train_y )
        if _ % 10 == 0:
            y_predict = model.predict(data.train_X)
            print(ùõº, mean_squared_error(y_predict, data.train_y))

In [5]:
for ùõº in [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1, 3]:
    print('----')
    learn(ùõº)

----
1e-05 258.5584435831246
1e-05 257.01064398534857
1e-05 255.85755306511993
1e-05 254.8440516487964
1e-05 253.91415007191193
1e-05 253.0429941491671
1e-05 252.21664829057016
1e-05 251.4262403917478
1e-05 250.66562541409704
1e-05 249.93043385119714
1e-05 249.21732226680652
----
3e-05 257.9749410400494
3e-05 253.36891961783607
3e-05 249.97488059461585
3e-05 247.0169621786718
3e-05 244.32396663921682
3e-05 241.8193869119451
3e-05 239.4596880897243
3e-05 237.21717035553303
3e-05 235.07273091577622
3e-05 233.01247944727544
3e-05 231.02569800695454
----
0.0001 255.7551312667668
0.0001 240.84170134693295
0.0001 230.26537739331587
0.0001 221.32034592769622
0.0001 213.39457652018547
0.0001 206.20573827240037
0.0001 199.59203127474444
0.0001 193.44802007331606
0.0001 187.69984089402732
0.0001 182.29292751919897
0.0001 177.18515961587264
----
0.0003 249.82173727808703
0.0003 208.66367835820247
0.0003 182.43375982047615
0.0003 162.07803229589655
0.0003 145.38242666426677
0.0003 131.289753760846

So now we observe that setting $\alpha = 0.01$ on this dataset converges optimally.