# SGDClassifier

In [1]:
from sklearn.linear_model import SGDClassifier

In [2]:
import sys
sys.path.append('..')
from utils import *

In [3]:
x_train, y_train = train_data()
x_val, y_val = val_data()
bl = baseline(x_train, y_train, x_val, y_val)
bl

0.69245946059591246

Scikit-learn is the obvious choice for classical ML, and what best than to start by following the sklearn cheat-sheet for choosing what model to use.

In [4]:
from IPython.display import Image
Image(url= "http://scikit-learn.org/stable/_static/ml_map.png")

First question: classification or regression?

Well, the task is binary classification, but Numer.ai asks for the probability of positive, so it's technically regression, like in logistic regression. Most classifiers in sklearn return the probabilities though, and I read in the forums that classification is better fitted to the task, so let's start with that.

Second question: <100K training samples?

Let's see.

In [5]:
x_train.shape

(535713, 50)

Nope.

Answer: SGDClassifier

This is a bit of a cop out, it just means: use a linear classifier. Using SGD means that we have more freedom when choosing the loss, as we don't need an analytical solution.

## Out-of-the-box

Let's first try it without any tuning. It defaults to SVM with L2 regularization, I believe it doesn't use any kernel transformation, it's just a hinge loss.

Unfortunately, and predictably, probability estimation is only available for LogReg and Huber, let's try both.

### LogReg

In [6]:
%time model = SGDClassifier(loss='log').fit(x_train, y_train)

CPU times: user 1.18 s, sys: 60 ms, total: 1.24 s
Wall time: 1.24 s


In [7]:
y_val_pred = model.predict_proba(x_val)
validate(y_val, y_val_pred, bl, True)

(0.69306060260318614, 8.6577956759148478e-05, -0.00060114200727368061)

### Huber

In [9]:
%time model = SGDClassifier(loss='modified_huber').fit(x_train, y_train)

CPU times: user 1.12 s, sys: 64 ms, total: 1.19 s
Wall time: 1.19 s


In [10]:
y_val_pred = model.predict_proba(x_val)
validate(y_val, y_val_pred, bl, True)

(0.71230993567454437, -0.019162755114599084, -0.019850475078631913)

## Binary predictions

Numer.ai may accept probabilities, but what if I stick to binary? It is an interesting notion, let's test it.

We'll just use the default SVM setting.

In [11]:
%time model = SGDClassifier().fit(x_train, y_train)

CPU times: user 1.05 s, sys: 72 ms, total: 1.12 s
Wall time: 1.12 s


In [12]:
y_val_pred = model.predict(x_val)
validate(y_val, y_val_pred, bl, True)

(16.782063661454917, -16.088916480894973, -16.089604200859004)

I see... Bad idea...

## Kernel approximation

I could play with hyperparameter tuning, but linear models all seem the same, no improvements from the baseline.

I have the assumption that the feature vectors are very abstract because of the encryption. I think the encryption is done using deep models, so I will definitely need more non-linearity if I'm looking for a single model solution, let's move in that direction.

The sklearn cheat-sheet proposes to move to kernel approximation, which makes a lot of sense. It's a scalable alternative to the kernel trick, and introduces sweet sweet non-linearity to play.

Let's start easy, RBF is a classic kernel, my ML prof always told be to choose it by default. Let's go then.

In [13]:
from sklearn.kernel_approximation import RBFSampler

In [14]:
rbf = RBFSampler()

In [15]:
x_train_rbf = rbf.fit_transform(x_train)
x_val_rbf = rbf.fit_transform(x_val)

In [16]:
model = SGDClassifier(loss='log').fit(x_train_rbf, y_train)

In [17]:
y_val_pred = model.predict_proba(x_val_rbf)
validate(y_val, y_val_pred, bl, True)

(0.6954379452472661, -0.0022907646873208121, -0.0029784846513536412)

Oh man, that's too bad...

## Polynomial features

If the encryption is done by using deep models, then a polynomial basis may work better. It's also common practice, although not very scalable. Let's see.

In [18]:
from sklearn.preprocessing import PolynomialFeatures

In [19]:
poly = PolynomialFeatures(degree=2)

In [20]:
x_train_poly = poly.fit_transform(x_train)

In [21]:
model = SGDClassifier(loss='log').fit(x_train_poly, y_train)

In [22]:
del x_train_poly

In [23]:
x_val_poly = poly.fit_transform(x_val)

In [24]:
y_val_pred = model.predict_proba(x_val_poly)

In [25]:
del x_val_poly

In [27]:
validate(y_val, y_val_pred, bl, True)

(0.80222230564711217, -0.10907512508716688, -0.10976284505119971)

Actually, pretty bad...

## AdaBoost

Now that we are working with linear models, why not use boosting. Boosting improves weak learners, it may help. Boosting is common with Decision Trees, and we'll try that at some other point, but let's stick to SGDClassifier for now.

In [28]:
from sklearn.ensemble import AdaBoostClassifier

In [29]:
model = AdaBoostClassifier(
    base_estimator=SGDClassifier(loss='log')
).fit(x_train, y_train)

In [30]:
y_val_pred = model.predict_proba(x_val)
validate(y_val, y_val_pred, bl, True)

(0.6931465190819549, 6.6147799038240862e-07, -0.00068705848604244668)

Just for curiosity, let's try the default tree AdaBoost.

In [31]:
model = AdaBoostClassifier().fit(x_train, y_train)

In [32]:
y_val_pred = model.predict_proba(x_val)
validate(y_val, y_val_pred, bl, True)

(0.69309945634692993, 4.7724213015354344e-05, -0.00063999575101747475)

## Conclusions

Alright, nothing much I can do to alter the results of these linear models. I should go to somewhere else.

Looking at the results we can see that the predictions barely deviate from 0.5, the models have no idea what they are doing, I need something deeper.

Actually, I may still be reinventing the wheel too much, there's plenty of people that have done this before me. Let's look at their work some more.