# SGDClassifier

In [1]:
from sklearn.linear_model import SGDClassifier

In [2]:
import sys
sys.path.append('..')
from utils import *

In [3]:
x_train, y_train = train_data()
x_val, y_val = val_data()
bl = baseline(x_train, y_train, x_val, y_val)
bl

0.69245946059591246

Scikit-learn is the obvious choice for classical ML, and what best than to start by following the sklearn cheat-sheet for choosing what model to use.

In [3]:
from IPython.display import Image
Image(url= "http://scikit-learn.org/stable/_static/ml_map.png")

First question: classification or regression?

Well, the task is binary classification, but Numer.ai asks for the probability of positive, so it's technically regression, like in logistic regression. Most classifiers in sklearn return the probabilities though, and I read in the forums that classification is better fitted to the task, so let's start with that.

Second question: <100K training samples?

Let's see.

In [4]:
x_train.shape

(535713, 50)

Nope.

Answer: SGDClassifier

This is a bit of a cop out, it just means: use a linear classifier. Using SGD means that we have more freedom when choosing the loss, as we don't need an analytical solution.

## Out-of-the-box

Let's first try it without any tuning. It defaults to SVM with L2 regularization, I believe it doesn't use any kernel transformation, it's just a hinge loss.

Unfortunately, and predictably, probability estimation is only available for LogReg and Huber, let's try both.

### LogReg

In [14]:
%time model = SGDClassifier(loss='log').fit(x_train, y_train)

CPU times: user 1.22 s, sys: 80 ms, total: 1.3 s
Wall time: 1.3 s


In [15]:
y_val_pred = model.predict_proba(x_val)
logloss = validate(y_val, y_val_pred)
logloss

0.69281484314352071

In [16]:
compare(bl, logloss)

-0.00051321783848894034

### Huber

In [23]:
%time model = SGDClassifier(loss='modified_huber').fit(x_train, y_train)

CPU times: user 1.15 s, sys: 68 ms, total: 1.22 s
Wall time: 1.21 s


In [24]:
y_val_pred = model.predict_proba(x_val)
logloss = validate(y_val, y_val_pred)
logloss

0.70016548550642688

In [13]:
compare(bl, logloss)

-0.0055079038343857295

## Binary predictions

Numer.ai may accept probabilities, but what if I stick to binary? It is an interesting notion, let's test it.

We'll just use the default SVM setting.

In [31]:
%time model = SGDClassifier().fit(x_train, y_train)

CPU times: user 1.09 s, sys: 72 ms, total: 1.16 s
Wall time: 1.16 s


In [32]:
y_val_pred = model.predict(x_val)
logloss = validate(y_val, y_val_pred)
logloss

16.956232478676892

I see... Bad idea...

## Kernel approximation

I could play with hyperparameter tuning, but linear models all seem the same, no improvements from the baseline.

I have the assumption that the feature vectors are very abstract because of the encryption. I think the encryption is done using deep models, so I will definitely need more non-linearity if I'm looking for a single model solution, let's move in that direction.

The sklearn cheat-sheet proposes to move to kernel approximation, which makes a lot of sense. It's a scalable alternative to the kernel trick, and introduces sweet sweet non-linearity to play.

Let's start easy, RBF is a classic kernel, my ML prof always told be to choose it by default. Let's go then.

In [70]:
from sklearn.kernel_approximation import RBFSampler

In [71]:
rbf = RBFSampler()

In [72]:
x_train_rbf = rbf.fit_transform(x_train)
x_val_rbf = rbf.fit_transform(x_val)

In [73]:
model = SGDClassifier(loss='log').fit(x_train_rbf, y_train)

In [74]:
y_val_pred = model.predict_proba(x_val_rbf)
logloss = validate(y_val, y_val_pred)
logloss

0.69573908540301288

In [75]:
compare(bl, logloss)

-0.0047361975591727256

Oh man, that's too bad...

## Polynomial features

If the encryption is done by using deep models, then a polynomial basis may work better. It's also common practice, although not very scalable. Let's see.

In [4]:
from sklearn.preprocessing import PolynomialFeatures

In [5]:
poly = PolynomialFeatures(degree=2)

In [6]:
x_train_poly = poly.fit_transform(x_train)

In [7]:
model = SGDClassifier(loss='log').fit(x_train_poly, y_train)

In [8]:
del x_train_poly

In [9]:
x_val_poly = poly.fit_transform(x_val)

In [10]:
y_val_pred = model.predict_proba(x_val_poly)
logloss = validate(y_val, y_val_pred)
logloss

0.69457646842529008

In [11]:
compare(bl, logloss)

-0.0030572299894000795

Nothing yet...

## Conclusions

Alright, nothing much I can do to alter the results of these linear models. I should go to somewhere else.

Actually, I may still be reinventing the wheel too much, there's plenty of people that have done this before me. Let's look at their work some more.