# Baseline

I just copied the baseline given by Numer.ai into a notebook to explore it.

## Inspect the baseline

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics, preprocessing, linear_model

### Setup

In [2]:
# Set seed for reproducibility
np.random.seed(0)

In [4]:
# Load the data from the CSV files
training_data = pd.read_csv('data/numerai_training_data.csv', header=0)
prediction_data = pd.read_csv('data/numerai_tournament_data.csv', header=0)

We have 50 features and a binary target. We also have an era attribute, I assume that it refers to some abstract time measure.

In [7]:
training_data.sample(10)

Unnamed: 0,id,era,data_type,feature1,feature2,feature3,feature4,feature5,feature6,feature7,...,feature42,feature43,feature44,feature45,feature46,feature47,feature48,feature49,feature50,target
52736,n7d52e39d0e22490,era9,train,0.60401,0.5052,0.27749,0.77684,0.62245,0.65862,0.45053,...,0.55147,0.53308,0.75621,0.47978,0.42135,0.46774,0.55192,0.6271,0.66621,0
431805,n2a27c0ea6ad84d8,era69,train,0.40248,0.55483,0.53647,0.54903,0.56724,0.56572,0.35852,...,0.56192,0.48128,0.59425,0.44154,0.46792,0.54394,0.64569,0.52086,0.52996,0
488688,n98842ea1ac8843a,era78,train,0.29641,0.47708,0.48294,0.31919,0.53438,0.53539,0.59678,...,0.37949,0.47633,0.44345,0.63916,0.5452,0.69952,0.56312,0.33349,0.50899,1
526910,n07b0e792e0e845e,era84,train,0.37686,0.48353,0.51329,0.55529,0.57198,0.44464,0.45682,...,0.49065,0.4256,0.5025,0.60277,0.46578,0.56232,0.5451,0.43592,0.49101,0
349947,n1f9ca0dfccd14d6,era56,train,0.33096,0.62125,0.69332,0.43064,0.48083,0.67085,0.53164,...,0.49245,0.48688,0.51067,0.498,0.47322,0.70052,0.57413,0.47063,0.53732,1
59891,n1d05d8218c4d4fd,era10,train,0.77222,0.42438,0.44725,0.66136,0.42622,0.8767,0.48996,...,0.53129,0.41772,0.5112,0.5048,0.4812,0.53022,0.58179,0.55974,0.74943,1
327607,n19b1158aad1d4fd,era52,train,0.69019,0.60163,0.48076,0.83235,0.60487,0.49129,0.45402,...,0.65561,0.43834,0.79491,0.51193,0.3539,0.40164,0.65208,0.74008,0.67859,1
165885,n181a8e02bb024e7,era27,train,0.44236,0.35316,0.50878,0.44798,0.64075,0.52911,0.31184,...,0.7071,0.53409,0.50295,0.60301,0.44653,0.57183,0.46365,0.50258,0.60877,0
204807,n21814d6ab2ee485,era32,train,0.53924,0.54476,0.72279,0.70324,0.41926,0.46674,0.77306,...,0.57224,0.17847,0.53731,0.44714,0.3634,0.47561,0.56544,0.56468,0.61561,0
440712,n3d62a2ef250e4a5,era71,train,0.56535,0.42849,0.6493,0.57886,0.38304,0.34199,0.55597,...,0.51514,0.43316,0.45001,0.50838,0.69629,0.44154,0.48275,0.62332,0.63393,0


Same thing here, with validation data for local evaluation and the test data that needs to be predicted.

In [58]:
prediction_data.sample(10)

Unnamed: 0,id,era,data_type,feature1,feature2,feature3,feature4,feature5,feature6,feature7,...,feature42,feature43,feature44,feature45,feature46,feature47,feature48,feature49,feature50,target
251499,ne7b4a9e6aafc44d,eraX,test,0.38958,0.55354,0.63414,0.21006,0.48615,0.5438,0.56129,...,0.31083,0.49531,0.34462,0.45604,0.60231,0.61381,0.50496,0.35649,0.33639,
190096,nda26afa259744b9,eraX,test,0.41494,0.44686,0.62303,0.46883,0.60564,0.36824,0.36287,...,0.40904,0.56414,0.49443,0.4657,0.57135,0.56458,0.40636,0.47398,0.3529,
298832,n0d6d7d23fc1449b,eraX,test,0.43008,0.41066,0.46273,0.41919,0.68978,0.47479,0.35709,...,0.58541,0.68744,0.58057,0.49921,0.49758,0.63115,0.33069,0.50399,0.43213,
26609,n321035196b7d4c5,era90,validation,0.41154,0.66604,0.53876,0.60045,0.45295,0.63023,0.50169,...,0.49267,0.51883,0.48751,0.52842,0.63063,0.13366,0.62943,0.46904,0.47784,1.0
296122,n6c04848b42a343c,eraX,test,0.42362,0.31957,0.53045,0.50261,0.4898,0.48713,0.51283,...,0.68447,0.42182,0.39916,0.63896,0.55254,0.51753,0.49028,0.62738,0.7337,
109166,n3009230e49b1430,eraX,test,0.49058,0.41641,0.6113,0.40676,0.38886,0.55508,0.54577,...,0.51131,0.51467,0.32632,0.68377,0.5647,0.53813,0.61556,0.53201,0.40087,
101859,n7f282c3cfeed41d,eraX,test,0.47482,0.37269,0.62005,0.62109,0.54101,0.61744,0.52978,...,0.42139,0.40316,0.38581,0.56813,0.47336,0.4788,0.68585,0.3681,0.60581,
18385,n1bc340e974d642c,era89,validation,0.65194,0.38285,0.61532,0.63747,0.35383,0.3618,0.51359,...,0.40007,0.49838,0.51419,0.47875,0.60149,0.32767,0.57218,0.67672,0.76421,1.0
156909,n6eea6e7e7514453,eraX,test,0.42493,0.78626,0.56307,0.48817,0.6051,0.52181,0.3275,...,0.37399,0.48465,0.65622,0.3494,0.6353,0.38745,0.69425,0.40842,0.46666,
73484,n1860c054b9e1480,era97,validation,0.52323,0.46474,0.68431,0.47072,0.45769,0.43222,0.72275,...,0.48189,0.41147,0.41047,0.6064,0.57087,0.67208,0.46943,0.52679,0.38084,0.0


In [12]:
# Transform the loaded CSV data into numpy arrays
features = [f for f in list(training_data) if "feature" in f]
X = training_data[features]
Y = training_data["target"]
x_prediction = prediction_data[features]
ids = prediction_data["id"]

### Training

Naturally, a LogReg baseline.

In [10]:
# This is your model that will learn to predict
model = linear_model.LogisticRegression(n_jobs=-1)

In [13]:
# Your model is trained on the training_data
model.fit(X, Y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Prediction

In [15]:
# Your trained model is now used to make predictions on the numerai_tournament_data
# The model returns two columns: [probability of 0, probability of 1]
# We are just interested in the probability that the target is 1.
y_prediction = model.predict_proba(x_prediction)
results = y_prediction[:, 1]
results_df = pd.DataFrame(data={'probability':results})
joined = pd.DataFrame(ids).join(results_df)

In [16]:
# Save the predictions out to a CSV file
joined.to_csv("predictions.csv", index=False)
# Now you can upload these predictions on numer.ai

## Writing utility functions

There's some data wrangling code that will be needed in every other model, let's get it all into a simple module.

The following functions will be copied to `../utils.py`.

### Traning data

In [28]:
def train_data(path='data/numerai_training_data.csv'):
    
    training_data = pd.read_csv(path, header=0)
    features = [f for f in list(training_data) if "feature" in f]
    
    X = training_data[features]
    Y = training_data["target"]
    
    return X, Y

In [30]:
x, y = train_data()

In [32]:
x.sample(3)

Unnamed: 0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,...,feature41,feature42,feature43,feature44,feature45,feature46,feature47,feature48,feature49,feature50
497225,0.52618,0.57091,0.62041,0.58871,0.52077,0.46767,0.44714,0.4266,0.44727,0.50927,...,0.37827,0.48779,0.51719,0.59788,0.38803,0.4622,0.46273,0.63752,0.53769,0.385
209143,0.44124,0.37456,0.5737,0.46306,0.62956,0.57386,0.28982,0.56695,0.45397,0.42695,...,0.50993,0.40892,0.58595,0.52257,0.41568,0.46965,0.62172,0.65691,0.434,0.46475
269280,0.5427,0.62655,0.64731,0.39688,0.57367,0.58154,0.4863,0.40618,0.5731,0.46558,...,0.45807,0.32236,0.49103,0.63213,0.38263,0.55201,0.65261,0.70341,0.43373,0.44353


In [33]:
y.sample(3)

312606    0
523072    1
42070     1
Name: target, dtype: int64

### Prediction data

In [43]:
def pred_data(path='data/numerai_tournament_data.csv'):
    prediction_data = pd.read_csv(path, header=0)
    features = [f for f in list(prediction_data) if "feature" in f]
    x_prediction = prediction_data[features]
    ids = prediction_data["id"]
    
    return x_prediction, ids

In [44]:
x_pred, ids = pred_data()

In [45]:
x_pred.sample(3)

Unnamed: 0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,...,feature41,feature42,feature43,feature44,feature45,feature46,feature47,feature48,feature49,feature50
12465,0.32372,0.46998,0.32368,0.38455,0.67962,0.56765,0.38017,0.43107,0.59212,0.42502,...,0.27537,0.34188,0.51467,0.50906,0.51943,0.49347,0.64019,0.52034,0.27359,0.47895
177281,0.41085,0.41434,0.66978,0.40757,0.39977,0.45228,0.42933,0.48106,0.48825,0.46429,...,0.48941,0.41247,0.48232,0.45119,0.57648,0.51066,0.56564,0.62927,0.46496,0.50244
190416,0.69441,0.59084,0.58576,0.46836,0.36185,0.44898,0.60011,0.56373,0.56057,0.36954,...,0.55458,0.35509,0.56887,0.56848,0.32957,0.82561,0.42328,0.25735,0.60933,0.57144


In [46]:
ids.sample(3)

158518    neb88c2430b4d4d7
157821    na791c74a658a4e4
15317     n898634e441a6432
Name: id, dtype: object

### Save predictions

In [37]:
def save_pred(y_prediction, ids, path='predictions.csv'):
    results = y_prediction[:, 1]
    results_df = pd.DataFrame(data={'probability':results})
    joined = pd.DataFrame(ids).join(results_df)
    joined.to_csv(path, index=False)

### Validation data

In [62]:
def val_data(path='data/numerai_tournament_data.csv'):
    prediction_data = pd.read_csv(path, header=0)
    validation_data = prediction_data[prediction_data['data_type'] == 'validation']
    features = [f for f in list(validation_data) if "feature" in f]
    x_validation = validation_data[features]
    y_validation = validation_data["target"]
    
    return x_validation, y_validation

In [63]:
x_val, y_val = val_data()

In [65]:
x_val.sample(3)

Unnamed: 0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,...,feature41,feature42,feature43,feature44,feature45,feature46,feature47,feature48,feature49,feature50
3240,0.50556,0.54693,0.51451,0.61103,0.5558,0.43823,0.47483,0.51428,0.34181,0.41723,...,0.32985,0.38372,0.40258,0.5418,0.54488,0.50834,0.38983,0.60164,0.51298,0.41407
10484,0.37411,0.3324,0.60859,0.41821,0.61704,0.45401,0.36619,0.46227,0.53933,0.56382,...,0.49816,0.6435,0.50944,0.52458,0.47293,0.25215,0.44624,0.61643,0.49272,0.44006
46662,0.46708,0.51316,0.62916,0.30926,0.49903,0.57531,0.54157,0.45751,0.53852,0.53643,...,0.46791,0.42254,0.56141,0.40174,0.58188,0.47765,0.60251,0.60199,0.35301,0.34892


In [66]:
y_val.sample(3)

56255    1.0
24085    0.0
2573     0.0
Name: target, dtype: float64

### Validation

We see that the baseline method lacks a validation. It is important that we can evaluate locally, so we'll try to replicate the logloss from the uploaded prediction.

Numerai gives me: 0.69245

It looks like there's a logloss in sklearn, handy, I bet this is the exact one.

In [67]:
from sklearn.metrics import log_loss

In [69]:
y_pred = model.predict_proba(x_val)

In [70]:
log_loss(y_val, y_pred)

0.69245946059591246

Indeed, perfect, lets put it into a function.

In [None]:
def validate(y_true, y_pred):
    return log_loss(y_true, y_pred)

### All together

I've copied all the functions into `../utils.py`. Let's do a final test.

In [73]:
import sys
sys.path.append('..')
import utils

In [75]:
x_train, y_train = utils.train_data()
x_val, y_val = utils.val_data()
x_pred, ids = utils.pred_data()

model = linear_model.LogisticRegression(n_jobs=-1)
model.fit(x_train, y_train)

y_val_pred = model.predict_proba(x_val)
print(validate(y_val, y_val_pred))

y_pred = model.predict_proba(x_pred)
save_pred(y_pred, ids)

0.692459460596


Yeah... Much nicer.