# Overview

Basically, the idea of the Numerai tournament is to let data scientists use machine learning techniques to earn money by  creating predictions for Numerai’s hedge fund.

The machine learning techniques can be neural networks, random forests, support vector machines, etc… . In this article I will focus on using neural networks. These neural networks will be programmed and trained using Python, Keras and scikit-learn (sklearn).

In addition to creating predictions for the regular tournament, it is also possible use our predictions to participate in a ‘staked’ tournament. We can stake the cryptocurrency numeraire (NMR) on our predictions in order to earn additional money. In this article, though, I  will only cover how to create predictions for the regular tournament.

# Pre-requisites

In this article the following programs and libraries need to be installed:

* Python 3 (from Anaconda)
* Keras
* scikit-learn
* Pandas
* Numpy
* Jupyter Notebook

We will also need to have an account on Numerai.

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

# Exploring the Datasets

After downloading the current datasets, we can have a look in the unzipped directory. Of relevance are the two CSV files:
* `numerai_training_data.csv`
* `numerai_tournament_data.csv`

The first file contains the training data and the second file the tournament data.

In [2]:
training_data = pd.read_csv('numerai_training_data.csv', header=0)
tournament_data = pd.read_csv('numerai_tournament_data.csv', header=0)

 Let’s have a quick look at the data in the files using Pandas.

In [3]:
training_data.head()

Unnamed: 0,id,era,data_type,feature1,feature2,feature3,feature4,feature5,feature6,feature7,...,feature42,feature43,feature44,feature45,feature46,feature47,feature48,feature49,feature50,target
0,n221973c37ed247a,era1,train,0.55098,0.42673,0.4018,0.44622,0.68562,0.45346,0.24763,...,0.55924,0.64714,0.62358,0.40199,0.5121,0.42287,0.33241,0.54669,0.55408,1
1,n435a06426e694b7,era1,train,0.32694,0.37829,0.38716,0.41725,0.50691,0.38413,0.61237,...,0.29351,0.57591,0.40191,0.60666,0.53842,0.52236,0.55653,0.26194,0.33737,1
2,n6b5405942fb446b,era1,train,0.4544,0.45144,0.55052,0.64551,0.63833,0.51962,0.34126,...,0.69339,0.44649,0.58797,0.51314,0.42471,0.31818,0.61949,0.66547,0.57674,1
3,ne5d49a5111e84b7,era1,train,0.64494,0.60252,0.43466,0.60305,0.52179,0.42805,0.59592,...,0.19793,0.45149,0.56702,0.31475,0.42197,0.38904,0.5957,0.42558,0.44277,1
4,nd3625b02877d4b2,era1,train,0.3906,0.62302,0.73704,0.58155,0.42124,0.54693,0.51778,...,0.64195,0.22353,0.48502,0.46545,0.52634,0.32485,0.62126,0.51344,0.55221,0


The above statement shows us the first 5 rows of the training data. Here we can see the names of the columns. Take note of the era and data_type columns.

In [4]:
print(training_data.era.unique())
print(training_data.data_type.unique())

['era1' 'era2' 'era3' 'era4' 'era5' 'era6' 'era7' 'era8' 'era9' 'era10'
 'era11' 'era12' 'era13' 'era14' 'era15' 'era16' 'era17' 'era18' 'era19'
 'era20' 'era21' 'era22' 'era23' 'era24' 'era25' 'era26' 'era27' 'era28'
 'era29' 'era30' 'era31' 'era32' 'era33' 'era34' 'era35' 'era36' 'era37'
 'era38' 'era39' 'era40' 'era41' 'era42' 'era43' 'era44' 'era45' 'era46'
 'era47' 'era48' 'era49' 'era50' 'era51' 'era52' 'era53' 'era54' 'era55'
 'era56' 'era57' 'era58' 'era59' 'era60' 'era61' 'era62' 'era63' 'era64'
 'era65' 'era66' 'era67' 'era68' 'era69' 'era70' 'era71' 'era72' 'era73'
 'era74' 'era75' 'era76' 'era77' 'era78' 'era79' 'era80' 'era81' 'era82'
 'era83' 'era84' 'era85']
['train']


Using the above statements we can see all the unique names for eras and data types in our data. In the training data there are 85 eras represented and all data is indeed training data. We can do something similar for the tournament data:

In [5]:
tournament_data.head()

Unnamed: 0,id,era,data_type,feature1,feature2,feature3,feature4,feature5,feature6,feature7,...,feature42,feature43,feature44,feature45,feature46,feature47,feature48,feature49,feature50,target
0,n00e1d5ebcf3d4d5,era86,validation,0.28523,0.52729,0.60784,0.43518,0.32576,0.63765,0.44005,...,0.35296,0.4617,0.50857,0.40087,0.55512,0.65612,0.60729,0.37915,0.46449,1.0
1,nb0b4cce48b78471,era86,validation,0.38658,0.57589,0.36267,0.36722,0.52405,0.54712,0.63671,...,0.26381,0.5604,0.36975,0.50206,0.59444,0.65968,0.42385,0.335,0.35268,1.0
2,n7aae3361b330439,era86,validation,0.33371,0.5465,0.49027,0.40156,0.43806,0.47818,0.55603,...,0.34934,0.46677,0.28978,0.63833,0.70284,0.46035,0.49885,0.29836,0.52302,0.0
3,neca221c2e9374fe,era86,validation,0.29859,0.35833,0.47076,0.38464,0.52346,0.48471,0.55128,...,0.4413,0.5533,0.37002,0.42181,0.45798,0.54207,0.58005,0.48274,0.44154,1.0
4,nf7f52c87d740439,era86,validation,0.60599,0.69024,0.80057,0.52854,0.43206,0.6174,0.44755,...,0.39515,0.32171,0.62867,0.53641,0.54626,0.67708,0.67124,0.33609,0.44527,0.0


Now, we get the first 5 rows of our tournament data.

In [6]:
print(tournament_data.era.unique())
print(tournament_data.data_type.unique())

['era86' 'era87' 'era88' 'era89' 'era90' 'era91' 'era92' 'era93' 'era94'
 'era95' 'era96' 'era97' 'eraX']
['validation' 'test' 'live']


The above statements show that the tournament data contains more diverse data types — validation, test, live. The eras contained are from 86 to 97 and some additional one called ‘eraX’. Let’s dig a bit deeper into this with the following:

In [7]:
print(tournament_data.era[tournament_data.data_type=='validation'].unique())
print(tournament_data.era[tournament_data.data_type=='test'].unique())
print(tournament_data.era[tournament_data.data_type=='live'].unique())

['era86' 'era87' 'era88' 'era89' 'era90' 'era91' 'era92' 'era93' 'era94'
 'era95' 'era96' 'era97']
['eraX']
['eraX']


Now we can see that the tournament data of the ‘validation’ type contains the eras from 86 to 97 and that the data of type ‘test’ and ‘live’ contain the ‘eraX’ era.

In the documentation on the Numerai help page, it is mentioned that the validation data contains the targets (like the training data does). Let’s verify this:

In [8]:
print(tournament_data.target[tournament_data.data_type=='validation'].unique())
print(tournament_data.target[tournament_data.data_type=='test'].unique())
print(tournament_data.target[tournament_data.data_type=='live'].unique())

[ 1.  0.]
[ nan]
[ nan]


The validation data does indeed contain the targets [0., 1.], whereas the test and live data contain [nan].

Given that we cannot ever have too much training data, combining all the eras into one ‘complete’ training set — training data plus validation data —  is the thing to do. We will therefore disregard the warning made by Numerai:

> We recommend you do not train on the validation data even though you have the targets.

But, won’t this give us problems with overfitting? Not necessarily, given that we will use cross-validation to test the performance of our neural network models.

Let’s create this ‘complete’ training set:

In [9]:
validation_data = tournament_data[tournament_data.data_type=='validation']
complete_training_data = pd.concat([training_data, validation_data])

And let’s check to see we have the correct eras:

In [10]:
complete_training_data.era.unique()

array(['era1', 'era2', 'era3', 'era4', 'era5', 'era6', 'era7', 'era8',
       'era9', 'era10', 'era11', 'era12', 'era13', 'era14', 'era15',
       'era16', 'era17', 'era18', 'era19', 'era20', 'era21', 'era22',
       'era23', 'era24', 'era25', 'era26', 'era27', 'era28', 'era29',
       'era30', 'era31', 'era32', 'era33', 'era34', 'era35', 'era36',
       'era37', 'era38', 'era39', 'era40', 'era41', 'era42', 'era43',
       'era44', 'era45', 'era46', 'era47', 'era48', 'era49', 'era50',
       'era51', 'era52', 'era53', 'era54', 'era55', 'era56', 'era57',
       'era58', 'era59', 'era60', 'era61', 'era62', 'era63', 'era64',
       'era65', 'era66', 'era67', 'era68', 'era69', 'era70', 'era71',
       'era72', 'era73', 'era74', 'era75', 'era76', 'era77', 'era78',
       'era79', 'era80', 'era81', 'era82', 'era83', 'era84', 'era85',
       'era86', 'era87', 'era88', 'era89', 'era90', 'era91', 'era92',
       'era93', 'era94', 'era95', 'era96', 'era97'], dtype=object)

This gives us all eras from 1 to 97, as it is supposed to.

We can now create our features (X) and labels(Y) for training our neural network:

In [11]:
features = [f for f in list(complete_training_data) if "feature" in f]
X = complete_training_data[features]
Y = complete_training_data["target"]

In [12]:
X.head()

Unnamed: 0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,...,feature41,feature42,feature43,feature44,feature45,feature46,feature47,feature48,feature49,feature50
0,0.55098,0.42673,0.4018,0.44622,0.68562,0.45346,0.24763,0.61223,0.52564,0.48586,...,0.53232,0.55924,0.64714,0.62358,0.40199,0.5121,0.42287,0.33241,0.54669,0.55408
1,0.32694,0.37829,0.38716,0.41725,0.50691,0.38413,0.61237,0.40076,0.52302,0.42932,...,0.46287,0.29351,0.57591,0.40191,0.60666,0.53842,0.52236,0.55653,0.26194,0.33737
2,0.4544,0.45144,0.55052,0.64551,0.63833,0.51962,0.34126,0.57061,0.44524,0.41106,...,0.40919,0.69339,0.44649,0.58797,0.51314,0.42471,0.31818,0.61949,0.66547,0.57674
3,0.64494,0.60252,0.43466,0.60305,0.52179,0.42805,0.59592,0.33314,0.48087,0.5699,...,0.40814,0.19793,0.45149,0.56702,0.31475,0.42197,0.38904,0.5957,0.42558,0.44277
4,0.3906,0.62302,0.73704,0.58155,0.42124,0.54693,0.51778,0.35616,0.37648,0.51755,...,0.37677,0.64195,0.22353,0.48502,0.46545,0.52634,0.32485,0.62126,0.51344,0.55221


In [13]:
Y.head()

0    1.0
1    1.0
2    1.0
3    1.0
4    0.0
Name: target, dtype: float64

# Performing Predictions with Keras and scikit-learn

Using the Keras neural network library for Python we can define some simple neural network model which we can train on our complete training set. First we define the neural network model in a function.

Then we create a wrapper for the neural network. This is needed to create a bridge between Keras and scikit-learn. We’ll tell it to run for 10 epochs, with a batch size of 128. Verbosity is set to 0, because we don’t need to see how far the network has been trained.

Now, from the Numerai documentation we can learn that:

> For cross-validation, it’s better to hold out a random sample of eras rather than a random sample rows. Using a random sample of rows tends to over fit.

Therefore, we should use a special cross-validation method called group-k-fold. In our case it is set to do k-fold over the individual eras rather than over the individual rows in the complete training set. We shall use the GroupKFold class from scikit-learn for this. First we create an instance of this class and tell it to create 5 folds. Then using the split method we tell the object to split the training data based on the eras.

Using GridSearchCV from scikit-learn we can find good hyper-parameters for our neural network model. Here we try to find out what works best — 10 neurons or 14 and a dropout probability of 0.01 or 0.26. This gives a parameter grid with a total of 4 combinations to try out.

Create an instance of  GridSearchCV with the neural network model we defined above, the parameter grid with the 4 combinations, a scoring function that matches the loss function of the neural network, one thread and a verbose level of 2. Then we tell it to fit our training data.


In [14]:
import numpy
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GroupKFold
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization, Dropout, Activation
from keras.wrappers.scikit_learn import KerasClassifier

def create_model(neurons=200, dropout=0.2):
    model = Sequential()
    model.add(Dense(neurons, input_shape=(50,), kernel_initializer='glorot_uniform', use_bias=False))
    model.add(BatchNormalization())
    model.add(Dropout(dropout))
    model.add(Activation('relu'))
    model.add(Dense(1, activation='sigmoid', kernel_initializer='glorot_normal'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_crossentropy', 'accuracy'])
    return model

model = KerasClassifier(build_fn=create_model, epochs=8, batch_size=128, verbose=0)

neurons = [10, 14]
dropout = [0.01, 0.26]
param_grid = dict(neurons=neurons, dropout=dropout)

gkf = GroupKFold(n_splits=5)
kfold_split = gkf.split(X, Y, groups=complete_training_data.era)

grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=kfold_split, scoring='neg_log_loss',n_jobs=1, verbose=3)
grid_result = grid.fit(X.values, Y.values)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Using TensorFlow backend.


Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] dropout=0.01, neurons=10 ........................................
[CV]  dropout=0.01, neurons=10, score=-0.6920131037236666, total= 3.1min
[CV] dropout=0.01, neurons=10 ........................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.2min remaining:    0.0s


[CV]  dropout=0.01, neurons=10, score=-0.6931060224708179, total= 3.1min
[CV] dropout=0.01, neurons=10 ........................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  6.4min remaining:    0.0s


[CV]  dropout=0.01, neurons=10, score=-0.692674851594345, total= 3.1min
[CV] dropout=0.01, neurons=10 ........................................
[CV]  dropout=0.01, neurons=10, score=-0.6913594624863721, total= 3.1min
[CV] dropout=0.01, neurons=10 ........................................
[CV]  dropout=0.01, neurons=10, score=-0.6926817816201736, total= 3.1min
[CV] dropout=0.01, neurons=14 ........................................
[CV]  dropout=0.01, neurons=14, score=-0.6920294021313386, total= 3.1min
[CV] dropout=0.01, neurons=14 ........................................
[CV]  dropout=0.01, neurons=14, score=-0.6928563954308723, total= 3.1min
[CV] dropout=0.01, neurons=14 ........................................
[CV]  dropout=0.01, neurons=14, score=-0.6926209476269657, total= 3.1min
[CV] dropout=0.01, neurons=14 ........................................
[CV]  dropout=0.01, neurons=14, score=-0.6914420946625519, total= 3.0min
[CV] dropout=0.01, neurons=14 ..................................

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 64.0min finished


Best: -0.692028 using {'dropout': 0.26, 'neurons': 14}
-0.692354 (0.000616) with: {'dropout': 0.01, 'neurons': 10}
-0.692419 (0.000631) with: {'dropout': 0.01, 'neurons': 14}
-0.692078 (0.000360) with: {'dropout': 0.26, 'neurons': 10}
-0.692028 (0.000464) with: {'dropout': 0.26, 'neurons': 14}


When we run the fit method of the GridSearchCV object, we are told the following:

> Fitting 5 folds for each of 4 candidates, totalling 20 fits

This means that our neural network will be trained 20 times. For each of the 4 combinations of our parameter grid, 5 different folds of training data will be used. The latter refers to the 5-fold cross validation.

Yes, this process takes a long time even with a new high end GPU. On my PC with a Nvidia GTX 1080 GPU every ‘fit’ takes roughly 3.1 min, giving a total of 64.0 min.

In [15]:
grid.best_estimator_.model.save('./my_model_2017-11-07_IV.h5')

In [16]:
grid.best_estimator_.model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_41 (Dense)             (None, 14)                700       
_________________________________________________________________
batch_normalization_21 (Batc (None, 14)                56        
_________________________________________________________________
dropout_21 (Dropout)         (None, 14)                0         
_________________________________________________________________
activation_21 (Activation)   (None, 14)                0         
_________________________________________________________________
dense_42 (Dense)             (None, 1)                 15        
Total params: 771
Trainable params: 743
Non-trainable params: 28
_________________________________________________________________


# Checking the Performance

We can now check the performance for the tournament of the best model found by the GridSearchCV object. Let’s go over the items mentioned in the Numerai documentation:

> The leaderboard displayed is based only on validation data. To be on the leaderboard, models are required to have concordance, originality, and consistency.

In order for us to earn money in the regular tournament, we need to be on the leaderboard. Hence, our model needs to satisfy the following 3 criteria — concordance, originality and consistency:

> Concordance is a measure of whether predictions on the validation set, test set, and live set appear to be generated by the same model.

This should not be a problem for our model.

> Originality is a measure of whether a set of predictions is uncorrelated with predictions already submitted.

This is the most tricky part. If you submit early in the round, then chances are very high that your submission will be original. On the other hand, this probability drops over time. This is why I put values like 0.01 and 0.26 instead 0.0 and 0.25 for the dropout rates in parameter grid. Chances are small that anyone else has these values for their model. This increases the uniqueness of tour model.

> Consistency measures the percentage of eras in which a model achieves a logloss < -ln(0.5). … Only models with consistency above 75% are considered consistent.

We can check the consistency with our own function.

In [17]:
def check_consistency(model, valid_data):
    eras = valid_data.era.unique()
    count = 0
    count_consistent = 0
    for era in eras:
        count += 1
        current_valid_data = valid_data[validation_data.era==era]
        features = [f for f in list(complete_training_data) if "feature" in f]
        X_valid = current_valid_data[features]
        Y_valid = current_valid_data["target"]
        loss = model.evaluate(X_valid.values, Y_valid.values, batch_size=128, verbose=0)[0]
        if (loss < -np.log(.5)):
            consistent = True
            count_consistent += 1
        else:
            consistent = False
        print("{}: loss - {} consistent: {}".format(era, loss, consistent))
    print ("Consistency: {}".format(count_consistent/count))
        
check_consistency(grid.best_estimator_.model, validation_data)

era86: loss - 0.6902206643444626 consistent: True
era87: loss - 0.6870789562210595 consistent: True
era88: loss - 0.6924690243949092 consistent: True
era89: loss - 0.6919184580327339 consistent: True
era90: loss - 0.6882611462892579 consistent: True
era91: loss - 0.692048743982823 consistent: True
era92: loss - 0.6956388723282587 consistent: False
era93: loss - 0.6937774828073401 consistent: False
era94: loss - 0.6880148011334818 consistent: True
era95: loss - 0.6889040892584941 consistent: True
era96: loss - 0.6909991045190831 consistent: True
era97: loss - 0.6920358447774848 consistent: True
Consistency: 0.8333333333333334


Our model achieves a consistency above 75%, so our submission passes this criterion.

# Submitting the Predictions

We’ll use the following statements to create predictions and write them to a CSV file.

In [18]:
from time import gmtime, strftime

x_prediction = tournament_data[features]
t_id = tournament_data["id"]
y_prediction = grid.best_estimator_.model.predict_proba(x_prediction.values, batch_size=128)

results = np.reshape(y_prediction,-1)
results_df = pd.DataFrame(data={'probability':results})
joined = pd.DataFrame(t_id).join(results_df)


# path = "predictions_w_loss_0_" + '{:4.0f}'.format(history.history['loss'][-1]*10000) + ".csv"
path = 'predictions_{:}'.format(strftime("%Y-%m-%d_%Hh%Mm%Ss", gmtime())) + '.csv'
print()
print("Writing predictions to " + path.strip())
# # Save the predictions out to a CSV file
joined.to_csv(path,float_format='%.15f', index=False)

Writing predictions to predictions_2017-11-08_14h09m41s.csv


Now we can upload the CSV file to Numerai and see which score we get.

# Further Explorations

The most obvious thing to try now, is of course to change the lists of hyper-parameter values. Perhaps some other values as well, like the number layers in the neural network or the type of optimizer used. You can find some inspiration for this here: https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

For the cross-validation you could try 10-fold cross-validation instead of the 5-fold cross-validation which I showed here. Do take into account that it will take at least twice as long to run. But is the extra running time worth it?

Use RandomizedSearchCV from the scikit-learn library instead of GridSearchCV. Or, write your own search function. A reason why you would want to do that, is so that you can save the progress as check-points after every fold. More on that here: https://machinelearningmastery.com/check-point-deep-learning-models-keras/

Try other machine learning methods like random forests or SVMs.

# References

https://keras.io/layers/core/

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

http://scikit-learn.org/stable/modules/cross_validation.html#group-k-fold

https://numer.ai/help

https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

https://machinelearningmastery.com/check-point-deep-learning-models-keras/