# Numerai

Numerai is a hedge fund that crowdsources their market predictions.  They disseminate data that is anonymized so that the data scientists who are working on the forecasting are not even aware of what features they are working with.  The prediction problem is reduced to a classification of predicting a gain or loss.

In this tutorial we will fit a variety models to Numerai data.

### Importing Data

Let's begin by importing the data and organizing our features and labels.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_data = pd.read_csv('numerai_training_data.csv')
df_data.head()

Unnamed: 0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10,...,feature13,feature14,feature15,feature16,feature17,feature18,feature19,feature20,feature21,target
0,0.499664,0.951271,0.12711,0.469706,0.188336,0.11383,0.917618,0.398412,0.41891,0.452983,...,0.137192,0.201437,0.507708,0.919475,0.978169,0.17708,0.101372,0.722138,0.832319,0
1,0.099515,0.682824,0.867939,0.943828,0.505526,0.886766,0.530862,0.531002,0.980002,0.941859,...,0.64264,0.533367,0.616879,0.697038,0.741461,0.08669,0.109533,0.324666,0.552276,1
2,0.671993,0.383901,0.533011,0.690863,0.176539,0.600196,0.381543,0.648849,0.831643,0.861746,...,0.520068,0.660924,0.538882,0.160117,0.765317,0.301772,0.352097,0.638205,0.383552,0
3,0.578177,0.872357,0.679625,0.108961,0.94591,0.571062,0.891958,0.916592,0.141508,0.258504,...,0.037959,0.604539,0.974103,0.187519,0.938254,0.560129,0.136483,0.284507,0.199446,1
4,0.474311,0.639613,0.563562,0.169508,0.456858,0.58071,0.969811,0.357417,0.157594,0.251147,...,0.038095,0.7702,0.697395,0.792327,0.71165,0.17708,0.247403,0.666598,0.755557,0


In [3]:
df_X = df_data.drop(columns='target').copy()

In [4]:
df_y = df_data['target'].copy()

### The Data is Clean

One nice thing about working with the Numerai data is that it is clean and normalized.  

1. All the features are in the range of $[0, 1]$.
1. All the features have a mean of 0.50 and a standard deviation of 0.28.
1. The occurrence labels is even at 50% gains and 50% losses.

In [5]:
df_X.mean()

feature1     0.511372
feature2     0.492770
feature3     0.492105
feature4     0.499420
feature5     0.502291
feature6     0.493039
feature7     0.480280
feature8     0.494526
feature9     0.492926
feature10    0.489265
feature11    0.495725
feature12    0.510969
feature13    0.489852
feature14    0.509350
feature15    0.487469
feature16    0.509012
feature17    0.488944
feature18    0.484929
feature19    0.491757
feature20    0.509223
feature21    0.498371
dtype: float64

In [6]:
df_X.std()

feature1     0.282260
feature2     0.287446
feature3     0.282481
feature4     0.284493
feature5     0.289867
feature6     0.287061
feature7     0.287526
feature8     0.288087
feature9     0.293945
feature10    0.287046
feature11    0.290922
feature12    0.285451
feature13    0.291276
feature14    0.290140
feature15    0.286997
feature16    0.289279
feature17    0.284790
feature18    0.290445
feature19    0.283742
feature20    0.291001
feature21    0.289637
dtype: float64

Notice that a guess of increase for all assets would yield an accuracy of 50.5%.

In [7]:
df_y.mean()

0.5051702657807309

### Logistic Regression

The first model that we will fit is a simple logistic regression.  We will use a 10-fold cross-validation accuracy as our goodness of fit metric.

In [8]:
from sklearn.linear_model import LogisticRegression
mdl_logistic_regression = LogisticRegression(C=1.0, random_state=0)

In [9]:
%%time
from sklearn.model_selection import cross_val_score
scores = cross_val_score(mdl_logistic_regression, df_X, df_y, cv=10, verbose=0)

CPU times: user 22.9 s, sys: 15.8 s, total: 38.7 s
Wall time: 4.94 s


We get a mean accuracy of about 52% which is just slightly higher than guessing a gain for all assets.

In [10]:
np.mean(scores)

0.5219476744186047

### Random Forest

Next let's fit a random forest.  We use a 5-fold cross-validation accuracy because these models take time to run.

In [11]:
from sklearn.ensemble import RandomForestClassifier
mdl_random_forest = RandomForestClassifier(n_estimators=100, max_depth=7, random_state=0)

In [12]:
%%time
scores = cross_val_score(mdl_random_forest, df_X, df_y, cv=5, scoring='accuracy', verbose=0)
scores

CPU times: user 44.8 s, sys: 58.9 ms, total: 44.9 s
Wall time: 44.7 s


array([0.51858389, 0.51723422, 0.52450166, 0.51650748, 0.52460548])

Similar to logistic regression we get a mean score of around 52%

In [13]:
np.mean(scores)

0.5202865448504983

### XGBoost

The next model that we will fit is a gradient boosted tree with the `xgboost` package.  We use 5-fold cross-validation to assess model performance.

In [14]:
from xgboost import XGBClassifier
mdl_xgboost = XGBClassifier(n_estimators=100, max_depth=10, learning_rate=0.01, use_label_encoder=False, eval_metric='logloss')

In [15]:
scores = cross_val_score(mdl_xgboost, df_X, df_y, cv=5, scoring='accuracy', verbose=0)
scores



array([0.51292566, 0.51775332, 0.51977782, 0.51629983, 0.51567691])

Our mean accuracy score is slightly lower at 51.5%.

In [16]:
np.mean(scores)

0.5164867109634551

### Validation Set

In the subsequent sections we will try a variety of neural networks.  In order to check out-of-sample accuracy, let's first create a holdout set with `train_test_split()`.

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.20, random_state=0)

### Initial Neural Network

We are now ready to fit our first neural network.

In [19]:
import random
import tensorflow as tf
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam
from sklearn.metrics import accuracy_score

Let's set our random seeds to get reproducible results.

In [20]:
def set_seeds(seed=100):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

In [21]:
set_seeds()

Next, let's build up our dense feed-forward neural network.  We use 3 hidden layer with 16, 8, and 4 units, respectively.

In [22]:
model = Sequential()
model.add(Dense(units=16, input_dim=len(df_X.columns), activation='relu'))
model.add(Dense(units=8, input_dim=len(df_X.columns), activation='relu'))
model.add(Dense(units=4, input_dim=len(df_X.columns), activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])

Now we can fit our model.

In [23]:
%%time
h = model.fit(X_train, y_train, epochs=10, verbose=False)

CPU times: user 17 s, sys: 1.93 s, total: 18.9 s
Wall time: 8.84 s


As we can see, our model performs about the same as the previous models.

In [24]:
model.evaluate(X_test, y_test)



[0.6913726925849915, 0.5212832093238831]

### Dropout

In this section we implement dropout regularization in our neural network.  We use a slightly different architecture with 2 hidden units of 64 and 32 units, respectively.

In [25]:
from keras.layers import Dropout

In [26]:
set_seeds()

In [27]:
model = Sequential()
model.add(Dense(units=64, input_dim=len(df_X.columns), activation='relu'))
model.add(Dropout(rate=0.1, seed=100))
model.add(Dense(units=32, input_dim=len(df_X.columns), activation='relu'))
model.add(Dropout(rate=0.1, seed=100))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])

In [28]:
%%time
model.fit(X_train, y_train, epochs=10, verbose=False)

CPU times: user 21.7 s, sys: 3.27 s, total: 25 s
Wall time: 10.1 s


<tensorflow.python.keras.callbacks.History at 0x7fa5f06b9280>

Dropout reglarization doesn't seem to have much of an effect on our model.

In [29]:
model.evaluate(X_test, y_test)



[0.6915660500526428, 0.5227367281913757]

### Regularization

Finally we perform `l1` regularization on our neural network.

In [30]:
from keras.regularizers import l1, l2

In [31]:
set_seeds()

In [32]:
model = Sequential()
model.add(Dense(units=64, input_dim=len(df_X.columns), activation='relu', activity_regularizer=l1(0.0005)))
model.add(Dense(units=32, input_dim=len(df_X.columns), activation='relu', activity_regularizer=l1(0.0005)))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])

In [33]:
%%time
model.fit(X_train, y_train, epochs=10, verbose=False)

CPU times: user 21.7 s, sys: 3.52 s, total: 25.2 s
Wall time: 10.2 s


<tensorflow.python.keras.callbacks.History at 0x7fa5f0598a00>

Once again, `l1` regularization doesn't seem to perform that well on our model.

In [34]:
model.evaluate(X_test, y_test)



[0.6923099756240845, 0.5174937844276428]

**Discussion Question:** Based on our results so far, which model would you select?

**Code Challenge:** Try some different neural architectures and see if ou can improve model performance.