# The Numerai data description

The Numerai data is divided into 3 subsets:
 - training data
 - validation data
 - live data
 
For training and validation data sets the target is known. The target information for live data is not provided. No one knows it (it is from the future). The goal of the competition is to predict target values for live data. Users predictions are used by Numerai to compute Meta Model (ensemble of all models from the competition). Numerai use Meta Model to submit positions to the market. Based on accuracy of user's prediction the rewards (payouts) are distributed to the participants in the Numerai token (Numeraire, ticker NMR). The reward depends on quality of predictions and stake value. The sake value is the amount of NMR the user is willing to bet on the predictions. Every week a new portion of live data is released. The users compute their predictions for live data and submit them to the Numerai service. The true target values are known for live data after 4 weeks from release and rewards are computed.  

The reward can be computed in two ways (user decide which will be used):
- reward based on correlation between predictions and targets on live data
- reward based on correlation between predictions and targets on live data and MMC contribution 

Three things that needs to be clarified:
- rewards are clipped on both side. The absolute correlation (or corr + MMC) value is clipped at 0.25.
- reward can be nagative if the correlation (or corr+MMC) is negative, then your stake will be decreased corresponding to the clipped correlation (or corr+MMC) value
- what is the MMC? The MMC is abbreviation for Meta Model Contribution. It is a metric that defines how much your predictions are important to the Numerai Meta Model.

## Why is it the hardest hardest data science tournament on the planet?

The number of possible solutions is **endless**. The goal of the user is to select such solution that will be the most profitable in the long run. It is hard because the signal to noise ratio is very small.

### How to do this?
You need to check as many as possible different solutions and select the best one (of few the best, because in the tournament you can stake on up to 10 models). In this tutorial I will show you how to create a set of solutions. Then select the models that will **minimize the risk** and **maximize the profit**. 

In [3]:
# load packages
import gc
import pickle
import pandas as pd
import numpy as np
from scipy.stats import spearmanr, uniform
from xgboost import XGBRegressor

## Load the data

In [4]:
print("# Loading data...")
# The training data is used to train your model how to predict the targets.
training_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz").set_index("id", drop=True).astype("float16", errors="ignore")
# The tournament data is the data that Numerai uses to evaluate your model.
tournament_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_tournament_data.csv.xz").set_index("id", drop=True).astype("float16", errors="ignore")
validation_data = tournament_data[tournament_data["data_type"] == "validation"]
del tournament_data
tournament_data = None
gc.collect()

feature_names = [f for f in training_data.columns if f.startswith("feature")]
print(f"Loaded {len(feature_names)} features")

# Loading data...
Loaded 310 features


## Train models and collect predictions

In [154]:
# prepare data to 2 fold CV
era_list = training_data["era"].unique().tolist()
h1_eras = era_list[:len(era_list)//2]
h2_eras = era_list[len(era_list)//2:]

h1_data = training_data[training_data["era"].isin(h1_eras)]
h2_data = training_data[training_data["era"].isin(h2_eras)]

## Do a random search in the space of available models

In [None]:
np.random.seed(1776)

predictions = []
for i in range(300):
    name = f"Xgboost_{i}"
    max_depth = np.random.choice([2, 3, 4, 5])
    learning_rate = np.random.choice([0.001, 0.005, 0.01, 0.05, 0.1])
    n_estimators = np.random.choice([10, 50, 100, 200])
    colsample_bytree = np.random.choice([0.1, 0.2, 0.4, 0.7, 1.0])

    features = feature_names # default, use all features
    select_eras = era_list # default, use all eras
    if i == 0:
        print("Example Predictions")
        max_depth = 5
        learning_rate = 0.01
        n_estimators = 2000
        colsample_bytree = 0.1
    
    if i >= 100:
        print("Subsample columns")
        cols_count = np.random.randint(30, 270) # arbitrary numbers
        features = np.random.choice(feature_names, cols_count, replace=False)
        
    if i >= 200:
        print("Subsample eras")
        eras_count = np.random.randint(40, 110) # arbitrary numbers
        select_eras = np.random.choice(era_list, eras_count, replace=False)
        
    print(f"{name}, max_depth={max_depth}, learning_rate={learning_rate}, n_estimators={n_estimators}, colsample_bytree={colsample_bytree}, eras count={len(select_eras)}, features count={len(features)}")

    model = XGBRegressor(max_depth=max_depth,
                         learning_rate=learning_rate,
                         n_estimators=n_estimators,
                         colsample_bytree=colsample_bytree,
                         n_jobs=-1,
                         verbosity=1,
                         random_state=12)

    # train on fold 1
    model.fit(h1_data[features][h1_data["era"].isin(select_eras)], h1_data["target_kazutsugi"][h1_data["era"].isin(select_eras)])
    h2_preds = pd.Series(model.predict(h2_data[features]), index=h2_data.index)
    validation_preds_fold_1 = pd.Series(model.predict(validation_data[features]), index=validation_data.index)
    
    

    # train on fold 2
    model.fit(h2_data[features][h2_data["era"].isin(select_eras)], h2_data["target_kazutsugi"][h2_data["era"].isin(select_eras)])
    h1_preds = pd.Series(model.predict(h1_data[features]), index=h1_data.index)
    validation_preds_fold_2 = pd.Series(model.predict(validation_data[features]), index=validation_data.index)

    # train on all training data
    model.fit(training_data[features][training_data["era"].isin(select_eras)], training_data["target_kazutsugi"][training_data["era"].isin(select_eras)])
    validation_preds = pd.Series(model.predict(validation_data[features]), index=validation_data.index)

    predictions += [{
        "name": name,
        "training_preds": pd.concat([h1_preds, h2_preds]),
        "validation_preds_cv": (validation_preds_fold_1 + validation_preds_fold_2) / 2.0,
        "validation_preds": validation_preds,
    }]
    pickle.dump(predictions, open("backup.pickle", "bw"))


Example Predictions
Xgboost_0, max_depth=5, learning_rate=0.01, n_estimators=2000, colsample_bytree=0.1, eras count=120, features count=310


## Prepare predictions Data Frames

In [138]:
training_preds_df = pd.DataFrame()
validation_preds_cv_df = pd.DataFrame()
validation_preds_df = pd.DataFrame()

In [139]:
for pred in predictions:
    training_preds_df[pred["name"]] = pred["training_preds"]
    validation_preds_cv_df[pred["name"]] = pred["validation_preds_cv"]
    validation_preds_df[pred["name"]] = pred["validation_preds"]

In [140]:
training_preds_df["era"] = training_data["era"]
validation_preds_cv_df["era"] = validation_data["era"]
validation_preds_df["era"] = validation_data["era"]

In [155]:
training_preds_df["target"] = training_data["target_kazutsugi"]
validation_preds_cv_df["target"] = validation_data["target_kazutsugi"]
validation_preds_df["target"] = validation_data["target_kazutsugi"]

In [157]:
# save them
training_preds_df.to_csv("training_preds.csv")
validation_preds_cv_df.to_csv("validation_preds_cv.csv")
validation_preds_df.to_csv("validation_preds.csv")

## Compute Scores per Era

In [142]:
model_names = [c for c in validation_preds_df if "era" != c and "target" != c]

In [143]:
val_scores_per_era = validation_preds_df.groupby("era").apply(lambda d: d[model_names].corrwith(d["target"], method="spearman"))

In [144]:
training_scores_per_era = training_preds_df.groupby("era").apply(lambda d: d[model_names].corrwith(d["target"], method="spearman"))

## Overall Metrics

In [145]:
metrics_df = pd.DataFrame(index=model_names, columns=[])

In [146]:
# Calculate mean
metrics_df["val_mean"] = val_scores_per_era.mean()
metrics_df["training_mean"] = training_scores_per_era.mean()
# Calcualte sharpe
metrics_df["val_sharpe"] = val_scores_per_era.mean()/val_scores_per_era.std()
metrics_df["training_sharpe"] = training_scores_per_era.mean()/training_scores_per_era.std()

## Feature Neutrality

In [None]:
# work in progress ...