<a href="https://colab.research.google.com/github/parmarsuraj99/numerai-guides/blob/master/KCL/Intro_to_Numerai_KCL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## A code first Introduction to Numerai tournament

## Overview 

1. Importing required tools and libraries

2. Loading the data
3. EDA
4. Simple modelling
5. Tuning some hyper parameters
6. Making and evaluating predictions
7. Submittting the predictions

To speed up the computation, we can use GPU accelaration.

`Runtime -> Change runtime type -> Hardware accelarator -> GPU -> save`

In [None]:
!nvidia-smi

## Importing tools and libraries

1. Data loading and EDA:

    - numerapi (Numerai's API for downloading latest files)
    - Pandas
    - numpy
    - matplotlib

2. Modelling:
    
    - sklearn
    - catboost

3. Evaluation

    - scipy

`pip install` is used to install libraries in python environment

In [None]:
!pip install numerapi
!pip install catboost

In [None]:
import os   #for OS commands
import gc   #garbage collector
import csv

import numpy as np   #for fast vectorized ops
import pandas as pd  #loading .csv file
import matplotlib.pyplot as plt #for visualizations

import  numerapi     #for programatically loading data

In [None]:
napi = numerapi.NumerAPI(verbosity="info")
# download current dataset
napi.download_current_dataset(unzip=True)

current_round = napi.get_current_round()
print(f"Current round: {current_round}")

In [None]:
TOURNAMENT_NAME = "kazutsugi"
TARGET_NAME = f"target_{TOURNAMENT_NAME}"
PREDICTION_NAME = f"prediction_{TOURNAMENT_NAME}"

# Submissions are scored by spearman correlation
def correlation(predictions, targets):
    ranked_preds = predictions.rank(pct=True, method="first")
    return np.corrcoef(ranked_preds, targets)[0, 1]


# convenience method for scoring
def score(df):
    return correlation(df[PREDICTION_NAME], df[TARGET_NAME])


# Payout is just the score cliped at +/-25%
def payout(scores):
    return scores.clip(lower=-0.25, upper=0.25)


# Read the csv file into a pandas Dataframe
def read_csv(file_path):
    with open(file_path, 'r') as f:
        column_names = next(csv.reader(f))
        dtypes = {x: np.float16 for x in column_names if
                  x.startswith(('feature', 'target'))}
    return pd.read_csv(file_path, dtype=dtypes)

In [None]:
%%time
print("# Loading data...")
# The training data is used to train your model how to predict the targets.
training_data = read_csv(f"/content/numerai_dataset_{current_round}/numerai_training_data.csv").set_index("id")
# The tournament data is the data that Numerai uses to evaluate your model.
tournament_data = read_csv(f"/content/numerai_dataset_{current_round}/numerai_tournament_data.csv").set_index("id")

example_preds = read_csv(f"/content/numerai_dataset_{current_round}/example_predictions_target_kazutsugi.csv")

validation_data = tournament_data[tournament_data.data_type == "validation"]

## Scoring Function:

Your predictions are scored on their correlation with live targets. 

## Data Exploration

In [None]:
training_data.head()

In [None]:
training_data.columns

In [None]:
feature_names = [feature for feature in training_data.columns if feature.startswith("feature")]
print(len(feature_names),"\n",feature_names)

In [None]:
feature_types = ["intelligence", "charisma", "strength", "dexterity", "constitution", "wisdom"]

### Era

In [None]:
training_data["erano"] = training_data.era.str.slice(3).astype(int)
eras = training_data.erano

print(np.unique(training_data['era']), "\n Total Eras in training data", len(np.unique(training_data['era'])))

In [None]:
training_data.groupby(eras).size().plot()

In [None]:
training_data.groupby(TARGET_NAME).size()

Numerai features are non stationary. i.e, Some feature may be highly correlated in some eras while they may even hurt in another era.

In [None]:
era_ = [1, 10, 22, 37, 50, 111]

fig = plt.figure(figsize=(20, 12))

for i in range(1, len(era_)+1):

    feature_corr = training_data[training_data["erano"]==era_[i-1]][feature_names[:20]].corr(method="spearman")

    ax = fig.add_subplot(2, 3, i)
    ax.set_title(f"Era: {era_[i-1]}")
    ax.matshow(feature_corr)

plt.show()

It may happen that your overfitted model perform exceptionally well for 2-3 rounds and then burns heavily in the next round.

You want your model to perform well across eras in live data.

## Simple model

In [None]:
from sklearn import linear_model

In [None]:
%%time
lin_reg = linear_model.LinearRegression()
lin_reg.fit(training_data[feature_names], training_data[TARGET_NAME])

In [None]:
tr_preds = lin_reg.predict(training_data[feature_names])
tour_preds = lin_reg.predict(tournament_data[feature_names])

training_data[PREDICTION_NAME] = tr_preds
tournament_data[PREDICTION_NAME] = tour_preds

In [None]:
#FEATURE_EXPOSURE
validation_data = tournament_data[tournament_data.data_type == "validation"]
corr_list = []
for feature in feature_names:
    #print(training_data[feature].values.shape, boosted_tr_preds.squeeze(1).shape)
    corr_list.append(correlation(validation_data[feature], 
                               validation_data[PREDICTION_NAME]))
corr_series = pd.Series(corr_list, index=feature_names)
print("Max Feat. exposure: ", corr_series.describe()["max"])

top_k_feats = list(corr_series.nlargest(100).index)
print(top_k_feats[:10])

# Check the per-era correlations on the training set
train_correlations = training_data.groupby("era").apply(score)
print(f"\nOn training the correlation has mean {train_correlations.mean()} and std {train_correlations.std()}")

# Check the per-era correlations on the validation set
validation_data = tournament_data[tournament_data.data_type == "validation"]
validation_correlations = validation_data.groupby("era").apply(score)
print(f"\nOn validation the correlation has mean {validation_correlations.mean()} and std {validation_correlations.std()}")

Models with large exposures to individual features tend to perform poorly or inconsistently out of sample

## Let's do some pre-processing

Applying some transformations to the data to see how it affects the performance

### PCA

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=100)

pca.fit(training_data[feature_names])

pca_train = pca.transform(training_data[feature_names])
pca_tour = pca.transform(tournament_data[feature_names])

In [None]:
%%time
lin_reg = linear_model.LinearRegression()
lin_reg.fit(pca_train, training_data[TARGET_NAME])

In [None]:
tr_preds = lin_reg.predict(pca_train)
tour_preds = lin_reg.predict(pca_tour)

training_data[PREDICTION_NAME] = tr_preds
tournament_data[PREDICTION_NAME] = tour_preds

In [None]:
#FEATURE_EXPOSURE
validation_data = tournament_data[tournament_data.data_type == "validation"]
corr_list = []
for feature in feature_names:
    #print(training_data[feature].values.shape, boosted_tr_preds.squeeze(1).shape)
    corr_list.append(correlation(validation_data[feature], 
                               validation_data[PREDICTION_NAME]))
corr_series = pd.Series(corr_list, index=feature_names)
print("Feat. exposure: ", corr_series.describe()["std"])
print("Max Feat. exposure: ", corr_series.describe()["max"])

top_k_feats = list(corr_series.nlargest(100).index)
print(top_k_feats[:10])


# Check the per-era correlations on the training set
train_correlations = training_data.groupby("era").apply(score)
print(f"\nOn training the correlation has mean {train_correlations.mean()} and std {train_correlations.std()}")
print(f"On training the average per-era payout is {payout(train_correlations).mean()}")

# Check the per-era correlations on the validation set
validation_data = tournament_data[tournament_data.data_type == "validation"]
validation_correlations = validation_data.groupby("era").apply(score)
print(f"\nOn validation the correlation has mean {validation_correlations.mean()} and std {validation_correlations.std()}")
print(f"On validation the average per-era payout is {payout(validation_correlations).mean()}")

### Using only some features

we can also try modelling using only a few features

Some options:

- Use a combination of feature group(s) (i.e, intelligence, constitution)
- Use top-k features correlated to target

In [None]:
corr_list = []
for feature in feature_names:
    corr_list.append(correlation(training_data[feature],
                     training_data[TARGET_NAME]))
    
corr_series = pd.Series(corr_list, index=feature_names)

In [None]:
#Here, I have set top-k to 100.
selected_features = corr_series.nlargest(100).index
print(selected_features)

In [None]:
#Exercise: Select top-k features and train a model using them.
#use training_data[selected_features] instead of feature_names

## Optimization

Parameter tuning

Cross-validate on group of eras

In [None]:
from sklearn import model_selection

In [None]:
CV = model_selection.GroupKFold(n_splits=3)
grp = list(CV.split(X = training_data[feature_names], y = training_data[TARGET_NAME],  groups = eras))

In [None]:
grp

Optimising [Ridge Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) for alpha

In [None]:
R = linear_model.Ridge(copy_X=True, fit_intercept=True, max_iter=None,
                       normalize=False, random_state=None, solver='auto') 
#make sure you omit the keyword arguments for the parameter(s) you wish to optimize

params1 = {'alpha': [0.001, 0.01, 0.1]}
GS = model_selection.GridSearchCV(estimator = R, param_grid = params1, 
                                  cv = grp, return_train_score = True)

GS.fit(training_data[feature_names].values, training_data[TARGET_NAME].values)



In [None]:
GS.best_params_

Exercise: 

tune more parameters 

In [None]:
tr_preds = GS.predict(training_data[feature_names].values)
tour_preds = GS.predict(tournament_data[feature_names].values)

In [None]:
training_data[PREDICTION_NAME] = tr_preds
tournament_data[PREDICTION_NAME] = tour_preds

In [None]:
#FEATURE_EXPOSURE
validation_data = tournament_data[tournament_data.data_type == "validation"]
corr_list = []
for feature in feature_names:
    #print(training_data[feature].values.shape, boosted_tr_preds.squeeze(1).shape)
    corr_list.append(correlation(validation_data[feature], 
                               validation_data[PREDICTION_NAME]))
corr_series = pd.Series(corr_list, index=feature_names)
print("Feat. exposure: ", corr_series.describe()["std"])
print("Max Feat. exposure: ", corr_series.describe()["max"])

top_k_feats = list(corr_series.nlargest(100).index)
print(top_k_feats[:10])


# Check the per-era correlations on the training set
train_correlations = training_data.groupby("era").apply(score)
print(f"\nOn training the correlation has mean {train_correlations.mean()} and std {train_correlations.std()}")

# Check the per-era correlations on the validation set
validation_data = tournament_data[tournament_data.data_type == "validation"]
validation_correlations = validation_data.groupby("era").apply(score)
print(f"\nOn validation the correlation has mean {validation_correlations.mean()} and std {validation_correlations.std()}")

## Boosting Models

- CatBoost (because it comes with GPU support on Colab)
- You can try other libraries like XGBoost and LightGBM too

In [None]:
from catboost import CatBoostRegressor

In [None]:
#Default parameters
params = {
    "iterations":500,
    "task_type":"GPU"
}

cat_reg = CatBoostRegressor(**params)

In [None]:
cat_reg.fit(training_data[feature_names].values, training_data[TARGET_NAME].values,
            eval_set=(validation_data[feature_names].values, validation_data[TARGET_NAME].values))

In [None]:
%%time

tr_preds = cat_reg.predict(training_data[feature_names])
tour_preds = cat_reg.predict(tournament_data[feature_names])

training_data[PREDICTION_NAME] = tr_preds
tournament_data[PREDICTION_NAME] = tour_preds

In [None]:
#FEATURE_EXPOSURE
validation_data = tournament_data[tournament_data.data_type == "validation"]
corr_list = []
for feature in feature_names:
    #print(training_data[feature].values.shape, boosted_tr_preds.squeeze(1).shape)
    corr_list.append(correlation(validation_data[feature], 
                               validation_data[PREDICTION_NAME]))
corr_series = pd.Series(corr_list, index=feature_names)
print("Feat. exposure: ", corr_series.describe()["std"])
print("Max Feat. exposure: ", corr_series.describe()["max"])

top_k_feats = list(corr_series.nlargest(100).index)
print(top_k_feats[:10])


# Check the per-era correlations on the training set
train_correlations = training_data.groupby("era").apply(score)
print(f"\nOn training the correlation has mean {train_correlations.mean()} and std {train_correlations.std()}")

# Check the per-era correlations on the validation set
validation_data = tournament_data[tournament_data.data_type == "validation"]
validation_correlations = validation_data.groupby("era").apply(score)
print(f"\nOn validation the correlation has mean {validation_correlations.mean()} and std {validation_correlations.std()}")

Exercise: Tune catboost parameters

https://www.dezyre.com/recipes/find-optimal-parameters-for-catboost-using-gridsearchcv-for-regression


Let's see how the example_predictions perform

In [None]:
tournament_data[PREDICTION_NAME] = example_preds["prediction_kazutsugi"].values

In [None]:
#FEATURE_EXPOSURE
validation_data = tournament_data[tournament_data.data_type == "validation"]
corr_list = []
for feature in feature_names:
    #print(training_data[feature].values.shape, boosted_tr_preds.squeeze(1).shape)
    corr_list.append(correlation(validation_data[feature], 
                               validation_data[PREDICTION_NAME]))
corr_series = pd.Series(corr_list, index=feature_names)
print("Feat. exposure: ", corr_series.describe()["std"])
print("Max Feat. exposure: ", corr_series.describe()["max"])

top_k_feats = list(corr_series.nlargest(100).index)
print(top_k_feats[:10])

# Check the per-era correlations on the validation set
validation_data = tournament_data[tournament_data.data_type == "validation"]
validation_correlations = validation_data.groupby("era").apply(score)
print(f"\nOn validation the correlation has mean {validation_correlations.mean()} and std {validation_correlations.std()}")

These are really good scores. You should try to get comparable results to this.


## Making Final Predictions

In [None]:
tournament_data.to_csv("sub_model_name_"+TOURNAMENT_NAME + "_submission.csv")

In [None]:

public_id = ""
secret_key = ""
model_id = ""
napi = numerapi.NumerAPI(public_id=public_id, secret_key=secret_key)

In [None]:
submission_id = napi.upload_predictions(f"sub_model_name_"+TOURNAMENT_NAME + "_submission.csv", model_id=model_id)