<a href="https://colab.research.google.com/github/parmarsuraj99/numerai-guides/blob/master/easy_guide/Numerai_e2e_CatBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An End-to-end guide to making your first Numer.ai Submission

The goal of this notebook in colab is to get you up and runnig with the tournament in the easiest way possible. Numerai data already comes with so many helpful scripts. This notebook is inspired by [example-scripts](https://github.com/numerai/example-scripts).

Colab provides free access to GPU/TPU to everyone ⚡. To utilize GPU for your model, go to `Runtime > Change runtime type > GPU > Save`

---

All you have to do to make your first submission is,

- Make sure you have signed up on [Numerai](https://numer.ai/signup)
- Create and setup your API keys (which is super easy)
- Click `Runtime > Run all`

## Loading required libraries 📔 and dataset 🗄️🔽

In [None]:
# installing required libraries
# numerapi, for facilitating data download and predictions uploading
# catboost, for modeling and making predictions
!pip install numerapi
!pip install catboost

In [None]:
import os
import gc
import csv
import glob
import time
from pathlib import Path

import numerapi

import scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from catboost import CatBoostRegressor

In [None]:
napi = numerapi.NumerAPI(verbosity="info")
# download current dataset
napi.download_current_dataset(unzip=True)

current_ds = napi.get_current_round()
latest_round = os.path.join('numerai_dataset_'+str(current_ds))

## Helper functions for efficient loading and evaluation 📐

In [None]:
TOURNAMENT_NAME = "kazutsugi"
TARGET_NAME = f"target_{TOURNAMENT_NAME}"
PREDICTION_NAME = f"prediction_{TOURNAMENT_NAME}"

BENCHMARK = 0
BAND = 0.2

#-----------------------------------------------------

# Submissions are scored by spearman correlation
def score(df):
    # method="first" breaks ties based on order in array
    return np.corrcoef(
        df[TARGET_NAME],
        df[PREDICTION_NAME].rank(pct=True, method="first")
    )[0, 1]


# The payout function
def payout(scores):
    return ((scores - BENCHMARK) / BAND).clip(lower=-1, upper=1)


# Read the csv file into a pandas Dataframe
def read_csv(file_path):
    with open(file_path, 'r') as f:
        column_names = next(csv.reader(f))
        dtypes = {x: np.float16 for x in column_names if
                  x.startswith(('feature', 'target'))}
    return pd.read_csv(file_path, dtype=dtypes)

## Loading and exploring dataset into memory 🖥️

In [None]:
%%time
print("# Loading data...")
# The training data is used to train your model how to predict the targets.
training_data = read_csv(os.path.join(latest_round, "numerai_training_data.csv")).set_index("id")
# The tournament data is the data that Numerai uses to evaluate your model.
tournament_data = read_csv(os.path.join(latest_round, "numerai_tournament_data.csv")).set_index("id")

example_preds = read_csv(os.path.join(latest_round, "example_predictions_target_kazutsugi.csv"))

validation_data = tournament_data[tournament_data.data_type == "validation"]

In [None]:
feature_names = [f for f in training_data.columns if f.startswith("feature")]
print(f"Loaded {len(feature_names)} features")

cols = feature_names+[TARGET_NAME]

Training data | Sample submission
- | - 
![alt](https://gblobscdn.gitbook.com/assets%2F-LmGruQ_-ZYj9XMQUd5x%2F-LrjUJcZGLBAGyzvX2tl%2F-LrlScdEXnDEVhYpSsIN%2FEx_data.png?alt=media&token=66e1ed15-abca-4fda-8485-cc72b7662bdb) | ![alt](https://gblobscdn.gitbook.com/assets%2F-LmGruQ_-ZYj9XMQUd5x%2F-LrjUJcZGLBAGyzvX2tl%2F-LrlT5EetbUvp5qr9MBy%2Fimage.png?alt=media&token=cab0eef4-759f-4412-8a8c-86b211e85917)

In [None]:
training_data.head()

In [None]:
tournament_data.head()

## Training our model 🤖⚙️

This is where most of tweaking will happen. You can add more model in your pipeline simply by changing your model and data pipeline suited for that architecture.

In [None]:
%%time
MODEL_FILE = "example_model.cbm"

params = {
    'task_type': 'GPU'
    }

model = CatBoostRegressor(**params)

if os.path.isfile(MODEL_FILE):
    print("Loading pre-trained model...")
    model.load_model(MODEL_FILE)
else:
    print("Training model...")
    model.fit(training_data[feature_names].astype(np.float32), training_data[TARGET_NAME].astype(np.float32),
         eval_set=(validation_data[feature_names].astype(np.float32), validation_data[TARGET_NAME].astype(np.float32))
         )
    model.save_model(MODEL_FILE)

## Predictions. Evaluation. ➡️

In [None]:
%%time
print("Generating predictions on training data...")
training_preds = model.predict(training_data[feature_names].astype(np.float32))
training_data[PREDICTION_NAME] = training_preds
gc.collect()

print("Generating predictions on tournament data...")
tournament_preds = model.predict(tournament_data[feature_names].astype(np.float32))
tournament_data[PREDICTION_NAME] = tournament_preds

In [None]:
# Check the per-era correlations on the training set (in sample)
train_correlations = training_data.groupby("era").apply(score)
print(f"On training the correlation has mean {train_correlations.mean()} and std {train_correlations.std()}")
print(f"On training the average per-era payout is {payout(train_correlations).mean()}")

# Check the per-era correlations on the validation set (out of sample)
validation_data = tournament_data[tournament_data.data_type == "validation"]
validation_correlations = validation_data.groupby("era").apply(score)
print(f"On validation the correlation has mean {validation_correlations.mean()} and "
        f"std {validation_correlations.std()}")
print(f"On validation the average per-era payout is {payout(validation_correlations).mean()}")

In [None]:

#FEAT_EXPOSURE: This is the standard deviation of your predictions' correlations with each feature. 
corr_list = []
for feature in feature_names:
    corr_list.append(np.corrcoef(tournament_data[feature].values, tournament_data[PREDICTION_NAME])[0,1])
corr_series = pd.Series(corr_list, index=feature_names)
print("Feat. exposure: ", corr_series.describe()['std'])

In [None]:
tournament_data[PREDICTION_NAME].to_csv(f"{TOURNAMENT_NAME}_{current_ds}_submission.csv")

## Uploading predictions using your API keys 🚀

To create a key for submission only, 

`Settings -> Create API key -> select "Upload Predictions" -> Save`


In [None]:
# NameOfYourAI
# Add keys between the quotes
public_id = "YourKeys"
secret_key = "YourKeys"
model_id = "YourKeys"
napi = numerapi.NumerAPI(public_id=public_id, secret_key=secret_key)

In [None]:
submission_id = napi.upload_predictions(f"{TOURNAMENT_NAME}_{current_ds}_submission.csv", model_id=model_id)

And its done. Congratulations🎉. Your predictions for latest round are submitted! 


Check some information about your latest predictions on [Numerai Tournament]
(https://numer.ai/tournament). It will show some metrics like this,

![Submission](https://cdn-images-1.medium.com/max/600/1*3pb7M7utM21d3RXnhjx5KA.png)

Note: This screenshot is from my other submissions


## Let's check out how well the `example_predictions` perform 💭
You can compare your models with `example_predictions` and try to beat it on some metrics or atlest, you should aim for positive correlation in initial submissions.

In [None]:
#@title
tournament_data[PREDICTION_NAME]=example_preds['prediction_kazutsugi'].values

In [None]:
#@title
# Check the per-era correlations on the validation set (out of sample)
validation_data = tournament_data[tournament_data.data_type == "validation"]
validation_correlations = validation_data.groupby("era").apply(score)
print(f"On validation the correlation has mean {validation_correlations.mean()} and "
        f"std {validation_correlations.std()}")
print(f"On validation the average per-era payout is {payout(validation_correlations).mean()}")

## Some useful tips from my experience for using colab efficiently ✨
- You can do simple data exploration without any accelators(GPU/TPU).
- Use GPU/TPU only when everything is ready for execution.
- You can mount your Google Drive to save any work done here.
- Make sure to terminate session if your work is complete and you no longer need that session.


Created by Suraj Parmar

- Numerai: [SurajP](https://numer.ai/surajp)

- Twitter: [@parmarsuraj99](https://twitter.com/parmarsuraj99)


Thanks to [@NJ](https://twitter.com/tasha_jade) and [@MikeP](https://twitter.com/EasyMikeP) for the feedback
