<a href="https://colab.research.google.com/github/jhmuller/DD_water_pumps/blob/master/Sep13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Making your first submission on Numerai

## Introduction 
This tutorial will go over how to create your first submission on Numerai.

## Overview

1. Using this notebook
2. Download the datasets
3. Train your first model
4. Generate your first predictions
4. Make your first submission


---



## 1. Using this notebook 

This is an interactive notebook. You can execute code in each cell by pressing `shift+enter`. This requires you to login with your Google account.

In order to make changes, you need to make a copy by `File -> Save a copy in Drive`.

Let's start off by installing and importing our dependencies.

In [1]:
# install dependencies
!pip install pandas sklearn numerapi



In [2]:
# import dependencies
import pandas as pd
import numerapi
import sklearn.linear_model

## 2. Download the datasets

### Datasets 
*   `training_data` is used to train your model
*   `tournament_data` is used to evaluate your model

### Column descriptions
*   id: a randomized id that corresponds to a stock 
*   era: a period of time
*   data_type: either `train`, `validation`, `test`, or `live` 
*   feature_*: abstract financial features of the stock 
*   target_kazutsugi: abstract measure of stock performance




In [3]:
# download the latest training dataset (takes around 30s)
import os
if os.path.exists("training.csv"):
  training_data = pd.read_csv("training.csv")
else:
  training_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz")
  training_data.to_csv("training.csv")

In [4]:
# download the latest tournament dataset (takes around 30s)
if os.path.exists("tournament.csv"):
  tournament_data = pd.read_csv("tournament.csv")
else:
  tournament_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_tournament_data.csv.xz")
  tournament_data.to_csv("tournament.csv")

In [5]:
os.listdir()

['.config', 'tournament.csv', 'training.csv', 'sample_data']

## 3. Train your first model
Let's create a basic model using sklearn's linear regression.

In [6]:
# find only the feature columns
feature_cols = training_data.columns[training_data.columns.str.startswith('feature')]

In [7]:
# select those columns out of the training dataset
training_features = training_data[feature_cols]

In [8]:
# create a model and fit the training data (~30 sec to run)
model = sklearn.linear_model.LinearRegression()
model.fit(training_features, training_data.target_kazutsugi)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## 4. Generate your first predictions
Now that we have a trained model, we can use it to make predictions on the tournament data.



In [9]:
# compare svd number of components with logistic regression algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot

N = 1000
X = training_data[feature_cols][:N]
y = training_data.iloc[:,-1][:N]
print(f"x shape= {X.shape}, y shape= {y.shape}")
print(y[:4])

x shape= (1000, 310), y shape= (1000,)
0    0.75
1    0.25
2    0.00
3    0.00
Name: target_kazutsugi, dtype: float64


In [10]:

# get a list of models to evaluate
def get_models():
	models = dict()
	for i in range(25,50,5):
		steps = [('svd', TruncatedSVD(n_components=i)), ('m', LinearRegression())]
		models[str(i)] = Pipeline(steps=steps)
	return models

steps = [('svd', TruncatedSVD(n_components=25)), ('m', LinearRegression())]
model = Pipeline(steps=steps)

# evaluate a give model using cross-validation
def evaluate_model(model, X, y):
	cv = KFold(n_splits=2,  random_state=41)
	scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=cv, error_score='raise')
	return scores

#scores = evaluate_model(model, X, y)
res = model.fit(X=X, y=y)


In [17]:
scores

array([-0.12442037, -0.12459488])

In [None]:
# predict the target on the live features
X_test = tournament_data[feature_cols]
predictions = model.predict(X_test)

In [None]:
# predictions must have an `id` column and a `prediction_kazutsugi` column
predictions_df = tournament_data["id"].to_frame()
predictions_df["prediction_kazutsugi"] = predictions
predictions_df.head()

Unnamed: 0,id,prediction_kazutsugi
0,n0003aa52cab36c2,0.472981
1,n000920ed083903f,0.492854
2,n0038e640522c4a6,0.556868
3,n004ac94a87dc54b,0.496384
4,n0052fe97ea0c05f,0.497034


## 5. Make your first submission
To enter the tournament, we must submit the predictions back to Numerai. We will use the `numerapi` library to do this.

In [None]:
# Get your API keys and model_id from https://numer.ai/submit
public_id = "REPLACEME"
secret_key = "REPLACEME"
model_id = "REPLACEME"
napi = numerapi.NumerAPI(public_id=public_id, secret_key=secret_key)

In [None]:
# Upload your predictions
predictions_df.to_csv("predictions.csv", index=False)
submission_id = napi.upload_predictions("predictions.csv", model_id=model_id)

# Done 🚀
Good job! You just made your first submission on Numerai!

Head back over to https://numer.ai/submit to continue.