# I started with this notebook and adapated a bunch of things 

Source: 
https://www.kaggle.com/code1110/numerai-tournament

## This Notebook does the following things:
1. Get the Training and tournment data from the numerai static links. 
2. Cast the data into Panadas DF and limit the memory use. 
3. Train a single light Gradient Boost Machine on the training data.
4. Print diagonstics of the model on two subsets of the validation data.
5. Write the predictions  to a file 

6. Submit the Predictions to numerai at the MRQUANTSALOT


Known Problem 

Loa

In [None]:
!pip install numerapi
import numerapi



In [None]:
import numpy as np
import pandas as pd
import os, sys
import gc
import pathlib
import json
from typing import List, NoReturn, Union, Tuple, Optional, Text, Generic, Callable, Dict
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, QuantileTransformer
from sklearn.model_selection import KFold, StratifiedKFold, TimeSeriesSplit
from sklearn.metrics import accuracy_score, roc_auc_score, log_loss, mean_squared_error, mean_absolute_error, f1_score
from scipy.stats import spearmanr # -P I think this is corr. 
import joblib

# model
import lightgbm as lgb
import xgboost as xgb
import operator

# visualize
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from matplotlib import pyplot
from matplotlib.ticker import ScalarFormatter
sns.set_context("talk")
style.use('seaborn-colorblind')



Methods

In [None]:
def cast_eras_as_int(x): # this is used to cast the eras from strings to ints
    try:
        return int(x[3:]) # the eras look like era####
    except:
        return 1000

# unclear if numerapi.download_latest_data() would be faster
def read_data(data='train'):
    # get data 
    if data == 'train':
        df = pd.read_csv('https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz')

        # The test data is significantly larger.
        # test data is the live tournment data
    elif data == 'test':
        df = pd.read_csv('https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_tournament_data.csv.xz')
        
    
    # features
    feature_cols = df.columns[df.columns.str.startswith('feature')]
    
    # map to int, to reduce the memory demand
    mapping = {0.0 : 0, 0.25 : 1, 0.5 : 2, 0.75 : 3, 1.0 : 4} # this is very clever -P
    for c in feature_cols:
        df[c] = df[c].map(mapping).astype(np.uint8)
        
    df["era"] = df["era"].apply(cast_eras_as_int)# also cast era to int
    return df

In [None]:
%%time
# load in the Training data
train = read_data('train')
print(train.shape)
train.head()

(501808, 314)
CPU times: user 1min 28s, sys: 6.1 s, total: 1min 35s
Wall time: 1min 36s


In [None]:
%%time
# the testing data is the tournement data
test = read_data('test')
print(test.columns)

Index(['id', 'era', 'data_type', 'feature_intelligence1',
       'feature_intelligence2', 'feature_intelligence3',
       'feature_intelligence4', 'feature_intelligence5',
       'feature_intelligence6', 'feature_intelligence7',
       ...
       'feature_wisdom38', 'feature_wisdom39', 'feature_wisdom40',
       'feature_wisdom41', 'feature_wisdom42', 'feature_wisdom43',
       'feature_wisdom44', 'feature_wisdom45', 'feature_wisdom46', 'target'],
      dtype='object', length=314)
CPU times: user 4min 36s, sys: 16.2 s, total: 4min 52s
Wall time: 4min 56s


Create a seperate valid split. 
Read more https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
Train: Model sees and learns this data

Validation: Use this to see your score. This is what you use to tune hyperparameters.

Valid is a subset of the test data. it is where the data_type is 'Validation'

### I don't have good evidence for splits in this size. It might be better for some other splits

In [None]:
%%time
# validation is derived from the live tournement data
valid = test[test["data_type"] == "validation"].reset_index(drop = True)

print(valid.columns) # want ot see if there is a target value here
# validation split
valid.loc[valid["era"] > 180, "valid2"] = True # Every era after 180 is in validation
valid.loc[valid["era"] <= 180, "valid2"] = False # Every era before is not in the validation set. 


Index(['id', 'era', 'data_type', 'feature_intelligence1',
       'feature_intelligence2', 'feature_intelligence3',
       'feature_intelligence4', 'feature_intelligence5',
       'feature_intelligence6', 'feature_intelligence7',
       ...
       'feature_wisdom38', 'feature_wisdom39', 'feature_wisdom40',
       'feature_wisdom41', 'feature_wisdom42', 'feature_wisdom43',
       'feature_wisdom44', 'feature_wisdom45', 'feature_wisdom46', 'target'],
      dtype='object', length=314)
CPU times: user 744 ms, sys: 12 ms, total: 756 ms
Wall time: 755 ms


In [None]:
# remove data_type to save memory
train.drop(columns=["data_type"], inplace=True)
valid.drop(columns=["data_type"], inplace=True)
test.drop(columns=["data_type"], inplace=True)

print('The number of records: train {:,}, valid {:,}, test {:,}'.format(train.shape[0], valid.shape[0], test.shape[0])) # df.shape[0] is number of rows.

The number of records: train 501,808, valid 137,779, test 1,671,309


# EDA (Exploratory Data Analysis)


## Determine features

In [None]:
# features
features = [f for f in train.columns.values.tolist() if 'feature' in f] # fancy for loop to get all the feature names. -p
print('There are {} features.'.format(len(features)))
features

There are 310 features.


['feature_intelligence1',
 'feature_intelligence2',
 'feature_intelligence3',
 'feature_intelligence4',
 'feature_intelligence5',
 'feature_intelligence6',
 'feature_intelligence7',
 'feature_intelligence8',
 'feature_intelligence9',
 'feature_intelligence10',
 'feature_intelligence11',
 'feature_intelligence12',
 'feature_charisma1',
 'feature_charisma2',
 'feature_charisma3',
 'feature_charisma4',
 'feature_charisma5',
 'feature_charisma6',
 'feature_charisma7',
 'feature_charisma8',
 'feature_charisma9',
 'feature_charisma10',
 'feature_charisma11',
 'feature_charisma12',
 'feature_charisma13',
 'feature_charisma14',
 'feature_charisma15',
 'feature_charisma16',
 'feature_charisma17',
 'feature_charisma18',
 'feature_charisma19',
 'feature_charisma20',
 'feature_charisma21',
 'feature_charisma22',
 'feature_charisma23',
 'feature_charisma24',
 'feature_charisma25',
 'feature_charisma26',
 'feature_charisma27',
 'feature_charisma28',
 'feature_charisma29',
 'feature_charisma30',
 'fe

## Target

In [None]:
target = train.columns[train.columns.str.startswith('target')].values.tolist()[0] # I dont know what this line does
# this just the string 'target'
print(target)

target


# Modeling
The example script of xgboost made by numerai [Example](https://github.com/numerai/example-scripts/blob/master/example_model.py)



Link to hyperparameters of xgboost 
https://xgboost.readthedocs.io/en/latest/parameter.html


In [None]:
# # create a model and fit (公式example)

# untested. -p
# model = xgb.XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=2000, n_jobs=-1, colsample_bytree=0.1)
# model.fit(train[features], train[target])

In [None]:
%%time

# create a model based on these params.
# for all intents and purposes these params are random.
# once this set is automated, I would need to verify what parmas are best. 
# Figure out how to do this on Google Cloud
params1 = {
            'n_estimators': 5000,
            'objective': 'regression',
            'boosting_type': 'gbdt',
            'max_depth': 55,
            'learning_rate': 0.013, 
            'feature_fraction': 0.095,
            'seed': 52
            }
model = lgb.LGBMRegressor(**params1) # now increasing the model n_estimators and sumbitting that under tutmodel
model.fit(train[features], train[target]) # usally takes ~9 minutes to run.

CPU times: user 9min 15s, sys: 1.33 s, total: 9min 16s
Wall time: 2min 21s


# Feature importance

You can see what features are important and the distribution of weights assigned to each of the features.
You might use this as part of the meta model of light xbgoost 



In [None]:
feature_importance =pd.DataFrame(model.feature_importances_, index=features, columns=['importance'])
feature_importance.describe()

Unnamed: 0,importance
count,310.0
mean,483.870968
std,56.475009
min,314.0
25%,450.0
50%,484.0
75%,521.0
max,640.0


It might make sense to group your models into clusters based on this stat. 
You would want to scale them down to 1 when you group them. 

# Validation Score
These are the methods to evaluate your model.
I did not write these models, many of the comments written I wrote though.


In [None]:
# naming conventions
PREDICTION_NAME = 'prediction'
TARGET_NAME = target # 'target is the string named 'target'
# EXAMPLE_PRED = 'example_prediction'

# ---------------------------
# Functions
# ---------------------------
def valid4score(valid : pd.DataFrame, pred : np.ndarray, load_example: bool=True, save : bool=False) -> pd.DataFrame:
    """
    Generate new valid pandas dataframe for computing scores
    
    :INPUT:
    - valid : pd.DataFrame extracted from tournament data (data_type='validation')
    
    """
    valid_df = valid.copy() # the validation dataframe you use this to test the CORR and other values

    # Your model creates an array of floats [0,1] rank method converst them in a list of ints. 

    # your lis tof ints is then compared to their list of ints. 
    valid_df['prediction'] = pd.Series(pred).rank(pct=True, method="first") # pred is the array of predictions your model creates for the set of validation vectors.  
    # I am unsure if this preds is a float only only between 0,1,2,3,4. 
    valid_df.rename(columns={target: 'target'}, inplace=True)
    
    # I don't know what the load example boolean is. I think you can use this to save predictions.
    if load_example:
        valid_df[EXAMPLE_PRED] = pd.read_csv(EXP_DIR + 'valid_df.csv')['prediction'].values
    
    if save==True:
        valid_df.to_csv(OUTPUT_DIR + 'valid_df.csv', index=False)
        print('Validation dataframe saved!')
    
    return valid_df

def compute_corr(valid_df : pd.DataFrame):
    """
    Compute rank correlation

    THIS IS WHAT YOU ARE PRIMARILY PAID ON 
    
    :INPUT:
    - valid_df : pd.DataFrame where at least 2 columns ('prediction' & 'target') exist
    
    """
    # this uses Person Correilation. 
    # I You are paid on spearman corrilation. That is where the ratio of change is important not the raw amount of change
    # see: https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/a-comparison-of-the-pearson-and-spearman-correlation-methods/
    return np.corrcoef(valid_df["target"], valid_df['prediction'])[0, 1]

def compute_max_drawdown(validation_correlations : pd.Series):
    """
    Compute max drawdown
    
    :INPUT:
    - validation_correaltions : pd.Series
    """
    
    rolling_max = (validation_correlations + 1).cumprod().rolling(window=100, min_periods=1).max()
    daily_value = (validation_correlations + 1).cumprod()
    max_drawdown = -(rolling_max - daily_value).max()
    
    return max_drawdown

def compute_val_corr(valid_df : pd.DataFrame):
    """
    Compute rank correlation for valid periods
    
    :INPUT:
    - valid_df : pd.DataFrame where at least 2 columns ('prediction' & 'target') exist
    """
    
    # all validation
    correlation = compute_corr(valid_df)
    print("rank corr = {:.4f}".format(correlation))
    return correlation
    
def compute_val_sharpe(valid_df : pd.DataFrame):
    """
    Compute sharpe ratio for valid periods
    
    :INPUT:
    - valid_df : pd.DataFrame where at least 2 columns ('prediction' & 'target') exist
    """
    # all validation
    d = valid_df.groupby('era')[['target', 'prediction']].corr().iloc[0::2,-1].reset_index()
    me = d['prediction'].mean()
    sd = d['prediction'].std()
    max_drawdown = compute_max_drawdown(d['prediction'])
    print('sharpe ratio = {:.4f}, corr mean = {:.4f}, corr std = {:.4f}, max drawdown = {:.4f}'.format(me / sd, me, sd, max_drawdown))
    
    return me / sd, me, sd, max_drawdown
    
def feature_exposures(valid_df : pd.DataFrame):
    """
    Compute feature exposure
    
    :INPUT:
    - valid_df : pd.DataFrame where at least 2 columns ('prediction' & 'target') exist
    """
    feature_names = [f for f in valid_df.columns
                     if f.startswith("feature")]
    exposures = []
    for f in feature_names:
        fe = spearmanr(valid_df['prediction'], valid_df[f])[0]
        exposures.append(fe)
    return np.array(exposures)

def max_feature_exposure(fe : np.ndarray):
    return np.max(np.abs(fe))

def feature_exposure(fe : np.ndarray):
    return np.sqrt(np.mean(np.square(fe)))

def compute_val_feature_exposure(valid_df : pd.DataFrame):
    """
    Compute feature exposure for valid periods
    
    :INPUT:
    - valid_df : pd.DataFrame where at least 2 columns ('prediction' & 'target') exist
    """
    # all validation
    fe = feature_exposures(valid_df)
    fe1, fe2 = feature_exposure(fe), max_feature_exposure(fe)
    print('feature exposure = {:.4f}, max feature exposure = {:.4f}'.format(fe1, fe2))
     
    return fe1, fe2

# to neutralize a column in a df by many other columns
#         I have no idea what this method does. -P. need to read about it and write up a link to it. 
def neutralize(df, columns, by, proportion=1.0):
    scores = df.loc[:, columns]
    exposures = df[by].values

    # constant column to make sure the series is completely neutral to exposures
    exposures = np.hstack(
        (exposures,
         np.asarray(np.mean(scores)) * np.ones(len(exposures)).reshape(-1, 1)))

    scores = scores - proportion * exposures.dot(
        np.linalg.pinv(exposures).dot(scores))
    return scores / scores.std()


# to neutralize any series by any other series
def neutralize_series(series, by, proportion=1.0):
    scores = series.values.reshape(-1, 1)
    exposures = by.values.reshape(-1, 1)

    # this line makes series neutral to a constant column so that it's centered and for sure gets corr 0 with exposures
    exposures = np.hstack(
        (exposures,
         np.array([np.mean(series)] * len(exposures)).reshape(-1, 1)))

    correction = proportion * (exposures.dot(
        np.linalg.lstsq(exposures, scores, rcond=None)[0]))
    corrected_scores = scores - correction
    neutralized = pd.Series(corrected_scores.ravel(), index=series.index)
    return neutralized


def unif(df):
    x = (df.rank(method="first") - 0.5) / len(df)
    return pd.Series(x, index=df.index)

def get_feature_neutral_mean(df):
    feature_cols = [c for c in df.columns if c.startswith("feature")]
    df.loc[:, "neutral_sub"] = neutralize(df, [PREDICTION_NAME],
                                          feature_cols)[PREDICTION_NAME]
    scores = df.groupby("era").apply(
        lambda x: np.corrcoef(x["neutral_sub"].rank(pct=True, method="first"), x[TARGET_NAME])).mean()
    return np.mean(scores)

def compute_val_mmc(valid_df : pd.DataFrame):    
    # MMC over validation
    mmc_scores = []
    corr_scores = []
    for _, x in valid_df.groupby("era"):
        series = neutralize_series(pd.Series(unif(x[PREDICTION_NAME])),
                                   pd.Series(unif(x[EXAMPLE_PRED])))
        mmc_scores.append(np.cov(series, x[TARGET_NAME])[0, 1] / (0.29 ** 2))
        corr_scores.append(np.corrcoef(unif(x[PREDICTION_NAME]).rank(pct=True, method="first"), x[TARGET_NAME]))

    val_mmc_mean = np.mean(mmc_scores)
    val_mmc_std = np.std(mmc_scores)
    val_mmc_sharpe = val_mmc_mean / val_mmc_std
    corr_plus_mmcs = [c + m for c, m in zip(corr_scores, mmc_scores)]
    corr_plus_mmc_sharpe = np.mean(corr_plus_mmcs) / np.std(corr_plus_mmcs)
    corr_plus_mmc_mean = np.mean(corr_plus_mmcs)

    print("MMC Mean = {:.6f}, MMC Std = {:.6f}, CORR+MMC Sharpe = {:.4f}".format(val_mmc_mean, val_mmc_std, corr_plus_mmc_sharpe))

    # Check correlation with example predictions
    corr_with_example_preds = np.corrcoef(valid_df[EXAMPLE_PRED].rank(pct=True, method="first"),
                                          valid_df[PREDICTION_NAME].rank(pct=True, method="first"))[0, 1]
    print("Corr with example preds: {:.4f}".format(corr_with_example_preds))
    
    return val_mmc_mean, val_mmc_std, corr_plus_mmc_sharpe, corr_with_example_preds


# this is the main method. The rest are just called interanlly. 
def score_summary(valid_df : pd.DataFrame):
    score_df = {}
    
    try:
        score_df['correlation'] = compute_val_corr(valid_df)
    except:
        print('ERR: computing correlation')
    try:
        score_df['corr_sharpe'], score_df['corr_mean'], score_df['corr_std'], score_df['max_drawdown'] = compute_val_sharpe(valid_df)
    except:
        print('ERR: computing sharpe')
    try:
        score_df['feature_exposure'], score_df['max_feature_exposure'] = compute_val_feature_exposure(valid_df)
    except:
        print('ERR: computing feature exposure')
    try:
        score_df['mmc_mean'], score_df['mmc_std'], score_df['corr_mmc_sharpe'], score_df['corr_with_example_xgb'] = compute_val_mmc(valid_df)
    except:
        print('ERR: computing MMC')
    
    return pd.DataFrame.from_dict(score_df, orient='index')

In [None]:
# prediction for valid periods   
# peek at the number of prediction
pred = model.predict(valid[features])
print(f'You made {len(pred)} total predictions')
print(type(pred))
print(pred[0])

You made 137779 total predictions
<class 'numpy.ndarray'>
0.486884289595734


In [None]:
# scores


valid_df = valid4score(valid, pred, load_example=False, save=False)

score_df = pd.DataFrame()
print('------------------')
print('ALL:')
print('------------------')
all_ = score_summary(valid_df).rename(columns={0: 'all'})

print('------------------')
print('VALID 1:')
print('------------------')
val1_ = score_summary(valid_df.query('era < 150')).rename(columns={0: 'val1'}) 
# there might be something strange with these numbers. before they are 180 need to verify unsure what valid2 = True means.

print('------------------')
print('VALID 2:')
print('------------------')
val2_ = score_summary(valid_df.query('era > 150')).rename(columns={0: 'val2'})

------------------
ALL:
------------------
rank corr = 0.0238
sharpe ratio = 0.8472, corr mean = 0.0243, corr std = 0.0286, max drawdown = -0.0723
feature exposure = 0.0825, max feature exposure = 0.3018
ERR: computing MMC
------------------
VALID 1:
------------------
rank corr = 0.0351
sharpe ratio = 1.2671, corr mean = 0.0351, corr std = 0.0277, max drawdown = -0.0290
feature exposure = 0.0741, max feature exposure = 0.2894
ERR: computing MMC
------------------
VALID 2:
------------------
rank corr = 0.0160
sharpe ratio = 0.5896, corr mean = 0.0162, corr std = 0.0274, max drawdown = -0.0480
feature exposure = 0.0942, max feature exposure = 0.3105
ERR: computing MMC


# See Scores

In [None]:
# scores
score_df = pd.concat([all_, val1_, val2_], axis=1)
score_df.style.background_gradient(cmap='viridis', axis=0)

Unnamed: 0,all,val1,val2
correlation,0.02383,0.035123,0.016025
corr_sharpe,0.847194,1.26712,0.589563
corr_mean,0.024271,0.035079,0.016164
corr_std,0.028648,0.027684,0.027417
max_drawdown,-0.072331,-0.028963,-0.048014
feature_exposure,0.082518,0.074126,0.094228
max_feature_exposure,0.301773,0.289383,0.310453


# Submission


1. Create the prediction list.
2. Link those predictions with the tournment data
3. Write the id, prediction to a csv file.
4. Use numerai wrapper to submit that .csv file as your current model



### Methods to handle submission

In [None]:
def load_api_creds_into_dict():
  """
    Read creds.json and return a dictionary of your API keys.
  """
  creds  = open('creds.json','r')
  api_keys_dict = json.load(creds) # untested
  creds.close()
  return api_keys_dict

api_keys_dict = load_api_creds_into_dict()
mrquantsalot_model_id = api_keys_dict['mr_quants_model_id']
tutModel_model_id = api_keys_dict['tutmodel_model_id']
PREDICTION_NAME = "prediction" # this is the header of the csv file you are creating
OUTPUT_DIR = '' # just the root of your local folder in this instance of google colab

def open_api_access():
    """
    Read in my private key from creds.json and return the numerai api wrapper
    """
    api_keys_dict = load_api_creds_into_dict()
    my_secret_key = api_keys_dict['secret_key']
    my_public_id = api_keys_dict['public_id'] # unclear if this is your public id (the long string or the public_id ) 
    napi = numerapi.NumerAPI(secret_key=my_secret_key, public_id=my_public_id)
    return napi


def merge_predictions_id(tournament_vectors: pd.DataFrame, tournament_pred : np.ndarray):
    """
    Create a dataframe that looks like 
    id,prediction
    asdfads,.5429
    asdfaddsss,.5051
    ...
    """
    predictions_df = tournament_vectors["id"].to_frame() # get all the Ids and cast them to a frame
    predictions_df[PREDICTION_NAME] = tournament_pred
    return predictions_df


def write_predictions_to_file(predictions_df: pd.DataFrame):
    try:
      out_location = open('myPredictions.csv', 'x')
    except:
      out_location = open('myPredictions.csv', 'w')

    predictions_df.to_csv(out_location, index=False)
    out_location.close()
    return 'myPredictions.csv' # the file name where you save the predictions


def submit_predictions_to_numerai(file_of_predictions='myPredictions.csv', model_id=mrquantsalot_model_id):
    napi = open_api_access() # open a connection to the numerai API with your creds.
    submission_id = napi.upload_predictions(file_of_predictions, model_id=mrquantsalot_model_id)
    print(f'You successfully submitted for {mrquantsalot_model_id}')

print('your helper methods work correctly')
    

your helper methods work correctly


### Run your model on the real tournament data.

In [None]:
%%time
# It might makes more sense to split this into several cells
# predict using your model on tournment data
tournament_pred = model.predict(test[features]) #it takes ~ 16 minutes to run your predictions
print(f'The live tournament has {test.shape()} shape')
print('it took you this long to run your model on the predictions.')

In [None]:
# add create a dataframe based on the id and predictiosn
predictions_df = merge_predictions_id(tournament_vectors=test,tournament_pred =tournament_pred)
print(predictions_df.head(10))
predictions_file = write_predictions_to_file(predictions_df)
print('wrote predictions to file')
submit_predictions_to_numerai(file_of_predictions = file_with_predictions_to_submit, model_id=tutModel_model_id ) # default to mrquantsalot

### Results
From the "Diagnostic" results of the MRQuants a lot on 3/19/2021
CORR =.0260

Std dev on CORR .0279.

This is very bad.

I need to incease my feature exposure and decrease the std dev of per era corrilations between my predictions and the true results. 

Unsure yet how to do that. 

params1 = {
            'n_estimators': 2000,
            'objective': 'regression',
            'boosting_type': 'gbdt',
            'max_depth': 55,
            'learning_rate': 0.013, 
            'feature_fraction': 0.095,
            'seed': 52
            }

            current params for this Mr quants a lot


Lets see how increasing the n_estimators changes the feature exposure.
