## Correlation analysis among Numerai targets

I tried to calculate Spearman's correlations of some targets.

Numerai CEO saied "jerome_60 is a good target for blending"

https://zenn.dev/katsu1110/articles/60c777d15e01d5 (Japanese Page)

In [None]:
!pip install -Uqq numerapi

In [None]:
%matplotlib inline
import os
import pandas as pd
import scipy.stats as st
import matplotlib.pylab as plt

from numerapi import NumerAPI
from collections import defaultdict
from contextlib import redirect_stderr
from tqdm import tqdm_notebook as tqdm

In [None]:
napi = NumerAPI()
with redirect_stderr(open(os.devnull, 'w')):
    napi.download_dataset("v3/numerai_training_data_int8.parquet", "numerai_training_data.parquet")
    napi.download_dataset("v3/numerai_validation_data_int8.parquet", "numerai_validation_data.parquet")
    napi.download_dataset("v3/numerai_live_data_int8.parquet", "numerai_live_data.parquet")
    napi.download_dataset("v3/numerai_datasets.zip", "numerai_datasets.zip")

In [None]:
df_train = pd.read_parquet("numerai_training_data.parquet")
df_valid = pd.read_parquet("numerai_validation_data.parquet")

In [None]:
df_train_targets = df_train.loc[:, df_train.columns.str.startswith('target')]
df_valid_targets = df_valid.loc[:, df_valid.columns.str.startswith('target')]

In [None]:
df_train_targets["era"] = df_train["era"]
df_valid_targets["era"] = df_valid["era"]

In [None]:
era_set_train = sorted(set(df_train["era"]))
era_set_valid = sorted(set(df_valid["era"]))

In [None]:
dd_train = defaultdict(lambda: defaultdict(int))
for era in tqdm(era_set_train):
    df_train_targets_era = df_train_targets.query("era == @era").drop("era", axis=1)
    for col in df_train_targets_era.columns:
        if col == "target":
            continue
        
        # 60 days はNaNの値があるため、ignore（omit）しています。
        dd_train[era][col] = st.spearmanr(df_train_targets_era["target"], df_train_targets_era[col], nan_policy="omit")[0]

dd_valid = defaultdict(lambda: defaultdict(int))
for era in tqdm(era_set_valid):
    df_valid_targets_era = df_valid_targets.query("era == @era").drop("era", axis=1)
    for col in df_valid_targets_era.columns:
        if col == "target":
            continue
        
        # 60 days はNaNの値があるため、ignore（omit）しています。
        dd_valid[era][col] = st.spearmanr(df_valid_targets_era["target"], df_valid_targets_era[col], nan_policy="omit")[0]

In [None]:
df_spearmanr_train = pd.DataFrame(dd_train)
df_spearmanr_valid = pd.DataFrame(dd_valid)

In [None]:
df_spearmanr_train_mean_std = df_spearmanr_train.drop("target_nomi_20").T.agg(["mean", "std"]).T
df_spearmanr_valid_mean_std = df_spearmanr_valid.drop("target_nomi_20").T.agg(["mean", "std"]).T

In [None]:
df_spearmanr_train_mean_std["sharpe_ratio"] = df_spearmanr_train_mean_std["mean"] / df_spearmanr_train_mean_std["std"]
df_spearmanr_valid_mean_std["sharpe_ratio"] = df_spearmanr_valid_mean_std["mean"] / df_spearmanr_valid_mean_std["std"]

In [None]:
display(df_spearmanr_train_mean_std.sort_values("sharpe_ratio", ascending=False))

In [None]:
display(df_spearmanr_valid_mean_std.sort_values("sharpe_ratio", ascending=False))

In [None]:
df_spearmanr_train.loc[df_spearmanr_train.index.str.endswith("_20"), :].T.drop("target_nomi_20", axis=1).plot()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=12)
plt.title("data: train, target: 20 days")
plt.xlabel("Era")
plt.ylabel("Spearman's Rho")
plt.show()

df_spearmanr_train.loc[df_spearmanr_train.index.str.endswith("_60"), :].T.plot()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=12)
plt.title("data: train, target: 60 days")
plt.xlabel("Era")
plt.ylabel("Spearman's Rho")
plt.show()

df_spearmanr_valid.loc[df_spearmanr_valid.index.str.endswith("_20"), :].T.drop("target_nomi_20", axis=1).plot()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=12)
plt.title("data: valid, target: 20 days")
plt.xlabel("Era")
plt.ylabel("Spearman's Rho")
plt.show()

df_spearmanr_valid.loc[df_spearmanr_valid.index.str.endswith("_60"), :].T.plot()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0, fontsize=12)
plt.title("data: valid, target: 60 days")
plt.xlabel("Era")
plt.ylabel("Spearman's Rho")
plt.show()

## Next Todo...

As next action, we may should use high sharpe ratio target such as "target_arthur_20", "target_william_20", and "target_jerome_20".


In [None]:
nan