# Unit 2 - Tournament Data
In this notebook we will cover:
   1. The new supa massive data launch
   2. How to download the data
   3. Brief overview of data structure

<img src="images/DEYTA.jpg" />

# Super Massive Data Launch
     
<img src="images/supamassivedatapng.png" />

# Accessing the Data
- 2 main ways of accessing training & tournament data:
    1. Manual download from the Numerai website 
    2. Using NumerAPI


- Use Pandas / Dask and Python to read parquet file into a DataFrame
     

### 1. Manual Download

- Legacy and current data sets available to download on the dashboard

<img src="images/manualDLDashboard.png" />     

### 2. NumerAPI
- We can list and download the datasets with numerAPI
- Make sure to run ***!pip install numerapi*** if you haven't already!
- Once downloaded, use Pandas and Python to read parquet file into a DataFrame
     

In [2]:
#!pip install numerapi
from pathlib import Path
import dask.dataframe as dd
import numerapi
import matplotlib.pyplot as plt


In [3]:
#Create instance of NumerAPI
napi = numerapi.NumerAPI()

#List available datasets
napi.list_datasets()


['example_predictions.csv',
 'example_predictions.parquet',
 'example_validation_predictions.csv',
 'example_validation_predictions.parquet',
 'features.json',
 'numerai_datasets.zip',
 'numerai_live_data.csv',
 'numerai_live_data.parquet',
 'numerai_live_data_int8.csv',
 'numerai_live_data_int8.parquet',
 'numerai_tournament_data.csv',
 'numerai_tournament_data.parquet',
 'numerai_tournament_data_int8.csv',
 'numerai_tournament_data_int8.parquet',
 'numerai_training_data.csv',
 'numerai_training_data.parquet',
 'numerai_training_data_int8.csv',
 'numerai_training_data_int8.parquet',
 'numerai_validation_data.csv',
 'numerai_validation_data.parquet',
 'numerai_validation_data_int8.csv',
 'numerai_validation_data_int8.parquet']

In [4]:
#Use numerAPI to download a single file
train_pq_path = "numerai_training_data_int8.parquet"

napi.download_dataset("numerai_training_data_int8.parquet", train_pq_path)


2021-11-07 15:10:30,953 INFO numerapi.utils: target file already exists
2021-11-07 15:10:30,954 INFO numerapi.utils: download complete


In [5]:
#Get the current round
#Shoutout to RocketChat for this code snippet!
CURRENT_ROUND = napi.get_current_round()

#Check all files if they are parquet and int8. If so, download it
for file in napi.list_datasets():
    if "parquet" in file and "int8" in file:
        if "training" in file or "validation" in file:
            napi.download_dataset(file, f"data/{file}")
        else:
            Path(f"data/{CURRENT_ROUND}").mkdir(exist_ok=True, parents=True)
            napi.download_dataset(file, f"data/{CURRENT_ROUND}/{file}")
            

2021-11-07 15:10:33,007 INFO numerapi.utils: target file already exists
2021-11-07 15:10:33,007 INFO numerapi.utils: download complete
2021-11-07 15:10:33,823 INFO numerapi.utils: target file already exists
2021-11-07 15:10:33,824 INFO numerapi.utils: download complete
2021-11-07 15:10:34,572 INFO numerapi.utils: target file already exists
2021-11-07 15:10:34,573 INFO numerapi.utils: download complete
2021-11-07 15:10:35,446 INFO numerapi.utils: target file already exists
2021-11-07 15:10:35,447 INFO numerapi.utils: download complete


In [7]:
#Read parquet files into DataFrames
df_train = dd.read_parquet('data/numerai_training_data_int8.parquet')  
df_val= dd.read_parquet('data/numerai_validation_data_int8.parquet')  

df_tournament = dd.read_parquet(f'data/{CURRENT_ROUND}/numerai_tournament_data_int8.parquet')


In [8]:
df_train.head()

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,4,2,4,4,0,0,4,4,...,0.25,0.25,0.25,0.0,0.166667,0.0,0.166667,0.0,0.166667,0.0
n003bee128c2fcfc,1,train,2,4,1,3,0,3,2,3,...,1.0,1.0,1.0,1.0,0.833333,0.666667,0.833333,0.666667,0.833333,0.666667
n0048ac83aff7194,1,train,2,1,3,0,3,0,3,3,...,0.5,0.25,0.25,0.25,0.5,0.333333,0.5,0.333333,0.5,0.333333
n00691bec80d3e02,1,train,4,2,2,3,0,4,1,4,...,0.5,0.5,0.5,0.5,0.666667,0.5,0.5,0.5,0.666667,0.5
n00b8720a2fdc4f2,1,train,4,3,4,4,0,0,4,2,...,0.5,0.5,0.5,0.5,0.666667,0.5,0.5,0.5,0.666667,0.5


In [9]:
df_val.head()

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n000777698096000,857,validation,2,2,1,1,0,0,0,3,...,0.25,0.5,0.25,0.5,0.166667,0.5,0.166667,0.5,0.166667,0.5
n0009793a3b91c27,857,validation,3,1,2,3,4,0,1,1,...,0.5,0.5,0.5,0.5,0.666667,0.833333,0.666667,0.833333,0.5,0.5
n00099ccd6698ab0,857,validation,1,3,0,3,4,3,4,4,...,0.25,0.25,0.25,0.25,0.166667,0.333333,0.166667,0.333333,0.166667,0.166667
n0019e36bbb8702b,857,validation,2,4,1,3,3,2,0,3,...,0.75,0.5,0.5,0.5,0.666667,0.5,0.666667,0.5,0.5,0.5
n0028cb874439df8,857,validation,0,3,0,1,4,2,0,0,...,0.25,0.5,0.5,0.75,0.333333,0.5,0.333333,0.5,0.5,0.666667


In [10]:
df_tournament.head()

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n000101811a8a843,575,test,2,0,4,0,3,0,4,1,...,,,,,,,,,,
n001e1318d5072ac,575,test,1,4,2,2,1,3,3,0,...,,,,,,,,,,
n002a9c5ab785cbb,575,test,1,2,2,3,1,1,3,0,...,,,,,,,,,,
n002ccf6d0e8c5ad,575,test,2,4,2,4,2,4,3,2,...,,,,,,,,,,
n0051ab821295c29,575,test,2,0,0,1,0,4,2,1,...,,,,,,,,,,


<img src="images/soccertargets.jpg" width=550/>

# Breaking Down the Data
- Let's take a deeper look at the Features, Eras and structure of the training data!

<img src="images/data.jpg" />

In [None]:
# There's 1050 features with fun names generated by a hashing function
features = [c for c in df_train if c.startswith("feature")]
targets = [c for c in df_train if c.startswith("target")]

df_train["erano"] = df_train.era.astype(int)
eras = df_train.erano

target = "target"

print(len(features))
print(features[:5])
print(eras.max().compute())


### Features
- 1050 randomly named features, with no feature groups!
- Values are either normalized (0,0.25,0.5,0.75,1), or (0-4) for int8 data format, and adjusted for biases
- Suggest starting with regression approach

In [None]:
# Visualize the feature correlation matrix. Feel free to construct your own groupings!
plt.figure(figsize = (8,8))
plt.imshow(df_train[df_train.erano==1][features].corr())


<img src="images/fcorr.png" />

### Eras

- Eras are now numbered by week!
    - Can be rolled up to monthly to reduce computing cost by subsampling (every 4th era)
    - Target is four weeks out (20 days)

- Time-overlap across Eras

- Although we are given the era, starting with a time series based model is meh
     

<img src="images/overlap.png" />

### Data Types (Traditional)
- Loosely follows conventional machine learning dataset splits:
    - **Training**: The sample of data used to fit the model.
    - **Validation**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
    - **Test**: The sample of data used to provide an unbiased evaluation of a final model.

     

<img src="images/dataSplits_TRAD.png"/>

### Data Types (Numerai)
- **training_data**
  - One continuous period of historical data
  - Has targets provided
- **tournament_data**
    - Consists of “test” and “live”
    - All of these rows must be predicted for a submission to be valid
    - No targets provided
    - Test is used for internal testing, but is not part of the tournament scoring and payouts
    - Live is what users stake on and are scored on in the tournament
- **validation_data**
    - A separate file. Predictions on these rows are not required for submission
    - It can be submitted at any time to receive diagnostics on your predictions
    - Has targets provided
    - This is the most recent data that we provide, far removed from training data. This makes it particularly useful for seeing how your models’ performance declines over time, and how it would have been performing lately.


     

<img src="images/dataSplits_NMR.png"/>

<img src="images/supaval.png" />

# Thank You and Good Luck!
- Like & Subscribe for more!
- Github with this notebook + links [here]
- Find my socials [here](https://linktr.ee/peterling) for more numer.ai related content

<img src="images/TAF.jpg"/>