<a href="https://colab.research.google.com/github/numerai/example-scripts/blob/master/Cross_Validation_and_Model_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
#import your packages and modules
import numpy as np
import pandas as pd
import sklearn

In [0]:
train_datalink = 'https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz'

df = pd.read_csv(train_datalink, nrows=50000) #download the training data and keep only the first 50,000 rows

We have an additional data step to consider before we define X and y.

Numerai's training data consists of eras, which are time-ordered groupings of data.

How many eras does our subsample contain?

We can use pandas to identify the unique observations in a column.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

We've done similar things in the past, but this may refresh your memory.

df.era.nunique()

In [0]:
#identify how many eras are contained in the subsample
df.era.nunique()

We need to change the 'era' column to contain only numbers so that we can easily split the data into a train and test set.

In [0]:
df['era'] = df.loc[:, 'era'].str[3:].astype('int32')

#what this line does:

#1 - over-write the existing 'era' column: df['era'] =
#2 - select only the column named 'era' and all of the observations in that column: df.loc[:, 'era']
#3 - for each observation in the selected column, remove the first 3 characters: .str[3:]
#4 - convert the data type to an integer32 which uses less memory: .astype('int32')

#this is called method chaining

This is a small subset of the 120 total training eras available.

We need to split the data into training and testing sets.

We need to manually partition the data into train and test sets. Let's set aside 8 eras for testing.

`df_train = df[df.era < 10].copy()`

In [0]:
df_train = df[df.era < 10].copy() #nothing is printed. if you want to see the results of this operation, then add a new line and print the first 5 rows of df_train.

In [0]:
df_train.head()

In [0]:
df_train.era.unique() #this shows you the actual eras that were selected in the code to define df_train

In [0]:
#now define your test set.
#HINT: above, you defined df_train. define df_test, and convert the operation above to something that would select all eras other than what was already selected.
#Verify you got the right answer by following the code directly above this cell but change df_train to df_test

df_test = df[df.era > 9].copy() #this code tells python to select all eras that are greater than 9, whereas the train set was defined as all eras less than 10. This is a simple logic test.

In [0]:
#Now we have two dataframes, and need to define X, y and era twice. Let's do that now.

X_train = df_train.iloc[:, 3:313].values

X_test = df_test.iloc[:, 3:313].values

y_train = df_train.target.values

y_test = df_test.target.values

era_train = df_train.era.values

era_test = df_test.era.values

I mentioned that cross validation helps to evaluate models and to mitigate the effects of overfitting.

We need to further partition our data into cross validation folds.

Thankfully, Scikit-Learn has functions to do that for us. Let's take a look at the Cross Validation functions in scikit-learn.

https://scikit-learn.org/stable/modules/cross_validation.html

We know already that our data has groups, so we should prefer cross validators that account for groups.

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data

Within this section, there are a couple of strategies.

We can visualize the methodology:

![alt text](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_005.png)
 
Please note that there are four different cross validation iterators for grouped data.

I want you to read about each one and think about your future production algorithm. Which one do you want to rely on to generate the best possible inference from the training data?

I will demonstrate how to use a cross-validation iterator using the first option. This may not be appropriate for your model!

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold

This time, we'll need to choose parameters and not rely on the default settings. Visit the link above and read the Parameters section.

 
You'll notice that the default number of splits is 5. We happen to have 9 eras, which splits evenly into 3 groups.

The warning in the parameter section tells you the default setting changed in version 0.22 from 3 to 5. This is why it is important to define the settings for production algorithms which may require stability over several years.

While we could rely on the default settings for now, let's instead use n_splits=3

![alt text](https://i.postimg.cc/WpqxTCv7/image.png)

In [0]:
#import the model_selection module

from sklearn import model_selection

#define your cross-validation iterator

CV = model_selection.GroupKFold(n_splits = 3)

Now that we've defined our splitter, we have to use it to generate our cross-validation partitions.

I demonstrated last lecture that you can access the functions of a model (or iterator) by viewing the Methods section.

![alt text](https://i.postimg.cc/mZc3qGsT/Capture.jpg)
 
We can access the methods using '.' on CV:

![alt text](https://i.postimg.cc/BbZrSQc7/method-parameters.jpg)
 
What happens behind the scenes is that the data is sliced into further train and test sets, and the output from using .split() is an array which models in the next step can use automatically to train and test on the different partitions. This isn't easy to visualize, but the example code in the link gives a pretty good explanation of the output. You are given train and test sets which identify the rows that belong to each grouping.

In [0]:
#We can now define a new variable which stores these row identifiers for algorithms to use.

grp = list(CV.split(X = X_train, y = y_train,  groups = era_train))

#what this line does is:

#1 - define a new variable: grp = 
#2 - use the python 'list' function: list(
#3 - within the list function, we use the .split() method on our group k-fold iterator, which we defined as CV: CV.split(
#4 - within the .split() method, we defined the required parameters and closed the .split parenthesis: X = X_train, y = y_train, groups = era_train)
#5 - we enclosed the list function: )

#this is called nested function.

You may have noticed that many models include parameters which can be changed from the default value. Doing so can increase the model's performance, and also lead to overfitting. The benefit of cross validation, and models with built-in cross validation optimization, is that you don't have to do the work of determining what value is the most performant.

Let's use Ridge Regression as an example.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV

![alt text](https://i.postimg.cc/sxRp7v0K/ridge-CV.jpg)

The model may improve by changing the alpha parameter, but we don't know which value to choose in advance. So we should just try a whole bunch and let the math wizards who wrote this code figure it out!

We can define the search space ourselves by creating a new variable and listing the values to evaluate explicitly like so:

`alphas = (0.1, 0.5, 1.0, 10.0)`

Since we are altering the default settings of RidgeCV then we have to modify those terms within the parenthesis:

`RidgeCV(alphas = alphas, cv = grp)`

above, we deviate from the default parameters for alphas and cv, which we've defined as variables ourselves above. Your python environment has stored those variables in memory, and RidgeCV will "call" those variables as part of the fit function automatically. Let's put this to work and determine which alpha gives the best in-sample performance.

In [0]:
from sklearn import linear_model
alphas = (0.1, 0.5, 1.0, 10.0)
REG1 = linear_model.RidgeCV(alphas = alphas, cv = grp)
REG1.fit(X_train, y_train)
#you can just ignore the output that is generated below once the code finishes

We can access stored values:

![alt text](https://i.postimg.cc/LX4SXTmV/access.jpg)

In [0]:
REG1.alpha_ #This line prints the alpha value that gave the model the best fit

In [0]:
REG1.score(X_train, y_train) #scoring the model gives you the R-sq of the model using the best alpha parameter

This may be insufficent for your analysis. What if you wanted to see the performance of each term?

We can use a model selection tool called Grid Search CV.

Let's turn this up to 11. 

![alt text](https://i.postimg.cc/SxjDyz0z/to-11.jpg)

https://scikit-learn.org/stable/modules/grid_search.html

Grid search allows us to "span the space" and determine which parameter is best and also to give you the performance in a table. This is a very powerful tool!

Let's keep using Ridge, but now we will use the non-CV version.

In [0]:
REG2 = linear_model.Ridge()

#we also have to create a python dictionary for grid search to use:

params1 = {'alpha': [0.1, 0.3, 0.5, 0.8, 1.0, 2.0, 5.0, 10.0]}

#what this line does is:

#a dictionary has a dimension n,j (think columns and rows, although that isn't quite right)

#1 - define a new variable: params = 
#2 - create a python 'dictionary': {
#3 - define the name of the nth item as: 'alpha':
#4 - the jth item is a list: [
#5 - inside the jth item list is: 0.1, 0.3, 0.5, 0.8, 1.0, 2.0, 5.0, 10.0]
#6 - close the dictionary: }

#we created a dictionary that contains a list: https://docs.python.org/3/tutorial/datastructures.html#dictionaries

In [0]:
GS1 = model_selection.GridSearchCV(estimator = REG2, param_grid = params1, cv = grp, return_train_score = True)
GS1.fit(X_train, y_train)
#you can ignore the output that is generated below after the code finishes

In [0]:
scores1 = pd.DataFrame(GS1.cv_results_); scores1

#if we type GS.cv_results_ then we are given a dictionary. We can use Pandas to convert a dictionary to a dataframe using method chaining.

#the code above defines a new variable called scores1, and we use the DataFrames() function from Pandas to interpret the GS1.cv_results_ method's output as a dataframe.
#we then used the shortcut method: ; scores1 to print the table below.

![alt text](https://i.postimg.cc/PfQjD0CQ/winning-jpg.jpg)

WE HAVE DATA!!!



In [0]:
#which model is the best?

GS1.best_estimator_

#the output below contains the parameters used in the best model. You could copy this and define it as a new variable for production if you wanted to.
#alpha=10.0, copy_X=True, fit_intercept=True, max_iter=None,
#      normalize=False, random_state=None, solver='auto', tol=0.001

In [0]:
GS1.best_score_

#this is the score given by the model above. Our score is negative! We've actually overfit, and the model is worse than just guessing randomly.
#However, this statement only holds for eras 1-9; perhaps the model would perform better with more data.
#I leave that to you to evaluate with a computer that has more resources available.

In [0]:
#this code will take almost 10 minutes to run. Please be patient!

#####################################################################################################################################################################################################################################

#What if a model has more than one parameter which can be optimized? Grid Search CV can handle that scenario as well. Here's how we handle more than one parameter optimization.
#Let's find the best combination of l1_ratio and tol in the ElasticNet model:

#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

params2 = {'l1_ratio': [0.0, 0.25, 0.33, 0.5, 0.66, 0.75, 1], 'tol': [0.01, 0.001, 0.0001, 0.00001]}

#what will happen in the background is that grid search will use l1_ratio of 0.0 and evaluate performance for each value of 'tol', and so on for every value of l1_ratio provided. This takes a long time with just 9 eras; imagine how long it would take to do this with all 120 eras! While trying a lot of parameters is a good idea, you've got to balance efficiency with your desire to find the best settings for the model. This is an art, not a science.

#You'll see some warning messages about convergence. You can ignore them for now.

REG3 = linear_model.ElasticNet(max_iter = 3000) #note I changed a default parameter! I am using a maximum of 3,000 training iterations to find the best fit for each run of the model.

GS2 = model_selection.GridSearchCV(estimator = REG3, param_grid = params2, cv = grp, return_train_score = True)

GS2.fit(X_train, y_train)

In [0]:
scores2 = pd.DataFrame(GS2.cv_results_); scores2

In [0]:
#determine which model was the best
GS2.best_estimator_

In [0]:
#determine the score of the best model
GS2.best_score_

In [0]:
#Now that we've determined which model is the best, we can evaluate the performance of the model on the out-of-sample test set.

#First, we have to re-train the model on the training data.

REG4 = GS2.best_estimator_ #we can easily define the best model without writing very much, since GridSearchCV allows us to access the best model directly
REG4.fit(X_train, y_train)

In [0]:
#Let's double check the in-sample performance
REG4.score(X_train, y_train)

Numerai scores your predictions based on correlation with the target variable. Until this point, we've only considered the performance of the model based on the R-sq. Numerai provides the scoring code.

We can define a python function (https://docs.python.org/3/tutorial/controlflow.html#defining-functions) to calculate the correlation score:

In [0]:
def correlation_score(y_true, y_pred):
    return np.corrcoef(y_true, y_pred)[0,1]

In [0]:
#Let's generate some predictions. Since we've fit the model to the training data, we can generate predictions based on the test set.
#We want the highest possible correlation score with the understanding that a very high score may be a sign that the model has overfit.

preds1 = REG4.predict(X_test)

In [0]:
#Now let's calculate the out-of-sample correlation using Numerai's scoring function and print the score:

OOS_score1 = correlation_score(y_test, preds1); OOS_score1

![alt text](https://i.postimg.cc/7LR1Yx8P/thumbs-up.jpg)

5.7% correlation is pretty good! There are some caveats to remember. You trained the model on 9 out of 120 eras, and evaluated the performance based on 8 of the 111 eras remaining. We know that there is a time-series component to the data, so perhaps eras 1-9 are similar to eras 10-17. We still don't know if the model is any good, but we did take a scientific approach to the evaluation.

Because we held out data that the model has never seen, we can infer that if live stock market data is similar to eras 10-17, then we should expect our models to generate a 5.7% correlation, on average. This is highly unlikely in practice!

We have another problem. We trained the model on eras 1-9, so the assumption is that each era is nearly identical in composition. We know this to be false as well. The more likely case is that some of the eras are more similar than others. What if we used grid search to identify the best parameters to use on eras 1-3, 4-6, and 7-9? What if we fit those best models to 1-3, 4-6, and 7-9 and then generated predictions on the test set for each of them and averaged the predictions across the three models? That would be a much better approach, wouldn't it?

Here's the thing: Even if you did average across groups like that, there's no guarantee that eras 1-3 are sufficiently similar to eachother but sufficiently different from 4-6 and 7-9 to improve the forecast. If 1-3, 4-6, and 7-9 are actually similar, then averaging the 3 prediction sets would yield an answer that was no different from any one of the 3 prediction sets.

And another layer of complexity exists that we haven't even addressed yet. Some of the features are highly correlated. If your model does not address this multicollinearity then the model is likely to overfit those features, causing bad inference.

ASSIGNMENT: Use the scikit-learn website to identify three models (in this case, a model is a sequence of tasks that eventually generate predictions) which work with our dataset (regression algorithms) that:

-either automatically handle multicollinearity or offer some way to mitigate multicollinearity

-mitigate problems with training on the entire dataset, such as averaging across a number of models (hint: ensemble methods)

-or a combination of a feature selection tool followed by an algorithm which will work with our dataset

You must also identify which parameters can be optimized.

![alt text](https://i.postimg.cc/V6D8bHgN/supervised-learning.jpg)
 
Below is an example of a correct answer. Note that just because there are three models within the answer, that doesn't mean that you are finished. You still must develop 2 additional models. Perhaps you find an algorithm that does everything you want and doesn't require a long answer such as below. That's fine, but you should justify that decision in the description. This is an opportunity for you to think about your production model and to show me some of your ideas, and also that you understand the concepts we've discussed. If you find one you like but are struggling to understand how to set it up properly, then please join Numerai's Rocket Chat in the #newusers channel and I can help you.

In [0]:
#MODEL 1: Voting Regressor

#LINK: https://scikit-learn.org/stable/modules/ensemble.html#voting-regressor

#Description of methodology and how the model helps to give better performance:

#Voting Regressor requires that several models be defined which will each be fit on the training data.
#The voting regressor then averages the predictions made by each model, which balances out the individual weaknesses of the models.
#This will help to reduce overfitting, but only if the models are sufficienty different.

#Import the relevant modules, define the model(s), identify parameters to be optimized:

from sklearn import ensemble
from sklearn import kernel_ridge
from sklearn import neural_network
VR1 = kernel_ridge.KernelRidge() #kernel ridge allows for optimization of the alpha parameter.
VR2 = neural_network.MLPRegressor() #MLP regressor allows for optimization of many parameters: the activation function, the solver, alpha, batch_size, learning_rate_init, max_iter, tol, and many other parameters. So many, that if I choose to use this model in production, then I will ask the Professor for help before I try to use this in practice!
VR3 = linear_model.ElasticNet() #ElasticNet allows for optimization of these parameters: alpha, l1_ratio, tol

#the example code given for Voting Regressor tells me that I have to provide a list of estimators within the parenthesis, so I follow the example like this:

META = ensemble.VotingRegressor(estimators=[('VR1', VR1), ('VR2', VR2), ('VR3', VR3)])

#Write the correct code to fit your model:

META.fit(X_train, y_train)

#Write the code to generate and store predictions from your fitted model:

#you can name this whatever you want, here are some examples:

#PREDS1
#y_pred
#predictions

VR_pred = META.predict(X_test)

#Write the code to score the out-of-sample predictions using the correlation function and print the correlation score:

#you can name this whatever you want

OOS_VR_score = correlation_score(y_test, VR_pred); OOS_VR_score

In [0]:
#MODEL 1: Elastic Net

#LINK: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet

#Description of methodology and how the model helps to give better performance:

#Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.

#Import the relevant modules, define the models, identify parameters to be optimized:

M1 = linear_model.ElasticNet() # parameters that can be optimized are: alpha, l1_ratio, tol

#Write the correct code to fit your model:

M1.fit(X_train, y_train)

#Write the code to generate and store predictions from your fitted model:

OOS_M1 = M1.predict(X_test)

#Write the code to score the out-of-sample predictions using the correlation function and print the correlation score:

OOS_M1_score = correlation_score(y_test, OOS_M1); OOS_M1_score

In [0]:
#MODEL 2: 

#LINK:

#Description of methodology and how the model helps to give better performance:

#Import the relevant modules, define the models, identify parameters to be optimized:
#parameters that can be optimized are: base_estimator, n_estimators, learning_rate, loss, random_state

#Write the correct code to fit your model:

#Write the code to generate and store predictions from your fitted model:

#Write the code to score the out-of-sample predictions using the correlation function and print the correlation score:


In [0]:
#MODEL 3: 

#LINK:

#Description of methodology and how the model helps to give better performance:

#Import the relevant modules, define the models, identify parameters to be optimized:

#Write the correct code to fit your model:

#Write the code to generate and store predictions from your fitted model:

#Write the code to score the out-of-sample predictions using the correlation function and print the correlation score: