<a href="https://colab.research.google.com/github/jonrtaylor/example-scripts/blob/master/Working_with_Numerai_data_and_SKLearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ATTENTION!!! YOU MUST RUN EACH CODE BLOCK FROM TOP TO BOTTOM IN ORDER TO EXECUTE THE CODE!!!

In [0]:
#import statements always go first

import pandas as pd
import numpy as np
import sklearn #this is the Scikit-Learn package (sklearn for short)

Scikit-Learn has conventions for code which must be followed. For instance, an algorithm requires training data. We already know that the Numerai data is labeled as features and targets, and observations exist in eras.
So, our rows are the observations, features are the independent variables, and the target is the dependent variable.
We will have a matrix of n observations, X columns, and the y target. Your data should conform to this convention in every step of your process!

In [0]:
#define variables used inside functions above those functions

train_datalink = 'https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz'

df_train = pd.read_csv(train_datalink, nrows=50000) #download the training data and keep only the first 50,000 rows
#df_train = pd.read_csv('https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz', nrows=50000) #this line of code would also work

Let's break the training data into the parts we need. First, let's create our dependent variable as a new object, y.

In [0]:
y = df_train.target.values  #we use the .values command to extract only the numerical values from the column 'target'

Let's also create a shortcut to access the feature columns

In [0]:
#create your training matrix of the 310 features
#there are several ways to do this! This is a way to maintain the dataframe's structure but to isolate the feature columns.
features = [c for c in df_train if c.startswith("feature")]

Have we accomplished our goal? Let's check!

In [0]:
df_train[features].head(10) #first 10 rows

Above, you'll notice that 10 rows x 310 columns are printed for your review. We have specified that features represents every column in the dataframe that contains the string "feature".
What do you notice about the values in each column?

In [0]:
df_train.feature_intelligence1.describe()

Let's look at just one Era

In [0]:
era1 = df_train[df_train.era == "era1"].copy()

In [0]:
era1.describe()

You do it! Follow the two code blocks above, and print the descriptive statistics for the second era.

In [0]:
#hint: You have to change some things in order to select the second era, but the code above works and can be modified for your use.


Now that you've explored the data a little bit, let's begin to use Scikit-Learn with our subsample.
Begin with Generalized Linear Models: https://scikit-learn.org/stable/modules/linear_model.html
Scikit-learn is extremely well documented. Most algorithms come with code examples. Let's first take a look at Ordinary Least Squares.
The documentation shows that LinearRegression is part of the linear_model module. To use any algorithm in linear_model, we must first import the module.

In [0]:
from sklearn import linear_model #This is how you import a specific module to your python environment. In order to use LinearRegression, we must access LinearRegression within the linear_model module.

LinearRegression takes several parameters.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression

Visiting the above link shows you the source code for LinearRegression.

You can see that:
    
    sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)

represents the default settings for the algorithm. This is a standard presentation format throughout Scikit-Learn.

We can ignore the default parameters and focus on using the algorithm, since LinearRegression is such a basic algorithm.

Below the explanation of the parameters, you'll see an "Examples" section.

We need to define our explanatory variables (independent variables, "features") and our dependent variable, y. We can define X as:

`X = df_train[features].values`

You need to copy the line of code above and paste it in the cell below in order to define X. Run the cell.

In [0]:
#Use this code cell to define X


Also, we can define the algorithm as its own python variable. The example names LinearRegression as reg.

You can do this as well, or call the algorithm whatever you want. Personally, I like to use a systematic naming method, because I often evaluate several versions of the algorithm at once.

`REG1 = linear_model.LinearRegression()`

You need to copy the line of code above and paste it in the cell below in order to define REG1. Run the cell.

In [0]:
#Use this code cell to define REG1


Now, REG1 is my algorithm, and I can access commands using "." (Dot Notation)

We have to fit the algorithm to the training data.

In [0]:
#We have to fit the model to the training data using the following convention:
#REG1.fit(X, y)

In [0]:
#YOU DO IT! Fit your algorithm to the training data.


Let's see how the model performed by using the .score() function.
After you've fit an algorithm in scikit-learn, you can evaluate the in-sample performance by using .score()

`REG1.score(X, y)`

Running the code snippet above will give you the R-squared value for the model. A score close to 1 is very good. A score close to zero is very bad. But remember! This is only an in-sample estimate!

You do it! Run your model's .score() and evaluate the R-squared value.

In [0]:
#Copy the code from the cell above to produce the model's score



You've now fit a model to a subset of the training data. If you can do this using LinearRegression, then what other algorithms can you use within the linear_model module?

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model

ASSIGNMENT:

```
PICK ANY ALGORITHM FROM THE LINK ABOVE OTHER THAN LINEARREGRESSION AND FIT THE ALGORITHM TO THE TRAINING DATA.

SCORE YOUR MODEL USING THE .SCORE() FUNCTION.
```

Congratulations! You have successfully demonstrated how to: define your training data, fit a machine learning model to the data, and evaluate the model's in-sample performance.

Don't stop now! Read other sections of scikit-learn and see which models you can use for a regression task.

ASSIGNMENT:

```
NAME 5 ALGORITHMS WHICH WILL WORK WITH A REGRESSION TASK.
IMPORT THEM TO THIS SHEET. AT LEAST ONE MUST COME FROM A MODULE OTHER THAN LINEAR_MODEL.

REMEMBER, SCIKIT-LEARN IS VERY WELL DOCUMENTED! GOOGLE IS YOUR FRIEND! DO SOME WORK!
```