{[Click here to read this notebook in Google Colab](https://colab.research.google.com/drive/1z_VSTnTAueUPF4kNXWSqaPr4DAOxyQ7z)}

<head><link rel = "stylesheet" href = "https://drive.google.com/uc?id=1zYOH-_Mb9jOjRbQmghdhsmZ2g6xAwakk"></head>

<table class = "header"><tr>
    <th align = "left">EPAT Batch 45 | MLT 2, 2020\05\04</th>
    <th align = "right">Written by: Gaston Solari Loudet</th>
</tr></table>

### Predictive Models

> Welcome! You have recently joined as a quant analyst in a trading firm. Your supervisor has asked you to build and compare two machine learning models to predict the direction (up or down) of the next day's return of stock ZXV, given the traded volume, return and deliverable quantity for the current day. You, being a systematic and well organized analyst, approach to tackle this problem in a stepwise fashion, which is laid out in this notebook. Go through steps 1 to 7 and answer the questions along the way.

#### Imports and downloading

> [**[1.A]**] In this assignment we will be working with the Scikit Learn library for machine learning, along with Numpy and Pandas. In the next cell, fill in the blanks to import "``sklearn``" library:

In [46]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas, numpy, sklearn.metrics, sklearn.linear_model, sklearn.preprocessing, sklearn.neighbors, sklearn.model_selection

> [**[1.B]**] Read the .CSV data file called "``MLT_Assignment_data.csv``" such that the "``Date``" column is the index, and save it in a DataFrame called "``df``". This is the daily data for a stock ZXV for 491 days. Check out the shape of the DataFrame.

In [63]:
############################################################################
# Download CSV file from GitHub repo for use outside from Jupyter.
URL = "https://drive.google.com/uc?id=1fDesPqg-Er4NHoIbehHy809r-2nUlM6L"
############################################################################
df = pandas.read_csv(URL, index_col = [0], parse_dates = True)
# Shape of "df"
print(f"DataFrame for market data on stock ZXV: {df.shape[0]} rows; {df.shape[1]} columns")
assert(df.shape[0] == 491) ## Verify that the answer above is correct.
print(f"Column names:\n{[x for x in df.columns]}")

DataFrame for market data on stock ZXV: 491 rows; 7 columns
Column names:
['Previous Close', 'Close Price', 'return', 'Traded Volume', 'Deliverable Qty', 'Next_day_return', 'Next_day_direction']


> [**[1.C]**] In this step you get to know about the dataset you are working with better and answer some important questions. The DataFrame "``df``" consists of the following seven columns:

> 1. "``Previous Close``": Previous day close price for the stock.
  2. "``Close Price``": Current day close price for the stock.
  3. "``return``": Current day return.
  4. "``Traded Volume``": Current day trading volume in number of shares.
  5. "``Deliverable Qty``": Quantity of shares which actually move from one investor's demat account to other on current day.
  6. "``Next_day_return``": Next day's return.
  7. "``Next_day_direction``": Direction of next day's return. "1 = up" and "0 = down".

> You have to create a ML model that predicts the direction of the return on the next trading day. Between linear regression and logistic regression, which is a better algorithm for this task and why? What are some other ML algorithms that can be used for this task?

Linear regression serves as a method for simulating the approximate linear relationship between a set of explanatory variables and an explained one, in which all of them are continuous (neither discrete nor boolean). The error term ("epsilon") implies the supposed distance between the modelled explained variable's value and the real one.

Logistic regression simulates a relationship between a set of explanatory variable, and the probability that one explained variable of boolean nature that depends on the formers, is either one of the two possible values, or the other. Output is then continuous (a probability between 0 and 1) but what tries to conceive is an idea of "proximity to 1" so the error term then becomes an uncertainty metric.

What we are trying to predict is a "direction": moreover, the sign of a one-dimensional vector. Positive or negative. The explained variable shall then be boolean, and a logistic regression shall suit our need for measuring the certainty of such being either positive or negative.

Note that a boolean entity is a non-ordinal discrete entity (ej: categories) with the restriction of being able to adopt only two possible values. Hence, any classifying algorithm with just two possible categories should do well. For example: support vector machines, dense neural networks with softmax activation functions, Naive Bayes, etc.

#### Preparing the data

> [**[2.A]**] The ML model that predicts the direction of the return on the next trading day should use 3 features. Namely: "``return``", "``Traded Volume``" and "``Deliverable Qty``". Which column in DataFrame 'df' is the target variable?

Column "``Next_day_direction``" may be the target variable. The learning algorithm should fill in binary outputs such as "0/1" or "-1/1".

> [**[2.B]**] Before we use the data with the <a href = "https://scikit-learn.org/">Scikit Learn</a> library, we need to get it in the right form. We need to prepare 2 Numpy arrays, "``X``" for the features and "``y``" for the target variable. In the cell below, we create a feature matrix "``X``" containing all the 3 aforementioned features. In the following cell, replace the blanks with the target column name to create the target array.

In [64]:
# feature matrix X
X = df[["return", "Traded Volume", "Deliverable Qty"]].values
print(f"Feature matrix 'X' ({X.ndim}-dimensional array): {X.shape[0]} rows, {X.shape[1]} columns.")
# target variable y
y = df["Next_day_direction"].values
print(f" Target vector 'y' ({y.ndim}-dimensional array): {y.shape[0]} rows")
assert(X.shape[0] == y.shape[0])

Feature matrix 'X' (2-dimensional array): 491 rows, 3 columns.
 Target vector 'y' (1-dimensional array): 491 rows


#### Train-test split

> [**[3]**] The next task is to split the avalilable data into train and test datasets. Once you train a model on the training dataset, you can check for its out of sample predictive accuracy on the test dataset. You are asked by your supervisor to keep 20% of your data as test data. (80:20 train-test split). Scikit learn provides a functionality to split the data using the "``train_test_split``" function. Put the correct value of the "``test_size``" parameter to conduct a 80 to 20 train-test split.

In [65]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
                                    X, y, test_size = 1/5, random_state = 0)
# Shapes of "X_train", "X_test", "y_train" and "y_test".
print(f"Training feature matrix: {X_train.shape[0]} rows, {X_train.shape[1]} columns.")
print(f" Training target matrix: {y_train.shape[0]} elements.")
print(f" Testing feature matrix: {X_test.shape[0]} rows, {X_test.shape[1]} columns.")
print(f"  Testing target matrix: {y_test.shape[0]} elements.")

Training feature matrix: 392 rows, 3 columns.
 Training target matrix: 392 elements.
 Testing feature matrix: 99 rows, 3 columns.
  Testing target matrix: 99 elements.


#### Feature scaling

> [**[4]**] As different features can have different scales and variance, we rescale all the features so that all of them have zero mean and unit standard deviation. This process is called "feature standardization" and can be performed using "``StandardScaler``"  from Scikit Learn. In the cell below, we standardize the "``X_train``" feature matrix. In the following cell, repeat the procedure for "``X_test``" so as to get a scaled version of "``X_test``".

In [66]:
# Standardizing features in X_train
X_train_scaled = sklearn.preprocessing.StandardScaler().fit_transform(X_train)
# Standardizing features in X_test
X_test_scaled = sklearn.preprocessing.StandardScaler().fit_transform(X_test)
# CHECK
c, i = ["X_train", "_scaled", "X_test", "_scaled"], ["Mean", "StDv"]
check = pandas.DataFrame(columns = c, index = i, 
        data = [[X_train.mean(), X_train_scaled.mean(),
                 X_train.std(), X_train_scaled.std()],
                [X_test.mean(), X_test_scaled.mean(),
                 X_test.std(), X_test_scaled.std()]]).round(0)
check.head()

Unnamed: 0,X_train,_scaled,X_test,_scaled.1
Mean,4512257.0,-0.0,5173718.0,1.0
StDv,4486566.0,0.0,4812810.0,1.0


#### Fitting the models

> [**[5]**] Based on past research on similar data, your supervisor has asked you to to fit and evaluate the following two models on the scaled training data:

> 1. "``Model_1``": A Logistic Regression model, to which you are already familiar.
  2. "``Model_2``": k-Nearest-Neighbors, with "$k = 5$"... But what is this?

The k-Nearest-Neighbors ("<u>**kNN**</u>") is also a classification algorithm like logistic regression, but uses a completely different methodology to classify the samples. It takes in a hyperparameter "$k$", decided by the user. Different values of "$k$" affect the decision boundary for the classification. In Scikit Learn, hyperparameter "$k$" is called "``n_neighbors``". From past experience, your supervisor tells you to use "$k = 5$".  

In the next cell we write the code to create "``Model_1``". Execute the cell. On similar lines, we hae written the code for "``Model_2``" in the cell after that. Fill in the blanks in the following cell to fit the data to "``Model_2``" (i.e.: kNN with "$k = 5$") and execute both cells.

In [67]:
# Creating the model
Model_1 = sklearn.linear_model.LogisticRegression(solver = 'liblinear', random_state = None)
Model_2 = sklearn.neighbors.KNeighborsClassifier(n_neighbors = 5)
# Training the model
Model_1.fit(X_train_scaled, y_train);
Model_2.fit(X_train_scaled, y_train);

#### k-fold cross validation on train data

> [**[6]**] Now that both "``Model_1``" and "``Model_2``" are trained, we want to know the average accuracy of each model using k-fold cross validation on the training data itself. This will help us to gauge the likely out of sample performance of each model. In the cell below, we calculate the mean cross validated score (for a 10-fold cross validation) for "``Model_1``". Repeat the same for "``Model_2``" in the following cell. Based on these scores, which of the two models is likely to generalize well; for example, perform better on out of sample data?

In [68]:
cv = 10 ## Partitions.
# average cross-validated score for Model_1 and Model_2
ac1 = sklearn.model_selection.cross_val_score(Model_1, X_train_scaled, y_train, cv = cv, scoring = 'accuracy').mean().round(3)
ac2 = sklearn.model_selection.cross_val_score(Model_2, X_train_scaled, y_train, cv = cv, scoring = 'accuracy').mean().round(3)
print("Scores from 10-fold cross validation.")
print(f"---> Model_1: {numpy.round(1e4*ac1)/1e2}%  |  Model_2: {numpy.round(1e4*ac2)/1e2}%")

Scores from 10-fold cross validation.
---> Model_1: 54.9%  |  Model_2: 53.6%


kNN model with "``k = 5``" has had a lower score than the logistic regression. This means that based in training data, the former will probably extrapolate better towards unseen (test) data in the future, in comparison to the latter (though not by much).

#### Prediction and accuracy scores for test data

> [**[7]**] In the cell below, we use "``Model_1``" to do out-of-sample prediction (i.e.: on test data) and print the accuracy scores. We are also printing the corresponding confusion matrix.  Repeat this procedure for "``Model_2``" in the following cell. Based on the accuracy scores for test data, which of the two models would you recommend?

In [69]:
# Model_1
# Prediction on test dataset
y_pred_test_1 = Model_1.predict(X_test_scaled)
score_1 = Model_1.score(X_test_scaled, y_test)
# Accuracy score 
print("Accuracy for 'Model_1' on test data:")
print(f"---> %.2f%%" % (100*score_1))
# Confusion matrix for prediction on test data
CM_1 = sklearn.metrics.confusion_matrix(y_test, y_pred_test_1)
print(f"Confusion matrix for test prediction:\n{CM_1}")

Accuracy for 'Model_1' on test data:
---> 48.48%
Confusion matrix for test prediction:
[[27 16]
 [35 21]]


In [70]:
# Model_2
# Prediction on test dataset
y_pred_test_2 = Model_2.predict(X_test_scaled)
score_2 = Model_2.score(X_test_scaled, y_test)
# Accuracy score 
print("Accuracy for 'Model_2' on test data:")
print(f"---> %.2f%%" % (100*score_2))
# Confusion matrix for prediction on test data
CM_2 = sklearn.metrics.confusion_matrix(y_test, y_pred_test_2)
print("Confusion matrix for test prediction:\n", CM_2)

Accuracy for 'Model_2' on test data:
---> 37.37%
Confusion matrix for test prediction:
 [[19 24]
 [38 18]]


#### Conclusions

The difference in precision is more than 10% favourable towards "``Model_1``". So Logistic Regression might likely perform quite better for this particular task. Also take into account that such algorithm is particularly optimized for binary classification, while "kNN" is a good extrapolator for more "discrete" sets. Both methods, however, had most of their fails in "false positives": expected bull days which ended up being bearish (lower rightmost numbers of matrices). It would be a better idea to create a deeper learning algorithm with a dense layer of "tanh" activation functions before the "sigmoid" logistic regressor.