# Machine Learning in Finance - Models Training

This notebook will focus on *how well we can forecast closing stocks using machine learning methods*.

*Authors:* [Mina Attia](https://people.epfl.ch/mina.attia), [Arnaud Felber](https://people.epfl.ch/arnaud.felber), [Rami Atassi](https://people.epfl.ch/rami.atassi) & [Paulo Ribeiro](https://people.epfl.ch/paulo.ribeirodecarvalho)

## Import

Import all python scripts and/or libraries needed.

In [ ]:
import pandas as pd
from models.linear_regression import OLS, OLSLasso
from models.pearson_correlation import PearsonCorrelation
from models.random_forest import RandomForest

%load_ext autoreload
%autoreload 2


## Data

Tell where we use the data and describe it quickly. Better description will be given in the README.md file.

In [ ]:
file_path = 'data/data_imputed.csv'
label = 'return'

data = pd.read_csv(filepath_or_buffer=file_path).head(10000)  #TODO: please delete the head(10000), just to test implementation
X = data.drop(columns=['permno', 'date', 'return', 'log_return', 'price', 'log_price'])
y = data[label]

## Methods

Describe which method we will train and test.

### Person Correlation Matrix

This allows to see which predictors are highly correlated. If two predictors are highly correlated we might don't want to keep both then we delete the least correlated with our label (return).

In [ ]:
correlation = PearsonCorrelation(data=data) # Takes a minute

In [ ]:
correlation_matrix = correlation.get_sorted_correlation_pairs(ascending=False, top_k_pairs=10)
display(correlation_matrix)

In [ ]:
correlation.plot(plot_type= 'median')

### OLS

One basic idea is to compute the linear regression of our predictors in function of our label and to see the weights given to each predictor. As seen in class, using regularization can help achieve predictors selection. We then implement :

1) Simple OLS
2) Lasso OLS


#### Simple OLS

In [ ]:
ols_model = OLS(predictors=X,
                label=y)
ols_model.fit()
ols_model.show_weights()

print(f'R-Square of the Simple OLS: {ols_model.r_square:.3f}')

#### Lasso OLS

In [ ]:
ols_lasso_model = OLSLasso(predictors=X,
                           label=y)

ols_lasso_model.alpha_cross_validation(from_=0.01, to_=0.015, val_number=10)
ols_lasso_model.fit()
ols_lasso_model.show_weights()

print(f'R-Square of the Lasso OLS: {ols_lasso_model.r_square:.3f}')

### Random Forest

Using Random Forest can also be a good way to select predictors. We then train a Random Forest into our label using the whole Dataset and we look at the importance of each predictors.

In [ ]:
test_data = data.sample(300000)
rf = RandomForest(data=data, target='return')

In [ ]:
rf.hyperparameter_tuning_with_crossvalidation(max_features=['sqrt', 'log2'], cv_splits=3)

In [ ]:
rf.fit_predict_and_print_score()

In [ ]:
rf.plot_feature_importance()

In [ ]:
rf.plot_decision_tree()

## Results

We will test our models and compare them to see which one outperform the others.