# Machine Learning in Finance - Data Processing

This notebook will focus on processing the data with ML techniques (RandomForest), before applying ML techniques to find the top characteristics for price prediction.

***Note: This part used a package `missingpy` that requires old versions of scipy. To avoid unnecessary conflict, please uncomment the first cell to create a conda environment just for this pre-processing task. Do not forget to set the created environment as you IDE interpreter.***

*Authors:* [Mina Attia](https://people.epfl.ch/mina.attia), [Arnaud Felber](https://people.epfl.ch/arnaud.felber), [Milos Novakovic](https://people.epfl.ch/milos.novakovic), [Rami Atassi](https://people.epfl.ch/rami.atassi) & [Paulo Ribeiro](https://people.epfl.ch/paulo.ribeirodecarvalho)

In [1]:
#!conda create --name impute python=3.8
#!conda activate impute
#!/opt/anaconda3/envs/impute/bin/pip install missingpy
#!/opt/anaconda3/envs/impute/bin/pip install scikit_learn
#!/opt/anaconda3/envs/impute/bin/pip install pandas

## Import

In [2]:
import pandas as pd
import numpy as np
from missingpy.missforest import MissForest
from helpers import load_data_df, select_time_window

%load_ext autoreload
%autoreload 2

## Data

Load the dataset.

In [3]:
#file_path = 'data/signed_predictors_all_wide.csv'

#data = load_data_df(file_path=file_path)

## Impute

Retrieve the missing values using Machine Learning techniques. To do so, we use the API from `missingpy` and call the MissForest algorithm. MissForest imputes missing values using Random Forests in an iterative fashion.

In [4]:
nan = np.nan
data = pd.DataFrame([[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]])

print(data)

     0  1    2
0  1.0  2  NaN
1  3.0  4  3.0
2  NaN  6  5.0
3  8.0  8  7.0


In [5]:
imputer = MissForest(missing_values=np.nan,
                     criterion=('squared_error', 'gini'),
                     max_features='sqrt',
                     random_state=1337)
data_imputed = imputer.fit_transform(data)
data_imputed

Iteration: 0
Iteration: 1
Iteration: 2


array([[1.  , 2.  , 3.88],
       [3.  , 4.  , 3.  ],
       [2.67, 6.  , 5.  ],
       [8.  , 8.  , 7.  ]])