# Prediction with Machine Learning: Part 3
In the first part (MarketPlaceSimulator.ipynb) we briefly described the main assumptions and strategy of the MarketPlace simulator and presented the source code. In the second part (Exploratory_data_analysis.ipynb), we started using the data generated in part one to better understand the correlations and trends of the data. In this part of the project, we will use ML algorithms to answer the following questions:

 1. Can we predict **when** a trader will be matched given his set of parameters choice?
 2. Which method is the best method for this task?
 
For this task we will be using the following **Python** packages: **matplotlib**, **graphlab create**, **SFrame**.
As mentioned in Part 1 of the project, the MarketPlace simulator will generate a set of data called **dayX_relinfo.csv** where **X** corresponds to a particular day and these data sets are already balanced, ie, the number of positive and negative observations (**Match_day**) are equally distributed.

This notebook is structured as:
    1. Data Overview
    2. Applying different Machine Learning algorithms to different models 
       i. Regression methods 
          Results
       ii. Classification methods
          Results

## 1. Data Overview

In [405]:
import graphlab as gl
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import numpy
import sklearn
gl.canvas.set_target('ipynb')
%matplotlib inline

In [421]:
# Loading the data
data_all = gl.SFrame.read_csv('Source_Code/TradeLog_relinfo.csv')
data_day1 = gl.SFrame.read_csv('Source_Code/day1_relinfo.csv')
data_day2 = gl.SFrame.read_csv('Source_Code/day2_relinfo.csv')
data_day3 = gl.SFrame.read_csv('Source_Code/day3_relinfo.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,float,float,float,float,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,float,float,float,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,float,float,float,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,float,float,float,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [424]:
# Quick look at the second data set
data_all

CurrencySell,avgIBR,OrdRate_IBR,wavgMatchRate_IBR,AmountOrd,TimeStampPlaced,Time_diff
EURO,0.778787670271,0.7764,0.780645237163,70300.0,8053,970
GBP,1.28214298363,1.2844,1.2853470437,116100.0,0,14811
GBP,1.29415696773,1.2931,1.2962761421,36800.0,0,4215
GBP,1.28908591455,1.2888,1.28932439402,54400.0,0,7681
EURO,0.778015393152,0.7778,0.778816199377,60400.0,14249,4
GBP,1.27861019473,1.2787,1.28336755647,2800.0,0,17154
EURO,0.783349861192,0.7796,0.780125386965,147900.0,8678,9289
EURO,0.782811168004,0.78,0.78063902068,134900.0,1259,17513
EURO,0.771292037295,0.7696,0.771663156835,99000.0,3526,0
GBP,1.29322457642,1.2909,1.29123797431,113100.0,0,1341


In [425]:
data_day1

CurrencySell,avgIBR,OrdRate_IBR,wavgMatchRate_IBR,AmountOrd,TimeStampPlaced,Match_day
GBP,1.28867470926,1.2896,1.29032258065,18800,0,0
EURO,0.772660362928,0.7715,0.773335395561,92500,4304,1
GBP,1.28501839922,1.2809,1.28766359505,147500,0,0
EURO,0.782007605574,0.7791,0.781606216338,77400,4606,0
EURO,0.776004350398,0.7724,0.776276975625,9500,11144,1
GBP,1.29253210267,1.2869,1.29115558425,14500,0,0
EURO,0.783022997104,0.7799,0.782131738464,120000,4205,0
EURO,0.77178992695,0.7734,0.773874013311,133700,4712,1
GBP,1.28104663896,1.2832,1.28435653737,12000,0,0
EURO,0.781611445232,0.779,0.78339208774,59900,1011,0


In [426]:
my_features = ['avgIBR', 'OrdRate_IBR', 'wavgMatchRate_IBR', 'AmountOrd','TimeStampPlaced']

In [427]:
# Filtering Euro and GBP
data_day1_Euro = data_day1[data_day1['CurrencySell'] == 'EURO']
data_day1_GBP  = data_day1[data_day1['CurrencySell'] == 'GBP' ]
data_day1_Euro[my_features].show()
data_day1_GBP[my_features].show()
data_day1['Match_day'].show(view='Categorical')

## 2. Applying different Machine Learning algorithms to different models 
### i. Regression methods 
We will access how different algorithmos for regression works in predicting when a matching event will occur. For this, we will use the **Time_diff** as our continuous predictor variable.
In particular we will use Linear Regression, Random Forest and K-Nearest Neighbours algorithms. 
We will now define functions that will take the variables chosen and apply different combinations of these variables to the different algorithms. 

In [443]:
# Splitting the data into train and test sets
train_set, test_set = data_all.random_split(.8,seed=0)

# 1) Linear Regression
predictors = ['CurrencySell','avgIBR', 'OrdRate_IBR', 'wavgMatchRate_IBR', 'AmountOrd','TimeStampPlaced']

LinearReg_model = gl.linear_regression.create(train_set, target='Time_diff',
                                              features=predictors, validation_set=test_set);

In [444]:
# 2) RandomForest Regression
RFReg_model = gl.random_forest_regression.create(train_set, target='Time_diff', max_iterations=500, max_depth=3,
                                              features=predictors, validation_set=test_set);

In [445]:
# 3) Boosted Trees
BTReg_model = gl.boosted_trees_regression.create(train_set, target='Time_diff', max_iterations=500, max_depth=3,
                                              features=predictors, validation_set=test_set)

### Results
As we can see, the best model among the regressors is given by **Boosted Trees**, where the **RMSE** is equal to *TimeStampMatched= 1001 ~ 2 days*. 

In [454]:

print '--------------------------------'
print 'Results for Regression models: '
print '--------------------------------'
print 'Linear Regression: ', LinearReg_model.evaluate(test_set,metric='auto')
#LinearReg_model.show(view='Evaluation')

print '--------------------------------'
print 'Results for Regression models: '
print '--------------------------------'
print 'RandomForest : ', RFReg_model.evaluate(test_set,metric='auto')
#RFReg_model.show(view='Evaluation')


print '--------------------------------'
print 'Results for Regression models: '
print '--------------------------------'
print 'Boosted Trees: ', BTReg_model.evaluate(test_set,metric='auto')
#BTReg_model.show(view='Evaluation')

--------------------------------
Results for Regression models: 
--------------------------------
Linear Regression:  {'max_error': 15842.107631492778, 'rmse': 4822.789059864433}
--------------------------------
Results for Regression models: 
--------------------------------
RandomForest :  {'max_error': 7363.248046875, 'rmse': 2610.561045707597}
--------------------------------
Results for Regression models: 
--------------------------------
Boosted Trees:  {'max_error': 6568.65966796875, 'rmse': 1001.096981365979}


### ii. Classification methods
If we reformulate the question to *"Will a matching event occur in day1 or day2 or day3 ...?"* the problem becomes a binary classification problem. For this, we will use the new logical variables created **day1, day2, day3** as our predictor variable.

In [465]:
# Splitting the data into train and test sets
train_set, test_set = data_day2.random_split(.8,seed=0)

gl.canvas.set_target('ipynb')

predictors = ['CurrencySell','avgIBR', 'OrdRate_IBR', 'wavgMatchRate_IBR', 'AmountOrd','TimeStampPlaced']

max_it = 50

# 1) Logistic Regression
LogReg_model = gl.logistic_classifier.create(train_set, target='Match_day',
                                              features=predictors, max_iterations=max_it,
                                              feature_rescaling='True',
                                              verbose='True', validation_set=test_set,
                                              l2_penalty=0);
# 2) RandomForest Classifier
RFClass_model = gl.random_forest_classifier.create(train_set, target='Match_day',
                                                   max_iterations=500, max_depth=1,
                                                  features=predictors,
                                                   row_subsample=0.5,column_subsample=0.5,
                                                  validation_set=test_set,random_seed=0,
                                                  )
# 3) Boosted Trees
BTClass_model = gl.boosted_trees_classifier.create(train_set, target='Match_day',
                                                   max_iterations=500,# max_depth=5,
                                                  features=predictors,
                                                   row_subsample=0.5,column_subsample=0.5,
                                                  validation_set=test_set,random_seed=0,
                                                  )

### Resuts
Next we show the metrics for each model and discuss the results.


In [466]:
print 'Logistic Regression '
print '--------------------'
LogReg_model.evaluate(test_set, metric='auto')
LogReg_model.show(view='Evaluation')

print 'RandomForest        '
print '--------------------'
RFClass_model.evaluate(test_set, metric='auto')
RFClass_model.show(view='Evaluation')

print 'Boosted Trees        '
print '--------------------'
BTClass_model.evaluate(test_set, metric='auto')
BTClass_model.show(view='Evaluation')

Logistic Regression 
--------------------


RandomForest        
--------------------


Boosted Trees        
--------------------
