# Block 6 Exercise 2: finding the best parameters for predicting the fare of taxi rides
We return to our Random Forest Regression and want to automatically optimize all free parameters ...

In [15]:
import pandas as pd
import numpy as np
import folium

from skopt import BayesSearchCV
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn import svm, metrics
from sklearn.linear_model import LinearRegression

In [2]:
# we load the data we have saved after wrangling and pre-processing in block I
X=pd.read_csv('../../DATA/train_cleaned.csv')
drop_columns=['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.1.1','key','pickup_datetime','pickup_date','pickup_latitude_round3','pickup_longitude_round3','dropoff_latitude_round3','dropoff_longitude_round3']
X=X.drop(drop_columns,axis=1)
X=pd.get_dummies(X)# one hot coding
#generate labels
y=X['fare_amount']
X=X.drop(['fare_amount'],axis=1)

### Scikit Optimize
Scikit Optimize (https://scikit-optimize.github.io/stable/index.html) is a AutoML toolbox wrapped around Scikit-Learn. It allows us to use state-of-the-art automatic hyper-parameter optimization on top of our learning algorithms.   



In [7]:
# install 
!pip install scikit optimize

zsh:1: command not found: pip


### E 2.1 Bayesian Optimization of a Random Forest Regression Model
use Bayesian Optimization with Cross-Validation (https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html#skopt.BayesSearchCV) to find the best regression model. Compare
* linear regression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) 
* Random Forest regression (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
* and SVM regression (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)

NOTES: this can become quite compute intensive! Hence,
* use a smaller subset of the training data to run the experiments 
* think about the range of your parameters (e.g. larger number of trees in RF or high C-values in SMV will make models expensive)
* optimize only the following parameters per model type:
    * linear: no parameters to optimize
    * RF: #trees and depth
    * SVM: C and gamma (use RBF kernel)
* parallelize -> n_jobs
* use CoLab to rum the job for up to 12h 


## Linear Regression

In [11]:
#We seperate the data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=100,test_size = 100)

In [37]:
reg = LinearRegression().fit(X_train, y_train)

# model can be saved, used for predictions or scoring
print("The score training of the Linear Regression is : ",reg.score(X_test, y_test))

The score training of the Linear Regression is :  0.7963767253513719


## Random Forest Regression

In [38]:
#We seperate the data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=100,test_size = 100)

In [39]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

In [42]:
%%time

regr = RandomForestRegressor()

param_grid = {
    'max_depth': (2,8),
    'n_estimators': (100,200)
}

# set up our optimiser to find the best params in 40 searches
opt = BayesSearchCV(
    regr,
    param_grid,
    n_iter=40,
)

opt.fit(X_train, y_train)

print('Best params achieve a test score of', opt.score(X_test, y_test), ':')

opt.best_params_



Best params achieve a test score of 0.5656799364852287 :
CPU times: user 1min 48s, sys: 1.72 s, total: 1min 50s
Wall time: 49.1 s


OrderedDict([('max_depth', 5), ('n_estimators', 100)])

## SVR Regression

In [45]:
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [48]:
#Without Optimisation
regr = make_pipeline(StandardScaler(), SVR(kernel = 'rbf'))
regr.fit(X_train, y_train)

print('The score of the SVR regression without optimizes parameters is :',regr.score(X_test,y_test))

The score of the SVR regression without optimizes parameters is : 0.22124446339891313


In [61]:
%%time
#With Optimizes parameters


regr = make_pipeline(StandardScaler(), SVR(kernel = 'rbf'))

svr__C = (1,5)
svr__gamma = ["auto","scale"]

parameters = {'svr__C': svr__C,'svr__gamma': svr__gamma}

CPU times: user 266 µs, sys: 79 µs, total: 345 µs
Wall time: 353 µs


In [62]:
regr.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'svr', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'svr__C', 'svr__cache_size', 'svr__coef0', 'svr__degree', 'svr__epsilon', 'svr__gamma', 'svr__kernel', 'svr__max_iter', 'svr__shrinking', 'svr__tol', 'svr__verbose'])

In [63]:
# set up our optimiser to find the best params in 40 searches
opt = BayesSearchCV(
    regr,
    parameters,
    n_iter=40,
    n_jobs=-2
)

opt.fit(X_train, y_train)

print('Best params achieve a test score of', opt.score(X_test, y_test), ':')

opt.best_params_



Best params achieve a test score of 0.49663021493516446 :


OrderedDict([('svr__C', 5), ('svr__gamma', 'auto')])