## Chapter 9 : Hyper-Parameter Tuning with Cross-Validation

## Introduction

Hyper-parameter tuning is an essential step in building Machine Learning algorithms. Although the ML model tuning process may seem to be no different for finance, but if not done properly the algorithm will likely to overfit and produce negative performance. As optimizing models in finance are prone to overfitting, we must consider some key points mentioned in the chapter. Some of the key takeaways from the chapter are mentioned at the end of this notebook.

In [9]:
# Importing packages
import warnings
import random
import time

import numpy as np
import pandas as pd
from scipy.stats import rv_continuous,kstest
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier

# Importing MlFinLab tools
from mlfinlab.cross_validation.cross_validation import PurgedKFold

# Setting a seed and silencing some warnings
random.seed(0)
warnings.filterwarnings('ignore')

### Question 9.1

Using the function getTestData from Chapter 8, form a synthetic dataset of 10,000 observations with 10 features, where 5 are informative and 5 are noise.

In [10]:
# This is the function used in Chapter 8

def get_test_data(n_features=40, n_informative=10, n_redundant=10, n_samples=10000):
    """
    Generate a random dataset for a classification problem.
    """
 
    trnsX, cont = make_classification(n_samples=n_samples, n_features=n_features, 
                                      n_informative=n_informative, n_redundant=n_redundant, 
                                      random_state=0, shuffle=False) 
    df0 = pd.date_range(periods=n_samples, freq=pd.tseries.offsets.Minute(), end=pd.to_datetime(pd.Timestamp('today').date()).round('S'))
    trnsX = pd.DataFrame(trnsX, index=df0)
    cont = pd.Series(cont, index=df0).to_frame('bin')
    df0 = ['I_%s' % i for i in range(n_informative)] + ['R_%s' % i for i in range(n_redundant)]
    df0 += ['N_%s' % i for i in range(n_features - len(df0))]
    trnsX.columns = df0
    cont['w'] = 1.0 / cont.shape[0]
    cont['t1'] = pd.Series(cont.index, index=cont.index)

    return trnsX, cont

X, cont = get_test_data(n_features=10, n_informative=5, n_redundant=0, n_samples=10000)

In [11]:
X.head(3)

Unnamed: 0,I_0,I_1,I_2,I_3,I_4,N_0,N_1,N_2,N_3,N_4
2023-06-13 01:21:00,2.105359,2.861661,0.104159,0.686149,1.369429,-0.868903,-1.297125,-0.160205,-0.481024,0.841338
2023-06-13 01:22:00,-0.330754,1.464379,-1.405119,0.396713,-1.722305,0.471952,-1.443687,-0.433773,0.123114,-0.10297
2023-06-13 01:23:00,-0.461334,-0.160432,-2.169501,-0.137535,0.398229,-0.278979,-1.860566,0.90954,-0.396742,2.455228


**(a)**
Use ```GridSearchCV``` on 10-fold CV to find the ```C, gamma``` optimal hyperparameters on a SVC with RBF kernel, where ```param_grid={'C':[1E-2,1E-1,1,10,100],'gamma':[1E-2,1E-1,1,10,100]}``` and the scoring function is ```neg_log_loss```.

In [14]:
# Settings
cv_gen = PurgedKFold(n_splits=2, samples_info_sets=cont['t1'])
param={'C':[1e-2,1e-1,1,10,100], 'gamma':[1e-2,1e-1,1,10,100]}
est = SVC(kernel='rbf', probability=True)
scoring = 'neg_log_loss'

# GridSearchCV
gs_cv = GridSearchCV(estimator=est, param_grid=param, cv=cv_gen, scoring=scoring, n_jobs=-1)

# Run Grid Serach
start = time.time()
pipe_gs = gs_cv.fit(X, cont['bin'], sample_weight=cont['w'])
end = time.time()

**(b)**
How many nodes are there in the grid?

In [15]:
print('The number nodes in the grid is', len(param['C'])*len(param['gamma']))

The number nodes in the grid is 25


**(c)**
How many fits did it take to find the optimal solution?

In [16]:
print(pipe_gs.best_index_ , 'fits taken to find the optimal solution')

15 fits taken to find the optimal solution


**(d)**
How long did it take to find this solution?

In [17]:
print(f'Time taken to find the solution is {end-start} secs')

Time taken to find the solution is 79.12290120124817 secs


**(e)**
How can you access the optimal result?

In [18]:
# The optimal result can be accesed in the following way
pipe_gs.best_estimator_

**(f)**
What is the CV score of the optimal parameter combination?

In [19]:
print(f'The best CV score for GridSearchCV is {abs(pipe_gs.best_score_)} log_loss')

The best CV score for GridSearchCV is 0.6901030495308973 log_loss


**(g)**
How can you pass sample weights to the SVC?

In [20]:
# Sample weights can be passed in following way 
# First we get the best estimator (SVC)
best_svc = pipe_gs.best_estimator_

# Then we fit the SVC
best_svc.fit(X, cont['bin'], sample_weight=cont['w'])

### Question 9.2 
Using the same dataset from exercise 1

**(a)** Use ```RandomizedSearchCV``` on 10-fold CV to find the ```C, gamma``` optimal hyper-parameters on an SVC with RBF kernel, where ```param_distributions={'C':logUniform(a=1E-2,b=1E2), 'gamma':logUniform(a=1E-2, b=1E2)}, n_iter=25``` and ```neg_log_loss``` is the scoring function.

In [21]:
# Code Snippet 9.4 - The logUniform_gen class

class logUniform_gen(rv_continuous):
    """
    Random numbers log-uniformly distributed between 1 and e.
    """
    
    def _cdf(self,x):
        
        return np.log(x/self.a)/np.log(self.b/self.a)
    
def logUniform(a=1, b=np.exp(1)):
    """
    This function creates a uniformly distributed
    random samples in a log-scale of a and b.
    """

    return logUniform_gen(a=a,b=b,name='logUniform')

In [23]:
# Setting Up the RandomSerachSV 
cv_gen = PurgedKFold(n_splits=2, samples_info_sets=cont['t1'])
param_dist = {'C':logUniform(a=1e-2, b=1e2), 'gamma':logUniform(a=1e-2, b=1e2)}
n_iter = 25
est = SVC(kernel='rbf', probability=True)
scoring = 'neg_log_loss'

rs_cv = RandomizedSearchCV(estimator=est, param_distributions=param_dist, n_iter=n_iter, cv=cv_gen, 
                           scoring=scoring, n_jobs=-1)
# Run RandomSerachSV
start = time.time()
pipe_rs = rs_cv.fit(X, cont['bin'], sample_weight=cont['w'])
end = time.time()

**(b)**
How long did it take to find this solution?

In [24]:
print(f'Time taken to find the solution is {end-start} secs')

Time taken to find the solution is 83.5537736415863 secs


**(c)**
Is the optimal parameter combination similar to the one found in exercise 1?

In [25]:
pipe_rs.best_estimator_

The optimal parameters ```{C, gamma}``` obtained from the RandomSearchCV are *different* from GridSearchCV. 

**(d)**
What is the CV score of the optimal parameter combination? How does it
compare to the CV score from exercise 1?

In [26]:
print(f'The best CV score for RandomSearchCV is {abs(pipe_rs.best_score_)} log_loss')

The best CV score for RandomSearchCV is 0.6930918601323244 log_loss


We obtain a better perofromance with RandomSearchCV resulting in a *lower* log_loss.

### Question 9.3
From exercise 1,

**(a)** Compute the Sharpe ratio of the resulting in-sample forecasts, from point 1.a
(see Chapter 14 for a definition of Sharpe ratio).

In [27]:
def sharpe_ratio(y_true : np.array, y_pred : np.array) -> float:
    """
    A function to generate sharpe ratio out of model prediction,
    if the prediction is a 1 and the label is also 1 we consider this as gain ,
    else if prediction is a 0 and the label is 1 or 0 we consider that no action is taken hence 0 gain (no action taken),
    else if prediction is a 1 and the label is 0 we consider it as loss hence -1 gain.
    """

    if len(y_true) == len(y_pred):
        returns = []

        for i in range(len(y_true)):
            t = y_true[i]
            p = y_pred[i]

            if t == 1 and p == 1:
                returns.append(1)

            elif t == 0 and p == 1:
                returns.append(-1)

        return np.mean(returns)/np.std(returns)

In [28]:
# Getting the best SVC obtained from the RandomSearchCV with neg_log_loss scoring
gs_svc = pipe_gs.best_estimator_

# Then we fit the SVC
gs_svc.fit(X.values, cont['bin'].values, sample_weight=cont['w'])

# Getting the in-sample predictions
pred = gs_svc.predict(X.values)

In [29]:
SR = sharpe_ratio( cont['bin'].values, pred)
print(f'Sharpe ratio of the resulting in-sample forecasts is {SR}')

Sharpe ratio of the resulting in-sample forecasts is 0.322747512496089


**(b)** Repeat point 1.a, this time with accuracy as the scoring function. Compute
the in-sample forecasts derived from the hyper-tuned parameters.

In [30]:
scoring = 'accuracy'

# GridSearchCV with accuracy scoring
gs_cv_acc = GridSearchCV(estimator=est, param_grid=param, cv=cv_gen, scoring=scoring, n_jobs=-1)
pipe_gs_acc = gs_cv_acc.fit(X, cont['bin'], sample_weight=cont['w'])

In [31]:
# Getting the best SVC obtained from the RandomSearchCV with accuracy scoring
gs_svc_acc = pipe_gs_acc.best_estimator_

# Then we fit the SVC
gs_svc_acc.fit(X.values, cont['bin'].values, sample_weight=cont['w'])

# Getting the in-sample predictions
pred = gs_svc_acc.predict(X.values)

In [32]:
SR = sharpe_ratio( cont['bin'].values, pred)
print(f'Sharpe ratio of the resulting in-sample forecasts is {SR}')

Sharpe ratio of the resulting in-sample forecasts is 0.8386441111526416


**(c)** What scoring method leads to higher (in-sample) Sharpe ratio?

The accuracy scoring leads to a higher in-sample Sharpe ratio, given that the sizes of all bets are equal (regardless of the forcast confidence).

### Question 9.4
From exercise 2,

**(a)**
Compute the Sharpe ratio of the resulting in-sample forecasts, from point 2.a.

In [33]:
# Getting the best SVC obtained from the RandomSearchCV with neg_log_loss scoring
rs_svc = pipe_rs.best_estimator_

# Then we fit the SVC
rs_svc.fit(X.values, cont['bin'].values, sample_weight=cont['w'])

# Getting the in-sample predictions
pred = rs_svc.predict(X.values)

In [34]:
SR = sharpe_ratio( cont['bin'].values, pred)
print(f'Sharpe ratio of the resulting in-sample forecasts is {SR}')

Sharpe ratio of the resulting in-sample forecasts is 0.0004000000320000038


**(b)** Repeat point 1.a, this time with accuracy as the scoring function. Compute
the in-sample forecasts derived from the hyper-tuned parameters

In [35]:
scoring = 'accuracy'

rs_cv_acc = RandomizedSearchCV(estimator=est, param_distributions=param_dist, n_iter=n_iter, cv=cv_gen, 
                               scoring=scoring, n_jobs=-1)
pipe_rs_acc = rs_cv_acc.fit(X, cont['bin'], sample_weight=cont['w'])

In [36]:
# Getting the best SVC obtained from the RandomSearchCV with accuracy scoring
rs_svc_acc = pipe_rs_acc.best_estimator_

# Then we fit the SVC
rs_svc_acc.fit(X.values, cont['bin'].values, sample_weight=cont['w'])

# Getting the in-sample predictions
pred = rs_svc_acc.predict(X.values)

In [37]:
SR_acc = sharpe_ratio( cont['bin'].values, pred)
print(f'Sharpe ratio of the resulting in-sample forecasts is {SR_acc}')

Sharpe ratio of the resulting in-sample forecasts is 0.7694564317467201


**(c)** What scoring method leads to higher (in-sample) Sharpe ratio?

The neg_log_loss scoring method produces a  better Sharp Ratio ,given that all bet sizes are equal regardless of confidence (probability).

### Question 9.5
Read the definition of log loss, ```L [Y, P]```.

**(a)** Why is the scoring function ```neg_log_loss``` defined as the negative log loss, ```−L [Y, P]```?

The intuition behind changing the sign of the los_loss function is that we want to maximize the negative log_loss which will return a lower log_loss value. Basically, sklearn tends to maximize a function while optimizing a model (like maximizing the accuracy). The reason behind using the negative log_loss (instead of using accuracy) of hyper-parameter optimization is completely due to the fact that we are optimizing a model of an investment strategy *(see 9.4 page 134 of AFML for more details).*

**(b)** What would be the outcome of maximizing the log loss, rather than the negative
log loss?

This will result quite the opposite of what we wanted. If we search for the parameters that will result in the highest log_loss possible, we will end up getting the worst combinations of parameters. (to better understand log_loss *ref : https://www.kaggle.com/dansbecker/what-is-log-loss*)

### Question 9.6
Consider an investment strategy that sizes its bets equally, regardless of the forecast’s
confidence. In this case, what is a more appropriate scoring function for
hyper-parameter tuning, accuracy or cross-entropy loss?

Log loss aka cross-entropy loss takes the confidence of a prediction in account while scoring a prediction. There may be times when a classifier may predict signal with low confidence and it results in a gain, also sometimes the classifier predicts signal with high confidance and it results a loss. So,during this scenario the cross-entropy loss will not offset the loss from the high confidance prediction *(see page 134 of AFML)*. <br>
But on the other hand we can offset a miss with high probability with a hit with low probability and it does consider the confidance of predictions. Since the investment strategy doesn't consider the confidence of a prediction we can consider using the **accuracy** for scoring.

## Conclusion

**Key Takeaways** : - <br>
* Use PurgedKFold class as CV generator while tuning in order to prevent overfits of the ML estimator to leaked information. ( 9.2 ; 2nd paragraph)
* Use scoring='f1' in the context of meta-labeling applications. (9.2 ; 3rd paragraph)
* Use neg_log_loss when you are tuning hyper-parameters for an investment strategy as it can account for the probability of hit and miss effectively than accuracy scoring. (9.4 ; 1st paragraph)
* Sampling from a uniform distribution would be inefficient ; e.g. if we sample a parameter from a uniform distribution  ```U[0,100]``` , 99% of the values would be expected to be greater than 1; which is not the effective way to  exploring the feasibility region of parameters. Hence the author suggests using log-uniform distribution. (9.3.1)

## References

- Advances in Financial Machine Learning, Chapter-09.