<a href="https://colab.research.google.com/github/jo-cho/advances_in_financial_machine_learning/blob/master/Chapter_09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [0]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn import metrics

Exercises

# 9.1. Using the function getTestData from Chapter 8, form a synthetic dataset of 10,000 observations with 10 features, where 5 are informative and 5 are noise.

In [0]:
def getTestData(n_features=40,n_informative=10,n_redundant=10,n_samples=10000):
  # generate a random dataset for a classification problem
  from sklearn.datasets import make_classification
  trnsX,cont = make_classification(n_samples=n_samples,n_features=n_features,n_informative=n_informative,n_redundant=n_redundant,random_state=0,shuffle=False)
  df0 = pd.DatetimeIndex(periods=n_samples,freq=pd.tseries.offsets.BDay(), end=pd.datetime.today())
  trnsX,cont = pd.DataFrame(trnsX,index=df0), pd.Series(cont,index=df0).to_frame('bin')
  df0 = ['I_'+str(i) for i in range(n_informative)]+ ['R_'+str(i) for i in range(n_redundant)]
  df0+=['N_'+str(i) for i in range(n_features-len(df0))]
  trnsX.columns=df0
  cont['w']=1./cont.shape[0]
  cont['t1']=pd.Series(cont.index,index=cont.index)
  return trnsX,cont

In [135]:
trnsX, cont = getTestData(10,5,0,3000)

  """


In [0]:
y= cont[['bin']]
t1=cont['t1']

## (a) Use GridSearchCV on 10-fold CV to find the C, gamma optimal hyperparameters on a SVC with RBF kernel, where param_grid={'C':[1E-2,1E-1,1,10,100],'gamma':[1E-2,1E-1,1,10,100]} and the scoring function is neg_log_loss.

Grid search cross-validation conducts an exhaustive search for the combination of parameters that maximizes the CV performance, according to some user-defined
score function.

In [7]:
#!pip install -q mlfinlab

[?25l[K     |██▊                             | 10kB 22.9MB/s eta 0:00:01[K     |█████▌                          | 20kB 1.4MB/s eta 0:00:01[K     |████████▎                       | 30kB 1.3MB/s eta 0:00:01[K     |███████████                     | 40kB 1.6MB/s eta 0:00:01[K     |█████████████▉                  | 51kB 1.8MB/s eta 0:00:01[K     |████████████████▋               | 61kB 2.2MB/s eta 0:00:01[K     |███████████████████▍            | 71kB 1.9MB/s eta 0:00:01[K     |██████████████████████▏         | 81kB 2.1MB/s eta 0:00:01[K     |█████████████████████████       | 92kB 2.4MB/s eta 0:00:01[K     |███████████████████████████▊    | 102kB 2.6MB/s eta 0:00:01[K     |██████████████████████████████▌ | 112kB 2.6MB/s eta 0:00:01[K     |████████████████████████████████| 122kB 2.6MB/s 
[?25h

In [0]:
from mlfinlab import cross_validation
from mlfinlab.cross_validation import PurgedKFold

We need to pass our PurgedKFold class in order to prevent that GridSearchCV overfits the ML estimator to leaked information.

In [0]:
n=5
pkf = PurgedKFold(n, t1, pct_embargo=0.01)

In [0]:
svc = SVC(kernel='rbf',probability=True)

In [0]:
param_grid = {'C':[1E-2,1E-1,1,10,100],'gamma':[1E-2,1E-1,1,10,100]}

In [0]:
gs0 = GridSearchCV(estimator = svc, param_grid = param_grid, cv = pkf, scoring='neg_log_loss',n_jobs=-1, iid=False, return_train_score=True)

In [180]:
gsfit = gs0.fit(trnsX, y)

  y = column_or_1d(y, warn=True)


### About scoring:

I advise you to use scoring='f1' in the context of meta-labeling applications,
for the following reason. Suppose a sample with a very large number of negative (i.e.,label ‘0’) cases. A classifier that predicts all cases to be negative will achieve high 'accuracy' or 'neg_log_loss', even though it has not learned from the features how to discriminate between cases. In fact, such a model achieves zero recall and undefined precision . The 'f1' score corrects for that performance inflation by scoring the classifier in terms of precision and recall.

For other (non-meta-labeling) applications, it is fine to use 'accuracy' or 'neg_log_loss', because we are equally interested in predicting all cases. Note that a relabeling of cases has no impact on 'accuracy' or 'neg_log_loss' however it will have an impact on 'f1'.

The key conceptual difference between accuracy and negative log-loss is that
negative log-loss takes into account not only whether our predictions were correct
or not, but the probability of those predictions as well.

In [0]:
gsresult = pd.DataFrame(gsfit.cv_results_)

In [182]:
gsresult.columns

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_C', 'param_gamma', 'params', 'split0_test_score',
       'split1_test_score', 'split2_test_score', 'split3_test_score',
       'split4_test_score', 'mean_test_score', 'std_test_score',
       'rank_test_score', 'split0_train_score', 'split1_train_score',
       'split2_train_score', 'split3_train_score', 'split4_train_score',
       'mean_train_score', 'std_train_score'],
      dtype='object')

## (c)

In [183]:
len(gsresult)
# 5x5

25

## (d)

In [184]:
gsresult['mean_fit_time'].sum()

45.315320062637326

In [185]:
gsresult['mean_score_time'].sum()

1.5529106616973876

## (e)

In [186]:
gsfit.best_params_

{'C': 10, 'gamma': 0.1}

## (f)

In [187]:
gsresult.sort_values('rank_test_score').head(1)[['rank_test_score','mean_test_score','mean_train_score']]

Unnamed: 0,rank_test_score,mean_test_score,mean_train_score
16,1,-0.340678,-0.053253


In [188]:
gsfit.best_score_

-0.34067817387878235

## (g)

This example introduces nicely one limitation of sklearn’s Pipelines : Their
fit method does not expect a sample_weight argument. Instead, it expects a
fit_params keyworded argument. That is a bug that has been reported in GitHub;
however, it may take some time to fix it, as it involves rewriting and testing much functionality. Until then, feel free to use the workaround in Snippet 9.2. It creates a newclass, called MyPipeline,which inherits allmethods from sklearn’s Pipeline. It overwrites the inherited fit method with a new one that handles the argument
sample_weight, after which it redirects to the parent class.

-------------
```
#Snippet 9.2

class MyPipeline(Pipeline):
  def fit(self,X,y,sample_weight=None,**fit_params):
    if sample_weight is not None:
      fit_params[self.steps[-1][0]+'__sample_weight']=sample_weight
    return super(MyPipeline,self).fit(X,y,**fit_params)
```



#2.

For ML algorithms with a large number of parameters, a grid search cross-validation
(CV) becomes computationally intractable. In this case, an alternative with good statistical
properties is to sample each parameter from a distribution

## (a) 
Use RandomizedSearchCV on 10-fold CV to find the C,
gamma optimal hyper-parameters on an SVC with RBF kernel,
where param_distributions={'C':logUniform(a=1E-2,b=
1E2),'gamma':logUniform(a=1E-2,b=1E2)},n_iter=25 and
neg_log_loss is the scoring function.

In [0]:
from scipy.stats import rv_continuous,kstest
class logUniform_gen(rv_continuous):
# random numbers log-uniformly distributed between 1 and e
  def _cdf(self,x):
    return np.log(x/self.a)/np.log(self.b/self.a)
def logUniform(a=1,b=np.exp(1)):
  return logUniform_gen(a=a,b=b,name='logUniform')

In [0]:
param_distributions={'C':logUniform(a=1E-2,b= 1E2),'gamma':logUniform(a=1E-2,b=1E2)}

In [0]:
rs0 = RandomizedSearchCV(svc,param_distributions,n_iter=25,n_jobs=-1,iid=False,return_train_score=True, scoring='neg_log_loss', cv=n)

In [223]:
rsfit = rs0.fit(trnsX,y)

  y = column_or_1d(y, warn=True)


In [0]:
rsresult = pd.DataFrame(rsfit.cv_results_)

##(b)

In [225]:
rsresult['mean_fit_time'].sum()

37.067204332351686

In [226]:
rsresult['mean_score_time'].sum()

1.2519288539886475

더 적은 시간 걸림

##(c)

In [227]:
rsfit.best_params_

{'C': 3.6091960984355933, 'gamma': 0.10004345852454348}

꽤다름

##(d)

In [228]:
rsresult.sort_values('rank_test_score').head(1)[['rank_test_score','mean_test_score','mean_train_score']] #this is neg-log-loss

Unnamed: 0,rank_test_score,mean_test_score,mean_train_score
10,1,-0.190361,-0.074954


더좋음

#3. 
from 1.

## (a) Compute the Sharpe ratio of the resulting in-sample forecasts, from point 1.a (see Chapter 14 for a definition of Sharpe ratio).

In [0]:
def sharpe(r):
  s = r.mean()/r.std()
  return s

In [214]:
#in-sample forecast

pred3a = gsfit.best_estimator_.predict(trnsX)
pred3a

array([0, 0, 0, ..., 1, 1, 1])

In [215]:
#where is return in the data?
#마음대로 가공해보자 bin=0이면 r=0, bin=1이면 실제에 따라 -1 or 1이라고 하자. 즉 실제로 bin=0인데 1로 했다면(틀리면) r=-1. (맞으면) r=1
#실제는 cont['bin']

r3a=pred3a*(2*cont['bin']-1)

sharpe(r3a)

0.948038314210756

##  (b)  Repeat point 1.a, this time with accuracy as the scoring function. Compute the in-sample forecasts derived from the hyper-tuned parameters.

In [0]:
gs1 = GridSearchCV(estimator = svc, param_grid = param_grid, cv = pkf, scoring='accuracy',n_jobs=-1, iid=False, return_train_score=True)

In [206]:
gsfit2 = gs1.fit(trnsX,y)

  y = column_or_1d(y, warn=True)


In [216]:
pred3b = gsfit2.best_estimator_.predict(trnsX)
r3b = pred3b*(2*cont['bin']-1)
sharpe(r3b)

0.948038314210756

똑같음.

# 4.

##(a) Compute the Sharpe ratio of the resulting in-sample forecasts, from point 2.a.

In [229]:
pred4a = rsfit.best_estimator_.predict(trnsX)
r4a = pred4a*(2*cont['bin']-1)
sharpe(r4a)

0.906000957368751

##(b)

In [0]:
rs1 = RandomizedSearchCV(svc,param_distributions,n_iter=25,n_jobs=-1,iid=False,return_train_score=True, scoring='accuracy', cv=n)

In [231]:
rsfit2 = rs1.fit(trnsX,y)

  y = column_or_1d(y, warn=True)


In [232]:
pred4b = rsfit2.best_estimator_.predict(trnsX)
r4b = pred4b*(2*cont['bin']-1)
sharpe(r4b)

0.9383503729346667

이건 또 다름

#5. Read the definition of log loss, L [Y, P]

https://towardsdatascience.com/understanding-negative-log-loss-8c3e77fafb79

Logarithmic loss (related to cross-entropy) measures the performance of a classification model where the prediction input is a probability value between 0 and 1. The goal of our machine learning models is to minimize this value. A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high log loss.

http://wiki.fast.ai/index.php/Log_Loss

Why negative?

Log Loss uses negative log to provide an easy metric for comparison. It takes this approach because the positive log of numbers < 1 returns negative values, which is confusing to work with when comparing the performance of two models

# 6.Consider an investment strategy that sizes its bets equally, regardless of the forecast’s confidence. In this case, what is a more appropriate scoring function for hyper-parameter tuning, accuracy or cross-entropy loss?

The key conceptual difference between accuracy and negative log-loss is that negative log-loss takes into account not only whether our predictions were correct or not, but the probability of those predictions as well

위에 4번예제가 그런 경우라고 볼 수 있다. Accuracy로 점수를 매겨 파라미터 튜닝을 했을 때 더 높은 샤프ratio가 나왔다. 즉, 베팅사이즈를 1과 0으로만 생각한다면 accuracy가 더 괜찮을 수 있다. 예측이 맞냐틀리냐만 따지면 되기 때문이다.