<a></a>
<div style="border-radius: 10px; border: 1px solid #0F9CF5; background-color: #232323; white-space: nowrap;">
    <p style="margin-top: -10px; margin-bottom: 0px; margin-left: 10px; font-size: 1.15em; padding: 10px; overflow: hidden;">
        <span style="color: orange; font-size: 2em;">&#9432;  </span>
        Click the <span style="color: orange;">Run All</span> <img style="max-height: 1.5em; border: 1px solid orange;" src="../img/RunAll.png" /> button in the toolbar above to run the code in this notebook 
    </p>
</div>

<a id="document-top"></a>
# BQuant Machine Learning Series Part 4 - Decision Trees and Random Forests


<a href='https://bloombergslides.com/view/mail?iID=PhKMSzF7kdZqGgqVgq8q'>Link to Episode 4 - ML Series Video - Decision Trees and Random Forests</a>

In [None]:
import bql
import numpy as np
import pandas as pd
import bqviz as bqv
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
from datetime import date

# cache bql request on disk
import src.cache as cachereq
from src.shared import * ## Shared library for retrieving data via BQL for Machine Learning Series

%load_ext autoreload
%autoreload 2

In [None]:
# classifiers and regressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier

# scoring the model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# encode target labels with value between 0 and n_classes-1.
from sklearn.preprocessing import LabelEncoder

# visualize a tree
from sklearn.tree import plot_tree

### Decision Tree Theory
Decision Trees are a form of supervised ML that seek to build a simple set of decision rules to make predictions.
DT are among the most popular machine learning algorithms given their interpretability and simplicity.
* DT can be applied for both classification (the predicted outcome is the class or category) and regression (the predicted outcome is continuous number), but mainly used for classifications
* DT is a foundation of Random Forests, which uses ensemble of different DTs and corrects overfitting
* DT are attractive because of interpretability and simplicity

### Decision Tree

<img src="img/decision_tree.png" width="80%" height="80%">

<img src="img/decision_boundary.png" width="60%" height="60%">

[Source: scikit-learn](https://scikit-learn.org/stable/modules/tree.html) <br>
**Advantages:**
* simple to understand and to interpret. Trees can be visualized.
* requires little data preparation
* able to handle both numerical and categorical data
* uses a white box model ~ results is easy to interpret
* the cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree
* able to handle multi-output problems
* possible to validate a model using statistical tests
* performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

**Disadvantages:**
* decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. 
* DTs can be unstable ~ resolve by using an ensemble of DT
* cannot guarantee to return the globally optimal decision tree
* predictions of DTs are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure. Therefore, they are not good at extrapolation

**Some mechanism to avoid overfitting:** 
* pruning or reducing the size of DT by removing sections of the tree that are non-critical and redundant to classify instances
* setting the minimum number of samples required at a leaf node 
* setting the maximum depth of the tree
* creating ensemble of DTs ~ Random Forest


An algorithm calculates  numerical measure to decide which feature to split and at which threshold at each step in building the tree.<br>
These generally measure the homogeneity of the target variable within the subsets.
* Gini impurity (different from gini coefficient) is a measure of purity or variability of categorical data
$$Gini=1-\sum^r_j p^2_j$$
* for example if 30% of sample was classified as CLASS1 and 70% as CLASS2, Gini impurity = 1 - (0.3 * 0.3 + 0.7 * 0.7 ) = 0.42
* entropy is the measure of disorder of a variable
$${\displaystyle \mathrm {Entropy}(X)=-\sum _{i=1}^{n}{\mathrm {P} (x_{i})\log _{b}\mathrm {P} (x_{i})},}$$
* for the same example entropy = - 0.3 * log2(0.3) - 0.7 * log2(0.7) = 0.88
* the goal is to find a split that best reduces the entropy
* Gini impurity and Entropy tend to generate similar result \ tree

### Initial set up
<font color='magenta'>The data for given parameters is pre-cached on disk and automatically will be fetched as far as the query stays the same.
Please do not change parameters (and as a result  change the query) to avoid BQL run. Query may sources significant amount of data and can lead to data limit issues. At the same time you have an access to BQL query and can study and run it as needed.
</font>

In [None]:
# do not change the dates if you want to use pre-cached data
start_date  = date(2017, 10, 1)
periods     = 9 # source 9 quarters
dates = pd.date_range(start_date, periods=periods, freq='QS')
dates = [ x.date() for x in dates ]
bq          = bql.Service()

# we will use this object to automatically cache the data on disk or read pre-saved data
cache = cachereq.CacheRequest(bq, {'cache_folder': 'data', 'cache_data_on_disk': True})

print(f'We will source the data for the following dates : {[ str(x) for x in dates ]}')
# dates

### Source Bond Data
- start with only one date to test

In [None]:
as_of_date = dates[0]

df = get_bond_data(as_of_date, cache=cache)
print(df.shape)
df.head()

### Source data for all dates

In [None]:
data = []

for as_of_date in dates:
    df = get_bond_data(as_of_date, cache=cache)
    data.append( df )
    
data = pd.concat(data)
data_orig = data.copy()
data.head()

### Enriching the data with extra information

#### Adding rating at the next period as we will predict the FUTURE and not the current rating
- this approach would make the model design cleaner, but not strictly necessary as only 6% of securities change the rating
- about 3.5% securities disappear from the rating from one period to another
- finally we will predict rating category (high yield, investment grade etc.) and even fewer securities will change the category

In [None]:
from dateutil.relativedelta import relativedelta

# prepare the dict with security + date key and rating as a value
current_date_rating = dict( zip(data.index + '_' + data['As_of_date'].astype(str), data['Rating']) )

# create a column with next period\quarter rating (future rating)
# tech details: find in prepared current_date_rating a key, which is equal security + date+3months
rating_future = data[['As_of_date']].reset_index().apply( lambda x : current_date_rating.get(x[0] + '_' + str(x[1] + relativedelta(months=+3)), ''), axis = 1)
data.insert(data.columns.tolist().index('Rating') + 1, 'Rating_future', rating_future.values)

# remove the last date and the future rating is not known for this set by definition
data = data[~(data['As_of_date'] == dates[-1])]

data.head()

### Let's check the current rating distribution

In [None]:
pd.DataFrame( data['Rating_future'].value_counts() ).sort_index()

#### Remapping rating to the category (high yield, investment grade etc.)

In [None]:
# remove distressed and defaulted bonds ~ 3.5% of population 
# as well as ones with empty future rating ~ 3% (when exclude the last date for which the future rating is null by definition)

data = data[ ~data['Rating_future'].isin(['CCC+', 'CCC', 'CCC-', 'CC+', 'CC', 'CC-', 'C+', 'C', 'C-', 'DD+', 'DDD+', ''])]

# remapping the rating to Investment Grade High, Investment Grade Low and High Yield categories
rating_map = { 'AAA' : 'IG High', 'AA+' : 'IG High', 'AA' : 'IG High', 'AA-' : 'IG High', 'A+' : 'IG High', 'A' : 'IG High', 'A-' : 'IG High',
               'BBB+' : 'IG Low', 'BBB' : 'IG Low', 'BBB-' : 'IG Low', 
               'BB+' : 'HY', 'BB' : 'HY', 'BB-' : 'HY', 'B+' : 'HY', 'B' : 'HY', 'B-' : 'HY'  }

data.insert(data.columns.tolist().index('Rating_future') + 1, 'Category_future', data['Rating_future'].apply( lambda x : rating_map[x] ))

data.head()

### Let's check count of the predicted category
- we have three approximately equal categories

In [None]:
display( pd.DataFrame( data['Category_future'].value_counts() ) )

In [None]:
print( f'Dataframe shape={data.shape}' )
data.head()

### Normalization
- with decision trees an impact of  outliers in in predictor variables (not target variables) are small.
- so we skip normalization step (winsorization, z-score etc.) as it is not necessary and one of the model advantage

### Visualize data

In [None]:
data_to_display = data[data['As_of_date'] == dates[-2]] #check only one date for performance issues
data_to_display = data_to_display.drop(columns = ['As_of_date', 'Rating_future', 'Issuer', 'Industry', 'Rating'])

plot = bqv.InteractiveScatterPlot(data_to_display,  color_field='Category_future',  reg_line=False)
plot.y_control.value = 'Spread_OAS'
plot.x_control.value = 'Duration'
plot.show()

### Use two features to predict Bond Category
1. Define features and labels
2. Split data into train and test
3. Train the model
4. Make a prediction
5. Check the probabilities
6. Check the feature importance
7. Check the model score

In [None]:
# Features or independent variables
features = ['Spread_OAS', 'Duration']
# Our target or dependent variable
target = 'Category_future'

# set the first six quarters as the train set
data_dropna = data.dropna(subset=features)
X_train = data_dropna.loc[ ~data_dropna['As_of_date'].isin(dates[-3:]), features]
y_train = data_dropna.loc[ ~data_dropna['As_of_date'].isin(dates[-3:]), target]

# set the last six quarters as the test set
X_test  = data_dropna.loc[ data_dropna['As_of_date'].isin(dates[-3:]), features]
y_test  = data_dropna.loc[ data_dropna['As_of_date'].isin(dates[-3:]), target]

In [None]:
print("X_train shape = ",  X_train.shape, "   y_train shape = ", y_train.shape[0])
print("X_test shape  = ",  X_test.shape,  "   y_test shape  = ", y_test.shape[0])

### Fit the model and get out-of-sample prediction

In [None]:
tree_classifier = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=63)
# tree_classifier = DecisionTreeClassifier(criterion='gini', random_state=63)
# fit the model on training data
tree_classifier.fit(X_train, y_train)

In [None]:
# make a prediction
prediction = tree_classifier.predict( X_test )

prediction

In [None]:
# check probabilities,
# for example first raw represents probability of the class belonging to class 'HY' or tree_classifier.classes_[0]
print( f'Classes names: {tree_classifier.classes_}' )

probabilities = tree_classifier.predict_proba(X_test)

probabilities

In [None]:
def plot_distr_of_prob(classifier, probabilities):
    plt.figure(figsize=(10,5))

    for i, k in enumerate(classifier.classes_):
        sns.kdeplot(probabilities[:,i], shade=True, label = k)

    plt.legend(fontsize=14)
    plt.title('Distribution of Probabilities by class', fontsize=14)
    plt.show()
    
plot_distr_of_prob(tree_classifier, probabilities)

In [None]:
# feature importance
feature_importances = pd.DataFrame({'Feature': X_train.columns.tolist(),
                                    'Importance': tree_classifier.feature_importances_.round(3)}).\
                                    sort_values('Importance', ascending = False)

feature_importances

In [None]:
# Check the model score
def print_score(classifier, X, y):
    # accuracy classification score
    print('Accuracy Score: {0:.4f}'.format( accuracy_score(y, classifier.predict(X)) ))
        
    # Precision is the ratio of true positives to total predicted positives (=TP / (TP + FP))
    # Recall is the ratio of true positives to total actual positives (=TP / (TP + FN))
    # F1-score provides a single score that balances both the concerns of precision and recall in one number
    print('Classification Report: \n {}\n'.format( classification_report(y, classifier.predict(X)) ))
    
    # CM summarizes the performance of a classification algorithm
    # we have true positives count on the diagonal and false negative elsewhere
    print('Confusion Matrix: \n {}\n'.format( confusion_matrix(y, classifier.predict(X)) ))

In [None]:
# Check the model score
print('************ Train set ************')
print_score(tree_classifier, X_train, y_train)

print('************ Test set ************')
print_score(tree_classifier, X_test, y_test)

### Function to plot decision regions

In [None]:
def plot_decision_boundary(X, y, classifier, resolution=0.1, xlabel=None, ylabel=None, title=None):
    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap( colors[:len(np.unique(y))] )

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))

    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)

    #encoding label to numerical value for plotting in case it is not numerical already
    if type( Z[0] ) == str:
        label_encoder = LabelEncoder()
        label_encoder.fit(y)
        Z = label_encoder.transform( Z )

    Z = Z.reshape(xx1.shape)

    plt.figure(figsize=[6,4], dpi=100)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], 
                    y=X[y == cl, 1],
                    alpha=0.8, 
                    c=colors[idx],
                    marker=markers[idx],
                    label=cl,
                    edgecolor='black')

    plt.legend(loc='upper left')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
#     plt.savefig('img/decision_boundary.png', format='png', dpi=600)

### Plotting decision boundary of a trained model

In [None]:
plot_decision_boundary(X_train.values, y_train.values, tree_classifier, resolution=0.1, xlabel=X_train.columns[0], ylabel=X_train.columns[1], title = 'Decision Boundary, Train Data')

### Plotting decision boundary applied to the out-of-sample data

In [None]:
plot_decision_boundary(X_test.values, y_test.values, tree_classifier, resolution=0.1, xlabel=X_test.columns[0], ylabel=X_test.columns[1], title = 'Decision Boundary, Test Data')

### Plot decision tree

In [None]:
plt.figure(figsize=[8,4], dpi=600)
# plot_tree(tree_classifier, feature_names = feature_names, class_names =class_names, filled=True)
feature_names = [x[0:10] for x in X_train.columns.tolist()]
plot_tree(tree_classifier, filled=True, feature_names = feature_names, class_names= np.unique( y_train.values ) )
# plt.savefig('img/decision_tree.png', format='png', dpi=600)
display(plt.show())

### Decision Tree Regression

In [None]:
data_reg = data_orig.copy()

# prepare the dict with security + date key and rating as a value
current_date_spread = dict( zip(data_reg.index + '_' + data_reg['As_of_date'].astype(str), data_reg['Spread_OAS']) )

# create a column with next period\quarter spread (future spread)
# tech details: find in prepared current_date_spread a key, which is equal security + date+3months
spread_future = data_reg[['As_of_date']].reset_index().apply( lambda x : current_date_spread.get(x[0] + '_' + str(x[1] + relativedelta(months=+3)), np.nan), axis = 1)
data_reg.insert(data_reg.columns.tolist().index('Spread_OAS') + 1, 'Spread_OAS_future', spread_future.values)

data_reg = data_reg.dropna(subset=['Spread_OAS', 'Spread_OAS_future'])
data_reg = data_reg[(data_reg['Spread_OAS'] >0) & (data_reg['Spread_OAS_future'] >0) ]

data_reg.insert(data_reg.columns.tolist().index('Spread_OAS_future') + 1, 'Spread_OAS_future_change', data_reg['Spread_OAS_future'] / data_reg['Spread_OAS'] -1  )

data_reg['Spread_OAS_future_change'] = data_reg['Spread_OAS_future_change'].clip(upper=3)

data_reg.head()

In [None]:
plt.figure(figsize=(10,5))

sns.kdeplot(data_reg['Spread_OAS_future_change'], shade=True)

plt.legend(fontsize=14)
plt.title('Distribution of OAS Spread change', fontsize=14)
plt.show()

In [None]:
# Features or independent variables
features = ['Spread_OAS', 'Duration']
# Our target or dependent variable
target = 'Spread_OAS_future_change'

# set the first six quarters as the train set
data_dropna = data_reg.dropna(subset=features)
X_train = data_dropna.loc[ ~data_dropna['As_of_date'].isin(dates[-3:]), features]
y_train = data_dropna.loc[ ~data_dropna['As_of_date'].isin(dates[-3:]), target]

# set the last six quarters as the test set
X_test  = data_dropna.loc[ data_dropna['As_of_date'].isin(dates[-3:]), features]
y_test  = data_dropna.loc[ data_dropna['As_of_date'].isin(dates[-3:]), target]

In [None]:
tree_regressor = DecisionTreeRegressor(max_depth=3, random_state=63)
# fit the model on training data
tree_regressor.fit(X_train, y_train)

# make a prediction
prediction = tree_regressor.predict( X_test )
print('DecisionTreeRegressor prediction:', prediction)

### Ensemble Techniques
* Ensemble is about joining different predictors into one.

There are several ensemble techniques, among them
* **bagging** uses sampling with replacement, also called bootstrapping
    * combine regression with average and classification with voting
    * running a bunch of models in a parallel way as each model is trained by a random subset of the data
* **pasting** uses sampling without replacement
* **boosting** - first we train week classifiers and then add them to strong classifier by weighting them by accuracy
    * training a bunch of individual models in a sequential way as each individual model learns from mistakes made by the previous model

### Random Forest Theory
* Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. Random forest correct for decision trees' habit of overfitting to their training set
* random forest is trained by bagging method with random sampling of observations and features

### Random Forest Classifier
1. Define features and labels
2. Split data into train and test
3. Train the model
4. Make a prediction
5. Check the probabilities
6. Check the feature importance
7. Check the model score

In [None]:
# Features or independent variables
features = [ 'Spread_OAS', 'Prob_of_default', 'Yield', 'Coupon', 'Amt_Outst', 'Duration', 'Maturity_Years', 'EBITDA_Margin', 'Return_on_Assets', 
            'Current_Ratio', 'Quick_Ratio', 'Debt_to_EBITDA', 'Debt_to_Assets', 'Debt_to_Equity', 'Return_on_Equity_3yr_avg']

# Our target or dependent variable
target = 'Category_future'

# set the first six quarters as the train set
data_dropna = data.dropna(subset=features)
X_train = data_dropna.loc[ ~data_dropna['As_of_date'].isin(dates[-3:]), features]
y_train = data_dropna.loc[ ~data_dropna['As_of_date'].isin(dates[-3:]), target]

# set the last six quarters as the test set
X_test  = data_dropna.loc[ data_dropna['As_of_date'].isin(dates[-3:]), features]
y_test  = data_dropna.loc[ data_dropna['As_of_date'].isin(dates[-3:]), target]

In [None]:
print("X_train shape = ",  X_train.shape, "   y_train shape = ", y_train.shape[0])
print("X_test shape  = ",  X_test.shape,  "   y_test shape  = ", y_test.shape[0])

In [None]:
# n_estimators - the number of trees in the forest.
# max_depth - the maximum depth of the tree
# bootstrap - whether bootstrap samples are used when building trees and the samples are drawn with replacement
rf_classifier = RandomForestClassifier(n_estimators=1000, bootstrap = True, max_depth = 6, random_state = 63)

# fit the model on training data
rf_classifier.fit(X_train, y_train)

# make a prediction
prediction = rf_classifier.predict(X_test)
prediction

In [None]:
# check probabilities,
# for example first raw represents probability of the class belonging to class 'HY' or tree_classifier.classes_[0]
print( f'Classes names: {rf_classifier.classes_}' )
probabilities = rf_classifier.predict_proba(X_test)
probabilities

In [None]:
plot_distr_of_prob(rf_classifier, probabilities)

#### Feature Importance
- it is very easy to measure the relative importance of each feature on the prediction with Random Forest
- achieved by looking at how much the tree nodes that use that feature reduce impurity across all trees in the forest

In [None]:
feature_importances = pd.DataFrame({'Feature': X_train.columns.tolist(),
                                    'Importance': rf_classifier.feature_importances_.round(3)}).\
                                    sort_values('Importance', ascending = False)

feature_importances

In [None]:
# Check the model score
print('************ Train set ************')
print_score(rf_classifier, X_train, y_train)

print('************ Test set ************')
print_score(rf_classifier, X_test, y_test)