<a href="https://colab.research.google.com/github/mnijhuis-dnb/Artificial_Intelligence_and_Machine_Learning_for_SupTech/blob/main/Tutorials/Tutorial%205%20Decision%20trees%20and%20random%20forests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Artificial Intelligence and Machine Learning for SupTech  
Tutorial 5: Decision trees and random forests

*	Growing your own decision tree
*	How deep? How many splits? How big are the leaves?
*	From trees to random forests
*	Comparing performance with the confusion matrix

<br/>

14 March 2023  

**Instructors**  
Prof. Iman van Lelyveld (iman.van.lelyveld@vu.nl)<br/>
Dr. Michiel Nijhuis (m.nijhuis@dnb.nl)  

## Company performance
The data we are going to work with in this tutorial is data from the performance of companies over a 5 year period. We are going to see if we can predict the sector a company is operating in based on the performance data

In [None]:
!gdown 1PCu4jNahysRpZ72z31KHpVkyAOp6nrKj

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Let's start by reading in the data

In [None]:
df = pd.read_csv('/content/company_data.csv', index_col=0)

In [None]:
df = df.fillna(-1)

In [None]:
df

In this case we are going to take the data of 2017

In [None]:
df_2017 = df[df['year']==2017]

Explore the data a bit to get a better feel for what is in the data

There are some categories with a relatively low number of observations, so let's combine these observations within a single category

In [None]:
df_2017['Sector'] = df_2017['Sector'].replace({'Consumer Defensive':'Other', 'Utilities':'Other', 'Communication Services':'Other'})

Now we can import the decision tree model, scale the data and make a test train split

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
numeric_columns = df_2017.select_dtypes(include=[np.float64, np.int64]).columns
df_2017.loc[:,numeric_columns] = scaler.fit_transform(df_2017[numeric_columns])

In [None]:
df_2017

In [None]:
 x_train, x_test, y_train, y_test = train_test_split(df_2017.drop(columns=['Sector']), df_2017['Sector'], test_size=0.33, random_state=42)

Let's check if we have enough data points of each sector in both the test and train data

In [None]:
pd.concat([y_test.value_counts().rename('Test'), y_train.value_counts().rename('Train')], axis=1)

We can set the parameters of the decision treee

In [None]:
dtc = DecisionTreeClassifier(criterion = "gini", 
                             splitter = "best", 
                             max_depth = 7, 
                             min_samples_split = 10, 
                             min_samples_leaf = 5, 
                             min_weight_fraction_leaf = 0, 
                             max_features = None, 
                             random_state = 11, 
                             max_leaf_nodes = None, 
                             min_impurity_decrease = 0, 
                             class_weight = None, 
                             ccp_alpha = 0)

In [None]:
dtc = dtc.fit(x_train, y_train)

In [None]:
dtc.predict(x_test)

To test how the model is performing we can just count the amount of correct classifications and divide it by the number of samples. Can you do this?

We can analyse the results using a confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
pd.DataFrame(confusion_matrix(y_test, dtc.predict(x_test), labels=y_test.unique()), index=y_test.unique(), columns=y_test.unique())

We can also plot the whole decision tree to get a better understanding of what it is doing. When we plot the tree you get to see boxes with the following structure:
</br>
</br> Rule for splitting the samples
</br> Gini coefficient, indication of how many different classes are present within the data at this node with 1 being all classes occur evenly and 0 only one class occurs
</br> How many samples are present at the node
</br> the occurence of each of the classes at the node
</br> What the dominant class is in the node


In [None]:
from sklearn.tree import plot_tree

In [None]:
fig, ax = plt.subplots(figsize=(200, 20))
plot_tree(dtc, ax=ax, label='root', fontsize=12, feature_names =x_train.columns, class_names=y_train.unique());

If you look at the tree you can see that the tree keeps on splitting the data, even if the dominant class does not change. The 'purity' of the data with the split increases in on node and decreases in the other. This way you can distinguish between instances which you can be more confidant of the correct classification and instances which you are more uncertain.

We can also look at the probabilities of the predictions

In [None]:
pd.DataFrame(dtc.predict_proba(x_test), columns=y_train.unique(), index=x_test.index).head()

This is still a relatively small decision tree, but already becomes large to plot. Can you find better parameters for the model?

In the previous code we just used a single decision tree, but we can also use multiple different decision trees to try and predict the sector of the companies. We do this with a random forest.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators = 10,
                             criterion = "gini", 
                             max_depth = 5, 
                             min_samples_split = 10, 
                             min_samples_leaf = 5, 
                             min_weight_fraction_leaf = 0, 
                             max_features = "sqrt", 
                             max_leaf_nodes = None, 
                             min_impurity_decrease = 0, 
                             bootstrap = True, 
                             oob_score = False, 
                             n_jobs = None, 
                             random_state = None, 
                             verbose = 0, 
                             warm_start = False, 
                             class_weight = None, 
                             ccp_alpha = 0, 
                             max_samples = None)

In [None]:
rfc = rfc.fit(x_train, y_train)

Can you analyse the performance of the random forest model

In [None]:
fig, ax = plt.subplots(figsize=(50, 200), nrows=10)
for i in range(10):
  plot_tree(rfc.estimators_[i], ax=ax[i], label='root', fontsize=12, feature_names =x_train.columns, class_names=y_train.unique());