### Decision Tree Classification Model

A decision tree model is a machine learning algorithm that can be used for classification tasks.
<br>This classification model is easy to interpret and can handle both categorical and numerical features, thus explaining our choice for this algorithm. 
<br>The model represents a series of decisions that lead to a final classification decision as a tree-like graph with several nodes. 
<br>At each node of the tree, a decision is made based on a specific attribute, and each branch represents a different decision or outcome. 
<br>
<br>In classification, the goal is to predict a categorical variable based on a set of input features.
<br>For our case - predicting the cost of damage for a fire incident - the set of input features is as follows:
<li><b>DateOfCall</b>: the month of the date when the fire incident was reported 
<li><b>PropertyType</b>: the type of location where the fire incident occured
<li><b>NumPumpsAttending</b>: the number of total fire pumps that were deployed to the fire incident location
<li><b>PumpHoursRoundUp</b>: the number of hours the fire pumps were used during the fire incident
<li><b>mean_temp</b>: mean daily temperature in Celsius (Cº)

<br>
<br>

#### Scikit-Learn Library

Our prediction model using the decision tree classifier is implemented with <b><i>scikit-learn</i></b> machine learning library in Python.<br>
Please follow the <b>scikit-learn</b>'s installation guide ([https://scikit-learn.org/stable/install.html](Hidden_landing_URL)) and have the library ready before running the code.<br>



#### 1. Importing Libraries

Following libraries and function are necessary to implement the decision tree prediction model.
<br>
<br>
<b> Pandas</b>: 
<li> Used for data cleaning in preparation for the model training
<br><br>
<b> tree </b> from sklearn:
<li> Scikit-learn's decision tree classifier model library
<br><br>
<b> train_test_split </b> from sklearn.model_selection:
<li> Dividing the data into training and testing sets for model training and peformance analysis
<br><br>
<b> classification_report</b> from sklearn.metrics:
<li> Visualizing and measure the performance of the prediction model
<br><br>
<b> GridSearchCV</b> from sklearn.model_selection:
<li> Hyper parameter tuning
<br><br>
<b> export_graphviz</b> from sklearn.tree:
<li> Vizualize the decision tree model

In [3]:
#import all necessary libraries for decision tree model training and testing

#basic libraries for the model building
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split

#library for measuring the performance metrics
from sklearn.metrics import classification_report

#library for hyper parameter tuning
from sklearn.model_selection import GridSearchCV

#library for vizualization 
from sklearn.tree import export_graphviz

#### 2. Load Dataset

Loading the fire incident and weather dataset into <i>oandas</i> dataframe from CSV file.<br>
The features present in the cleaned dataset are listed below, alongside the type of data each of them holds.
<br>Only selected features will be used for the prediction model.
<br>Please refer to below for each data types' equivalent in Python.
<br><br>
Pandas' datatypes and their Python equivalents:
<li> int64 = int
<li> float64 = float
<li> object = string
<br><br>*Please refer to the <b><i>preprocessing</i></b> folder for detailed implementation on data cleaning.

In [4]:
# load dataset, print types
df = pd.read_csv('preprocessing/data/london_clean.csv')
df.dtypes

DateOfCall             int64
CalYear                int64
HourOfCall             int64
IncidentGroup          int64
PropertyCategory       int64
PropertyType           int64
NumPumpsAttending      int64
PumpHoursRoundUp       int64
Notional Cost (£)      int64
Date                  object
cloud_cover          float64
sunshine             float64
global_radiation     float64
max_temp             float64
mean_temp            float64
min_temp             float64
precipitation        float64
pressure             float64
snow_depth           float64
CostCat                int64
dtype: object

#### 3. Feature Selection and Dataset Split

First, the entire dataset is split into input and output features - X and y respectively.
<br>We wish to predict feature in <i>y</i> based on features grouped in <i>X</i> used as input values.
<br>
<Br>
Both groups of input and output features are split into two subsets for their respective use with the help of Scikit-Learn's <i>train_test_split</i> function:
<li>67% of the dataset is used to train the decision tree model
<li>33% of the dataset is used to test the performance of the decision tree model
<br>
<br>
In <i>train_test_split</i> function, the <i>random_state</i> variable is specified to 42.
<br>This specification allows <i>train_test_split</i> function to generate the identical training and testing subsets every time it is called.
<br>This functionality allows all three prediction models to be trained on the same dataset, allowing better comparison between thier performance in the later step.

In [5]:
# do train and test split
X = df[['DateOfCall', 'PropertyType', 'NumPumpsAttending', 'PumpHoursRoundUp', 'mean_temp']]
y = df[['CostCat']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# print a small sample of train X and y
print(X_train[0:5])
print(y_train[0:5])

         DateOfCall  PropertyType  NumPumpsAttending  PumpHoursRoundUp  \
706975            5            12                  1                 1   
450954            9             6                  2                 1   
525760            6            12                  2                 1   
20577             3            83                  2                 1   
1069644          11            37                  1                 1   

         mean_temp  
706975        15.0  
450954        13.3  
525760        16.4  
20577          7.6  
1069644        8.3  
         CostCat
706975         0
450954         0
525760         0
20577          0
1069644        1


#### 4. Model implementation and training
Scikit-Learn's deicision tree classifier offers a range of parameters that can be tuned to control the behaviour of the decision tree model.
<br>
The following are the parameters we are interested in:
<ol>
<li><b><i>max_depth</i></b>: 
<br> - Sets the maximum depth of the decision tree
<br> - A deeper tree can capture more complex relationships in the data, but can also lead to overfitting
<br> - Default value = "None"
<li><b><i>min_samples_split</i></b>: 
<br> - Sets the minimum number of samples required to split an internal node. 
<br> - A higher value can prevent the tree from splitting too early, leading to more robust models.
<br> - Default value = 2
<li><b><i>criterion</i></b>: 
<br> - Specifies the function used to measure the quality of a split. 
<br> - The two options are "gini" and "entropy", which correspond to the Gini impurity and information gain criteria, respectively.
<br> - Default value = "gini"
</ol>


The first training and testing of the decision tree model is done with default parameters set by the Scikit-Learn library.
<br>
These hyper parameters will be specified in the following step, once the decision tree model provides an acceptable performance score with the current training and testing datasets.


In [6]:
# create and train decision tree
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)

In [None]:
#tree vizualization
tree.plot_tree(clf)

In [7]:
# test
y_pred = clf.predict(X_test).round()
print("Default Decision Tree Model Description")
print("Depth: " + clf.get_depth())
print("Number of Leaves: " + clf.get_n_leaves())
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.85      0.81    231390
           1       0.71      0.61      0.66    140418
           2       1.00      1.00      1.00     34444
           3       0.74      0.70      0.72      5421
           4       0.57      0.60      0.58      4962
           5       0.92      0.92      0.92      7949

    accuracy                           0.78    424584
   macro avg       0.79      0.78      0.78    424584
weighted avg       0.77      0.78      0.77    424584



#### 5. Hyper-parameter Tuning

Now that the decision tree model has proven to produce acceptable performance, we will try to specify its hyper parameters to determine the combination that maxmizes the model's performance score.

Scikit-Learn offers an exhuastive search algorithm enabling us to easily determine the best hyper parameter combination called <b><i>GridSearchCV</i></b>
The algorithm iterate over every combination possible of the input hyper parameter values and select the best one 

In [None]:
tree_params = {'criterion':['gini','entropy'],'max_depth':[3,5],'min_samples_split':[10,50,100]}
dtc_top = GridSearchCV(DecisionTreeClassifier(), tree_params, cv=5)
# Training the model for emotion classification
model2_e_top = dtc_top.fit(vector_training_post, emotion_train)
# Predict emotion class of test data
predict_emotion5 = model2_e_top.predict(vector_test_post)
print("Best hyperparameters for emotion classification: ", dtc_top.best_params_)

# Training the model for sentiment classification
model2_s_top = dtc_top.fit(vector_training_post, sentiment_train)
# Predict sentiment class of new email
predict_sentiment5 = model2_s_top.predict(vector_test_post)
print("Best hyperparameters for sentiment classification: ", dtc_top.best_params_)