### Decision Tree Classification Model

A decision tree model is a machine learning algorithm that can be used for classification tasks.
<br>This classification model is easy to interpret and can handle both categorical and numerical features, thus explaining our choice for this algorithm. 
<br>The model represents a series of decisions that lead to a final classification decision as a tree-like graph with several nodes. 
<br>At each node of the tree, a decision is made based on a specific attribute, and each branch represents a different decision or outcome. 
<br>
<br>In classification, the goal is to predict a categorical variable based on a set of input features.
<br>For our case - predicting the cost of damage for a fire incident - the set of input features is as follows:
<li><b>DateOfCall</b>: the month of the date when the fire incident was reported 
<li><b>PropertyType</b>: the type of location where the fire incident occured
<li><b>NumPumpsAttending</b>: the number of total fire pumps that were deployed to the fire incident location
<li><b>PumpHoursRoundUp</b>: the number of hours the fire pumps were used during the fire incident
<li><b>mean_temp</b>: mean daily temperature in Celsius (Cº)
<br>
<br>
The output of our classification model is the cost of damage in pound sterling (£).
<br>The cost value was originally a continuous numerical variable, but we converted it to a categorical variable, dividing and categorizing the numerical value in intervals of £100. 
<br>For example, all records of cost between £0.00 and £100.00 fall under category 1, all records of cost between £100.01 and £200.00 fall under category 2, and so on. All records with costs larger than £1000.01 fall under category 11. 
<br>
<br>

#### Scikit-Learn Library

Our prediction model using the decision tree classifier is implemented with <b><i>scikit-learn</i></b> machine learning library in Python.<br>
Please follow the <b>scikit-learn</b>'s installation guide ([https://scikit-learn.org/stable/install.html](Hidden_landing_URL)) and have the library ready before running the code.<br>



#### 1. Importing Libraries

Following libraries and function are necessary to implement the decision tree prediction model.
<br>
<br>
<b> Pandas</b>: 
<li> Data manipulation library
<br><br>
<b> Numpy</b>: 
<li> Data manipulation library
<br><br>
<b> train_test_split </b> from sklearn.model_selection:
<li> Dividing the data into training and testing sets for model training and peformance analysis
<br><br>
<b> tree </b> from sklearn:
<li> Scikit-learn's decision tree classifier model library
<br><br>
<b> classification_report</b> from sklearn.metrics:
<li> Visualizing and measure the performance of the prediction model
<br><br>
<b> GridSearchCV</b> from sklearn.model_selection:
<li> Hyper parameter tuning
<br><br>
<b>pickle</b>:
<li> Saving and loading machine learning model

In [1]:
#import all necessary libraries for decision tree model training and testing

#libraries for the data manipulation
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

#library for decision tree model
from sklearn import tree

#library for measuring the performance metrics
from sklearn.metrics import classification_report

#library for hyper parameter tuning
from sklearn.model_selection import GridSearchCV

#library for saving the final classifier model
import pickle

#### 2. Load Dataset

Loading the fire incident and weather dataset into <i>pandas</i> dataframe from CSV file.<br>
The features present in the cleaned dataset are listed below, alongside the type of data each of them holds.
<br>Only selected features will be used for the prediction model.
<br>Please refer to below for each data types' equivalent in Python.
<br><br>
Pandas' datatypes and their Python equivalents:
<li> int64 = int
<li> float64 = float
<li> object = string
<br><br>*Please refer to the <b><i>preprocessing</i></b> folder for detailed implementation on data cleaning.

In [2]:
# load dataset, print types
df = pd.read_csv('preprocessing/data/london_clean_weather.csv')
df.dtypes

DateOfCall             int64
CalYear                int64
HourOfCall             int64
IncidentGroup          int64
PropertyCategory       int64
PropertyType           int64
NumPumpsAttending      int64
PumpHoursRoundUp       int64
Notional Cost (£)      int64
Date                  object
CostCat                int64
cloud_cover          float64
sunshine             float64
global_radiation     float64
max_temp             float64
mean_temp            float64
min_temp             float64
precipitation        float64
pressure             float64
snow_depth           float64
dtype: object

#### 3. Feature Selection and Dataset Split

First, the entire dataset is split into input and output features - X and y respectively.
<br>We wish to predict feature in <i>y</i> based on features grouped in <i>X</i> used as input values.
<br>
<Br>
Both groups of input and output features are split into two subsets for their respective use with the help of Scikit-Learn's <i>train_test_split</i> function:
<li>67% of the dataset is used to train the decision tree model
<li>33% of the dataset is used to test the performance of the decision tree model
<br>
<br>
In <i>train_test_split</i> function, the <i>random_state</i> variable is specified to 42.
<br>This specification allows <i>train_test_split</i> function to generate the identical training and testing subsets every time it is called.
<br>This functionality allows all three prediction models to be trained on the same dataset, allowing better comparison between thier performance in the later step.

In [3]:
# do train and test split
X = df[['DateOfCall', 'PropertyType', 'NumPumpsAttending', 'PumpHoursRoundUp', 'mean_temp']]
y = df[['CostCat']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# print a small sample of train X and y
print(X_train[0:5])
print(y_train[0:5])

         DateOfCall  PropertyType  NumPumpsAttending  PumpHoursRoundUp  \
706975            5            12                  1                 1   
450954            9             6                  2                 1   
525760            6            12                  2                 1   
20577             3            83                  2                 1   
1069644          11            37                  1                 1   

         mean_temp  
706975        15.0  
450954        13.3  
525760        16.4  
20577          7.6  
1069644        8.3  
         CostCat
706975         3
450954         3
525760         3
20577          3
1069644        4


#### 4. Model implementation and training
Scikit-Learn's deicision tree classifier offers a range of parameters that can be tuned to control the behaviour of the decision tree model.
<br>
The following are the parameters we are interested in:
<ol>
<li><b><i>max_depth</i></b>: 
<br> - Sets the maximum depth of the decision tree
<br> - A deeper tree can capture more complex relationships in the data, but can also lead to overfitting
<br> - Default value = "None"
<li><b><i>min_samples_split</i></b>: 
<br> - Sets the minimum number of samples required to split an internal node. 
<br> - A higher value can prevent the tree from splitting too early, leading to more robust models.
<br> - Default value = 2
<li><b><i>criterion</i></b>: 
<br> - Specifies the function used to measure the quality of a split. 
<br> - The two options are "gini" and "entropy", which correspond to the Gini impurity and information gain criteria, respectively.
<br> - Default value = "gini"
</ol>


The first training and testing of the decision tree model is done with default parameters set by the Scikit-Learn library.
<br>
These hyper parameters will be specified in the following step, once the decision tree model provides an acceptable performance score with the current training and testing datasets.


In [4]:
# create and train decision tree
# clf = tree.DecisionTreeRegressor()
clf = tree.DecisionTreeClassifier(random_state=0)
clf = clf.fit(X_train, y_train)

In [5]:
# test
y_pred = clf.predict(X_test).round()
print("Default Decision Tree Model Description")
print("Depth: %d" % clf.get_depth())
print("Number of Leaves: %d" % clf.get_n_leaves())
print("*******************************************")
print("Classification Report of the default model")
print(classification_report(y_test, y_pred))

Default Decision Tree Model Description
Depth: 52
Number of Leaves: 106173
*******************************************
Classification Report of the default model
              precision    recall  f1-score   support

           3       0.78      0.85      0.81    231390
           4       0.71      0.61      0.66    140418
           6       0.75      0.79      0.77     20572
           7       0.66      0.60      0.63     13872
           8       0.58      0.61      0.60      3444
           9       0.43      0.43      0.43      1977
          10       0.46      0.43      0.45      2228
          11       0.93      0.93      0.93     10683

    accuracy                           0.75    424584
   macro avg       0.66      0.66      0.66    424584
weighted avg       0.75      0.75      0.75    424584



#### 5. Hyper-parameter Tuning

Now that the decision tree model has proven to produce acceptable performance, we will try to specify its hyper parameters to determine the combination that maxmizes the model's performance score.

Scikit-Learn offers an exhuastive search algorithm enabling us to easily determine the best hyper parameter combination called <b><i>GridSearchCV</i></b>
<br>The algorithm iterate over every combination possible of the input hyper parameter values and select the best one.

The following are the values of hyperparameters tested:
<ol>
<li><b><i>max_depth</i></b>: 
<br> - between 45 and 55
<li><b><i>min_samples_split</i></b>: 
<br> - range from 2 to 10: (2, 3, 4, 5, 6, 7, 8, 9, 10)
<br> - range from 10 to 100: (10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
<li><b><i>criterion</i></b>: 
<br> - "gini" and "entropy"
</ol>



In [7]:
#specify the value of parameters of interest
tree_params = {'criterion':['gini','entropy'],'max_depth': list(range(45, 56)),'min_samples_split':list(range(2, 10))+np.arange(10, 100, 10).tolist()}

#determine the best hyper parameter
dtc_top = GridSearchCV(tree.DecisionTreeClassifier(), tree_params, cv=5)

# Training the model for classification with the same dataset 
model2 = dtc_top.fit(X_train, y_train)

print("Best hyperparameters for decision tree classification: ", dtc_top.best_params_)


Best hyperparameters for decision tree classification:  {'criterion': 'gini', 'max_depth': 46, 'min_samples_split': 90}


#### 6. Best Performing Model

The following are the selected hyperparameter values:
<ol>
<li><b><i>max_depth</i></b>: 46
<li><b><i>min_samples_split</i></b>: 90
<li><b><i>criterion</i></b>: gini
</ol>

With the best hyperparameter combination found in the previous step, we can now build our final decision tree classifier and measure its performance.
<br>Using the <i>pickle</i> library, the model with the best hyperparameter is saved to the project file for future use.



In [8]:
#saving the model with the best performing hyperparameters
filename = "models/DT_model.pickle"

#train the model with the best hyperparameters
model_final = tree.DecisionTreeClassifier(criterion = "gini", max_depth = 46, min_samples_split = 90)
model_final.fit(X_train, y_train)

# save model
pickle.dump(model_final, open(filename, "wb"))

In [9]:
# Measure performance of the model with the best hyper parameter 
y_top_pred = model_final.predict(X_test)
print("Classification Report of the default model")
print(classification_report(y_test, y_top_pred))

Classification Report of the default model
              precision    recall  f1-score   support

           3       0.79      0.85      0.82    231390
           4       0.71      0.62      0.66    140418
           6       0.73      0.83      0.77     20572
           7       0.68      0.53      0.60     13872
           8       0.53      0.73      0.61      3444
           9       0.42      0.31      0.36      1977
          10       0.43      0.37      0.40      2228
          11       0.95      0.91      0.93     10683

    accuracy                           0.76    424584
   macro avg       0.65      0.64      0.64    424584
weighted avg       0.75      0.76      0.75    424584

