### Decision Tree Classification Model

Our prediction model using the decision tree classifier is implemented with <b><i>scikit-learn</i></b> machine learning library in Python.


#### 1. Importing Libraries

Following libraries and function are necessary to implement the decision tree prediction model.
<br>
<br>
<b> Pandas</b>: 
<li> Used for data cleaning in preparation for the model training
<br><br>
<b> tree </b> from sklearn:
<li> Scikit-learn's decision tree classifier model library
<br><br>
<b> train_test_split </b> from sklearn.model_selection:
<li> Dividing the data into training and testing sets for model training and peformance analysis
<br><br>
<b> classification_report</b> from sklearn.metrics:
<li> Visualizing the performance of the prediction model
<br><br>

In [3]:
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

#### 2. Load Dataset

Loading the fire incident and weather dataset into <i>oandas</i> dataframe from CSV file.<br>
The features present in the cleaned dataset are listed below, alongside the type of data each of them holds.
<br>Only selected features will be used for the prediction model.
<br>Please refer to below for each data types' equivalent in Python.
<br><br>
Pandas' datatypes and their Python equivalents:
<li> int64 = int
<li> float64 = float
<li> object = string
<br><br>*Please refer to the <b><i>preprocessing</i></b> folder for detailed implementation on data cleaning.

In [4]:
# load dataset, print types
df = pd.read_csv('preprocessing/data/london_clean.csv')
df.dtypes

DateOfCall             int64
CalYear                int64
HourOfCall             int64
IncidentGroup          int64
PropertyCategory       int64
PropertyType           int64
NumPumpsAttending      int64
PumpHoursRoundUp       int64
Notional Cost (£)      int64
Date                  object
cloud_cover          float64
sunshine             float64
global_radiation     float64
max_temp             float64
mean_temp            float64
min_temp             float64
precipitation        float64
pressure             float64
snow_depth           float64
CostCat                int64
dtype: object

#### 3. Feature Selection and Dataset Split

The features selected as input for the model implementation are as follows:
<li><b>DateOfCall</b>: the month of the date when the fire incident was reported 
<li><b>PropertyType</b>: the type of location where the fire incident occured
<li><b>NumPumpsAttending</b>: the number of total fire pumps that were deployed to the fire incident location
<li><b>PumpHoursRoundUp</b>: the number of hours the fire pumps were used during the fire incident
<li><b>mean_temp</b>: mean daily temperature in Celsius (Cº)
<br><br>
The entire dataset is split into two subsets for their respective use:
<li>67% of the dataset is used to train the decision tree model
<li>33% of the dataset is used to test the performance of the decision tree model
<br>
<br>
In <i>train_test_split</i> function, the <i>random_state</i> variable is specified to 42.
<br>This specification allows <i>train_test_split</i> function to generate the identical training and testing subsets every time it is called.
<br>This functionality allows all three prediction models to be trained on the same dataset, allowing better comparison between thier performance in the later step.

In [5]:
# do train and test split
X = df[['DateOfCall', 'PropertyType', 'NumPumpsAttending', 'PumpHoursRoundUp', 'mean_temp']]
y = df[['CostCat']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# print a small sample of train X and y
print(X_train[0:5])
print(y_train[0:5])

         DateOfCall  PropertyType  NumPumpsAttending  PumpHoursRoundUp  \
706975            5            12                  1                 1   
450954            9             6                  2                 1   
525760            6            12                  2                 1   
20577             3            83                  2                 1   
1069644          11            37                  1                 1   

         mean_temp  
706975        15.0  
450954        13.3  
525760        16.4  
20577          7.6  
1069644        8.3  
         CostCat
706975         0
450954         0
525760         0
20577          0
1069644        1


#### 4. Model implementation and training


In [6]:
# create and train decision tree
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)

In [7]:
# test
y_pred = clf.predict(X_test).round()
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.85      0.81    231390
           1       0.71      0.61      0.66    140418
           2       1.00      1.00      1.00     34444
           3       0.74      0.70      0.72      5421
           4       0.57      0.60      0.58      4962
           5       0.92      0.92      0.92      7949

    accuracy                           0.78    424584
   macro avg       0.79      0.78      0.78    424584
weighted avg       0.77      0.78      0.77    424584

