### Random Forest Classification Model

A random forest model is a type of machine learning model that is used for both classification and regression tasks. 
<br>It is an ensemble model, meaning that it is made up of a collection of decision trees.
<br>multiple decision trees are created, each with a random subset of the available features and data points. 
<br>The trees are then combined to produce a final prediction. 
<br>
<br>In classification, the goal is to predict a categorical variable based on a set of input features.
<br>For our case - predicting the cost of damage for a fire incident - the set of input features is as follows:
<li><b>DateOfCall</b>: the month of the date when the fire incident was reported 
<li><b>PropertyType</b>: the type of location where the fire incident occured
<li><b>NumPumpsAttending</b>: the number of total fire pumps that were deployed to the fire incident location
<li><b>PumpHoursRoundUp</b>: the number of hours the fire pumps were used during the fire incident
<li><b>mean_temp</b>: mean daily temperature in Celsius (Cº)
<br>
<br>
The output of our classification model is the cost of damage in pound sterling (£).
<br>The cost value was originally a continuous numerical variable, but we converted it to a categorical variable, dividing and categorizing the numerical value in intervals of £100. 
<br>For example, all records of cost between £0.00 and £100.00 fall under category 1, all records of cost between £100.01 and £200.00 fall under category 2, and so on. All records with costs larger than £1000.01 fall under category 11. 
<br>
<br>

#### Scikit-Learn Library

Our prediction model using the decision tree classifier is implemented with <b><i>scikit-learn</i></b> machine learning library in Python.<br>
Please follow the <b>scikit-learn</b>'s installation guide ([https://scikit-learn.org/stable/install.html](Hidden_landing_URL)) and have the library ready before running the code.<br>



#### 1. Importing Libraries

Following libraries and function are necessary to implement the decision tree prediction model.
<br>
<br>
<b> Pandas</b>: 
<li> Data manipulation library
<br><br>
<b> Numpy</b>: 
<li> Data manipulation library
<br><br>
<b> train_test_split </b> from sklearn.model_selection:
<li> Dividing the data into training and testing sets for model training and peformance analysis
<br><br>
<b> RandomForestClassifier </b> from sklearn:
<li> Scikit-learn's random forest classifier model library
<br><br>
<b> classification_report</b> from sklearn.metrics:
<li> Visualizing and measure the performance of the prediction model
<br><br>
<b> GridSearchCV</b> from sklearn.model_selection:
<li> Hyper parameter tuning
<br><br>
<b>pickle</b>:
<li> Saving and loading machine learning model

In [2]:
#import all necessary libraries for random forest model training and testing

#libraries for the data manipulation
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# library for data vizualization 
from matplotlib import pyplot as plt
# used for matplotlib in jupyter notebooks
%matplotlib inline 
import seaborn as sns

#library for random forest model
from sklearn.ensemble import RandomForestClassifier

# library for measuring the performance metrics
from sklearn.metrics import accuracy_score, confusion_matrix, r2_score, classification_report

#for hyper parameter tunning
from sklearn.model_selection import GridSearchCV

#library for saving the final classifier model
import pickle

# To change scientific numbers to float
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
# Increases the size of sns plots
sns.set(rc={'figure.figsize':(8,6)})
import warnings
warnings.filterwarnings('ignore')

# from sklearn.preprocessing import MinMaxScaler
# from sklearn import tree
# from sklearn.tree import DecisionTreeClassifier, export_graphviz
# Datetime lib
from pandas import to_datetime
import itertools
import datetime

#### 2. Load Dataset

Loading the fire incident and weather dataset into <i>pandas</i> dataframe from CSV file.<br>
The features present in the cleaned dataset are listed below, alongside the type of data each of them holds.
<br>Only selected features will be used for the prediction model.
<br>Please refer to below for each data types' equivalent in Python.
<br><br>
Pandas' datatypes and their Python equivalents:
<li> int64 = int
<li> float64 = float
<li> object = string
<br><br>*Please refer to the <b><i>preprocessing</i></b> folder for detailed implementation on data cleaning.

In [3]:
data = pd.read_csv('preprocessing/data/london_clean.csv')
data.dtypes

DateOfCall             int64
CalYear                int64
HourOfCall             int64
IncidentGroup          int64
PropertyCategory       int64
PropertyType           int64
NumPumpsAttending      int64
PumpHoursRoundUp       int64
Notional Cost (£)      int64
Date                  object
cloud_cover          float64
sunshine             float64
global_radiation     float64
max_temp             float64
mean_temp            float64
min_temp             float64
precipitation        float64
pressure             float64
snow_depth           float64
CostCat                int64
dtype: object

#### 3. Feature Selection and Dataset Split

First, the entire dataset is split into input and output features - X and y respectively.
<br>We wish to predict feature in <i>y</i> based on features grouped in <i>X</i> used as input values.
<br>
<Br>
Both groups of input and output features are split into two subsets for their respective use with the help of Scikit-Learn's <i>train_test_split</i> function:
<li>67% of the dataset is used to train the random forest model
<li>33% of the dataset is used to test the performance of the random forest model
<br>
<br>
In <i>train_test_split</i> function, the <i>random_state</i> variable is specified to 42.
<br>This specification allows <i>train_test_split</i> function to generate the identical training and testing subsets every time it is called.
<br>This functionality allows all three prediction models to be trained on the same dataset, allowing better comparison between thier performance in the later step.

In [4]:
X = data[['DateOfCall', 'PropertyType', 'NumPumpsAttending', 'PumpHoursRoundUp', 'mean_temp']]
y = data[['CostCat']]

print('X shape: {}'.format(np.shape(X)))
print('y shape: {}'.format(np.shape(y)))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X shape: (1286617, 5)
y shape: (1286617, 1)


#### 4. Model implementation and training
Scikit-Learn's Random Forest classifier offers a range of parameters that can be tuned to control the behaviour of the decision tree model.
<br>
The following are the parameters we are interested in:
<ol>
<li><b><i>n_estimators</i></b>: 
<br> - The number of trees in the forest.
<br> - Builds multiple decision trees on different sub-samples of the dataset and averages their predictions to obtain a more stable and accurate prediction.
<br> - Default value = 100
<li><b><i>criterion</i></b>: 
<br> - Specifies the function used to measure the quality of a split. 
<br> - The three options are "gini", "entropy" and "log_loss", which correspond to the Gini impurity and information gain criteria, respectively.
<br> - Default value = "gini"
<li><b><i>max_depth</i></b>: 
<br> - Sets the maximum depth of the decision tree
<br> - A deeper tree can capture more complex relationships in the data, but can also lead to overfitting
<br> - Default value = "None"
<li><b><i>max_features</i></b>: 
<br> - The number of features to consider when looking for the best split. 
<br> - Helps prevent overfitting and improves the generalization of the model.
<br> - Default value = "sqrt"

</ol>


In [5]:
rf = RandomForestClassifier(n_estimators=10, criterion='entropy')
rf.fit(X_train, y_train)
prediction_test = rf.predict(X=X_test)

# source: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

# Accuracy on Test
print("Training Accuracy is: ", rf.score(X_train, y_train))
print("Testing Accuracy is: ", rf.score(X_test, y_test))
print(classification_report(y_test, prediction_test))

Training Accuracy is:  0.839866919247871
Testing Accuracy is:  0.7713079155125958
              precision    recall  f1-score   support

           0       0.78      0.83      0.80    231390
           1       0.69      0.62      0.65    140418
           2       1.00      1.00      1.00     34444
           3       0.70      0.78      0.74      5421
           4       0.58      0.51      0.54      4962
           5       0.92      0.92      0.92      7949

    accuracy                           0.77    424584
   macro avg       0.78      0.78      0.78    424584
weighted avg       0.77      0.77      0.77    424584



#### 5. Hyper-parameter Tuning

Now that the random forest model has proven to produce acceptable performance, we will try to specify its hyper parameters to determine the combination that maxmizes the model's performance score.

Scikit-Learn offers an exhuastive search algorithm enabling us to easily determine the best hyper parameter combination called <b><i>GridSearchCV</i></b>
<br>The algorithm iterate over every combination possible of the input hyper parameter values and select the best one.

The following are the values of hyperparameters tested:
<ol>
<li><b><i>n_estimators</i></b>: 
<br> - [5, 10]
<li><b><i>criterion</i></b>: 
<br> - "gini", "entropy" and "log_loss"
<li><b><i>max_depth</i></b>: 
<br> - [5, 7, 10]
<li><b><i>max_features</i></b>: 
<br> - "sqrt", "log2" and None
</ol>



### Hyperparameter Tuning

In [6]:
tree_params = {'n_estimators':[2, 3, 5],
               'criterion':['gini','entropy', 'log_loss'],
               'max_depth':[5, 7, 10], 
               'max_features':['sqrt', 'log2', None]}
rf_top = GridSearchCV( RandomForestClassifier(), tree_params, cv=5)

# Training the model with each combination
rf_top2 = rf_top.fit(X_train, y_train)

# Display the best hyperparameters
print("Best hyperparameters for random forest classification: ", rf_top.best_params_)


Best hyperparameters for random forest classification:  {'criterion': 'log_loss', 'max_depth': 10, 'max_features': None, 'n_estimators': 3}


#### 6. Best Performing Model

The following are the selected hyperparameter values:
<ol>
<li><b><i>criterion</i></b>: gini
<li><b><i>max_depth</i></b>: 10
<li><b><i>max_features</i></b>: None
<li><b><i>n_estimators</i></b>: 10
</ol>

With the best hyperparameter combination found in the previous step, we can now build our final random forest classifier and measure its performance.
<br>Using the <i>pickle</i> library, the model with the best hyperparameter is saved to the project file for future use.



In [10]:
#saving the model with the best performing hyperparameters
filename = "models/RF_model.pickle"

#train the model with the best hyperparameters
model_final = RandomForestClassifier(criterion = "log_loss", max_depth = 10, max_features =  None, n_estimators = 3)
model_final.fit(X_train, y_train)

# save model
pickle.dump(model_final, open(filename, "wb"))

In [11]:
# Measure performance of the model with the best hyper parameter 
y_top_pred = model_final.predict(X_test)
print("Classification Report of the default model")
print(classification_report(y_test, y_top_pred))

Classification Report of the default model
              precision    recall  f1-score   support

           0       0.65      0.97      0.78    231390
           1       0.73      0.14      0.23    140418
           2       1.00      1.00      1.00     34444
           3       0.66      0.94      0.77      5421
           4       0.68      0.27      0.39      4962
           5       0.89      0.96      0.92      7949

    accuracy                           0.69    424584
   macro avg       0.77      0.71      0.68    424584
weighted avg       0.71      0.69      0.61    424584



### 7. Cross Validation
While GridSearchCV already ran cross-validation, we ran it again testing different folds to make sure the accuracy results are not being influenced by a given split of the data. We observed that the accuracy of the model remains close to the values we obtained with 5-fold GridSearchCV.

In [9]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std

folds = [8, 10]
for i in folds:
    cross_val = KFold(n_splits=i, random_state=42, shuffle=True)
    scores = cross_val_score(rf_top, X, y, scoring='accuracy', cv=cross_val, n_jobs=4)
    print("Testing with {} fold:".format(i))
    print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Testing with 8 fold:
Accuracy: 0.670 (0.002)
Testing with 10 fold:
Accuracy: 0.669 (0.002)
