### Data preprocessing

Before we start building and training our prediction model, we need to prepare and clean the dataset to ensure the viability of the results and analysis of our experiment.

<br>To reproduce the steps on your devices, please ensure you have the relevant files downloaded beforehand.
<br>Refer to the README.md page for access to the data CSV files.

### 1. Import libraries

This preprocessing step requires the following libraries:

<b> Pandas</b>: 
<li> Data manipulation library
<li> Loading the dataset
</li>

<b> datetime</b>: 
<li> Date data formatiing

<br>The specific operations needed from these libraries will be explained as we advance in the tutorial.

In [1]:
import pandas as pd
from datetime import datetime

### 2. Load the dataset 

<b><i>lfb_incident.csv</i></b> file is currently not added to this repository's <i>data</i> folder due to its size.
<br>Please ensure to have it downloaded from Kaggle via the links provided on README at the root of the repository.
<br><b>* Note: </b><i>You are not required to save the CSV file under the 'data' folder of the repository. Please ensure you are loading the file with the right path in the below Python code section line 4</i>
<br>
<br><i>used_cols</i> variable contains a list of columns we intend to use for our prediction model.
<br>The choice of columns was at first based on what we thought would work best for predicting fire incidents in a real-world scenario. 
<br>We asked ourselves: right after the fire department receives a fire call and sends the firefighters to the incident location, what kind of data do they have about the incident?
- Date and time
- Location (e.g. they know if it is in a building, a house, a park, etc.)
- Number of fire engines sent to work on the incident
- Number of firefighters sent to the incident location

<br>Based on this assumption, we selected the corresponding columns of the dataset, as well as the variable we want to predict at the end: the cost of operation. 
<br>We will verify for correlation between these variables in another notebook (preprocessing/2_check_correlation.ipynb) after we preprocess the data to make sure our assumptions were correct. 

<br>At this point of the project we did not do any filtering, so the dataset contains 1,465,060 rows in total.

In [2]:
used_cols = ['DateOfCall', 'CalYear', 'HourOfCall', 'IncidentGroup', 
'PropertyType', 'PumpHoursRoundUp', 'NumPumpsAttending', 
'Notional Cost (£)', 'PropertyCategory']
ds = pd.read_csv('data/lfb_incident.csv', usecols=used_cols)
print("Count:", len(ds))

Count: 1465060


### 2. Drop NaN values

The first step of our data preprocessing section is to drop the any rows containging the <i>NaN</i> values.
<br>After this first steps, the dataset containted 1,453,312 rows in total.

In [3]:
ds = ds.dropna()
print("Count:", len(ds))

Count: 1453312


### 3. Duplicate DateOfCall column for reformatting

The next step is to convert the values in DateOfCall column into a format that is more commonly used: <b>month/day/year</b>.
<br>This conversion is necessary in order to join the fire incident data with the weather data. 
<br>Refer to step 9 for details.

In [4]:
ds2 = pd.DataFrame(ds)
ds2['Date'] = ds2['DateOfCall']
ds2['Date'] = pd.to_datetime(ds2['Date']).dt.strftime('%m/%d/%Y')

### 4. Convert DateOfCall column
Next, we bin the DateOfCall column's values into 12 categories representing the month of the incident date. 
<br>This step is necessary because we need the date feature to be a discrete (have distinct, countable values) variable. <br>However, the column currently contain too many combinations of different dates (day/month/year). 
<br>Thus, we decided to reduce these values into 12 categories, each representing a month of the year. 

In [5]:
ds2['DateOfCall'] = pd.to_datetime(ds2['DateOfCall']).dt.month
ds2[0:5]

Unnamed: 0,DateOfCall,CalYear,HourOfCall,IncidentGroup,PropertyCategory,PropertyType,NumPumpsAttending,PumpHoursRoundUp,Notional Cost (£),Date
0,1,2009,0,Special Service,Road Vehicle,Car,2.0,1.0,255.0,01/01/2009
2,1,2009,0,Fire,Outdoor,Road surface/pavement,1.0,1.0,255.0,01/01/2009
3,1,2009,0,Fire,Outdoor,Domestic garden (vegetation not equipment),1.0,1.0,255.0,01/01/2009
4,1,2009,0,Fire,Outdoor,Cycle path/public footpath/bridleway,2.0,1.0,255.0,01/01/2009
5,1,2009,0,False Alarm,Dwelling,Purpose Built Flats/Maisonettes - Up to 3 stor...,2.0,1.0,255.0,01/01/2009


### 5. Standardizing numerical variables
<br> In this step, we convert all numerical variables (e.g. number of pumps, cost, etc.) to a standardized format of integer for Pandas library: <b>int64</b>.

In [6]:
ds2['NumPumpsAttending'] = ds2['NumPumpsAttending'].astype('int64')
ds2['Notional Cost (£)'] = ds2['Notional Cost (£)'].astype('int64')
ds2['PumpHoursRoundUp'] = ds2['PumpHoursRoundUp'].astype('int64')
ds2['HourOfCall'] = ds2['HourOfCall'].astype('int64')

### 6. Standardizing categorical variables

All categorical variables (e.g. incident group, property type, etc.) of the original dataset were originally strings (e.g. "building" or "house" for PropertyType).
<br> We factorize them or, in other words, we convert them to categorical integer values (e.g. "building" becomes 1, "house" becomes 2, etc.) to allow easy computation in later steps.

In [7]:
ds2['IncidentGroup'] = pd.factorize(ds2['IncidentGroup'])[0]
ds2['PropertyType'] = pd.factorize(ds2['PropertyType'])[0]
ds2['PropertyCategory'] = pd.factorize(ds2['PropertyCategory'])[0]

### 7. Save

A lot of changes have been made to the dataset so far.
<br>At this point we save the new dataset to the disk. 
<br>Next, we will join this dataset with another one, containing weather data. 
<br>The new dataset contains a total of 1,453,312 rows.

In [8]:
ds2.to_csv('data/london_clean.csv', index=False)
len(ds2)

1453312

### 8. Load and prepare weather data

As was the case for the fire incident dataset, please ensure to have downloaded <b><i>london_weather.csv</i></b> from the link provided in the README.
<br>After loading the dataset, we need to process the <i>Date</i> column so its values are in the same format as the date values of the previous dataset (month/day/year).
<br>This stpe is very important because this column will be used to join both datasets in the following step.

In [9]:
weather_data = pd.read_csv('data/london_weather.csv')
weather_data['date'] = pd.to_datetime(weather_data['date'], format='%Y%m%d').dt.strftime('%m/%d/%Y')
weather_data = weather_data.rename(columns={'date': 'Date'})
weather_data[0:5]

Unnamed: 0,Date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
0,01/01/1979,2.0,7.0,52.0,2.3,-4.1,-7.5,0.4,101900.0,9.0
1,01/02/1979,6.0,1.7,27.0,1.6,-2.6,-7.5,0.0,102530.0,8.0
2,01/03/1979,5.0,0.0,13.0,1.3,-2.8,-7.2,0.0,102050.0,4.0
3,01/04/1979,8.0,0.0,13.0,-0.3,-2.6,-6.5,0.0,100840.0,2.0
4,01/05/1979,6.0,2.0,29.0,5.6,-0.8,-1.4,0.0,102250.0,1.0


### 9. Join the datasets

While this operation is frequently called "join" in the context of relational databases, this operation is called <i>merge</i> in Pandas library. 
<br>The principle, however, remains the same: we choose a column that exists in both datasets and we perform an <b><i>inner join</b></i> (intersection of keys from both datasets).

In [10]:
ds_merged = pd.merge(ds2, weather_data, on='Date')
ds_merged[0:5]

Unnamed: 0,DateOfCall,CalYear,HourOfCall,IncidentGroup,PropertyCategory,PropertyType,NumPumpsAttending,PumpHoursRoundUp,Notional Cost (£),Date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
0,1,2009,0,0,0,0,2,1,255,01/01/2009,8.0,0.0,13.0,3.5,1.5,-0.5,0.0,103010.0,0.0
1,1,2009,0,1,1,1,1,1,255,01/01/2009,8.0,0.0,13.0,3.5,1.5,-0.5,0.0,103010.0,0.0
2,1,2009,0,1,1,2,1,1,255,01/01/2009,8.0,0.0,13.0,3.5,1.5,-0.5,0.0,103010.0,0.0
3,1,2009,0,1,1,3,2,1,255,01/01/2009,8.0,0.0,13.0,3.5,1.5,-0.5,0.0,103010.0,0.0
4,1,2009,0,2,2,4,2,1,255,01/01/2009,8.0,0.0,13.0,3.5,1.5,-0.5,0.0,103010.0,0.0


### 10. Remove rows without temperature data
Temperature is an important component in fire incidents.
<br> For example, higher temperatures make it hard to put out a fire or increase the probability of some kinds of fire incidents happening. 
<br>Hence the reason why we selected this specific column for our prediction model and remove all rows that do not contain this data. 
<br> After this step, the total number of the rows is 1,286,617.

In [11]:
ds_merged = ds_merged[ds_merged['mean_temp'].notna()]
print(len(ds_merged))

1286617


### 11. Creating Cost Column

Our output feature is the fire pumps' notional cost in pound sterling (£). 
<br>The cost value was originally a continuous numerical variable, but we convert it to a categorical variable by dividing and categorizing the numerical value in intervals of £300. 
<br>For example, all records of cost between £0.00 and £300.00 fall under category 1, all records of cost between £300.01 and £500.00 fall under category 2, and so on. 
<br>All records with costs larger than £1100.00 are categorized as category 5. 
<br><br>
<b>IMPORTANT</b>: This step MUST be performed after the join and all other filtering steps. 
<br>If we do this step first and filter after, we risk not having all categories from 1 to 5 represented in the dataset, which will cause problems when training the tree based prediction models.


In [12]:
# check to which category each cost belongs to
cost_cat = []
for item in ds_merged['Notional Cost (£)'].values:
    if item < 300:
        cost_cat.append(0)
    elif item >= 300 and item < 500:
        cost_cat.append(1)
    elif item >= 500 and item < 700:
        cost_cat.append(2)
    elif item >= 700 and item < 900:
        cost_cat.append(3)
    elif item >= 900 and item < 1100:
        cost_cat.append(4)    
    else:
        cost_cat.append(5)

# add cost categories to dataset
ds_merged['CostCat'] = cost_cat

# check wheter the categories exist and how many corresponding rows they have
print(ds_merged['CostCat'].value_counts())
print(len(ds_merged))

0    701094
1    425134
2    104288
5     24706
3     16375
4     15020
Name: CostCat, dtype: int64
1286617


### 12. Save
Once all stpes are completed successfully, we save the result to the disk. 
<br>The new dataset will be used to train our prediction models.

In [13]:
ds_merged.to_csv('data/london_clean.csv', index=False)