# Data mining TorontoFireIncidents

### Customs Modules

- data_clean

  Classes:
  
  - `DataCleaner`:
  
    **Functions**:
    
    - `createPipeline(df)`: Returns a pipeline with an imputer.
    - `cleanse_dataframe(df)`: Returns a cleansed dataframe.
    
---
- data_reduction

  Classes:
  
  - `FeatureAnalysis`:
  
    **Functions**:
    
    - `keepStrongestFeaturesInDataFrame(responseVariable, df)`: Returns a dataframe with variables that have a strong correlation to `responseVariable`.
---
- feature_transformers

  Classes:
  
  - `FeatureTransformer`:
  
    **Functions**:
    
    - `createTransformerPipeline(df)`: Returns a pipeline with ordinal and onehot encoders. Scales numerical features, and log transforms response variable.

### Preprocessing Pipeline

The pipeline is designed as follows:

- `Pipeline`
  - `Preprocessor`
    - `Data_cleaning (Ryan)`
      - Drop Null rows (inside data_clean.cleanse_dataframe() method)
      - Drop False positives (inside data_clean.cleanse_dataframe() method)
      - Impute missing values
      - Remove outliers (inside data_clean.cleanse_dataframe() method)
    - `Feature_Engineering` (may exist inside a helper function outside the sklearn pipeline)
      - create a new feature Control Time. (how long fire burned for)
      - create a new feature called response time (how long took the first arriving unit to incident.)
    - `feature_transformers (Ryan)`
      - categorical one hot encoding
      - categorical ordinal encoding
      - numerical scaler
      - log transformation on response variable
      - normalize features
    - `Data_reduction (Ryan)` (will use SelectKBest)
      - F-value is used as default score for classification
  - `model(regressor)` 
    - Linear models
      - Multiple Linear Regression (OLS - Ordinary Least Squares): `(Maurilio)`
      - Lasso (Least Absolute Shrinkage and Selection Operator): `(Maurilio)`
      - Elastic-Net: `(Maurilio)`
      - Huber Regressor:
    - Ensemble methods
      - XGBoost Regressor
    - Non-Linear Models
      - Neural networks (MLP - Multi-layer Perceptron):


In [1]:
# Third Party libraries
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

from sklearn.linear_model import LinearRegression

In [2]:
# Load DF
df = pd.read_csv('../../data/raw/Fire_Incidents_Data.csv', low_memory=False)

### Data Cleaning
The following data will be removed:
- False positives (Final_Incident_Type: 03 - NO LOSS OUTDOOR fire (exc: Sus.arson,vandal,child playing,recycling, or dump fires)
- Null Values or null values for Estimated Loss (response variable) or Area_of_Origin 

Missing Data will be imputed using Simple Imputer, with 'most_frequent' strategy.


In [3]:
# Import Data_cleaning module
from modules.data_clean import DataCleaner

# Cleanse Dataframe
df = DataCleaner.cleanse_dataframe(df)

# Data_cleaning pipeline (contains imputer)
missing_data_imputer = DataCleaner.createPipeline(df) 

### Feature Engineering

In [4]:
# feature_engineering pipeline
feature_engineering = Pipeline([])

### Feature Transformers

In [5]:
from modules.feature_transformers import FeatureTransformer

# create feature transformer pipeline
feature_transformers = FeatureTransformer.createTransformerPipeline(df)

### Data Reduction
Data reduction will be focused on selecting the best predictors to use in our model.

Applying correlation analysis, we will identify the variables which have a strong correlation with the response variable: Kruskal-Wallis Test, Spearman coefficient, Chi-Squared (Χ²) Test will be utilized.

In [6]:
# Import Data_reduction module
from modules.data_reduction import FeatureAnalysis

# helper function that will drop low correlated variables in the dataset
#df_reduced = FeatureAnalysis.keepStrongestFeaturesInDataFrame('Estimated_Dollar_Loss', df)

## Assembing sklearn pipeline

In [7]:

# Assemble final pipeline -- Set model before using!
pipeline = Pipeline(steps=[
                            #('feature engineering', feature_engineering), --to-do
                            ('feature transformers', feature_transformers) ,
                            ('imputer', missing_data_imputer),
                            #('feature_selection', data_reduction), --to-do
                            ('model', LinearRegression())
                           ])

pipeline

## Models

### Functions

In [8]:
# Function: Print Results of a Model (R2, MEAN ABSOLUTE ERROR, MEAN SQUARED ERROR, COEF and INTERCEPT)

def print_results(r2, mae, mse, coefficients, intercept):
    print(f"R-squared score: {r2:.4f}")
    print(f"Mean Absolute Error: {mae}")
    print(f"Mean Squared Error: {mse}")
    print(f"Coefficients: {coefficients}")
    print(f"Intercept: {intercept}")

# Function: Return Results (R2, MEAN ABSOLUTE ERROR, MEAN SQUARED ERROR, COEF and INTERCEPT)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def get_results(model, y_test, y_pred):
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    coefficients = model.coef_
    intercept = model.intercept_
    return (r2, mae, mse, coefficients, intercept)

### Split training and test data

In [9]:
from sklearn.model_selection import train_test_split

# List all columns in the DataFrame
all_columns = df.columns.tolist()

# Use every other column in the df except for the response variable
features = [col for col in all_columns if col != 'Estimated_Dollar_Loss']

# Separate features from response variable
X, y = df[features], df['Estimated_Dollar_Loss']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)


In [10]:
X_train


Unnamed: 0,Area_of_Origin,Building_Status,Business_Impact,Civilian_Casualties,Count_of_Persons_Rescued,Estimated_Number_Of_Persons_Displaced,Extent_Of_Fire,Final_Incident_Type,Fire_Alarm_System_Impact_on_Evacuation,Fire_Alarm_System_Operation,Fire_Alarm_System_Presence,Ignition_Source,Incident_Ward,Initial_CAD_Event_Type,Material_First_Ignited,Method_Of_Fire_Control,Number_of_responding_apparatus,Number_of_responding_personnel,Possible_Cause,Property_Use,Smoke_Alarm_at_Fire_Origin,Smoke_Alarm_at_Fire_Origin_Alarm_Failure,Smoke_Alarm_at_Fire_Origin_Alarm_Type,Smoke_Alarm_Impact_on_Persons_Evacuating_Impact_on_Evacuation,Smoke_Spread,Sprinkler_System_Operation,Sprinkler_System_Presence,Status_of_Fire_On_Arrival,TFS_Firefighter_Casualties
25822,97 - Other - unclassified,01 - Normal (no change),1 - No business interruption,0.0,0.0,0.0,1 - Confined to object of origin,01 - Fire,9 - Undetermined,8 - Not applicable (no system),1 - Fire alarm system present,"28 - Cord, Cable for Appliance, Electrical Art...",22.0,Medical - Other,97 - Other,3 - Extinguished by occupant,6.0,22.0,52 - Electrical Failure,323 - Multi-Unit Dwelling - Over 12 Units,3 - Floor/suite of fire origin: Smoke alarm pr...,4 - Remote from fire – smoke did not reach alarm,4 - Interconnected,"8 - Not applicable: No alarm, no persons present",2 - Confined to part of room/area of origin,3 - Did not activate: fire too small to trigge...,2 - Partial sprinkler system present,1 - Fire extinguished prior to arrival,0.0
12933,24 - Cooking Area or Kitchen,01 - Normal (no change),1 - No business interruption,0.0,0.0,0.0,1 - Confined to object of origin,01 - Fire,7 - Not applicable: Occupant(s) first alerted ...,9 - Fire alarm system operation undetermined,9 - Undetermined,12 - Oven,1.0,FIR,54 - Plastic,1 - Extinguished by fire department,7.0,26.0,44 - Unattended,301 - Detached Dwelling,4 - Floor/suite of fire origin: Smoke alarm pr...,4 - Remote from fire – smoke did not reach alarm,1 - Battery operated,7 - Not applicable: Occupant(s) first alerted ...,"4 - Spread beyond room of origin, same floor",8 - Not applicable - no sprinkler system present,3 - No sprinkler system,2 - Fire with no evidence from street,0.0
8767,"44 - Trash, Rubbish Storage (inc garbage chute...",01 - Normal (no change),8 - Not applicable (not a business),0.0,0.0,0.0,1 - Confined to object of origin,01 - Fire,7 - Not applicable: Occupant(s) first alerted ...,2 - Fire alarm system did not operate,1 - Fire alarm system present,"71 - Smoker's Articles (eg. cigarettes, cigars...",43.0,FIG,"46 - Rubbish, Trash, Waste",1 - Extinguished by fire department,1.0,4.0,45 - Improperly Discarded,323 - Multi-Unit Dwelling - Over 12 Units,1 - Floor/suite of fire origin: No smoke alarm,98 - Not applicable: Alarm operated OR presenc...,8 - Not applicable - no smoke alarm or presenc...,"8 - Not applicable: No alarm, no persons present",2 - Confined to part of room/area of origin,2 - Did not activate: remote from fire,2 - Partial sprinkler system present,4 - Flames showing from small area (one storey...,0.0
3626,"71 - Open Area (inc lawn, field, farmyard, par...",01 - Normal (no change),1 - No business interruption,1.0,0.0,0.0,1 - Confined to object of origin,01 - Fire,7 - Not applicable: Occupant(s) first alerted ...,9 - Fire alarm system operation undetermined,1 - Fire alarm system present,999 - Undetermined,20.0,Fire - Commercial/Industrial,54 - Plastic,1 - Extinguished by fire department,15.0,52.0,"98 - Unintentional, cause undetermined",218 - Hospice,4 - Floor/suite of fire origin: Smoke alarm pr...,98 - Not applicable: Alarm operated OR presenc...,2 - Hardwired (standalone),7 - Not applicable: Occupant(s) first alerted ...,9 - Confined to roof/exterior structure,9 - Activation/operation undetermined,9 - Undetermined,3 - Fire with smoke showing only - including v...,0.0
17644,22 - Sleeping Area or Bedroom (inc. patients r...,01 - Normal (no change),1 - No business interruption,0.0,0.0,0.0,"4 - Spread beyond room of origin, same floor",01 - Fire,9 - Undetermined,2 - Fire alarm system did not operate,1 - Fire alarm system present,34 - Space Heater - Portable,8.0,FAHR,"39 - Other Soft Goods, Wearing Apparel",1 - Extinguished by fire department,25.0,81.0,52 - Electrical Failure,323 - Multi-Unit Dwelling - Over 12 Units,4 - Floor/suite of fire origin: Smoke alarm pr...,9 - Other reason,4 - Interconnected,9 - Undetermined,"7 - Spread to other floors, confined to building",8 - Not applicable - no sprinkler system present,3 - No sprinkler system,3 - Fire with smoke showing only - including v...,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22787,"44 - Trash, Rubbish Storage (inc garbage chute...",01 - Normal (no change),1 - No business interruption,0.0,0.0,0.0,1 - Confined to object of origin,01 - Fire,"8 - Not applicable: No fire alarm system, no p...",9 - Fire alarm system operation undetermined,1 - Fire alarm system present,999 - Undetermined,24.0,Fire - Highrise Residential,"46 - Rubbish, Trash, Waste",1 - Extinguished by fire department,6.0,22.0,"98 - Unintentional, cause undetermined",323 - Multi-Unit Dwelling - Over 12 Units,1 - Floor/suite of fire origin: No smoke alarm,4 - Remote from fire – smoke did not reach alarm,4 - Interconnected,"8 - Not applicable: No alarm, no persons present",5 - Multi unit bldg: spread beyond suite of or...,3 - Did not activate: fire too small to trigge...,1 - Full sprinkler system present,3 - Fire with smoke showing only - including v...,0.0
7217,28 - Office,01 - Normal (no change),3 - May resume operations within a month,0.0,0.0,0.0,8 - Entire Structure,01 - Fire,"8 - Not applicable: No fire alarm system, no p...",8 - Not applicable (no system),8 - Not applicable (bldg not classified by OBC...,55 - Candle,11.0,FICI,99 - Undetermined (formerly 98),1 - Extinguished by fire department,16.0,59.0,44 - Unattended,531 - Florist,1 - Floor/suite of fire origin: No smoke alarm,98 - Not applicable: Alarm operated OR presenc...,8 - Not applicable - no smoke alarm or presenc...,"8 - Not applicable: No alarm, no persons present",10 - Spread beyond building of origin,8 - Not applicable - no sprinkler system present,3 - No sprinkler system,"7 - Fully involved (total structure, vehicle, ...",0.0
16905,50 - Basement/cellar (not partitioned),01 - Normal (no change),9 - Undetermined,0.0,0.0,0.0,2 - Confined to part of room/area of origin,01 - Fire,7 - Not applicable: Occupant(s) first alerted ...,1 - Fire alarm system operated,1 - Fire alarm system present,"26 - Terminations-Copper (incl receptacles, sw...",16.0,FACI,"36 - Rug, Carpet",1 - Extinguished by fire department,10.0,35.0,52 - Electrical Failure,"537 - Rug, floor covering store",4 - Floor/suite of fire origin: Smoke alarm pr...,98 - Not applicable: Alarm operated OR presenc...,2 - Hardwired (standalone),"8 - Not applicable: No alarm, no persons present","7 - Spread to other floors, confined to building",1 - Sprinkler system activated,1 - Full sprinkler system present,2 - Fire with no evidence from street,0.0
17917,46 - Product Storage (inc products or material...,01 - Normal (no change),2 - May resume operations within a week,0.0,0.0,0.0,2 - Confined to part of room/area of origin,01 - Fire,4 - Fire Alarm system operated but failed to a...,1 - Fire alarm system operated,1 - Fire alarm system present,52 - Florescent Lamp (includes ballast),8.0,FACI,48 - Multiple Objects or Materials,2 - Extinguished by automatic system,6.0,18.0,52 - Electrical Failure,653 - Mfg: Secondary Processing (eg finished g...,1 - Floor/suite of fire origin: No smoke alarm,98 - Not applicable: Alarm operated OR presenc...,8 - Not applicable - no smoke alarm or presenc...,9 - Undetermined,5 - Multi unit bldg: spread beyond suite of or...,1 - Sprinkler system activated,1 - Full sprinkler system present,2 - Fire with no evidence from street,0.0


### Multiple Linear Regression

In [11]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Ordinal encoding encountering N/A -- must Fix 
lr = LinearRegression() # to do - settings hyperparameters 
pipeline.set_params(model=lr)

# Fit pipeline on training data
pipeline.fit(X_train, y_train)


ValueError: could not convert string to float: '1 - Confined to object of origin'

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score

# Get predictions
y_pred_lr = pipeline.predict(X_test)

residuals_lr = y_test - y_pred_lr

(r2_lr, mae_lr, mse_lr, coefficients_lr, intercept_lr) = get_results(lr, y_test, y_pred_lr)

# Results - Another option would be to use statsmodels to display a summary
print("---------MLR----------")
print_results(r2_lr, mae_lr, mse_lr, coefficients_lr, intercept_lr)
print("----------------------")

# Get evaluation metrics
print('Accuracy:', accuracy_score(y_test, y_pred_lr))
print("Overall Precision:", precision_score(y_test, y_pred_lr))
print("Overall Recall:", recall_score(y_test, y_pred_lr))

---------MLR----------
R-squared score: 0.0025
Mean Absolute Error: 70540.54249502197
Mean Squared Error: 625891552510.4305
Coefficients: [   5672.53535934  -17292.37864796   27285.76993965  -46938.88225852
   -6074.31432662  -38152.67952924   37113.45820279  105643.26103038
 -101996.25486574  -45477.62436891  -46600.68036953  -50976.10839712
  -11170.63954034  -21204.40521799  -17566.33499097  -31062.35566627
  -16257.14595257  -13617.73451003  -19023.60307736  -12191.24374637
   23017.47232009  318972.81596159]
Intercept: -9093.91551040113
----------------------


ValueError: Classification metrics can't handle a mix of multiclass and continuous targets

In [None]:
# from sklearn.model_selection import train_test_split
# from sklearn.linear_model import LinearRegression
# from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12) # random_state should be 12 for all models

# lr = LinearRegression() # to do - settings hyperparameters 
# lr.fit(X_train, y_train)

# y_pred_lr = lr.predict(X_test)
# residuals_lr = y_test - y_pred_lr

# (r2_lr, mae_lr, mse_lr, coefficients_lr, intercept_lr) = get_results(lr, y_test, y_pred_lr)

# # Results - Another option would be to use statsmodels to display a summary
# print("---------MLR----------")
# print_results(r2_lr, mae_lr, mse_lr, coefficients_lr, intercept_lr)
# print("----------------------")

### Lasso

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

# Need to scale before using Lasso. I am not sure if we've already done in preprocessing
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.fit_transform(X_test)

# Lasso Model 1 = basic

lasso = Lasso()
lasso.fit(X_train, y_train)

y_pred_lasso = lasso.predict(X_test)

(r2_lasso, mae_lasso, mse_lasso, coefficients_lasso, intercept_lasso) = get_results(lasso, y_test, y_pred_lasso)


print("---------LASSO----------")
print_results(r2_lasso, mae_lasso, mse_lasso, coefficients_lasso, intercept_lasso)
print("------------------------")

# Lasso Model 2 = Testing different parameters (CROSS-VALIDATOR: CV)

param_grid = {
	'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]	
}

lasso_cv = GridSearchCV(lasso, param_grid, cv=3, n_jobs=-1)
lasso_cv.fit(X_train, y_train)

y_pred_lasso_cv = lasso_cv.predict(X_test)

(r2_lasso_cv, mae_lasso_cv, mse_lasso_cv, coefficients_lasso_cv, intercept_lasso_cv) = get_results(lasso_cv, y_test, y_pred_lasso_cv)

print("---------LASSO CV----------")
print_results(r2_lasso_cv, mae_lasso_cv, mse_lasso_cv, coefficients_lasso_cv, intercept_lasso_cv)
print("---------------------------")

# save best alpha paramter
best_alpha = lasso_cv.best_estimator_

# Lasso Model 3 = Lasso model with the best paramters
lasso_best = Lasso(alpha=best_alpha)
lasso_best.fit(X_train, y_train)

### ElasticNet

In [None]:
from sklearn.linear_model import ElasticNet

# X_train and X_test should be scaled

# ElasticNet Model 1 = basic

elastic_net = ElasticNet()
elastic_net.fit(X_train, y_train)

y_pred_elastic_net = elastic_net.predict(X_test)

(r2_elastic_net, mae_elastic_net, mse_elastic_net, coefficients_elastic_net, intercept_elastic_net) = get_results(elastic_net, y_test, y_pred_elastic_net)

print("---------ELASTIC NET BASIC----------")
print_results(r2_elastic_net, mae_elastic_net, mse_elastic_net, coefficients_elastic_net, intercept_elastic_net)
print("---------------------------")

# ElasticNet Model 2 = Testing different parameters (CROSS-VALIDATOR: CV)

param_grid = {
	'alpha': [0.1, 0.3, 0.5, 0.7, 0.9, 1.0],
	'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9, 1.0], 	
}

elastic_cv = GridSearchCV(elastic_net, param_grid, scoring='neg_mean_squared_error', cv=3, n_jobs=-1)
elastic_cv.fit(X_train, y_train)

y_pred_elastic_cv = elastic_cv.predict(X_test)
(r2_elastic_cv, mae_elastic_cv, mse_elastic_cv, coefficients_elastic_cv, intercept_elastic_cv) = get_results(elastic_cv, y_test, y_pred_elastic_cv)

print("---------ELASTIC NET CV----------")
print_results(r2_elastic_cv, mae_elastic_cv, mse_elastic_cv, coefficients_elastic_cv, intercept_elastic_cv)
print("---------------------------")


# Best parameters
best_estimator = elastic_cv.best_estimator_
print(best_estimator)

In [None]:
# # Sample imports
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OrdinalEncoder
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
# from sklearn.impute import SimpleImputer
# from sklearn.impute import KNNImputer
# from sklearn.neighbors import KNeighborsClassifier
# import pandas as pd

# df = load_data('data/loan.csv')

# # Encode the target variable using LabelEncoder
# label_encoder = LabelEncoder()
# df['Loan_Status'] = label_encoder.fit_transform(df['Loan_Status'])


# # Define categorical and numerical features
# ordinal_categorical_features = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed']
# c1_idx = [df.columns.get_loc(item) for item in ordinal_categorical_features]
# onehot_categorical_features = ['Property_Area']
# c2_idx = [df.columns.get_loc(item) for item in onehot_categorical_features]
# numerical_features = df.columns.difference(ordinal_categorical_features + onehot_categorical_features + ['Loan_Status'])
# n_idx = [df.columns.get_loc(item) for item in numerical_features]

# # Create transformers for numerical and categorical features
# numerical_transformer = Pipeline(steps=[
#     ('scaler', StandardScaler())
# ])

# ordinal_categorical_transformer = Pipeline(steps=[
#     ('ordinal', OrdinalEncoder(handle_unknown='error'))
# ])

# onehot_categorical_transformer = Pipeline(steps=[
#     ('onehot', OneHotEncoder(handle_unknown='ignore'))
# ])

# column_imputer = Pipeline(steps=[
#     ('imputer0', KNNImputer())
# ])

# # Apply transformers to features using ColumnTransformer
# feature_transformer = ColumnTransformer(
#     transformers=[
#         ('cat1', ordinal_categorical_transformer, c1_idx),
#         ('cat2', onehot_categorical_transformer, c2_idx),
#         ('num', numerical_transformer, n_idx),
#     ])

# missing_value_imputer = ColumnTransformer(
#     transformers=[
#         ('imputer', column_imputer, c1_idx + c2_idx + n_idx)
#     ])


# # Define the KNN model
# knn_model = KNeighborsClassifier(n_neighbors=3)  # You can adjust the number of neighbors

# # Create the pipeline
# # Create preprocessing and training pipeline
# pipeline = Pipeline(steps=[('transformer', feature_transformer),
#                            ('imputer', missing_value_imputer),
#                            ('classifier', knn_model)])
# pipeline
