# Data mining TorontoFireIncidents

### Customs Modules

- data_clean

  Classes:
  
  - `DataCleaner`:
  
    **Functions**:
    
    - `createPipeline()`: Returns a pipeline with an imputer.
    - `cleanse_dataframe()`: Returns a cleansed dataframe.
    
---
- data_reduction

  Classes:
  
  - `FeatureAnalysis`:
  
    **Functions**:
    
    - `keepStrongestFeaturesInDataFrame(responseVariable, df)`: Returns a dataframe with variables that have a strong correlation to `responseVariable`.




### Preprocessing Pipeline

The pipeline is designed as follows:

- `Pipeline`
  - `Preprocessor`
    - `Data_cleaning (Ryan)`
      - Drop Null rows (inside data_clean.cleanse_dataframe() method)
      - Drop False positives (inside data_clean.cleanse_dataframe() method)
      - Impute missing values
      - Remove outliers (inside data_clean.cleanse_dataframe() method)
    - `Data_reduction (Ryan)` (mostly exists in data_reduction module outside sklearn pipeline)
      - select best predictors (identify the variables which have a strong correlation with the response variable)
      - identify the variables which have a strong correlation with the response variable: Kruskal-Wallis Test, Spearman coefficient, Chi-Squared (Χ²) Test
    - `Feature_Engineering` (may exist inside a helper function outside the sklearn pipeline)
      - create a new feature Control Time. (how long fire burned for)
      - create a new feature called response time (how long took the first arriving unit to incident.)
    - `feature_transformers`
      - categorical one hot encoding
      - categorical ordinal encoding
      - numerical scaler
      - log transformation on response variable
      - normalize features
  - `model(regressor)` 
    - Linear models
      - Multiple Linear Regression (OLS - Ordinary Least Squares): `(Maurilio)`
      - Lasso (Least Absolute Shrinkage and Selection Operator): `(Maurilio)`
      - Elastic-Net: `(Maurilio)`
      - Huber Regressor:
    - Ensemble methods
      - XGBoost Regressor
    - Non-Linear Models
      - Neural networks (MLP - Multi-layer Perceptron):


In [1]:
# Third Party libraries
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd


In [8]:
# Load DF
df = pd.read_csv('../../data/raw/Fire_Incidents_Data.csv', low_memory=False)

### Data Cleaning
The following data will be removed:
- False positives (Final_Incident_Type: 03 - NO LOSS OUTDOOR fire (exc: Sus.arson,vandal,child playing,recycling, or dump fires)
- Null Values for null values for Estimated Loss (response variable) or Area_of_Origin 

Missing Data will be imputed using KNN.


In [6]:
# Import Data_cleaning module
from modules.data_clean import DataCleaner

# Cleanse Dataframe
df = DataCleaner.cleanse_dataframe(df)

# Data_cleaning pipeline (contains imputer)
data_cleaning = DataCleaner.createPipeline() 

TypeError: DataCleaner.createPipeline() missing 1 required positional argument: 'self'

### Data Reduction
Data reduction will be focused on selecting the best predictors to use in our model.

Applying correlation analysis, we will identify the variables which have a strong correlation with the response variable: Kruskal-Wallis Test, Spearman coefficient, Chi-Squared (Χ²) Test will be utilized.

In [None]:
# Import Data_reduction module
from modules.data_reduction import FeatureAnalysis

# helper function that will drop low correlated variables in the dataset
df = FeatureAnalysis.keepStrongestFeaturesInDataFrame('estimated_loss', df)

### Feature Engineering

In [None]:
# feature_engineering pipeline
feature_engineering = Pipeline([])

### Feature Transformers

In [None]:
# feature_transformers pipeline
feature_transformers = Pipeline([])

## Assembing pipeline

In [None]:

# Preprocessor pipeline
preprocessor = Pipeline(steps=[('data cleaning', data_cleaning),
                               #('data reduction', data_reduction),
                               ('feature engineering', feature_engineering)
                               ('feature transformers', feature_transformers),
                                ])
# sample model
model = KNeighborsClassifier(n_neighbors=3)  

# Assemble final pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', model)])

pipeline

## Models

In [None]:
# Function: Print Results

def print_results(r2, mae, mse, coefficients, intercept):
    print(f"R-squared score: {r2:.4f}")
    print(f"Mean Absolute Error: {mae}")
    print(f"Mean Squared Error: {mse}")
    print(f"Coefficients: {coefficients}")
    print(f"Intercept: {intercept}")

# Function: Return Results
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def get_results(model, y_test, y_pred):
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    coefficients = model.coef_
    intercept = model.intercept_
    return (r2, mae, mse, coefficients, intercept)

## Multiple Linear Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12) # random_state should be 12 for all models

lr = LinearRegression() # to do - settings hyperparameters 
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)
residuals_lr = y_test - y_pred_lr

(r2_lr, mae_lr, mse_lr, coefficients_lr, intercept_lr) = get_results(lr, y_test, y_pred_lr)

# Results - Another option would be to use statsmodels to display a summary
print("---------MLR----------")
print_results(r2_lr, mae_lr, mse_lr, coefficients_lr, intercept_lr)
print("----------------------")

## Lasso

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

# Need to scale before using Lasso. I am not sure if we've already done in preprocessing
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.fit_transform(X_test)

# Lasso Model 1 = basic

lasso = Lasso()
lasso.fit(X_train, y_train)

y_pred_lasso = lasso.predict(X_test)

(r2_lasso, mae_lasso, mse_lasso, coefficients_lasso, intercept_lasso) = get_results(lasso, y_test, y_pred_lasso)


print("---------LASSO----------")
print_results(r2_lasso, mae_lasso, mse_lasso, coefficients_lasso, intercept_lasso)
print("------------------------")

# Lasso Model 2 = Testing different parameters (CROSS-VALIDATOR: CV)

param_grid = {
	'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]	
}

lasso_cv = GridSearchCV(lasso, param_grid, cv=3, n_jobs=-1)
lasso_cv.fit(X_train, y_train)

y_pred_lasso_cv = lasso_cv.predict(X_test)

(r2_lasso_cv, mae_lasso_cv, mse_lasso_cv, coefficients_lasso_cv, intercept_lasso_cv) = get_results(lasso_cv, y_test, y_pred_lasso_cv)

print("---------LASSO CV----------")
print_results(r2_lasso_cv, mae_lasso_cv, mse_lasso_cv, coefficients_lasso_cv, intercept_lasso_cv)
print("---------------------------")

# save best alpha paramter
best_alpha = lasso_cv.best_estimator_

# Lasso Model 3 = Lasso model with the best paramters
lasso_best = Lasso(alpha=best_alpha)
lasso_best.fit(X_train, y_train)

## ElasticNet

In [None]:
from sklearn.linear_model import ElasticNet

# X_train and X_test should be scaled

# ElasticNet Model 1 = basic

elastic_net = ElasticNet()
elastic_net.fit(X_train, y_train)

y_pred_elastic_net = elastic_net.predict(X_test)

(r2_elastic_net, mae_elastic_net, mse_elastic_net, coefficients_elastic_net, intercept_elastic_net) = get_results(elastic_net, y_test, y_pred_elastic_net)

print("---------ELASTIC NET BASIC----------")
print_results(r2_elastic_net, mae_elastic_net, mse_elastic_net, coefficients_elastic_net, intercept_elastic_net)
print("---------------------------")

# ElasticNet Model 2 = Testing different parameters (CROSS-VALIDATOR: CV)

param_grid = {
	'alpha': [0.1, 0.3, 0.5, 0.7, 0.9, 1.0],
	'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9, 1.0], 	
}

elastic_cv = GridSearchCV(elastic_net, param_grid, scoring='neg_mean_squared_error', cv=3, n_jobs=-1)
elastic_cv.fit(X_train, y_train)

y_pred_elastic_cv = elastic_cv.predict(X_test)
(r2_elastic_cv, mae_elastic_cv, mse_elastic_cv, coefficients_elastic_cv, intercept_elastic_cv) = get_results(elastic_cv, y_test, y_pred_elastic_cv)

print("---------ELASTIC NET CV----------")
print_results(r2_elastic_cv, mae_elastic_cv, mse_elastic_cv, coefficients_elastic_cv, intercept_elastic_cv)
print("---------------------------")


# Best parameters
best_estimator = elastic_cv.best_estimator_
print(best_estimator)

In [None]:
# # Sample imports
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OrdinalEncoder
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
# from sklearn.impute import SimpleImputer
# from sklearn.impute import KNNImputer
# from sklearn.neighbors import KNeighborsClassifier
# import pandas as pd

# df = load_data('data/loan.csv')

# # Encode the target variable using LabelEncoder
# label_encoder = LabelEncoder()
# df['Loan_Status'] = label_encoder.fit_transform(df['Loan_Status'])


# # Define categorical and numerical features
# ordinal_categorical_features = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed']
# c1_idx = [df.columns.get_loc(item) for item in ordinal_categorical_features]
# onehot_categorical_features = ['Property_Area']
# c2_idx = [df.columns.get_loc(item) for item in onehot_categorical_features]
# numerical_features = df.columns.difference(ordinal_categorical_features + onehot_categorical_features + ['Loan_Status'])
# n_idx = [df.columns.get_loc(item) for item in numerical_features]

# # Create transformers for numerical and categorical features
# numerical_transformer = Pipeline(steps=[
#     ('scaler', StandardScaler())
# ])

# ordinal_categorical_transformer = Pipeline(steps=[
#     ('ordinal', OrdinalEncoder(handle_unknown='error'))
# ])

# onehot_categorical_transformer = Pipeline(steps=[
#     ('onehot', OneHotEncoder(handle_unknown='ignore'))
# ])

# column_imputer = Pipeline(steps=[
#     ('imputer0', KNNImputer())
# ])

# # Apply transformers to features using ColumnTransformer
# feature_transformer = ColumnTransformer(
#     transformers=[
#         ('cat1', ordinal_categorical_transformer, c1_idx),
#         ('cat2', onehot_categorical_transformer, c2_idx),
#         ('num', numerical_transformer, n_idx),
#     ])

# missing_value_imputer = ColumnTransformer(
#     transformers=[
#         ('imputer', column_imputer, c1_idx + c2_idx + n_idx)
#     ])


# # Define the KNN model
# knn_model = KNeighborsClassifier(n_neighbors=3)  # You can adjust the number of neighbors

# # Create the pipeline
# # Create preprocessing and training pipeline
# pipeline = Pipeline(steps=[('transformer', feature_transformer),
#                            ('imputer', missing_value_imputer),
#                            ('classifier', knn_model)])
# pipeline
