# Data mining TorontoFireIncidents

### Customs Modules

- data_clean

  Classes:
  
  - `DataCleaner`:
  
    **Functions**:
    
    - `createPipeline()`: Returns a pipeline with an imputer.
    - `cleanse_dataframe()`: Returns a cleansed dataframe.
    
---
- data_reduction

  Classes:
  
  - `FeatureAnalysis`:
  
    **Functions**:
    
    - `keepStrongestFeaturesInDataFrame(responseVariable, df)`: Returns a dataframe with variables that have a strong correlation to `responseVariable`.




### Preprocessing Pipeline

The pipeline is designed as follows:

- `Pipeline`
  - `Preprocessor`
    - `Data_cleaning (Ryan)`
      - Drop Null rows (inside data_clean.cleanse_dataframe() method)
      - Drop False positives (inside data_clean.cleanse_dataframe() method)
      - Impute missing values
      - Remove outliers (inside data_clean.cleanse_dataframe() method)
    - `Data_reduction (Ryan)` (mostly exists in data_reduction module outside sklearn pipeline)
      - select best predictors (identify the variables which have a strong correlation with the response variable)
      - identify the variables which have a strong correlation with the response variable: Kruskal-Wallis Test, Spearman coefficient, Chi-Squared (Χ²) Test
    - `Feature_Engineering` (may exist inside a helper function outside the sklearn pipeline)
      - create a new feature Control Time. (how long fire burned for)
      - create a new feature called response time (how long took the first arriving unit to incident.)
    - `feature_transformers`
      - categorical one hot encoding
      - categorical ordinal encoding
      - numerical scaler
      - log transformation on response variable
      - normalize features
  - `model(regressor)` 
    - Linear models
      - Multiple Linear Regression (OLS - Ordinary Least Squares):
      - Lasso (Least Absolute Shrinkage and Selection Operator):
      - Elastic-Net:
      - Huber Regressor:
    - Ensemble methods
      - XGBoost Regressor
    - Non-Linear Models
      - Neural networks (MLP - Multi-layer Perceptron):


In [None]:
# Third Party libraries
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd


In [None]:
# Load DF
df = pd.read_csv('../../data/raw/Fire_Incidents_Data.csv')

### Data Cleaning
The following data will be removed:
- False positives (Final_Incident_Type: 03 - NO LOSS OUTDOOR fire (exc: Sus.arson,vandal,child playing,recycling, or dump fires)
- Null Values for null values for Estimated Loss (response variable) or Area_of_Origin 

Missing Data will be imputed using KNN.


In [None]:
# Import Data_cleaning module
from modules.data_clean import DataCleaner

# Cleanse Dataframe
df = DataCleaner.cleanse_dataframe(df)

# Data_cleaning pipeline (contains imputer)
data_cleaning = DataCleaner.createPipeline() 

### Data Reduction
Data reduction will be focused on selecting the best predictors to use in our model.

Applying correlation analysis, we will identify the variables which have a strong correlation with the response variable: Kruskal-Wallis Test, Spearman coefficient, Chi-Squared (Χ²) Test will be utilized.

In [None]:
# Import Data_reduction module
from modules.data_reduction import FeatureAnalysis

# helper function that will drop low correlated variables in the dataset
df = FeatureAnalysis.keepStrongestFeaturesInDataFrame('estimated_loss', df)

### Feature Engineering

In [None]:
# feature_engineering pipeline
feature_engineering = Pipeline([])

### Feature Transformers

In [None]:
# feature_transformers pipeline
feature_transformers = Pipeline([])

## Assembing pipeline

In [None]:

# Preprocessor pipeline
preprocessor = Pipeline(steps=[('data cleaning', data_cleaning),
                               #('data reduction', data_reduction),
                               ('feature engineering', feature_engineering)
                               ('feature transformers', feature_transformers),
                                ])
# sample model
model = KNeighborsClassifier(n_neighbors=3)  

# Assemble final pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', model)])

pipeline

In [None]:
# # Sample imports
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OrdinalEncoder
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
# from sklearn.impute import SimpleImputer
# from sklearn.impute import KNNImputer
# from sklearn.neighbors import KNeighborsClassifier
# import pandas as pd

# df = load_data('data/loan.csv')

# # Encode the target variable using LabelEncoder
# label_encoder = LabelEncoder()
# df['Loan_Status'] = label_encoder.fit_transform(df['Loan_Status'])


# # Define categorical and numerical features
# ordinal_categorical_features = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed']
# c1_idx = [df.columns.get_loc(item) for item in ordinal_categorical_features]
# onehot_categorical_features = ['Property_Area']
# c2_idx = [df.columns.get_loc(item) for item in onehot_categorical_features]
# numerical_features = df.columns.difference(ordinal_categorical_features + onehot_categorical_features + ['Loan_Status'])
# n_idx = [df.columns.get_loc(item) for item in numerical_features]

# # Create transformers for numerical and categorical features
# numerical_transformer = Pipeline(steps=[
#     ('scaler', StandardScaler())
# ])

# ordinal_categorical_transformer = Pipeline(steps=[
#     ('ordinal', OrdinalEncoder(handle_unknown='error'))
# ])

# onehot_categorical_transformer = Pipeline(steps=[
#     ('onehot', OneHotEncoder(handle_unknown='ignore'))
# ])

# column_imputer = Pipeline(steps=[
#     ('imputer0', KNNImputer())
# ])

# # Apply transformers to features using ColumnTransformer
# feature_transformer = ColumnTransformer(
#     transformers=[
#         ('cat1', ordinal_categorical_transformer, c1_idx),
#         ('cat2', onehot_categorical_transformer, c2_idx),
#         ('num', numerical_transformer, n_idx),
#     ])

# missing_value_imputer = ColumnTransformer(
#     transformers=[
#         ('imputer', column_imputer, c1_idx + c2_idx + n_idx)
#     ])


# # Define the KNN model
# knn_model = KNeighborsClassifier(n_neighbors=3)  # You can adjust the number of neighbors

# # Create the pipeline
# # Create preprocessing and training pipeline
# pipeline = Pipeline(steps=[('transformer', feature_transformer),
#                            ('imputer', missing_value_imputer),
#                            ('classifier', knn_model)])
# pipeline
