# Introduction

This notebook will contain an analysis of two distinct datasets, each corresponding to a different dataset type: experimental and simulated. Both datasets apply to a comparable building, highlighting the alignment between experimental and simulated scenarios. Experimental and simulation data were based on a small commercial building located in Iowa. These data covered different seasons, namely summer, winter and transitional periods, to allow a comprehensive assessment of the building's behavior under different conditions.

Dataset contains following data:

![obraz.png](attachment:obraz.png)
![obraz-2.png](attachment:obraz-2.png)

In the actual building, faults were manualy introduced into the control system for a duration of one day, as outlined in the table provided below:

![obraz.png](attachment:obraz.png)

Similarly, within the simulation context, faults were manually introduced into the system for a single day, mirroring the approach delineated in the following table:

![obraz.png](attachment:obraz.png)

It's evident that the simulated dataset have a broader spectrum of fault types. Interestingly, there exists an overlap between faults featured in both the experimental and simulated datasets. This overlap provides a unique opportunity to put together models developed using distinct tools and check whether a model trained with simulated data can effectively generalize to real building data. This comparison promises valuable insights into the models' adaptability and performance across different scenarios.

# Import of the relevant libraries and notebook preparation

In [2]:
import sys
import os

In [3]:
src_path = os.path.join(os.getcwd(), "..", "src")
sys.path.append(src_path)

In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import warnings
from helper_functions import convert_date, train_evaluate_classification_models, train_evaluate_regression_models
import copy

In [5]:
from catboost import CatBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier,BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

In [6]:
# preparing list of classifiacation algorithms for comparison and to choose the best one
classifiers = [['KNeighborsClassifier', KNeighborsClassifier()],
              ['MLPClassifier',MLPClassifier()],
              ['AdaBoostClassifier',AdaBoostClassifier()],
              ['GradientBoostingClassifier',GradientBoostingClassifier()],
              ['CatBoostClassifier', CatBoostClassifier()],
              ['XGBClassifier', XGBClassifier()],
              ['BaggingClassifier', BaggingClassifier()],
              ['RandomForestClassifier', RandomForestClassifier()],
              ['DecisionTreeClassifier', DecisionTreeClassifier()],
              ['LogisticRegression', LogisticRegression()]]

In [7]:
warnings.filterwarnings("ignore")

In [8]:
csv_file_path_1 = os.path.join("..", "data", "MZVAV-2-1.csv")
csv_file_path_2 = os.path.join("..", "data", "MZVAV-2-2.csv")

In [9]:
raw_data_exp = pd.read_csv(csv_file_path_1)

In [10]:
raw_data_sim = pd.read_csv(csv_file_path_2)

# Experimental dataset

## Data exploration and feature engineering

In [11]:
raw_data_exp.sample(5)

Unnamed: 0,Datetime,AHU: Supply Air Temperature,AHU: Supply Air Temperature Set Point,AHU: Outdoor Air Temperature,AHU: Mixed Air Temperature,AHU: Return Air Temperature,AHU: Supply Air Fan Status,AHU: Return Air Fan Status,AHU: Supply Air Fan Speed Control Signal,AHU: Return Air Fan Speed Control Signal,AHU: Exhaust Air Damper Control Signal,AHU: Outdoor Air Damper Control Signal,AHU: Return Air Damper Control Signal,AHU: Cooling Coil Valve Control Signal,AHU: Heating Coil Valve Control Signal,AHU: Supply Air Duct Static Pressure Set Point,AHU: Supply Air Duct Static Pressure,Occupancy Mode Indicator,Fault Detection Ground Truth
13518,5/3/2009 9:18,55.04,55,59.83,62.19,74.18,1,1,0.71,0.56,1.0,1.0,1.0,0.31,0.0,1.4,1.36,1,0
12962,5/3/2009 0:02,71.29,55,66.77,69.62,72.71,0,0,0.2,0.2,0.0,0.0,0.0,0.0,0.0,1.4,0.0,0,0
1808,8/29/2007 6:08,69.79,55,67.87,73.86,73.97,1,1,0.65,0.51,0.39,0.39,0.61,0.06,0.0,1.4,0.76,1,1
16274,5/5/2009 7:14,53.32,55,44.06,51.37,69.76,1,1,0.66,0.53,0.47,0.63,0.63,0.0,0.0,1.4,1.49,1,0
10638,2/16/2009 9:18,64.08,65,21.95,52.43,71.87,1,1,0.71,0.57,0.72,0.47,0.47,0.0,0.83,1.4,1.4,1,0


In [12]:
raw_data_exp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21600 entries, 0 to 21599
Data columns (total 19 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Datetime                                        21600 non-null  object 
 1   AHU: Supply Air Temperature                     21600 non-null  float64
 2   AHU: Supply Air Temperature Set Point           21600 non-null  int64  
 3   AHU: Outdoor Air Temperature                    21600 non-null  float64
 4   AHU: Mixed Air Temperature                      21600 non-null  float64
 5   AHU: Return Air Temperature                     21600 non-null  float64
 6   AHU: Supply Air Fan Status                      21600 non-null  int64  
 7   AHU: Return Air Fan Status                      21600 non-null  int64  
 8   AHU: Supply Air Fan Speed Control Signal        21600 non-null  float64
 9   AHU: Return Air Fan Speed Control Signa

The dataset is devoid of any null values and object-type columns, rendering data cleaning unnecessary in these aspects. The sole exception pertains to the "Datetime" column, which will undergo a transformation into the DateTime data format.

In [None]:
# Changing data type of the Datetime column
raw_data_exp['Datetime'] = pd.to_datetime(raw_data_exp['Datetime'])

In [None]:
raw_data_exp.describe()

Data seems to be cleaned. Now it's time to select relevant data for model building.

In [None]:
# Making copy of dataset for further data transformation
data_exp = raw_data_exp.copy()

Observing the dataset reveals that the column labeled "AHU: Supply Air Duct Static Pressure Set Point" maintains a constant value throughout. As a result, it is necessary to eliminate this column from the dataset, as it imparts negligible information for modeling purposes.

In [None]:
data_exp.drop('AHU: Supply Air Duct Static Pressure Set Point', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16,16))
sns.heatmap(data_exp.corr(), annot=True)

Upon reviewing the dataset, it becomes evident that five columns—namely, "AHU: Supply Air Fan Status," "AHU: Return Air Fan Status," "AHU: Supply Air Fan Speed Control Signal," "AHU: Return Air Fan Speed Control Signal," and "AHU: Supply Air Duct Static Pressure"—exhibit significant correlation, likely attributed to the control sequence. To mitigate multicollinearity, it is rational to retain only one of these columns.

Notably, the dataset description indicates that fans are deactivated during unoccupied mode. Additionally, a strong correlation is evident between the occupancy mode indicator and the supply air fan status. Consequently, it is reasonable to omit the occupancy mode indicator column as well.

In [None]:
data_exp.drop(['AHU: Return Air Fan Status', 'AHU: Supply Air Fan Speed Control Signal','AHU: Return Air Fan Speed Control Signal',
                 'AHU: Supply Air Duct Static Pressure ', 'Occupancy Mode Indicator'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16,16))
sns.heatmap(data_exp.corr(), annot=True)

The dataset appears to be ready for the subsequent model-building phase. In the heatmap above correlations involving the "Fault Detection Ground Truth" column and several other columns is visible. This correlation indicates that the "Fault Detection Ground Truth" column is well-suited as a target for a classification algorithm.

## Model building

### Feature and target data preparation

In [None]:
X_exp = data_exp.drop(['Datetime','Fault Detection Ground Truth'], axis=1)

In [None]:
y_exp = data_exp['Fault Detection Ground Truth']

In [None]:
X_train_exp, X_test_exp, y_train_exp, y_test_exp = train_test_split(X_exp, y_exp, test_size=0.1, shuffle=True)

### Classification model training

In [None]:
models_exp, Acc_exp = train_evaluate_classification_models(X_train_exp, X_test_exp, y_train_exp, y_test_exp, classifiers)

### Models evaluation

In [None]:
print(Acc_exp)

The outcomes highlight the remarkable performance of most models in this task, even without balancing or fine-tuning efforts. Interestingly, even the Logistic Regression model exhibits relatively strong performance. This trend suggests that the relationships between features and target data are easily recognizable, and the dataset is well-prepared. This implies that utilizing machine learning models for detecting such faults is likely to be straightforward and practical, as long as we have high quality data.

Creating such a system, however, would necessitate a sequence of functional tests, and good procedures to produce such high quality data, and this process could be costly. However, the ultimate result would be an exceptionally effective tool for fault detection, a solution with the potential of substantial benefits. These benefits could notably outweigh the implementation costs, particularly when contrasted with the financial consequences of overlooked faults or inefficiencies in the building management system.

### Feature importances

In [None]:
plt.figure(figsize=(10,16))
plt.barh(X_exp.columns, models_exp[4].get_feature_importance())

An intriguing observation emerges from the above graph: the heating coil valve signal holds minimal importance in detecting faults associated with it. This unexpected finding could imply that the faulty element's behavior doesn't inherently offer direct clues about its malfunction. Instead, fault indications likely arise from the atypical behaviors of other components. For instance, the cold valve signal carries more significance, potentially because a faulty heating valve might lead to increased demand for cooling, intensifying the cold valve's activity.

This insight underscores the complex interplay between system components and their fault indicators. It emphasizes the need to consider not only the individual behavior of components but also their collective responses in identifying anomalies and failures. It also highlights the usefulness of machine learning tools.

# Simulated dataset

## Data exploration and feature engineering

In [None]:
raw_data_sim.sample(5)

In [None]:
raw_data_sim.info()

It appears that the dataset is free from any major cleaning requirements, except for the modification of the "Datetime" column's datatype to datetime. 

In [None]:
raw_data_sim['Datetime'] = pd.to_datetime(raw_data_sim['Datetime'])

In [None]:
raw_data_sim.describe()

In [None]:
# Making copy of dataset for further data transformation
data_sim = raw_data_sim.copy()

AHU: Supply Air Duct Static Pressure Set Point column has constant values so it should be excluded from the dataset.

In [None]:
data_sim.drop('AHU: Supply Air Duct Static Pressure Set Point', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16,16))
sns.heatmap(data_sim.corr(), annot=True)

The "AHU: Supply Air Fan Status" and "Occupancy Mode Indicator" columns exhibit complete correlation, rendering it unnecessary to retain both. Therefore, only the "AHU: Supply Air Fan Status" column will be retained.

Among the columns "AHU: Exhaust Air Damper Control Signal," "AHU: Outdoor Air Damper Control Signal," and "AHU: Return Air Damper Control Signal," a 100% correlation is also evident, necessitating the preservation of only one. Hence, the "AHU: Exhaust Air Damper Control Signal" column will be retained.

These modifications align with the objective of refining the dataset for analysis and modeling by removing redundant and correlated features.

In [None]:
data_sim.drop(['Occupancy Mode Indicator','AHU: Outdoor Air Damper Control Signal  ','AHU: Return Air Damper Control Signal'], axis=1,inplace=True)

In [None]:
plt.figure(figsize=(16,16))
sns.heatmap(data_sim.corr(), annot=True)

The dataset appears to be well-prepared for the subsequent phase of model building. The "Fault Detection Ground Truth" column exhibits correlations with certain other columns. However, upon closer inspection of the dataset description, it becomes evident that this column encompasses a diverse array of fault types. Taking that into consideration it holds greater value to discern not only the presence of a fault but also the specific fault type. Consequently, three distinct target columns will be prepared to do this. This alteration enables a more nuanced and informative modeling approach.

In [None]:
def convert_date(date_str):
    return pd.to_datetime(date_str, format='%m/%d/%Y')

In [None]:
# Creating list of dates when faults occured accordin to the description.
OA_fault_dates = [convert_date('2/12/2008'), 
                  convert_date('5/7/2008'), 
                  convert_date('5/8/2008'), 
                  convert_date('9/5/2007'), 
                  convert_date('9/6/2007')]
heat_vlv_fault_dates = [convert_date('8/28/2007'), 
                        convert_date('8/29/2007'), 
                        convert_date('8/30/2007')]
cool_vlv_fault_dates = [convert_date('5/6/2008'), 
                        convert_date('8/31/2007'), 
                        convert_date('5/15/2008'),
                        convert_date('9/1/2007'),
                        convert_date('9/2/2007')]

In [None]:
# Creating and filling fault column with values according to the dates given earlier.
data_sim['OA_fault'] = 0
data_sim['heat_vlv_fault'] = 0
data_sim['cool_vlv_fault'] = 0
for date in OA_fault_dates:
    data_sim.loc[data_sim['Datetime'].dt.date == date.date(), 'OA_fault'] = 1
for date in heat_vlv_fault_dates:
    data_sim.loc[data_sim['Datetime'].dt.date == date.date(), 'heat_vlv_fault'] = 1
for date in cool_vlv_fault_dates:
    data_sim.loc[data_sim['Datetime'].dt.date == date.date(), 'cool_vlv_fault'] = 1

## Models building

### Feature and target data preparation

In [None]:
X_sim = data_sim.drop(['Datetime','Fault Detection Ground Truth','OA_fault','heat_vlv_fault','cool_vlv_fault'], axis=1)

In [None]:
y_oa = data_sim['OA_fault']
y_heat = data_sim['heat_vlv_fault']
y_cool = data_sim['cool_vlv_fault']

In [None]:
X_train_oa, X_test_oa, y_train_oa, y_test_oa = train_test_split(X_sim, y_oa, test_size=0.1, shuffle=True)
X_train_heat, X_test_heat, y_train_heat, y_test_heat = train_test_split(X_sim, y_heat, test_size=0.1, shuffle=True)
X_train_cool, X_test_cool, y_train_cool, y_test_cool = train_test_split(X_sim, y_cool, test_size=0.1, shuffle=True)

### Classification model training

In [None]:
models_oa, Acc_oa = train_evaluate_classification_models(X_train_oa, X_test_oa, y_train_oa, y_test_oa, classifiers)
models_heat, Acc_heat = train_evaluate_classification_models(X_train_heat, X_test_heat, y_train_heat, y_test_heat, classifiers)
models_cool, Acc_cool = train_evaluate_classification_models(X_train_cool, X_test_cool, y_train_cool, y_test_cool, classifiers)

## Models evaluation

In [None]:
print(Acc_oa)

In [None]:
print(Acc_heat)

In [None]:
print(Acc_cool)

The outcomes reveal the remarkable performance of most models across all tasks. This success suggests that the data's underlying dependencies were wasily comprehensible to the models. Notably, the potential to train such models for integration into building management systems is apparent. It's important to bear in mind that these models are trained on simulated data, effectively constituting models of a model.

Given this context, it is necessary to evaluate the similarity between models trained on experimental and simulated data. This assessment will varify whether these models capture analogous dependencies and can effectively translate to real-world data and systems.

Regarding models predicting faults involving dampers and cooling valves, fine-tuning is feasible by balancing class weights. This can lead to improved performance. The CatBoostClassifier has built-in functionality to do this refinement. This nuanced approach ensures optimal outcomes by addressing class imbalances and enhancing the model's overall performance.

## Class balancing

In [None]:
model_oa = CatBoostClassifier(auto_class_weights='Balanced')

In [None]:
model_oa.fit(X_train_oa, y_train_oa)

In [None]:
y_pred_train = model_oa.predict(X_train_oa)
y_pred_test = model_oa.predict(X_test_oa)
acc_train = accuracy_score(y_train_oa, y_pred_train)
acc_test = accuracy_score(y_test_oa, y_pred_test)
prec_train = precision_score(y_train_oa, y_pred_train)
prec_test = precision_score(y_test_oa, y_pred_test)    
rec_train = recall_score(y_train_oa, y_pred_train)
rec_test = recall_score(y_test_oa, y_pred_test) 
f1_train = f1_score(y_train_oa, y_pred_train)
f1_test = f1_score(y_test_oa, y_pred_test)

In [None]:
print(pd.Series({               'train_accuracy': acc_train,
                                'test_accuracy': acc_test,
                                'train_precision': prec_train,
                                'test_precision': prec_test,
                                'train_recall': rec_train,
                                'test_recall': rec_test,
                                'train_f1': f1_train,
                                'test_f1': f1_test}))

The model with balanced classes exhibited a slight enhancement in performance. This improvement is particularly notable in the significantly improved recall metric, indicating the model's ability to better identify instances of the positive class (fault occurrences). While this led to a slight reduction in precision, the overall outcome is a favorable trade-off. This model configuration seems promising for deployment, as it strikes a balance between identifying faults and minimizing false negatives.

In [None]:
model_cool = CatBoostClassifier(auto_class_weights='Balanced')

In [None]:
model_cool.fit(X_train_cool, y_train_cool)

In [None]:
y_pred_train = model_cool.predict(X_train_cool)
y_pred_test = model_cool.predict(X_test_cool)
acc_train = accuracy_score(y_train_cool, y_pred_train)
acc_test = accuracy_score(y_test_cool, y_pred_test)
prec_train = precision_score(y_train_cool, y_pred_train)
prec_test = precision_score(y_test_cool, y_pred_test)    
rec_train = recall_score(y_train_cool, y_pred_train)
rec_test = recall_score(y_test_cool, y_pred_test) 
f1_train = f1_score(y_train_cool, y_pred_train)
f1_test = f1_score(y_test_cool, y_pred_test)

In [None]:
print(pd.Series({               'train_accuracy': acc_train,
                                'test_accuracy': acc_test,
                                'train_precision': prec_train,
                                'test_precision': prec_test,
                                'train_recall': rec_train,
                                'test_recall': rec_test,
                                'train_f1': f1_train,
                                'test_f1': f1_test}))

Similar to the preceding model, the current model also demonstrates improved performance in terms of recall, while precision exhibits a reduction. This pattern suggests that the model is adept at correctly identifying instances of the positive class (fault occurrences), albeit at the expense of a slightly increased rate of false positives. This configuration is particularly useful when prioritizing sensitivity in fault detection.

## Feature importances

In [None]:
plt.figure(figsize=(10,16))
plt.barh(X_sim.columns, models_heat[4].get_feature_importance())

The weights above appear to align logically with the system's operational dynamics. Notably, differences exist between this model and the model constructed using experimental data. These differences could arise from distinct characteristics of the simulated and experimental datasets, reflecting the variability between these two sources of information.

# Simulation and experimental model comparison

Currently, we possess two categories of models originating from an alternate data source. Despite their distinct origins, these models share a common objective: detecting malfunctioning heating valve leaks within the similar systems. Consequently, it becomes viable to conduct a comparative analysis of these models' performance when applied to different datasets. This assessment will shed light on the models' versatility and adaptability across varied data contexts.

## Model trained on experimnetal data -  predict on an simulated data

In [None]:
Acc_exp_on_sim = pd.DataFrame(index=None, columns=['model','accuracy','precision','recall','f1'])

In [None]:
# Preparing feature data with columns from the experimental data model.
X_exp_on_sim = raw_data_sim[X_exp.columns]

In [None]:
for model in models_exp:
    name = str(model).split('Classifier')[0]
    y_pred = model.predict(X_exp_on_sim)
    acc = accuracy_score(y_heat, y_pred)
    prec = precision_score(y_heat, y_pred)    
    rec = recall_score(y_heat, y_pred)
    f1 = f1_score(y_heat, y_pred)
    
    Acc_exp_on_sim = Acc_exp_on_sim.append(pd.Series({'model':name,
                                'accuracy': acc,
                                'precision': prec,
                                'recall': rec,
                                'f1': f1}),ignore_index=True)

In [None]:
print(Acc_exp_on_sim)

The outcomes reveal a substantial decline in the predictive performance of all models, particularly in terms of precision. Notably, the models predicted a significantly higher number of faults compared to the actual occurrences. This trend is consistent across all models, suggesting that the decrease in performance isn't attributed to particular or poor model training. Rather, it is likely a result of the dissimilar dependencies present in the experimental data when compared to the simulated data.

To varify the consistency of this behavior, it is necessary to investigate whether this pattern persists in reverse scenario. 

## Model trained on simulated data - prediction on an experimental data

In [None]:
Acc_sim_on_exp = pd.DataFrame(index=None, columns=['model','accuracy','precision','recall','f1'])

In [None]:
# Preparing feature data with columns from the experimental data model.
X_sim_on_exp = raw_data_exp[X_sim.columns]

In [None]:
for model in models_heat:
    name = str(model).split('Classifier')[0]
    y_pred = model.predict(X_sim_on_exp)
    acc = accuracy_score(y_exp, y_pred)
    prec = precision_score(y_exp, y_pred)    
    rec = recall_score(y_exp, y_pred)
    f1 = f1_score(y_exp, y_pred)
    
    Acc_sim_on_exp = Acc_sim_on_exp.append(pd.Series({'model':name,
                                'accuracy': acc,
                                'precision': prec,
                                'recall': rec,
                                'f1': f1}),ignore_index=True)

In [None]:
print(Acc_sim_on_exp)

In this instance, the models' performance further deteriorated, reaffirming the previous assumptions that the relationships present in real-world data differ from those within the simulated data. This confirmation underscores the significance of data source variability and its impact on the models' ability to generalize effectively.

# Final conclusions

In summation, the outcomes emphasize the viability of employing machine learning models for fault detection within a system. Nevertheless, the choice of data source significantly influences model performance. When utilizing well-constructed data, the majority of models exhibit outstanding performance even without extensive model tuning or data balancing.

Regrettably, the limitation emerges that simulation data might not be suitable for model training. While generating data through simulation offers cost-effective benefits, the resulting model's efficacy is inherently constrained by the accuracy of the simulation model. This essentially creates a scenario where a model is based on another model, presenting limitations for real-world practicality. The ultimate requirement is to develop a model that closely mirrors reality, maximizing its usefulness in real-world applications.