## Problem Statement

The goal of this analysis is to:

- Explore a custom synthetic HVAC dataset.
- Identify key patterns, trends, and anomalies in the data.
- Apply suitable preprocessing steps, including:
  - Handling missing values,
  - Normalization,
  - Outlier treatment.
- Compute core performance metrics such as:
  - Energy usage,
  - Power consumption.
- Derive additional Key Performance Indicators (KPIs), if applicable, including:
  - HVAC efficiency,
  - Comfort scoring.




## Initial Data Exploration and Insights

<b>Import the Necessary Libraries Here</b>
<hr>

In [1]:
# Import necessary libraries
import pandas as pd              # Data manipulation and analysis
import numpy as np               # Numerical operations
import matplotlib.pyplot as plt  # Data visualization
import seaborn as sns            # Enhanced visualizations
import os
# Display settings for cleaner output
pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")

<b> Data Exploration and Preprocessing </b>
<hr>

In [2]:
# Load the HVAC synthetic dataset
file_name = "hvac_synth.csv"
data_folder = 'data'
data_path = os.path.join(data_folder, file_name)
hvac_data = pd.read_csv(data_path)

# Preview the first few rows
hvac_data.head()

Unnamed: 0,timestamp,indoor_temp,supply_temp,hvac_control,airflow,power_usage,outdoor_temp,solar_radiation,occupancy,price,temp_error,cooling_demand,heating_demand,indoor_temp_savgol,supply_temp_savgol,outdoor_temp_savgol,temp_error_savgol,hvac_control_sma,airflow_sma,power_usage_sma,solar_radiation_sma,occupancy_sma,price_sma,cooling_demand_sma,heating_demand_sma,hvac_control_ema,airflow_ema,power_usage_ema,solar_radiation_ema,occupancy_ema,price_ema,cooling_demand_ema,heating_demand_ema,indoor_temp_robust,supply_temp_robust,outdoor_temp_robust,temp_error_robust,hvac_control_minmax,airflow_minmax,power_usage_minmax,solar_radiation_minmax,occupancy_minmax,price_minmax,cooling_demand_minmax,heating_demand_minmax
0,2020-01-01 00:00:00,,,,,,0.537145,0.0,0.245344,0.160679,,0.0,,5.815322,-0.184678,-0.084103,-16.184678,1.0,0.901447,1496.661944,0.0,0.236733,0.163987,0.0,16.200316,,,,0.0,0.245344,0.160679,0.0,,2.501707,2.501695,-1.397942,2.501707,0.0,0.263284,0.01257,0.0,0.245344,0.453302,0.0,0.040649
1,2020-01-01 00:15:00,,,,0.851765,1416.817844,1.245409,0.0,0.189761,0.163932,,0.0,,5.815322,-0.184678,2.362373,-16.184678,1.0,0.901447,1496.661944,0.0,0.236733,0.163987,0.0,16.200316,,0.851765,1416.817844,0.0,0.213582,0.162538,0.0,,2.501707,2.501695,-1.33023,2.501707,0.0,0.263284,0.01257,0.0,0.189761,0.487622,0.0,0.040649
2,2020-01-01 00:30:00,,,,0.881392,1441.962322,3.160134,0.0,0.050169,0.182369,,0.0,,5.815322,-0.184678,3.536728,-16.184678,1.0,0.901447,1496.661944,0.0,0.236733,0.163987,0.0,16.200316,,0.868695,1431.186117,0.0,0.142917,0.171114,0.0,,2.501707,2.501695,-1.147177,2.501707,0.0,0.410525,0.104629,0.0,0.050169,0.682176,0.0,0.040649
3,2020-01-01 00:45:00,,,,0.919333,1488.323327,5.05805,0.0,0.215428,0.169229,,0.0,,5.815322,-0.184678,3.720391,-16.184678,1.0,0.901447,1496.661944,0.0,0.236733,0.163987,0.0,16.200316,,0.890593,1455.8941,0.0,0.169435,0.170424,0.0,,2.501707,2.501695,-0.96573,2.501707,0.0,0.599093,0.274367,0.0,0.215428,0.543514,0.0,0.040649
4,2020-01-01 01:00:00,,,,0.866692,1479.400149,1.32229,0.0,0.318407,0.163675,,0.0,,5.815322,-0.184678,2.620375,-16.184678,1.0,0.901447,1496.661944,0.0,0.21475,0.16547,0.0,16.200316,,0.881852,1464.490598,0.0,0.218266,0.168212,0.0,,2.501707,2.501695,-1.32288,2.501707,0.0,0.337469,0.241697,0.0,0.318407,0.484911,0.0,0.040649


In [3]:
print(f"Dataset shape: {hvac_data.shape}")


Dataset shape: (408000, 45)


> As I began exploring the dataset, I quickly noticed its scale: it contains **408,000 data points** spread across **45 features**. This richness offers a great opportunity to uncover meaningful patterns but also calls for careful preprocessing to manage the complexity.


In [4]:
hvac_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408000 entries, 0 to 407999
Data columns (total 45 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   timestamp               408000 non-null  object 
 1   indoor_temp             405616 non-null  float64
 2   supply_temp             405615 non-null  float64
 3   hvac_control            407974 non-null  float64
 4   airflow                 407999 non-null  float64
 5   power_usage             407999 non-null  float64
 6   outdoor_temp            408000 non-null  float64
 7   solar_radiation         408000 non-null  float64
 8   occupancy               408000 non-null  float64
 9   price                   408000 non-null  float64
 10  temp_error              405616 non-null  float64
 11  cooling_demand          408000 non-null  float64
 12  heating_demand          405616 non-null  float64
 13  indoor_temp_savgol      408000 non-null  float64
 14  supply_temp_savgol  

> The `timestamp` column is currently stored as an object type and needs to be converted to a proper `datetime` format for accurate time-based analysis.


In [5]:
# Make sure timestamp column is datetime
hvac_data['timestamp'] = pd.to_datetime(hvac_data['timestamp'])

# Determine the temporal coverage of the dataset
min_timestamp = hvac_data['timestamp'].min()
max_timestamp = hvac_data['timestamp'].max()

print(f"The dataset covers a period from {min_timestamp:%Y-%m-%d %H:%M:%S} to {max_timestamp:%Y-%m-%d %H:%M:%S}.")


The dataset covers a period from 2020-01-01 00:00:00 to 2031-08-20 23:45:00.


In [6]:
# Calculate time differences between consecutive rows
time_diffs = hvac_data['timestamp'].diff().dropna()

# View the time differences
print(time_diffs.value_counts())

timestamp
0 days 00:15:00    407999
Name: count, dtype: int64


> The above analysis indicate that the HVAC data is recorded at **15-minute intervals**, spanning from **2020-01-01 00:00:00** to **2031-08-20 23:45:00**. This suggests a consistent and regular sampling frequency throughout the dataset.


In [7]:
# Identify columns with zero variance (i.e., constant values across all rows)
singleton_columns = hvac_data.columns[hvac_data.nunique() == 1]

# If any such columns exist, display and drop them
if not singleton_columns.empty:
    print(f"Removing the following zero-variance columns: {singleton_columns.tolist()}")
    hvac_data = hvac_data.drop(columns=singleton_columns)
else:
    print("No zero-variance columns found.")
   

Removing the following zero-variance columns: ['hvac_control', 'cooling_demand', 'hvac_control_sma', 'cooling_demand_sma', 'hvac_control_ema', 'cooling_demand_ema', 'hvac_control_minmax', 'cooling_demand_minmax']


> The above analysis indicates that the dataset contains variables with constant values across all rows. Such predictors provide no meaningful information for modeling and have therefore been removed to simplify the dataset.  
>  
> The removed variables are:  
> `hvac_control`, `cooling_demand`, `hvac_control_sma`, `cooling_demand_sma`, `hvac_control_ema`, `cooling_demand_ema`, `hvac_control_minmax`, `cooling_demand_minmax`.


In [8]:
hvac_data.head()

Unnamed: 0,timestamp,indoor_temp,supply_temp,airflow,power_usage,outdoor_temp,solar_radiation,occupancy,price,temp_error,heating_demand,indoor_temp_savgol,supply_temp_savgol,outdoor_temp_savgol,temp_error_savgol,airflow_sma,power_usage_sma,solar_radiation_sma,occupancy_sma,price_sma,heating_demand_sma,airflow_ema,power_usage_ema,solar_radiation_ema,occupancy_ema,price_ema,heating_demand_ema,indoor_temp_robust,supply_temp_robust,outdoor_temp_robust,temp_error_robust,airflow_minmax,power_usage_minmax,solar_radiation_minmax,occupancy_minmax,price_minmax,heating_demand_minmax
0,2020-01-01 00:00:00,,,,,0.537145,0.0,0.245344,0.160679,,,5.815322,-0.184678,-0.084103,-16.184678,0.901447,1496.661944,0.0,0.236733,0.163987,16.200316,,,0.0,0.245344,0.160679,,2.501707,2.501695,-1.397942,2.501707,0.263284,0.01257,0.0,0.245344,0.453302,0.040649
1,2020-01-01 00:15:00,,,0.851765,1416.817844,1.245409,0.0,0.189761,0.163932,,,5.815322,-0.184678,2.362373,-16.184678,0.901447,1496.661944,0.0,0.236733,0.163987,16.200316,0.851765,1416.817844,0.0,0.213582,0.162538,,2.501707,2.501695,-1.33023,2.501707,0.263284,0.01257,0.0,0.189761,0.487622,0.040649
2,2020-01-01 00:30:00,,,0.881392,1441.962322,3.160134,0.0,0.050169,0.182369,,,5.815322,-0.184678,3.536728,-16.184678,0.901447,1496.661944,0.0,0.236733,0.163987,16.200316,0.868695,1431.186117,0.0,0.142917,0.171114,,2.501707,2.501695,-1.147177,2.501707,0.410525,0.104629,0.0,0.050169,0.682176,0.040649
3,2020-01-01 00:45:00,,,0.919333,1488.323327,5.05805,0.0,0.215428,0.169229,,,5.815322,-0.184678,3.720391,-16.184678,0.901447,1496.661944,0.0,0.236733,0.163987,16.200316,0.890593,1455.8941,0.0,0.169435,0.170424,,2.501707,2.501695,-0.96573,2.501707,0.599093,0.274367,0.0,0.215428,0.543514,0.040649
4,2020-01-01 01:00:00,,,0.866692,1479.400149,1.32229,0.0,0.318407,0.163675,,,5.815322,-0.184678,2.620375,-16.184678,0.901447,1496.661944,0.0,0.21475,0.16547,16.200316,0.881852,1464.490598,0.0,0.218266,0.168212,,2.501707,2.501695,-1.32288,2.501707,0.337469,0.241697,0.0,0.318407,0.484911,0.040649


In [9]:
hvac_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408000 entries, 0 to 407999
Data columns (total 37 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   timestamp               408000 non-null  datetime64[ns]
 1   indoor_temp             405616 non-null  float64       
 2   supply_temp             405615 non-null  float64       
 3   airflow                 407999 non-null  float64       
 4   power_usage             407999 non-null  float64       
 5   outdoor_temp            408000 non-null  float64       
 6   solar_radiation         408000 non-null  float64       
 7   occupancy               408000 non-null  float64       
 8   price                   408000 non-null  float64       
 9   temp_error              405616 non-null  float64       
 10  heating_demand          405616 non-null  float64       
 11  indoor_temp_savgol      408000 non-null  float64       
 12  supply_temp_savgol      408000

### Analyze correlations between variables

In this dataset, some variables appear to be scaled versions of original features — for example, `power_usage` and `power_usage_minmax`. Identifying such strong correlations can help detect redundancy and simplify the feature space.


In [10]:
# Compute correlation matrix
corr_matrix = np.round(hvac_data.corr().abs(), 2)

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find columns with correlation == 1
perfect_corr = [column for column in upper.columns if any(upper[column] == 1.0)]

print("Perfectly correlated columns to consider dropping:", perfect_corr)

Perfectly correlated columns to consider dropping: ['temp_error', 'heating_demand', 'indoor_temp_savgol', 'supply_temp_savgol', 'temp_error_savgol', 'indoor_temp_robust', 'supply_temp_robust', 'outdoor_temp_robust', 'temp_error_robust', 'airflow_minmax', 'power_usage_minmax', 'solar_radiation_minmax', 'occupancy_minmax', 'price_minmax', 'heating_demand_minmax']


In [11]:
# Round the correlation matrix to 2 decimal places to avoid missing near-perfect correlations 
# due to floating-point precision (e.g., 0.999999 ≈ 1.0).
corr_matrix = np.round(hvac_data.corr().abs(), 2)


# Select the upper triangle of the correlation matrix to avoid duplicate pairs
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find all variable pairs with perfect correlation (correlation = 1.0)
perfect_corr_pairs = [
    (col1, col2) 
    for col1 in upper.columns 
    for col2 in upper.index 
    if upper.loc[col2, col1] == 1.0
]

# Display the pairs
if perfect_corr_pairs:
    print("Perfectly correlated variable pairs:")
    for pair in perfect_corr_pairs:
        print(f"{pair[0]} <--> {pair[1]}")
else:
    print("No perfectly correlated variable pairs found.")


Perfectly correlated variable pairs:
temp_error <--> indoor_temp
heating_demand <--> indoor_temp
heating_demand <--> temp_error
indoor_temp_savgol <--> indoor_temp
indoor_temp_savgol <--> temp_error
indoor_temp_savgol <--> heating_demand
supply_temp_savgol <--> supply_temp
temp_error_savgol <--> indoor_temp
temp_error_savgol <--> temp_error
temp_error_savgol <--> heating_demand
temp_error_savgol <--> indoor_temp_savgol
indoor_temp_robust <--> indoor_temp
indoor_temp_robust <--> temp_error
indoor_temp_robust <--> heating_demand
indoor_temp_robust <--> indoor_temp_savgol
indoor_temp_robust <--> temp_error_savgol
supply_temp_robust <--> supply_temp
supply_temp_robust <--> supply_temp_savgol
outdoor_temp_robust <--> outdoor_temp
temp_error_robust <--> indoor_temp
temp_error_robust <--> temp_error
temp_error_robust <--> heating_demand
temp_error_robust <--> indoor_temp_savgol
temp_error_robust <--> temp_error_savgol
temp_error_robust <--> indoor_temp_robust
airflow_minmax <--> airflow
power

> Correlation analysis reveals that the dataset contains several pairs of perfectly correlated variables (correlation = ±1). This indicates redundancy, as these features convey identical information. To ensure model robustness and avoid multicollinearity—especially in models sensitive to feature independence—it is important to remove one variable from each perfectly correlated pair.

> Additionally, the dataset includes both original variables and their scaled versions (e.g., min-max scaled features). Keeping both may lead to information duplication and bias model learning. We should retain only one version of each feature based on the modeling needs and scaling requirements.


---

## 🚩 Why Are There So Many Perfectly Correlated Variables in This Dataset?

Instead of just removing these variables, **let’s dive deeper to understand the underlying reasons!**

Exploring these correlations can reveal important insights about data preprocessing, feature engineering, or measurement methods used.

---


> Most variables are multiple transformed versions of the same original features, including **Savitzky-Golay (savgol), Simple Moving Average (sma), Exponential Moving Average (ema), Robust filtering, and Min-Max scaling (minmax)**.  
>  
> Let’s categorize these transformations to gain a clearer understanding of the dataset!


In [12]:
# Original variables (raw measurements)
original_variables = [
    'timestamp', 'indoor_temp', 'supply_temp', 'outdoor_temp', 'temp_error', 'airflow', 'power_usage', 'solar_radiation', 'occupancy', 'price', 'heating_demand'] 
]

# Variables smoothed with Savitzky-Golay filter (smoothing + shape-preserving)
savgol_variables = [
    'indoor_temp_savgol', 'supply_temp_savgol', 'outdoor_temp_savgol', 'temp_error_savgol'
]

# Variables processed with Robust filter (resistant to outliers/spikes)
robust_variables = [
    'indoor_temp_robust', 'supply_temp_robust', 'outdoor_temp_robust', 'temp_error_robust']

# Variables smoothed with Simple Moving Average (SMA)
sma_variables = [
    'airflow_sma', 'power_usage_sma', 'solar_radiation_sma', 'occupancy_sma', 'price_sma', 'heating_demand_sma'
]

# Variables smoothed with Exponential Moving Average (EMA)
ema_variables = [
    'airflow_ema', 'power_usage_ema', 'solar_radiation_ema', 'occupancy_ema', 'price_ema', 'heating_demand_ema'
]

# Variables scaled with Min-Max normalization
minmax_variables = [
    'airflow_minmax', 'power_usage_minmax', 'solar_radiation_minmax', 'occupancy_minmax', 'price_minmax', 'heating_demand_minmax'
]


SyntaxError: unmatched ']' (1330091703.py, line 4)