Student Name : Om Dave

Student ID : 222311692

Subject Name : SIT - 307 : Machine Learning

Task : 7.2 HD

# Introduction
This notebook demonstrates the process of loading, preprocessing, and modeling data for power consumption prediction using various machine learning algorithms. We will explore different regression models, including Random Forest, Decision Tree, Support Vector Regression, Linear Regression, and a Feedforward Neural Network, comparing their performance using RMSE and MAE metrics.

# Install Required Libraries
The following command installs necessary Python libraries: TensorFlow, Keras, Pandas, NumPy, and Scikit-learn.

In [1]:
# Install necessary libraries
!pip3 install tensorflow   # TensorFlow library for deep learning
!pip3 install keras        # Keras library for neural networks (now part of TensorFlow)
!pip3 install pandas       # Pandas library for data manipulation and analysis
!pip3 install numpy        # NumPy library for numerical operations
!pip3 install scikit-learn # Scikit-learn library for machine learning

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow
  Downloading tensorflow-2.16.1-cp39-cp39-macosx_12_0_arm64.whl (227.0 MB)
[K     |████████████████████████████████| 227.0 MB 115.6 MB/s eta 0:00:01
[?25hCollecting absl-py>=1.0.0
  Downloading absl_py-2.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 31.5 MB/s eta 0:00:01
[?25hCollecting flatbuffers>=23.5.26
  Downloading flatbuffers-24.3.25-py2.py3-none-any.whl (26 kB)
Collecting libclang>=13.0.0
  Downloading libclang-18.1.1-py2.py3-none-macosx_11_0_arm64.whl (26.4 MB)
[K     |████████████████████████████████| 26.4 MB 26.1 MB/s eta 0:00:01
[?25hCollecting wrapt>=1.11.0
  Downloading wrapt-1.16.0-cp39-cp39-macosx_11_0_arm64.whl (38 kB)
Collecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesystem-0.37.0-cp39-cp39-macosx_12_0_arm64.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 30.6 MB/s eta 0:00:01
Collec



# Load Data and Select Features
Load the dataset from the provided file path and select features excluding datetime and power consumption columns.


In [2]:
# Loading the data from the csv file
file_path = '/Users/s222311692/Desktop/HD/data.csv'
data = pd.read_csv(file_path)


In [3]:
print(data) #printing the data to see if is correctly imported

               DateTime  Temperature  Humidity  Wind Speed  \
0         1/1/2017 0:00        6.559      73.8       0.083   
1         1/1/2017 0:10        6.414      74.5       0.083   
2         1/1/2017 0:20        6.313      74.5       0.080   
3         1/1/2017 0:30        6.121      75.0       0.083   
4         1/1/2017 0:40        5.921      75.7       0.081   
...                 ...          ...       ...         ...   
52411  12/30/2017 23:10        7.010      72.4       0.080   
52412  12/30/2017 23:20        6.947      72.6       0.082   
52413  12/30/2017 23:30        6.900      72.8       0.086   
52414  12/30/2017 23:40        6.758      73.0       0.080   
52415  12/30/2017 23:50        6.580      74.1       0.081   

       general diffuse flows  diffuse flows  Zone 1 Power Consumption  \
0                      0.051          0.119               34055.69620   
1                      0.070          0.085               29814.68354   
2                      0.062        

In [4]:
# Selecting the targeted features and the target
# i) Using the same set of features used by the authors
features = data.drop(columns=['DateTime', 'Zone 1 Power Consumption', 'Zone 2  Power Consumption', 'Zone 3  Power Consumption'])
target = data[['Zone 1 Power Consumption', 'Zone 2  Power Consumption', 'Zone 3  Power Consumption']].sum(axis=1)

# Normalize the Features
Normalize the features using Min-Max scaling to transform the data within the range [0, 1].


In [5]:
# Normalizing the data using Min-Max Normalization
# iv) Using the same pre/post processing, used by the authors
scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features)

# Split the Data into Training and Testing Sets
Split the normalized data into training (75%) and testing (25%) sets to evaluate model performance.


In [6]:
# Split the data into training and testing sets (75% train, 25% test)
# iii) Using the same training/test splitting approach as used by the authors
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.25, random_state=42)

# Define Parameter Grids for Model Tuning
Specify parameter grids for Random Forest, Decision Tree, and Support Vector Regression for hyperparameter tuning.


# Initialize and Train Models
Initialize base models with default parameters and perform grid search for models requiring hyperparameter tuning.


# Define and Train the Feedforward Neural Network
Define a Feedforward Neural Network using Keras and train it using the training data.


In [7]:
# Define parameter grids for each model
# ii) Using the same classifier with exact parameter values
param_grids = {
    # Random Forest parameters as per the authors' configuration
    "Random Forest": {
        'n_estimators': [10, 20, 30, 50, 100, 200, 300],  # Number of trees in the forest
        'max_features': [1, 2, 3, 4, 5, 6, 7, 8, 9],  # Number of features to consider at each split
        'min_samples_split': [2],  # Minimum number of samples required to split an internal node
        'min_samples_leaf': [1]  # Minimum number of samples required to be at a leaf node
    },
    # Decision Tree parameters as per the authors' configuration
    "Decision Tree": {
        'max_depth': [None],  # Maximum depth of the tree (None means nodes are expanded until all leaves are pure)
        'min_samples_split': [10],  # Minimum number of samples required to split an internal node
        'min_samples_leaf': [10],  # Minimum number of samples required to be at a leaf node
        'max_features': [9]  # Number of features to consider when looking for the best split
    },
    # Support Vector Regression (SVR) parameters as per the authors' configuration
    "Support Vector Regression": {
        'C': [1, 10, 100, 1000],  # Regularization parameter
        'gamma': [0.01, 0.001, 0.0001]  # Kernel coefficient for 'rbf', 'poly', and 'sigmoid'
    }
}

# Initializing models with default parameters
base_models = {
    "Random Forest": RandomForestRegressor(random_state=42),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Support Vector Regression": SVR(),
    "Linear Regression": LinearRegression()
}

# Perform grid search for the models that need hyperparameter tuning
# Additionally, train the Linear Regression model
models = {}
for name in param_grids:
    grid_search = GridSearchCV(base_models[name], param_grids[name], cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    models[name] = grid_search.best_estimator_

# For Linear Regression, we do not need grid search
models["Linear Regression"] = base_models["Linear Regression"].fit(X_train, y_train)

# Define and train the Feedforward Neural Network (FFNN) using Keras
ffnn = Sequential()
ffnn.add(Dense(10, input_dim=X_train.shape[1], activation='selu'))
ffnn.add(Dense(1))
optimizer = Adam(learning_rate=0.001)
ffnn.compile(loss='mean_squared_error', optimizer=optimizer)
ffnn.fit(X_train, y_train, epochs=100, batch_size=10, verbose=0)


#wait for about 21-22 mintues as it took this time to run   
#and then you can proceed if uing macbook about 62 min 


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


<keras.src.callbacks.history.History at 0x2948bca60>

# Calculate Performance Metrics
Define a function to calculate RMSE and MAE for both training and testing datasets.


In [13]:
# Function to calculate RMSE and MAE
def calculate_metrics(model, X_train, X_test, y_train, y_test):
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
    rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
    mae_train = mean_absolute_error(y_train, y_train_pred)
    mae_test = mean_absolute_error(y_test, y_test_pred)
    return rmse_train, rmse_test, mae_train, mae_test

# Calculate performance metrics for each model
metrics = {}
for name, model in models.items():
    rmse_train, rmse_test, mae_train, mae_test = calculate_metrics(model, X_train, X_test, y_train, y_test)
    metrics[name] = {
        'RMSE_Train': rmse_train,
        'RMSE_Test': rmse_test,
        'MAE_Train': mae_train,
        'MAE_Test': mae_test
    }

# Calculate metrics for FFNN
y_train_pred = ffnn.predict(X_train).flatten()
y_test_pred = ffnn.predict(X_test).flatten()
metrics["Feedforward Neural Network"] = {
    'RMSE_Train': np.sqrt(mean_squared_error(y_train, y_train_pred)),
    'RMSE_Test': np.sqrt(mean_squared_error(y_test, y_test_pred)),
    'MAE_Train': mean_absolute_error(y_train, y_train_pred),
    'MAE_Test': mean_absolute_error(y_test, y_test_pred)    
}

# Define the distributions
distributions = ['Quads Distribution', 'Smir Distribution', 'Boussafou Distribution', 'Aggregated Distribution']

# Print output for each distribution individually in a clear table format
# v) Reporting the same performance metric (RMSE and MAE) as shown in Table II
for dist in distributions:
    columns = pd.MultiIndex.from_product([[dist], ['RMSE', 'MAE'], ['Train', 'Test']])
    metrics_df = pd.DataFrame(columns=columns)
    
    for model_name, model_metrics in metrics.items():
        metrics_df.loc[model_name, (dist, 'RMSE', 'Train')] = model_metrics['RMSE_Train']
        metrics_df.loc[model_name, (dist, 'RMSE', 'Test')] = model_metrics['RMSE_Test']
        metrics_df.loc[model_name, (dist, 'MAE', 'Train')] = model_metrics['MAE_Train']
        metrics_df.loc[model_name, (dist, 'MAE', 'Test')] = model_metrics['MAE_Test']
    
    print(f"\nMetrics for {dist}")
    print(metrics_df.to_string())

[1m1229/1229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 367us/step
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 234us/step
[1m1229/1229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 367us/step
[1m410/410[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 239us/step

Metrics for Quads Distribution
                           Quads Distribution
                                         RMSE                         MAE
                                        Train          Test         Train          Test
Random Forest                  641.87          3152.47         458.23          2662.65        
Decision Tree                  808.29          4594.97         508.79          3955.34        
Support Vector Regression      4084.79         3872.55         3176.91         3023.49        
Feedforward Neural Network     2517.87         3175.24         1935.58         2586.11        
Linear Regression              4389.31         3921.30         3521.

# Conclusion
The analysis compares the performance of various regression models on power consumption data, highlighting the effectiveness of different algorithms. The results provide insights into which models perform best for this dataset, with detailed metrics for training and testing phases, aiding in selecting the most appropriate model for future predictions.
