# **Appliance Energy Prediction**

# **Context of the Problem Statement:**

This dataset contains measurements related to energy consumption within a house over a period of 4.5 months.

It includes features representing indoor temperature and humidity conditions from various rooms, as well as weather data from a nearby weather station (Chièvres Airport, Belgium).

The main objective is to predict the energy use of appliances based on environmental and temporal features.


## **Objective:**

• Clean and preprocess the dataset.

• Perform EDA, Univariate, Bivariate, and Multivariate Analysis to explore and understand the data.

• Decide on the appropriate machine learning model to use.

• Train the model and evaluate its performance.

• Interpret the model results and key features. • Draw conclusions and suggest possible improvements.

## **Loading required packages and Importing dataset:**

In [None]:
import numpy as np
import pandas as pd
energy_data = pd.read_csv('/content/Appliance_Energy.csv')

## **View dataset**

In [None]:
energy_data.head(20)

In [None]:
energy_data.tail(20)

## **Shape of the Dataframe**

In [None]:
energy_data.shape

## **Descriptive Analysis:**

In [None]:
energy_data.describe()

## **Summary of the Data:**

In [None]:
energy_data.info()

In [None]:
energy_data.columns

# **Removing Data Column**

In [None]:
energy_data = energy_data.drop('date', axis=1)

In [None]:
energy_data.head()

## **Checking Null Values:**

In [None]:
energy_data.isnull().sum()

## **Check for missing values:**

In [None]:
energy_data.duplicated().sum()

## **Outlier Detection: Through Box Plot**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

numeric_cols = energy_data.select_dtypes(include=['number']).columns

for col in numeric_cols:
  # Calculate Q1, Q3, and IQR
  Q1 = energy_data[col].quantile(0.25)
  Q3 = energy_data[col].quantile(0.75)
  IQR = Q3 - Q1

  # Define bounds for outliers
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR

  # Identify outliers
  outliers = energy_data[(energy_data[col] < lower_bound) | (energy_data[col] > upper_bound)]

  print(f"Outliers in {col}:")
  print(outliers)

  # Plot box plot
  plt.figure(figsize=(8, 6))
  sns.boxplot(x=energy_data[col])
  plt.title(f"Box Plot of {col}")
  plt.show()


## **Treating Outliers**

In [None]:
def treat_outliers_iqr(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Replace outliers with the bounds
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)
    return df


numeric_cols = energy_data.select_dtypes(include=['number']).columns

for col in numeric_cols:
  energy_data = treat_outliers_iqr(energy_data, col)

# Now energy_data has outliers replaced by the bounds.  You can verify this:
for col in numeric_cols:
  # Calculate Q1, Q3, and IQR
  Q1 = energy_data[col].quantile(0.25)
  Q3 = energy_data[col].quantile(0.75)
  IQR = Q3 - Q1

  # Define bounds for outliers
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR

  # Identify outliers
  outliers = energy_data[(energy_data[col] < lower_bound) | (energy_data[col] > upper_bound)]

  print(f"Outliers in {col} after treatment:")
  print(outliers)

  # Plot box plot
  plt.figure(figsize=(8, 6))
  sns.boxplot(x=energy_data[col])
  plt.title(f"Box Plot of {col} after treatment")
  plt.show()


# **Scaling the data**

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(energy_data)

# Create a new DataFrame with the scaled data
energy_data_scaled = pd.DataFrame(scaled_data, columns=energy_data.columns)

# Print the scaled DataFrame
energy_data_scaled


## **Defining Independent variables as 'x' and dependent variable as 'y':**

In [None]:
# prompt: define variable y as Appliance column in energy_data and define variable x as all columns in energy_data except Appliances

y = energy_data['Appliances']
x = energy_data.drop('Appliances', axis=1)


In [None]:
x

In [None]:
y

# **Univariate Analysis**

In [None]:
#for x
for col in x.columns:
    plt.figure(figsize=(8, 6))
    if x[col].dtype == 'float64' or x[col].dtype == 'int64':
        # Histogram
        sns.histplot(x[col], kde=True)
        plt.title(f"Distribution of {col}")
        plt.show()

        # Density plot
        sns.kdeplot(x[col])
        plt.title(f"Density plot of {col}")
        plt.show()
    else:
      # Count plot for categorical features
        sns.countplot(x=x[col])
        plt.title(f"Count Plot of {col}")
        plt.show()
#for y
plt.figure(figsize=(8, 6))

if y.dtype == 'float64' or y.dtype == 'int64':
    # Histogram
    sns.histplot(y, kde=True)
    plt.title("Distribution of Appliances")
    plt.show()

    # Density plot
    sns.kdeplot(y)
    plt.title("Density plot of Appliances")
    plt.show()


# **Bivariate Analysis**

In [None]:
# Violin plots for each independent variable against the dependent variable
for col in x.columns:
    plt.figure(figsize=(8, 6))
    sns.violinplot(x=energy_data[col], y=y)
    plt.title(f'Violin Plot of {col} vs. Appliances')
    plt.show()

## **Checking linearity using Scatter Plot**

In [None]:
#scatter plots
for col in x.columns:
  plt.figure(figsize=(8, 6))
  plt.scatter(x[col], y)
  plt.xlabel(col)
  plt.ylabel('Appliances')
  plt.title(f'Scatter Plot of {col} vs. Appliances')
  plt.show()


## **Extracting Correlation Matrix:**

In [None]:
# heatmap
correlation_matrix = x.corrwith(y)

print("Correlation of each feature with Appliances:")
print(correlation_matrix)

plt.figure(figsize=(25, 20))
sns.heatmap(x.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Features')
plt.show()


# **Multivarite Analysis**

In [None]:
# pairplot
df_pairplot = pd.concat([x, y], axis=1)

sns.pairplot(df_pairplot)
plt.show()


## **Checking Multi-colinearity using Variance Inflation Factor**

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
vif["features"] = x.columns

vif

## **Dropping the Columns containing Multi-colinearity**

In [None]:
# Drop columns 'rv1' and 'rv2'
energy_data = energy_data.drop(['rv1', 'rv2'], axis=1)

In [None]:
energy_data.columns

## **Extracting Multiple Linear regression Model**

In [None]:
import statsmodels.api as sm

x = sm.add_constant(x) # Adding a constant term to the independent variables
model = sm.OLS(y, x).fit()
print(model.summary())

print("\nP-values for each coefficient:")
model.pvalues


## ***Dropping all Insignificant Independent Variables stepwise ***

In [None]:
x = energy_data.drop('Appliances', axis=1)
y = energy_data['Appliances']

x = sm.add_constant(x)  # Adding a constant term to the independent variables

# Iterate and remove columns based on p-values
columns_to_remove = []
while True:
    model = sm.OLS(y, x).fit()
    print(model.summary())
    print("\nP-values for each coefficient:\n", model.pvalues)

    p_values = model.pvalues.drop('const') #exclude intercept
    max_p_value = p_values.max()

    if max_p_value > 0.05:
        column_to_remove = p_values.idxmax()
        print(f"\nRemoving column '{column_to_remove}' with p-value {max_p_value:.3f}")
        columns_to_remove.append(column_to_remove)
        x = x.drop(column_to_remove, axis=1)
    else:
        break

print("\nFinal Model Summary:")
final_model = sm.OLS(y, x).fit()
print(final_model.summary())
print("\nFinal p-values:\n", final_model.pvalues)
print("\nColumns removed:", columns_to_remove)


Interpretation- Out of 27 independent variables only 18 remain significant. r-squared value = 0.190 implies 19% of the change in dependent variable is estimated by the 18 statistically significant independent variables.

# **Model Diagnostics: Testing the model**

### **Checking wether Residuals are Normally Distributed** bold text

In [None]:
# Get the residuals
residuals = final_model.resid

# Display the residuals
print("Residuals:\n", residuals)

# Plot the residuals on a histogram
plt.figure(figsize=(8, 6))
sns.histplot(residuals, kde=True)
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')

# Add a line to check for normal distribution
plt.axvline(residuals.mean(), color='r', linestyle='dashed', linewidth=2, label='Mean')
plt.legend()
plt.show()


## **Cheking wether the Residuals exhibit Homoscedasticity **

## **Using Scatter Plot**

In [None]:
# Get the predicted values
predicted_values = final_model.fittedvalues

# Create a scatter plot of residuals vs. predicted values
plt.figure(figsize=(10, 6))
sns.scatterplot(x=predicted_values, y=residuals)
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Values")
plt.axhline(y=0, color='r', linestyle='--')  # Add a horizontal line at y=0
plt.show()


# **Conclusion:**

Scatter plot exhibitted Heteroscadasticity in residuals.

R-squared remained 0.190, which implies that the using MLR model for this data is not the best fit.

Using other regressions model like random forest and neural network will increase the accuracy of the model.