<a href="https://colab.research.google.com/github/hamzatayel/Hamza-Tayel-s-Repository/blob/main/Regression_Mini-Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining — Linear Regression — Mini-Project 1 (Energy Consumption)

**Course:** [CSEN911] Data Mining (Winter 2025)  
**Instructor:** Dr. Ayman Alserafi  
**Due:** 24 October 2025, 11:59 PM

**Dataset:** `energy_data.csv`  


> **Instructions:** For every step, write your own explanations and justifications and visualization in the provided Markdown prompts.




***Edit this cell with your name(s), tutorial number(s) and ID(s)***

---

Name: Hamza Tayel

ID: 58-1702

Tutorial: 02

---

Name: Hazem Mowafi  

ID: 58-0930

Tutorial: 02

---


The dataset contains building-level energy readings and contextual attributes.

Each row represents a building observation. Columns include:

<div style="font-size:20px;">

| **Column** | **Description** |
|-------------|-----------------|
| **Building_ID** | Unique identifier for each building record. Used to distinguish one building entry from another. |
| **Building_Type** | Category describing the primary use of the building (e.g., Residential, Commercial, Industrial, Educational, etc.). |
| **Governorate** | The administrative region (governorate) where the building is located (e.g., Cairo, Giza, Alexandria). |
| **Neighborhood** | The smaller district or local area within the governorate where the building is located. |
| **Day_of_Week** | The day on which the energy consumption measurement was recorded (e.g., Sunday, Monday, etc.). |
| **Occupancy_Level** | The relative number of occupants or activity level in the building, typically categorized as *Low*, *Medium*, or *High*. |
| **Appliances_Usage_Level** | Indicates how intensively appliances are used in the building *Low*, *Medium*, or *High*. |
| **SquareFootage** | The total floor area of the building (numeric). Serves as a proxy for building size, often influencing energy usage. |
| **Last_Maintenance_Date** | The date of the last maintainance done on the building. |
| **Average_Temperature** | The average ambient temperature (in °C) recorded during the data period. |
| **Energy_Consumption** | The total energy used by the building, typically measured in kilowatt-hours (kWh).|

</div>


## Importing Libraries & Dataset

In [65]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

try:
    df = pd.read_csv('energy_data.csv')
except:
    df = pd.read_csv('https://raw.githubusercontent.com/GUC-DM/W2025/refs/heads/main/data/energy_data.csv')

df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,WeDnesday,High,Low,7063m2,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,+Mohandessin14,tuesDAY,High,High,44372m2,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,SunDay,Medium,Low,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,,,+92Dokki,TuesDay,Low,High,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,Monday,Low,Low,13375,2022-08-12,28.82,3279.17 kWh


## Data Inspection

Perform data inspection tasks here (recommended for data understanding).

In [66]:
df.describe()

Unnamed: 0,Average_Temperature
count,990.0
mean,33.499404
std,10.703806
min,-4.91
25%,29.4225
50%,35.26
75%,39.97
max,50.0


In [67]:
df.shape

(1100, 11)

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Building_ID             1100 non-null   object 
 1   Building_Type           990 non-null    object 
 2   Governorate             873 non-null    object 
 3   Neighborhood            1100 non-null   object 
 4   Day_of_Week             1100 non-null   object 
 5   Occupancy_Level         1100 non-null   object 
 6   Appliances_Usage_Level  1100 non-null   object 
 7   SquareFootage           1100 non-null   object 
 8   Last_Maintenance_Date   1100 non-null   object 
 9   Average_Temperature     990 non-null    float64
 10  Energy_Consumption      1100 non-null   object 
dtypes: float64(1), object(10)
memory usage: 94.7+ KB


In [69]:
df['Building_Type'].value_counts()

Unnamed: 0_level_0,count
Building_Type,Unnamed: 1_level_1
Residential,349
Commercial,325
Industrial,316


In [70]:
df['Building_Type'].unique()

array(['Residential', 'Commercial', 'Industrial', nan], dtype=object)

In [71]:
df['Governorate'].unique()

array(['Alexandria', 'Giza', 'Cairo', nan], dtype=object)

## Data Pre-Processing & Cleaning

_Apply any data preprocessing and/or feature engineering below. Show/output the changes to the dataset._

In [72]:
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,WeDnesday,High,Low,7063m2,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,+Mohandessin14,tuesDAY,High,High,44372m2,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,SunDay,Medium,Low,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,,,+92Dokki,TuesDay,Low,High,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,Monday,Low,Low,13375,2022-08-12,28.82,3279.17 kWh


handling missing values in building type column

In [73]:
df['Building_Type'].fillna('Unknown', inplace=True)
df.head()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Building_Type'].fillna('Unknown', inplace=True)


Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,WeDnesday,High,Low,7063m2,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,+Mohandessin14,tuesDAY,High,High,44372m2,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,SunDay,Medium,Low,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,Unknown,,+92Dokki,TuesDay,Low,High,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,Monday,Low,Low,13375,2022-08-12,28.82,3279.17 kWh


handling missing values in governorate column

In [74]:
df['Governorate'].fillna('Unknown', inplace=True)
df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Governorate'].fillna('Unknown', inplace=True)


Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,WeDnesday,High,Low,7063m2,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,+Mohandessin14,tuesDAY,High,High,44372m2,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,SunDay,Medium,Low,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,Unknown,Unknown,+92Dokki,TuesDay,Low,High,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,Monday,Low,Low,13375,2022-08-12,28.82,3279.17 kWh


removing symbols and numbers from the neighborhood column

In [75]:
import re

df['Neighborhood'] = (
    df['Neighborhood']
    .astype(str)                                  # typecasting into strings
    .str.strip()                                  # remove extra spaces bedore and after
    .str.replace(r'[^A-Za-z\s]', '', regex=True)  # remove numbers and symbols
    .str.replace(r'\s+', ' ', regex=True)         # ensure only one soace between words
)


In [76]:
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,WeDnesday,High,Low,7063m2,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,Mohandessin,tuesDAY,High,High,44372m2,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,SunDay,Medium,Low,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,Unknown,Unknown,Dokki,TuesDay,Low,High,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,Monday,Low,Low,13375,2022-08-12,28.82,3279.17 kWh


In [77]:
df['Neighborhood'].unique()

array(['Smouha', 'Mohandessin', 'New Cairo', 'Dokki', 'Heliopolis',
       'Gleem', 'Maadi'], dtype=object)

day of week proper capitalization

In [78]:
df['Day_of_Week'] = df['Day_of_Week'].str.capitalize()
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,Wednesday,High,Low,7063m2,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,Mohandessin,Tuesday,High,High,44372m2,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,Sunday,Medium,Low,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,Unknown,Unknown,Dokki,Tuesday,Low,High,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,Monday,Low,Low,13375,2022-08-12,28.82,3279.17 kWh


day of week normalization into numeric

In [79]:
DOW_map = {'Sunday': 1, 'Monday': 2, 'Tuesday': 3, 'Wednesday': 4, 'Thursday': 5, 'Friday': 6, 'Saturday': 7}
df['Day_of_Week'] = df['Day_of_Week'].map(DOW_map)
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,4,High,Low,7063m2,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,Mohandessin,3,High,High,44372m2,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,1,Medium,Low,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,Unknown,Unknown,Dokki,3,Low,High,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,2,Low,Low,13375,2022-08-12,28.82,3279.17 kWh


occupancy level and appliance usage level normalization into numeric for linear regression

In [80]:
occupancy_map = {'Low': 1, 'Medium': 2, 'High': 3}
df['Occupancy_Level'] = df['Occupancy_Level'].map(occupancy_map)
appliances_map = {'Low': 1, 'Medium': 2, 'High': 3}
df['Appliances_Usage_Level'] = df['Appliances_Usage_Level'].map(appliances_map)
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,4,3,1,7063m2,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,Mohandessin,3,3,3,44372m2,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,1,2,1,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,Unknown,Unknown,Dokki,3,1,3,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,2,1,1,13375,2022-08-12,28.82,3279.17 kWh


square footage column cleaning

In [81]:
df.tail(10)

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
1090,BLD-2090-LV,Residential,Unknown,Gleem,2,1,1,31178,2023-09-08,30.62,3451.58 kWh
1091,BLD-2091-XH,Residential,Alexandria,Gleem,1,3,1,33642,2020-01-01,34.03,3977.63 kWh
1092,BLD-2092-OT,Residential,Unknown,Heliopolis,4,3,1,34160,2024-01-16,36.56,3830.68 kWh
1093,BLD-2093-JY,Unknown,Giza,Mohandessin,7,3,3,2091m2,2024-11-10,36.12,4250.29 kWh
1094,BLD-2094-XZ,Industrial,Cairo,New Cairo,7,1,1,30211,2023-08-27,38.48,4137.66 kWh
1095,BLD-2095-OH,Commercial,Giza,Dokki,7,3,1,1161m2,2022-04-21,27.85,3010.81 kWh
1096,BLD-2096-RH,Residential,Unknown,Dokki,1,2,2,37943m2,2024-10-31,36.23,4248.49 kWh
1097,BLD-2097-JZ,Commercial,Giza,Mohandessin,1,1,2,1558,2021-04-18,20.0,2843.6 kWh
1098,BLD-2098-ZP,Industrial,Alexandria,Smouha,7,2,1,2145,2023-09-14,34.43,3348.39 kWh
1099,BLD-2099-GL,Residential,Cairo,New Cairo,6,3,2,42414,2020-12-09,40.37,4722.59 kWh


In [82]:
df['SquareFootage'].isnull().sum()

np.int64(0)

In [83]:
df['SquareFootage'] = df['SquareFootage'].astype(str).str.replace('m2', '', regex=False) #turning to string to remove m2 and keep numbers only
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,4,3,1,7063,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,Mohandessin,3,3,3,44372,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,1,2,1,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,Unknown,Unknown,Dokki,3,1,3,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,2,1,1,13375,2022-08-12,28.82,3279.17 kWh


In [84]:
df['SquareFootage'] = pd.to_numeric(df['SquareFootage'], errors='coerce') # turning back to numeric


In [85]:
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption
0,BLD-1000-UD,Residential,Alexandria,Smouha,4,3,1,7063,2020-01-01,28.61,2713.95 kWh
1,BLD-1001-AX,Commercial,Giza,Mohandessin,3,3,3,44372,2022-02-24,,5744.99 kWh
2,BLD-1002-IH,Industrial,Cairo,New Cairo,1,2,1,19255,2021-02-22,37.88,4101.24 kWh
3,BLD-1003-HE,Unknown,Unknown,Dokki,3,1,3,13265,2023-07-30,35.06,3009.14 kWh
4,BLD-1004-XD,Commercial,Alexandria,Smouha,2,1,1,13375,2022-08-12,28.82,3279.17 kWh


feature engineering last maintenance date to create another feature called days since last maintenance

In [86]:
from datetime import datetime
today = pd.Timestamp(datetime.today().date()) # getting today actual date
df['Last_Maintenance_Date'] = pd.to_datetime(df['Last_Maintenance_Date'], errors='coerce') # transforming the column into datetime
df['Days_Since_Last_Maintenance'] = (today - df['Last_Maintenance_Date']).dt.days # finding the difference in days
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption,Days_Since_Last_Maintenance
0,BLD-1000-UD,Residential,Alexandria,Smouha,4,3,1,7063,2020-01-01,28.61,2713.95 kWh,2121
1,BLD-1001-AX,Commercial,Giza,Mohandessin,3,3,3,44372,2022-02-24,,5744.99 kWh,1336
2,BLD-1002-IH,Industrial,Cairo,New Cairo,1,2,1,19255,2021-02-22,37.88,4101.24 kWh,1703
3,BLD-1003-HE,Unknown,Unknown,Dokki,3,1,3,13265,2023-07-30,35.06,3009.14 kWh,815
4,BLD-1004-XD,Commercial,Alexandria,Smouha,2,1,1,13375,2022-08-12,28.82,3279.17 kWh,1167


average temp column cleaning

In [87]:
df['Average_Temperature'].isnull().sum()

np.int64(110)

In [88]:
print(df['Average_Temperature'].max())
print(df['Average_Temperature'].min())
print(df['Average_Temperature'].mean())
print(df['Average_Temperature'].median())

50.0
-4.91
33.499404040404045
35.26


In [89]:
df['Average_Temperature'].fillna(df['Average_Temperature'].mean(), inplace=True)
df['Average_Temperature'].isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Average_Temperature'].fillna(df['Average_Temperature'].mean(), inplace=True)


np.int64(0)

In [90]:
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption,Days_Since_Last_Maintenance
0,BLD-1000-UD,Residential,Alexandria,Smouha,4,3,1,7063,2020-01-01,28.61,2713.95 kWh,2121
1,BLD-1001-AX,Commercial,Giza,Mohandessin,3,3,3,44372,2022-02-24,33.499404,5744.99 kWh,1336
2,BLD-1002-IH,Industrial,Cairo,New Cairo,1,2,1,19255,2021-02-22,37.88,4101.24 kWh,1703
3,BLD-1003-HE,Unknown,Unknown,Dokki,3,1,3,13265,2023-07-30,35.06,3009.14 kWh,815
4,BLD-1004-XD,Commercial,Alexandria,Smouha,2,1,1,13375,2022-08-12,28.82,3279.17 kWh,1167


feature engineering a new column called day type

In [91]:
df['Day_Type'] = df['Day_of_Week'].apply(lambda x: 'Weekend' if x in [6, 7] else 'Weekday')
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption,Days_Since_Last_Maintenance,Day_Type
0,BLD-1000-UD,Residential,Alexandria,Smouha,4,3,1,7063,2020-01-01,28.61,2713.95 kWh,2121,Weekday
1,BLD-1001-AX,Commercial,Giza,Mohandessin,3,3,3,44372,2022-02-24,33.499404,5744.99 kWh,1336,Weekday
2,BLD-1002-IH,Industrial,Cairo,New Cairo,1,2,1,19255,2021-02-22,37.88,4101.24 kWh,1703,Weekday
3,BLD-1003-HE,Unknown,Unknown,Dokki,3,1,3,13265,2023-07-30,35.06,3009.14 kWh,815,Weekday
4,BLD-1004-XD,Commercial,Alexandria,Smouha,2,1,1,13375,2022-08-12,28.82,3279.17 kWh,1167,Weekday


energy consumption cleaning

In [92]:
df['Energy_Consumption'].isnull().sum()

np.int64(0)

In [93]:
df['Energy_Consumption'] = df['Energy_Consumption'].astype(str).str.replace('kWh', '', regex=False)
df['Energy_Consumption'] = df['Energy_Consumption'].str.strip()
df['Energy_Consumption'] = pd.to_numeric(df['Energy_Consumption'], errors='coerce')
df.head()

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption,Days_Since_Last_Maintenance,Day_Type
0,BLD-1000-UD,Residential,Alexandria,Smouha,4,3,1,7063,2020-01-01,28.61,2713.95,2121,Weekday
1,BLD-1001-AX,Commercial,Giza,Mohandessin,3,3,3,44372,2022-02-24,33.499404,5744.99,1336,Weekday
2,BLD-1002-IH,Industrial,Cairo,New Cairo,1,2,1,19255,2021-02-22,37.88,4101.24,1703,Weekday
3,BLD-1003-HE,Unknown,Unknown,Dokki,3,1,3,13265,2023-07-30,35.06,3009.14,815,Weekday
4,BLD-1004-XD,Commercial,Alexandria,Smouha,2,1,1,13375,2022-08-12,28.82,3279.17,1167,Weekday


encoding day type column

In [101]:
daytype_map = {
    1: 0,
    2: 0,
    3: 0,
    4: 0,
    5: 0,
    6: 1,
    7: 1
}

df['Day_Type'] = df['Day_of_Week'].map(daytype_map)
df.head(15)

Unnamed: 0,Building_ID,Building_Type,Governorate,Neighborhood,Day_of_Week,Occupancy_Level,Appliances_Usage_Level,SquareFootage,Last_Maintenance_Date,Average_Temperature,Energy_Consumption,Days_Since_Last_Maintenance,Day_Type
0,BLD-1000-UD,Residential,Alexandria,Smouha,4,3,1,7063,2020-01-01,28.61,2713.95,2121,0
1,BLD-1001-AX,Commercial,Giza,Mohandessin,3,3,3,44372,2022-02-24,33.499404,5744.99,1336,0
2,BLD-1002-IH,Industrial,Cairo,New Cairo,1,2,1,19255,2021-02-22,37.88,4101.24,1703,0
3,BLD-1003-HE,Unknown,Unknown,Dokki,3,1,3,13265,2023-07-30,35.06,3009.14,815,0
4,BLD-1004-XD,Commercial,Alexandria,Smouha,2,1,1,13375,2022-08-12,28.82,3279.17,1167,0
5,BLD-1005-VX,Commercial,Unknown,New Cairo,1,1,2,37377,2022-07-31,37.54,4687.67,1179,0
6,BLD-1006-RC,Industrial,Cairo,Heliopolis,7,3,1,38638,2023-07-07,50.0,5526.83,838,1
7,BLD-1007-SN,Residential,Cairo,New Cairo,3,2,1,34950,2020-07-29,38.51,4116.32,1911,0
8,BLD-1008-BA,Industrial,Alexandria,Gleem,2,3,3,29741,2024-12-31,43.62,5841.65,295,0
9,BLD-1009-CG,Residential,Unknown,Mohandessin,6,2,3,17467,2023-01-14,33.18,3419.13,1012,1


## Exploratory Data Analysis

**Q1:** What are the most popular neighborhoods? plot all and order them on the graph (mention top 3)

**Visualization**

**Answer for Q1:** _Your answer here_

**Q2:** Show the distribution of the energy consumption of each Building type.

Which type have the widest distribution of energy consumption?

Which (on average) has the highest consumption?.

**Visualization**

**Answer for Q2:** _Your answer here_

**Q3:** How does the building size affect energy consumption?

**Visualization**

**Answer for Q3:** _Your answer here_

**Q4:** Do buildings consume more energy if not maintained frequently?

**Visualization**

**Answer for Q4:** _Your answer here_

**Q5:** Are all the numerical variables normally distributed, or is there any skewness?

**Visualization**

**Answer for Q5:** _Your answer here_

**Q6:** What is multicollinearity? And why is it a problem for linear regression? Does this problem exist in this
dataset?

**Visualization**

**Answer for Q6:** _Your answer here_

## Data Preparation for Modelling

_Apply any additional data preparation steps needed before modelling below. Show/output the changes to the dataset._

## Modelling

_Apply the linear regression model below._

## Model Evaluation

Evaluate the model you applied.

## Conclusion and Recommendations

Comment on the model performance and your findings from model evaluation. State the problems (if any) and suggest possible solutions. Would you recommend this model for an electrcity company aiming to estimate the energy levels of each building?

**Answer**: your answer here.