<a href="https://www.kaggle.com/code/kouroshsajjadi/exercise-exploratory-data-analysis?scriptVersionId=143205815" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Data Preprocessing

In [None]:
import numpy as np
import pandas as pd
import glob

path = '../input/building-energy-dataset'
all_files = glob.glob(path + "/*.csv") #Reading all the files

li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col="Time",parse_dates=True, header=0)
    li.append(df)

building = pd.concat(li, axis=0, ignore_index=False)
building.sort_index(inplace= True)

building.info()

Now resample the data set to an hourly resolution (beware of the units considered)

In [None]:
building = building.resample('H').sum() #The obj must have a date-time like index.
building.head()

Below we load the weather data

In [None]:
path = '../input/weather-data/Weather_data.csv'
 # use your path
weather_data = pd.read_csv(path, index_col="Datetime",parse_dates=True, header=0)
column_names = {'GHI':'Global Horizontal Irradiance [W/m2]', 'DIF':'Diffuse Horizontal Irradiance [W/m2]', 'DNI':'Direct Normal Irradiance [W/m2]', 'SE':'Sun elevation angle [°]', 'SA':'Sun azimuth angle [°]', 'TMOD':'Module temperature [°C]', 
                'TEMP':'Air temperature [°C]', 'WS':'Wind speed [m/s]', 'WD':'Wind direction [°]', 'RH':'Relative humidity [%]', 'AP':' Atmospheric pressure [hPa]', 'PWAT':'Precipitable Water [kg/m2]', 'SWE':'Snow water equivalent [kg/m2]', 'WG':'Wind gust [m/s]'}
weather_data.rename(columns=column_names, inplace=True)
weather_data.head(5)

Then resample weather measurements to hourly values as well.

In [None]:
weather_data = weather_data.resample('H').mean()
weather_data.head()

Let us concatenate both data frames together using the pd.concat() function

In [None]:
df = pd.concat([building, weather_data], axis=1)
df.head()

# Exploratory data analysis

With our data set now preprocessed we would like to visually explore a few selected features.

### Run charts

Let us start with a simple runchart of the building energy consumption over a particular year (to not have a plot that is too large).

In [None]:
# Select a subset of the building dataframe using the column 'HV power [kW]' and a year of your choosing
building.info()
data_subset = building['HV light Power [kW]'].loc['2016']
print(data_subset)

In [None]:
# Now plot the subset
import matplotlib.pyplot as plt

# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(data_subset.index, data_subset.values, label='HV light Power [kW]')
plt.title('HV Light Power Consumption for 2016')
plt.xlabel('Date')
plt.ylabel('Power[kW]')
plt.legend()
plt.grid(True)
plt.xticks(rotation = 45)
plt.show()

To make things a little more automated to plot multiple years of data, we decide to group them per year and plot them one at a time.

In [None]:
#group data by year
groups = building['HV light Power [kW]'].groupby(pd.Grouper(freq= 'Y'))

In [None]:
import matplotlib.pyplot as plt

#set figure and axis
fig, axs = plt.subplots(len(groups), 1, figsize=(15,15))

# Loop over groups and plot
for ax, (name, group) in zip(axs, groups):
    
    ax.plot(pd.Series(group.values))
    
    ax.set_xlabel('Hour of Year')
    ax.set_ylabel('Total Load')
    ax.set_title(name.year)
    
    plt.subplots_adjust(hspace=0.5)

### Correlation heatmap

Let us follow the instructions of https://www.python-graph-gallery.com/91-customize-seaborn-heatmap to plot a correlation heatmap of all our available measurements.

In [None]:
# Calculate the correlation matrix of the aggregated dataframe df using a built-in function of pandas
# A correlation matrix shows the correlation of each of the two features in each cell.
correlation = df.corr()

In [None]:
import seaborn as sns

# Now plot the correlation using the seaborn package as described in under the link
correlation.style.background_gradient(cmap='coolwarm')

Going even further we could produce a hierarchical cluster over the correlation matrix as desribed in https://www.python-graph-gallery.com/405-dendrogram-with-heatmap-and-coloured-leaves

In [None]:
sns.clustermap(# code to complete ... )
plt.show()

### Heatmaps for time series EDA

Let us follow the steps of described in https://www.python-graph-gallery.com/heatmap-for-timeseries-matplotlib

In [None]:
# Select a subset of the data set - over a specific month and year
subset = building[(building.index.year == 2019) & (building.index.month == 8)]

In [None]:
# define which feature you would like to visually explore
feature = df['HVAC Actual [kW]']

In [None]:
# Extract hour, day, and temperature
hour = subset.index.hour
day = subset.index.day
data = subset['HVAC Actual [kW]']

# Re-arrange temperature values
data = data.values.reshape(24, len(day.unique()), order="F")

# Compute x and y grids, passed to `ax.pcolormesh()`.

# The first + 1 increases the length
# The outer + 1 ensures days start at 1, and not at 0.
xgrid = np.arange(day.max() + 1) + 1

# Hours start at 0, length 2
ygrid = np.arange(25)

In [None]:
fig, ax = plt.subplots()
ax.pcolormesh(xgrid, ygrid, data)
ax.set_frame_on(False)

Making this a little more coplex now and plotting this over the months of the year, we get

In [None]:
MIN_TEMP = building["HVAC Actual [kW]"].min()
MAX_TEMP = building["HVAC Actual [kW]"].max()

# Define a function for creating a single plot
def single_plot(data, month, year, ax):
    # Filter data by year and month
    data = data[(data.index.year == year) & (data.index.month == month)]

    # Extract hour, day, and temperature
    hour = data.index.hour
    day = data.index.day
    temp = data.values.reshape(24, len(day.unique()), order="F")
    
    # Create x and y grids
    xgrid = np.arange(day.max() + 1) + 1
    ygrid = np.arange(25)
    
    # Create a pseudocolor plot with specific settings
    ax.pcolormesh(xgrid, ygrid, temp, cmap="magma", vmin=MIN_TEMP, vmax=MAX_TEMP)
    
    # Invert the vertical axis
    ax.set_ylim(24, 0)
    
    # Set tick positions for both axes
    ax.yaxis.set_ticks([i for i in range(24)])
    ax.xaxis.set_ticks([10, 20, 30])
    
    # Remove ticks by setting their length to 0
    ax.yaxis.set_tick_params(length=0)
    ax.xaxis.set_tick_params(length=0)
    
    # Remove all spines (axes lines and labels)
    ax.set_frame_on(False)

In [None]:
# Calculate the number of years to plot based on the range of years in the index
number_of_years_to_plot = building.index.year.max() - building.index.year.min()

# Create a figure and an array of subplots with specified settings
fig, axes = plt.subplots(number_of_years_to_plot, 12, figsize=(30, 20), sharey=True)

# Iterate over years and months to create individual plots using the single_plot function
for i, year in enumerate(range(building.index.year.min(), building.index.year.max())):
    for j, month in enumerate(range(1, 13)):
        single_plot(building["HVAC Actual [kW]"], month, year, axes[i, j])

# Adjust margin and space between subplots
# Extra space is on the left to add a label
fig.subplots_adjust(left=0.05, right=0.98, top=0.9, hspace=0.08, wspace=0.04)