# **Clustering Analysis of Weather Data**
### Author: *Phani Arvind Vadali*

In this study, weather data from Boulder, CO obtained from the Energy Plus weather (epw) file will be analyzed in an unsupervised manner. The objective of the study is to find patterns in weather data.

## Mount Drive

In [None]:
from google.colab import drive  # The google colab module to access folders on GDrive
import os  # the python module for all things related to the OS.

# we mount our gDrive drive at the startpoint
drive.mount('/content/drive')

# change that into the path you want to change into (as if you were starting in your current root folder in drive)
my_folder_path = "AREN5030/HOMEWORKS/HW6"

# we navigate to the target folder
os.chdir("drive/My Drive/" + my_folder_path)

Mounted at /content/drive


## Import all packages

In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import plotly.subplots as sp
import random
import itertools
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, make_scorer, r2_score
from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree

## Import Data
Extract data from a csv file to a dataframe and set index in datetime format.

In [None]:
df = pd.read_csv("BoulderWeather.csv")
TimeRange = pd.date_range(start = pd.to_datetime("2023-01-01 01:00:00"),end=pd.to_datetime("2024-01-01 00:00:00"),freq="1H")
df = df.set_index(TimeRange)
df = df.drop(columns=["HH:MM","Date"])
df = pd.concat([pd.DataFrame(index=[pd.to_datetime("2023-01-01 00:00:00")],columns=df.columns,data=df.iloc[[-1]].values),df])
df = df.iloc[:-1]
df.head()

Unnamed: 0,Dry Bulb Temperature {C},Dew Point Temperature {C},Relative Humidity {%},Atmospheric Pressure {Pa},Extraterrestrial Horizontal Radiation {Wh/m2},Extraterrestrial Direct Normal Radiation {Wh/m2},Horizontal Infrared Radiation Intensity from Sky {Wh/m2},Global Horizontal Radiation {Wh/m2},Direct Normal Radiation {Wh/m2},Diffuse Horizontal Radiation {Wh/m2},...,Visibility {km},Ceiling Height {m},Present Weather Observation,Precipitable Water {mm},Aerosol Optical Depth {.001},Snow Depth {cm},Days Since Last Snow,Albedo {.01},Liquid Precipitation Depth {mm},Liquid Precipitation Quantity {hr}
2023-01-01 00:00:00,-13.0,-15.0,83.0,82331.0,0.0,0.0,212.0,0.0,0.0,0.0,...,8047.0,457.0,0.0,2.0,0.085,11.0,88.0,0.55,0.1,0.0
2023-01-01 01:00:00,-13.0,-16.0,76.0,82331.0,0.0,0.0,207.0,0.0,0.0,0.0,...,777.7,610.0,0.0,2.0,0.085,12.0,88.0,0.56,0.1,0.0
2023-01-01 02:00:00,-14.0,-16.0,83.0,82267.0,0.0,0.0,200.0,0.0,0.0,0.0,...,6437.0,427.0,0.0,2.0,0.085,12.0,88.0,0.56,0.0,0.0
2023-01-01 03:00:00,-14.0,-16.0,83.0,82267.0,0.0,0.0,210.0,0.0,0.0,0.0,...,6437.0,792.0,0.0,2.0,0.085,12.0,88.0,0.56,0.0,0.0
2023-01-01 04:00:00,-15.0,-17.0,83.0,82202.0,0.0,0.0,214.0,0.0,0.0,0.0,...,6437.0,579.0,0.0,2.0,0.085,11.0,88.0,0.54,0.0,0.0


Create columns to represent the hour of day and day of the year.


In [None]:
df['day_of_year'] = np.repeat(np.arange(1, 366), 24)  # This will repeat each element of the array 24 times.
df['hour'] = df.index.hour

Pivot the current dataframe such that the hours of the day are the columns and the days of the year are rows with each cell containing the corresponding dry bulb temperature.

In [None]:
df_pivoted = df.pivot(index="day_of_year", columns="hour", values='Dry Bulb Temperature {C}')
df_pivoted.head()

hour,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
day_of_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-13.0,-13.0,-14.0,-14.0,-15.0,-15.0,-15.0,-15.0,-16.0,-15.0,...,-11.0,-10.0,-10.0,-11.0,-12.0,-13.0,-13.0,-15.0,-14.0,-14.0
2,-15.0,-15.0,-13.0,-13.0,-12.0,-11.0,-11.0,-13.0,-10.0,-10.0,...,1.0,0.4,1.0,1.0,-1.0,-2.0,-2.0,-4.0,-4.0,-3.0
3,-5.0,-6.0,-5.0,-4.0,-6.0,-6.0,-4.0,-5.0,-5.0,-3.0,...,5.1,6.0,5.7,3.8,1.3,6.0,8.0,3.0,4.0,4.0
4,6.0,8.0,10.0,5.0,1.0,0.0,0.0,-2.0,1.0,2.0,...,11.1,15.0,10.9,8.8,4.4,2.4,1.2,4.0,6.0,3.0
5,4.0,3.0,2.0,2.0,1.0,2.0,1.0,0.0,0.0,2.0,...,13.0,13.0,9.8,8.7,8.0,8.0,8.0,4.8,4.7,4.0


## K - Means Clustering
Now using K-means clustering the days of the year can be divided into clusters. Here 4 clusters are chosen hopefully to represent each temperature range.

In [None]:
DryBulb_Kmeans = KMeans(n_clusters=4,random_state=3,n_init=20).fit(df_pivoted)
df_pivoted["KCluster"] = DryBulb_Kmeans.labels_        # Labelling each data point
cluster_centers = DryBulb_Kmeans.cluster_centers_   # Mean of all points in the cluster

To check if the four clusters were necessary or sufficient the daily average temperature is plotted against the day of the year for all the days of each cluster.

In [None]:
df_pivoted["DailyAvgTemp"] = df_pivoted.loc[:,~df_pivoted.columns.isin(["KCluster"])].mean(axis=1)
fig = go.Figure()
colors = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)']

for class_label in {0,1,2,3}:
    df_class = df_pivoted[df_pivoted['KCluster'] == class_label]
    fig.add_trace(go.Scatter(x=df_class.index, y=df_class['DailyAvgTemp'],
                             mode='markers',
                             marker=dict(size=10,color=colors[class_label]),
                             name=f'Cluster {class_label}'))
fig.update_layout(title='Daily average temperature across the year segregated by cluster', xaxis_title='Day of the year', yaxis_title='Temperature (C)')

fig.update_layout(
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins to make the plot tighter
    autosize=False,  # Disable autosizing
    width=1000,       # Set the width of the plot
    height=400,       # Set the height of the plot
    title_x=0.5,
    title_y=0.95
)
fig.show()


In the plot above a clear demarcation can be seen among the four clusters with a decreasing trend in daily average temperature across the clusters. Further, the plot indicates that indeed four clusters are necessary and suffient to fully capture the different groupings in the data.

Although K-means clustering will report the centers of each of the individual clusters, a more useful piece of information will be the data point closest to the cluster center since this data point will effectively represent a representative day for the whole cluster.

In [None]:
df_rep = pd.DataFrame()
for i in range(0,4):
  dfi = df_pivoted.loc[df_pivoted["KCluster"]==i]
  dfi_rep = dfi.iloc[np.argmin(np.linalg.norm(dfi.iloc[:,:-2].values - cluster_centers[[i]], axis=1))]  #Identify the data point closest to the center
  df_rep = pd.concat([df_rep,dfi_rep],axis=1)
sorted_columns = sorted(df_rep.columns, reverse=True)
df_rep = df_rep.reindex(columns=sorted_columns) # Sort the column names which is the day of the year in descending order
rep_days = sorted(np.unique(df.loc[df["day_of_year"].isin(df_rep.columns)].index.date),reverse=True) # Find the dates of the representative days based on day of year column and sort them in descending order
df_rep.rename(columns=dict(zip(df_rep.columns, rep_days)), inplace=True)
df_rep = df_rep[df_rep.iloc[0].sort_values().index]
df_rep.head()

Unnamed: 0,2023-12-26,2023-11-01,2023-10-08,2023-07-16
0,-1.0,2.9,8.2,18.0
1,-1.0,3.8,8.1,16.0
2,-1.0,3.1,7.4,15.0
3,-2.0,3.3,7.0,15.0
4,-2.0,3.1,6.8,11.0


The representative days are identified, along with the data are plotted below. The days correspond to: 16th July which is summer, 8th October which is representative of mild summer/spring weather, 1st November which is fall and 26th December which represents winter weather.

In [None]:
fig = go.Figure()

colors = ['rgb(214, 39, 40)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(31, 119, 180)']
for i, column in enumerate(df_rep.columns):
    fig.add_trace(go.Scatter(
        x=df_rep.index,
        y=df_rep[column],
        mode='lines+markers',
        name=str(column),
        marker=dict(color=colors[i])  # Use 'color' instead of 'colors'
    ))

fig.update_layout(
    title='Hourly Variation in Outdoor Temperature for 4 Representative Days',
    xaxis_title='Hour of Day',
    yaxis_title='Temperature (C)',
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins
    autosize=False,  # Disable autosizing
    width=1000,      # Set plot width
    height=400,      # Set plot height
    title_x=0.5,     # Center the title horizontally
    title_y=0.95     # Adjust the title position vertically
)

fig.show()

In a final test, the variance of the whole years worth of hourly temperature data can be compared with the variance of the hourly temperature data from each individual cluster using box plots.

In [None]:
fig = go.Figure()
xlab = [1,2,3,4,5]
fig.add_trace(go.Box(y=df["Dry Bulb Temperature {C}"], name='Whole Year'))
#fig.update_layout(title='Whole Dataset Box Plot')

# Create box plots for each cluster sequentially
colors = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)']

for i in range(4):
    cluster_data = df_pivoted[df_pivoted['KCluster'] == i]
    fig.add_trace(go.Box(y=cluster_data.values.flatten(),name=f"Cluster {i}",marker_color=colors[i]))

fig.update_layout(title='Box plots of hourly temperature data from the whole year as well as from each individual cluster', yaxis_title='Temperature (C)')
fig.update_layout(
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins to make the plot tighter
    autosize=False,  # Disable autosizing
    width=1000,       # Set the width of the plot
    height=400,       # Set the height of the plot
    title_x=0.5,
    title_y=0.95
)
# Show plot
fig.show()

The above plots show that the individual cluster each have much lower variance than the variance in the whole dataset. This difference will be even more apparent if daily average temperature data is used because the variance of the individual day is then removed. From either way the conclusion is that clustering of data can help segregate the data into groups which have much lower variance than the original data.

## Hierarchical Clustering

To check what the difference in between the two clustering techniques in terms of the results, hierarchal clustering is conducted on the same pivoted dataset. Hierarchical clustering is typically a bottom up approach wherein a **dissimilarity** metric is computed among different data point and similar data points are hierarchically clustered together.

In [None]:
HClust = AgglomerativeClustering(distance_threshold=0,n_clusters=None,linkage='complete')
hc_comp = HClust.fit_predict(df_pivoted.iloc[:,:-2])
linkage_comp = linkage(df_pivoted.iloc[:,:-2],method="complete")  #Using complete linkage
cut_labels = cut_tree(linkage_comp, n_clusters=4).flatten()   # Like above 4 clusters were chosen

# Assign the cluster labels back to the DataFrame
df_pivoted['HCluster'] = cut_labels

Like K-means clustering, the daily average temperature across the days of the year can be plotted and segregated into different clusters based on hierarchical clustering.

In [None]:
fig = go.Figure()
colors = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)']

for class_label in {0,1,2,3}:
    df_class = df_pivoted[df_pivoted['HCluster'] == class_label]
    fig.add_trace(go.Scatter(x=df_class.index, y=df_class['DailyAvgTemp'],
                             mode='markers',
                             marker=dict(size=10,color=colors[class_label]),
                             name=f'Cluster {class_label}'))
fig.update_layout(title='Daily average temperature across the year segregated by cluster', xaxis_title='Day of the year', yaxis_title='Temperature (C)')

fig.update_layout(
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins to make the plot tighter
    autosize=False,  # Disable autosizing
    width=1000,       # Set the width of the plot
    height=400,       # Set the height of the plot
    title_x=0.5,
    title_y=0.95
)
fig.show()

The plot above indicates that there is indeed a difference in between how each of the two methods clusters different data points. In order to visualize the differences, the same plot is made showing data points showing differences in between cluster assigned.

In [None]:
# First both clusters need to be set in the same order.
df_pivoted["HClustNew"] = df_pivoted["HCluster"]
df_pivoted.HClustNew.mask(df_pivoted.HCluster==0,other=3,inplace=True)
df_pivoted.HClustNew.mask(df_pivoted.HCluster==3,other=0,inplace=True)
df_pivoted = df_pivoted.drop(columns={"HCluster"})
#Plot differences

In [None]:
fig = go.Figure()
df_same = df_pivoted[df_pivoted['HClustNew'] == df_pivoted['KCluster']]
colors = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)']
for class_label in {0,1,2,3}:
    df_class = df_same[df_same['HClustNew'] == class_label]
    fig.add_trace(go.Scatter(x=df_class.index, y=df_class['DailyAvgTemp'],
                             mode='markers',
                             marker=dict(size=10,color=colors[class_label],symbol="circle"),
                             name=f'Same Cluster: {class_label}'))

df_diff = df_pivoted[df_pivoted['HClustNew'] != df_pivoted['KCluster']]
fig.add_trace(go.Scatter(x=df_diff.index, y=df_diff['DailyAvgTemp'],
                             mode='markers',
                             marker=dict(size=10,symbol="star",color="black"),
                             name=f'Different clusters'))

fig.update_layout(title='Difference in cluster assigned to each day by K-means and Hierarchical clustering', xaxis_title='Day of the year', yaxis_title='Daily Average Temperature (C)')

fig.update_layout(
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins to make the plot tighter
    autosize=False,  # Disable autosizing
    width=1000,       # Set the width of the plot
    height=400,       # Set the height of the plot
    title_x=0.5,
    title_y=0.95
)
fig.show()

While most of the points are assigned the same cluster, the boundary regions of clusters 1,2 and 2,3 have the most differences.

Like in K - Means the closest data points which correspond to the representative days of the clusters are found.

In [None]:
Hcluster_centers = df_pivoted.loc[:,~df_pivoted.columns.isin(["KCluster","DailyAvgTemp"])].groupby("HClustNew").mean().values
df_rep_hclust = pd.DataFrame()
for i in range(0,4):
  dfi = df_pivoted.loc[df_pivoted["HClustNew"]==i].iloc[:,:-3]
  dfi_rep = dfi.iloc[np.argmin(np.linalg.norm(dfi.values - Hcluster_centers[[i]], axis=1))]
  df_rep_hclust = pd.concat([df_rep_hclust,dfi_rep],axis=1)
sorted_columns = sorted(df_rep_hclust.columns, reverse=True)
df_rep_hclust = df_rep_hclust.reindex(columns=sorted_columns) # Sort the column names which is the day of the year in descending order
HClust_rep_days = sorted(np.unique(df.loc[df["day_of_year"].isin(df_rep_hclust.columns)].index.date),reverse=True) # Find the dates of the representative days based on day of year column and sort them in descending order
df_rep_hclust.rename(columns=dict(zip(df_rep_hclust.columns, HClust_rep_days)), inplace=True)
df_rep_hclust = df_rep_hclust[df_rep_hclust.iloc[0].sort_values().index] # Rearrange the df in increasing order of temperature

In [None]:
fig = go.Figure()
custom_colors = ['rgb(214, 39, 40)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)','rgb(31, 119, 180)' ]


for i,column in enumerate(df_rep_hclust.columns):
    fig.add_trace(go.Scatter(x=df_rep_hclust.index, y=df_rep_hclust[column], mode='markers+lines', name="Hierarchical:" + str(column), line=dict(color=custom_colors[i])))
for i,column in enumerate(df_rep.columns):
    fig.add_trace(go.Scatter(x=df_rep.index, y=df_rep[column], mode='lines+markers', line=dict(dash='dash',color=custom_colors[i]), name="K-Means:"+str(column)))

SDD =  df.loc[df.index.date==pd.to_datetime("2023-07-21").date()]
WDD = df.loc[df.index.date == pd.to_datetime("2023-12-21").date()]
fig.add_trace(go.Scatter(x=df_rep_hclust.index, y=SDD["Dry Bulb Temperature {C}"], mode='lines+markers', name="Summer Design Day",marker=dict(size=10,symbol="star")))
fig.add_trace(go.Scatter(x=df_rep_hclust.index, y=WDD["Dry Bulb Temperature {C}"], mode='lines+markers', name="Winter Design Day",marker=dict(size=10,symbol="star")))

# Update layout for better readability
fig.update_layout(title='Daily variation in outdoor temperature for 4 representative days', xaxis_title='Hour of day', yaxis_title='Temperature (C)')

fig.update_layout(
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins to make the plot tighter
    autosize=False,  # Disable autosizing
    width=1200,       # Set the width of the plot
    height=400,       # Set the height of the plot
    title_x=0.5,
    title_y=0.95
)
fig.show()

Except for the hottest day in which case both methods point to the same day i.e., July 16, for the rest of the days hierarchical consistently chooses days which have a lower temperature than those chosen by k-means. Typicaly summer and winter design days are July 21st and Dec 21st. The plot above shows that while July 21st is a good estimation of extreme weather, Dec 21st is not. In this case a winter design day date should be chosen more appropriately.

#### Hierarchical clustering using correlation
Hierarchical clustering depends on the similarity metric used to judge different clusters. The above method uses complete linkage which is the distance between the two points that farthest apart. Instead of Euclidean distance clustering using correlation coefficient may lead to better identification of similar patterned days.

In [None]:
HClust = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='complete', metric='correlation')
hc_comp = HClust.fit(df_pivoted.iloc[:,:-3])
linkage_comp = linkage(df_pivoted.iloc[:,:-3], method='complete', metric='correlation')
cut_labels = cut_tree(linkage_comp, n_clusters=4).flatten()

# Assign the cluster labels back to the DataFrame
df_pivoted['HCluster_Corr'] = cut_labels
# Remove the the k means label out when finding cluster center
Hcluster_centers_corr = df_pivoted.loc[:,~df_pivoted.columns.isin(["KCluster","HClustNew","DailyAvgTemp"])].groupby("HCluster_Corr").mean().values


In [None]:
fig = go.Figure()
colors = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)']

for class_label in {0,1,2,3}:
    df_class = df_pivoted[df_pivoted['HCluster_Corr'] == class_label]
    fig.add_trace(go.Scatter(x=df_class.index, y=df_class['DailyAvgTemp'],
                             mode='markers',
                             marker=dict(size=10,color=colors[class_label]),
                             name=f'Cluster {class_label}'))
fig.update_layout(title='Daily average temperature across the year segregated by clusters <br> identified using correlation as the metric of hierarchical clustering', xaxis_title='Day of the year', yaxis_title='Temperature (C)')

fig.update_layout(
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins to make the plot tighter
    autosize=False,  # Disable autosizing
    width=1000,       # Set the width of the plot
    height=400,       # Set the height of the plot
    title_x=0.5,
    title_y=0.95
)
fig.show()

The plot above like before shows the variation of daily average temperature over the course of the year for different clusters identfied using correlation coefficient. In this case since all the days follows a similar profile in terms of hourly temperature variation, it is really difficult to identify four different clusters as shown above.

In [None]:
df_rep_hclust_corr = pd.DataFrame()
for i in range(0,4):
  dfi = df_pivoted.loc[df_pivoted["HCluster_Corr"]==i].iloc[:,:-4]
  dfi_rep = dfi.iloc[np.argmin(np.linalg.norm(dfi.values - Hcluster_centers_corr[[i]], axis=1))]
  df_rep_hclust_corr = pd.concat([df_rep_hclust_corr,dfi_rep],axis=1)

sorted_columns = sorted(df_rep_hclust_corr.columns, reverse=True)
df_rep_hclust_corr = df_rep_hclust_corr.reindex(columns=sorted_columns) # Sort the column names which is the day of the year in descending order
HClust_corr_rep_days = sorted(np.unique(df.loc[df["day_of_year"].isin(df_rep_hclust_corr.columns)].index.date),reverse=True) # Find the dates of the representative days based on day of year column and sort them in descending order
df_rep_hclust_corr.rename(columns=dict(zip(df_rep_hclust_corr.columns, HClust_corr_rep_days)), inplace=True)
df_rep_hclust_corr = df_rep_hclust_corr[df_rep_hclust_corr.iloc[0].sort_values().index] # Rearrange the df in increasing order of temperature

In [None]:
fig = go.Figure()
custom_colors = ['rgb(214, 39, 40)','rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(31, 119, 180)']

for i,column in enumerate(df_rep_hclust.columns):
    fig.add_trace(go.Scatter(x=df_rep_hclust.index, y=df_rep_hclust[column], mode='markers+lines', name="Hierarchical:" + str(column), line=dict(color=custom_colors[i])))
for i,column in enumerate(df_rep_hclust_corr.columns):
    fig.add_trace(go.Scatter(x=df_rep_hclust_corr.index, y=df_rep_hclust_corr[column], mode='markers+lines', line=dict(dash='dash',color=custom_colors[i]), name="Hierarchical (Corr):"+str(column)))

fig.update_layout(title='Daily variation in outdoor temperature for 4 representative days found using hierarchical clustering<br>with a distance and correlation based dissimilarity metrics', xaxis_title='Hour of day', yaxis_title='Temperature (C)')

fig.update_layout(
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins to make the plot tighter
    autosize=False,  # Disable autosizing
    width=1200,       # Set the width of the plot
    height=400,       # Set the height of the plot
    title_x=0.5,
    title_y=0.95
)
fig.show()

When plotting the same plot and comparing hierarchical clustering with Eulcidean distance as the measure of dissimilarity and correlation as the measure of dissimilarity the former is considerably better at identfying distinct temperature profiles. The latter identfies representative days that are all very close to each other and hence not indicative of different groupings. To drive this point home even more the box plots of each cluster can be plotted and compared with the box plots of the data from the whole year.

In [None]:
fig = go.Figure()
xlab = [1,2,3,4,5]
fig.add_trace(go.Box(y=df["Dry Bulb Temperature {C}"], name='Whole Year'))
#fig.update_layout(title='Whole Dataset Box Plot')

# Create box plots for each cluster sequentially
colors = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)']

for i in range(4):
    cluster_data = df_pivoted[df_pivoted['HCluster_Corr'] == i]
    fig.add_trace(go.Box(y=cluster_data.values.flatten(),name=f"Cluster {i}",marker_color=colors[i]))

fig.update_layout(title='Box plots of hourly temperature data from the whole year as well as from each individual cluster<br> as identfied using correlation based hirearchical clustering', yaxis_title='Temperature (C)')
fig.update_layout(
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins to make the plot tighter
    autosize=False,  # Disable autosizing
    width=1000,       # Set the width of the plot
    height=400,       # Set the height of the plot
    title_x=0.5,
    title_y=0.95
)
# Show plot
fig.show()

The box plots clearly show that correlation based hierarchical clustering is not good at reducing the variance in the data unlike the methods considered previously.

## Principal Component Analysis
In the above anaylsis only dry bulb temperature data was used to identify representative days. However, the weather is dependent on many factors not just the temperature as indicated by the various fileds of the weather data file. There are in total 30 predictors in the original weather data file. So, a principal component analysis can now be done to find the directions with the greatest variability.

In [None]:
df = df.loc[:,~df.columns.isin(["day_of_year","hour"])]
print(df.columns)

Index(['Dry Bulb Temperature {C}', 'Dew Point Temperature {C}',
       'Relative Humidity {%}', 'Atmospheric Pressure {Pa}',
       'Extraterrestrial Horizontal Radiation {Wh/m2}',
       'Extraterrestrial Direct Normal Radiation {Wh/m2}',
       'Horizontal Infrared Radiation Intensity from Sky {Wh/m2}',
       'Global Horizontal Radiation {Wh/m2}',
       'Direct Normal Radiation {Wh/m2}',
       'Diffuse Horizontal Radiation {Wh/m2}',
       'Global Horizontal Illuminance {lux}',
       'Direct Normal Illuminance {lux}',
       'Diffuse Horizontal Illuminance {lux}', 'Zenith Luminance {Cd/m2}',
       'Wind Direction {deg}', 'Wind Speed {m/s}', 'Total Sky Cover {.1}',
       'Opaque Sky Cover {.1}', 'Visibility {km}', 'Ceiling Height {m}',
       'Present Weather Observation', 'Precipitable Water {mm}',
       'Aerosol Optical Depth {.001}', 'Snow Depth {cm}',
       'Days Since Last Snow', 'Albedo {.01}',
       'Liquid Precipitation Depth {mm}',
       'Liquid Precipitation Quanti

Let us look at the standard deviation of the various columns of the weather data.

In [None]:
print(df.std())

Dry Bulb Temperature {C}                                       10.279289
Dew Point Temperature {C}                                       7.393292
Relative Humidity {%}                                          21.765141
Atmospheric Pressure {Pa}                                     571.850298
Extraterrestrial Horizontal Radiation {Wh/m2}                 413.133716
Extraterrestrial Direct Normal Radiation {Wh/m2}              660.697684
Horizontal Infrared Radiation Intensity from Sky {Wh/m2}       49.079165
Global Horizontal Radiation {Wh/m2}                           286.348478
Direct Normal Radiation {Wh/m2}                               352.139673
Diffuse Horizontal Radiation {Wh/m2}                           77.711478
Global Horizontal Illuminance {lux}                         30409.993874
Direct Normal Illuminance {lux}                             34590.554206
Diffuse Horizontal Illuminance {lux}                         9878.988499
Zenith Luminance {Cd/m2}                           

To make the data more uniform, it can be scaled with respect to the respective column mean and standard deviation.

In [None]:
scaler = StandardScaler(with_std=True,with_mean=True)
df_scaled = scaler.fit_transform(df)

The Principal Components of the scaled weather data which has 30 columns and 8760 data points can now be identified.

In [None]:
pca_weather = PCA()
pca_weather.fit(df_scaled)

To see how the amount of variance explained

In [None]:
# Get explained variance ratio
explained_variance_ratio = pca_weather.explained_variance_ratio_

# Calculate cumulative explained variance ratio
cumulative_variance_ratio = explained_variance_ratio.cumsum()
# Plot proportion of variance explained
fig = sp.make_subplots(rows=1, cols=2)


# Create trace for the elbow plot
fig.add_trace(go.Scatter(
    x=list(range(1, len(explained_variance_ratio) + 1)),
    y=explained_variance_ratio,
    mode='lines+markers',
    marker=dict(color='blue'),
    name='Explained<br>Variance Ratio'
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=list(range(1, len(cumulative_variance_ratio) + 1)),
    y=cumulative_variance_ratio,
    mode='lines+markers',
    marker=dict(color='red'),
    name='Cumulative<br>Explained Variance Ratio'
), row=1, col=2)

fig.update_layout(title='Proportion of variance explained by different PCs')
# Update layout for the first subplot
fig.update_xaxes(title_text='Number of Components', row=1, col=1)
fig.update_yaxes(title_text='Explained Variance Ratio', row=1, col=1)

# Update layout for the second subplot
fig.update_xaxes(title_text='Number of Components', row=1, col=2)
fig.update_yaxes(title_text='Cumulative Explained Variance Ratio', row=1, col=2)

fig.update_layout(
    margin=dict(l=50, r=50, b=50, t=75),  # Adjust margins to make the plot tighter
    autosize=False,  # Disable autosizing
    width=1200,       # Set the width of the plot
    height=400,       # Set the height of the plot
    title_x=0.5,
    title_y=0.95
)
fig.show()

The above figure shows two subplots. The first one is the elbow plot which shows how the proportion of variance explained changes with PCs. A clear "elbow" can be noted at the 4th PC. A cumulative distribution of proportion of variance explained by various principal components is shown in the second subplot. To explain at least 95% of the variance we'd need to have at least 12 components.

In [None]:
# Create a heatmap for loadings
loadings = pca_weather.components_

loadings_df = pd.DataFrame(loadings.T, columns=[f'PC{i}' for i in range(1, len(loadings) + 1)], index=df.columns)

fig = go.Figure(data=go.Heatmap(z=loadings_df.values, x=loadings_df.columns, y=loadings_df.index, colorscale='Viridis'))

fig.update_layout(title='Principal Component Loadings',
                  xaxis_title='Principal Component',
                  yaxis_title='Original Features')

fig.show()

Now we calculate the Principal Component scores for each of the 8760 hours using the first 12 principal components only.

In [None]:
n_components = 12  # Choose the number of principal components
pca = PCA(n_components=n_components)
pca_result = pca.fit_transform(df_scaled)
df_pca = pd.DataFrame(data=pca_result,index=df.index)

Therefore, we were able to successfully reduce the data with 30 predictors into a small dataset with only 12 predictors that can capture about 95% of the variation in the original dataset. This smaller dimension dataframe can now be used for analysis instead of the large one.

## Conclusion
Here we used the weather data from Boulder to identify different representative days so instead of running building energy simulation protocols for all the days the four days could be used. It seems K-means and hierarchical clustering were both good at identifying such representative days with the K-means having on average a higher temperature. Changing the dissimilarity metric for hierarchical clustering from distance to correlation did not help identify distinct clusters or reduce the variance and hence should not be used when clustering weather data. Using all the predictors in the data and doing a Principal Component Analysis can help reduce the dimension of the data from 30 to 12 while still capturing 95% of all the original variation.