## Day Type Identification of Algerian Electricity Load

**Introduction:**

This Jupyter notebook explores the identification of different day types based on electricity load patterns in an Algerian city. We will analyze a dataset containing hourly recordings of Maximum Power Demand (PMA) and Temperature for two years, from January 1st, 2016, to December 31st, 2017.

**Data Source:**

The dataset used in this analysis is stored in a file named `pma.xlsx`. It contains three columns:

* `time`: Date and time (hourly)
* `pma`: Maximum Power Demand (MW)
* `tmp`: Temperature (°C)

**Software and Tools:**

This project will utilize Python libraries such as:

* `pandas` for data manipulation and analysis
* `numpy` for scientific computing
* `matplotlib` and `seaborn` for data visualization
* `scikit-learn` for machine learning and clustering algorithms

**Let's begin by importing the necessary libraries and reading the data into a Pandas dataframe.**


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score

In [3]:
# Load the data
df = pd.read_excel("pma.xlsx", skiprows=1)

# Rename columns
df.columns = ["time", "pma", "tmp"]

# Convert time column to datetime
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
# Create a column for date only without time
df['date'] = df['time'].dt.date
# Create a column for time only without date
df['hour'] = df['time'].dt.time
# Create a column for day of the week
df['day'] = df['time'].dt.day_name()
# Create a column for hour of the day
df['hour'] = df['time'].dt.hour
# Drop time column
df.drop('time', axis=1)

df.head()



Unnamed: 0,time,pma,tmp,date,hour,day
0,2016-01-01 01:00:00,982.529002,6.405644,2016-01-01,1,Friday
1,2016-01-01 02:00:00,983.240592,5.932445,2016-01-01,2,Friday
2,2016-01-01 03:00:00,1002.780354,5.503807,2016-01-01,3,Friday
3,2016-01-01 04:00:00,1011.657004,5.112056,2016-01-01,4,Friday
4,2016-01-01 05:00:00,999.13723,4.751342,2016-01-01,5,Friday


## Exploratory Data Analysis (EDA)

In [15]:
def weekly_aggregation(df):
    return df.resample("W-Sun").mean()

# Group data by week and calculate weekly means
df_weekly = df.groupby("date").apply(weekly_aggregation)

# Plot time series of weekly PMA and Temperature
plt.figure(figsize=(12, 6))
plt.plot(df_weekly.index, df_weekly["pma"], label="PMA")
plt.plot(df_weekly.index, df_weekly["tmp"], label="Temperature")
plt.legend()
plt.xlabel("Week")
plt.ylabel("Value")
plt.title("Weekly Time Series of PMA and Temperature")
plt.show()


TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

## Exploratory Data Analysis

#### shape of the data

In [None]:
print(f'number of rows: {df.shape[0]}')
print(f'number of columns: {df.shape[1]}')

#### description

In [None]:
df.describe()

#### missing data

In [None]:
total_missing_values = df.isna().sum().sum()
print(f'total of missing values: {total_missing_values}')

#### duplicates

In [None]:
total_duplicates = df.duplicated().sum()
print(f'total of missing values: {total_duplicates}')

In [None]:
# spread tme into several columns for better groupping
df['hour'] = df.time.dt.hour
df['day'] = df.time.dt.day
df['month'] = df.time.dt.month
df['year'] = df.time.dt.year
df

### Trends

#### PMA over all time

Here, we will try to see the development of the maximum power demand over time in 2016 and 2017 by months

In [None]:
# group
year_month_grouped_pma = df.groupby(['year', 'month']).pma.agg(years_monthly_pma='mean')

title_style = {'family':'serif','color':'darkblue','size':18, 'weight':'bold'}
labels_style = {'family':'serif','color':'black','size':15}
def line_plot(data, x, y, title='', xlabel='', ylabel='', rotate_x=0):
    plt.figure(figsize=[12, 6])
    plt.title(title, fontdict=title_style)
    plt.ylabel(ylabel, fontdict=labels_style)
    plt.xlabel(xlabel, fontdict=labels_style)
    plt.xticks(rotation=rotate_x)
    sns.lineplot(data=data, x=x, y=y)
    plt.show()

# adding a column for the combination year-month
year_month_grouped_pma['year_month'] = year_month_grouped_pma.index.map(lambda x: f'{x[0]}' + '-' + f'{x[1]}')
# year_month_grouped_pma

# plotting the data
line_plot(year_month_grouped_pma, 'year_month', 'years_monthly_pma', 'monthly pma between 2016 and 2017', 'year-month', 'pma', 90)

As we notice, for now, maximum power demand increases mostly during summer from May till October.  
To see this better, let us take the average pma between the two years for ech month:

#### PMA over months

In [None]:
year_month_grouped_pma
monthly_grouped_pma = year_month_grouped_pma.groupby('month').years_monthly_pma.agg(average_monthly_pma='mean')
# monthly_grouped_pma
line_plot(monthly_grouped_pma, 'month', 'average_monthly_pma', 'Average Monthly pma', 'months', 'pma')

Great, It is clear enough now.

#### PMA over days

Now, let us find the hours of the day with most power demand.

In [None]:
hourly_grouped_pma = df.groupby('hour').pma.agg(hourly_pma='mean')
#hourly_grouped_pma
line_plot(hourly_grouped_pma, 'hour', 'hourly_pma', 'Average Hourly PMA During One Day', 'Hour', 'pma')

However, this one is for both two years, let us do it for each season and try to compare.

In [None]:
seasons = ['Winter', 'Spring', 'Summer', 'Fall']

def map_season(month, day):
    if (month == 12 and day >= 21) or (month == 1) or (month == 2) or (month == 3 and day < 21):
        return seasons[0]  # Winter
    elif (month == 3 and day >= 21) or (month == 4) or (month == 5) or (month == 6 and day < 21):
        return seasons[1]  # Spring
    elif (month == 6 and day >= 21) or (month == 7) or (month == 8) or (month == 9 and day < 21):
        return seasons[2]  # Summer
    else:
        return seasons[3]  # Fall

# Assign seasons to each day, month, and hour
df['season'] = df.apply(lambda row: map_season(row.month, row.day), axis=1)

In [None]:
seasoned_hourly_grouped_pma = df.groupby(['season', 'hour']).pma.agg(seasoned_hourly_pma='mean')
fall_hourly = seasoned_hourly_grouped_pma[seasoned_hourly_grouped_pma.index.get_level_values('season')=='Fall']
winter_hourly = seasoned_hourly_grouped_pma[seasoned_hourly_grouped_pma.index.get_level_values('season')=='Winter']
spring_hourly = seasoned_hourly_grouped_pma[seasoned_hourly_grouped_pma.index.get_level_values('season')=='Spring']
summer_hourly = seasoned_hourly_grouped_pma[seasoned_hourly_grouped_pma.index.get_level_values('season')=='Summer']

plt.figure(figsize=[12, 6])
plt.title('Average Hourly PMA During Each Season', fontdict=title_style)
plt.ylabel('seasoned pma', fontdict=labels_style)
plt.xlabel('hour', fontdict=labels_style)

for season in seasons:
    seasoned_hourly = seasoned_hourly_grouped_pma[seasoned_hourly_grouped_pma.index.get_level_values('season')==season]
    #line_plot(seasoned_hourly, 'hour', 'seasoned_hourly_pma', f'Average Hourly Demand in {season}', 'Hour', 'pma')
    sns.lineplot(data=seasoned_hourly, x='hour', y='seasoned_hourly_pma', label=season)

plt.show()

#### Key notes

* Power demand reaches its maximum values during summer. Which logical, most people are in holidays thus staying at home most of the time compared to the rest of the year.
* During one day, PMA reaches its climax at around 8pm.
* During Winter, max demand is at its peak before 8pm, and after that time it starts decreasing. This could be due to many reasons. One of them is that people tend to sleep earlier at winter. Meanwhile during summer, it reaches its climax after 8pm and higher values as well 1pm and 4pm. Because, most people are at their homes with their AC (Air Conditioner) on at those times due to high temperatures outside.
* Demand is low in all seasons during night (most logically) and medium during day where everyone are doing their activities and daily tasks.

### Relationships

In [None]:
#df
def scatter_plot(data, x, y, title='', xlabel='', ylabel='', rotate_x=0):
    title_style = {'family':'serif','color':'darkblue','size':18, 'weight':'bold'}
    labels_style = {'family':'serif','color':'black','size':15}
    plt.figure(figsize=[12, 6])
    plt.title(title, fontdict=title_style)
    plt.ylabel(ylabel, fontdict=labels_style)
    plt.xlabel(xlabel, fontdict=labels_style)
    plt.xticks(rotation=rotate_x)
    sns.regplot(data=data, x=x, y=y)
    plt.show()
scatter_plot(df, 'tmp', 'pma')

## Clustering

First of all, let us understand the study we are conducting and its aim. We are trying to identify day types of Algerian electricity load. Given the maximum power demand (pma) and termperature each hour in each day from January 1st, 2016 to December 31st, 2017, we will group our data and reshape it a bit to get each day as a data object with its pma (the average of the day) and temperature (the average in that day).  
**Note:** be careful in working with dates

In [None]:
# setting up another column in order to group by it
df['day'] = df.time.dt.day
df['month'] = df.time.dt.month
df['year'] = df.time.dt.year
df['fullday'] = pd.to_datetime(df[['year', 'month', 'day']]).dt.date

In [None]:
# building the new dataframe
df_daily = df.groupby('fullday').agg({'pma': 'mean', 'tmp': 'mean'}).reset_index()
df_daily.set_index('fullday', inplace=True)
df_daily.head()

#### KMeans

Before clustering, we will reduce the dimentionality of the data by applying PCA dimensionality reduction technique.

In [None]:
pca = PCA(n_components=2)
df_daily_transformed = pca.fit_transform(df_daily)
df_daily_transformed

Then, strandardization of the result:

In [None]:
scaler = StandardScaler()
df_daily_transformed = scaler.fit_transform(df_daily_transformed)
df_daily_transformed

Now, we are going to find the best k number of clusters for K-Means.

##### **Elbow Method**

cluster our transformed dataframe and plot the Sum of Squared Errors to get a Scree Plot. Then, based on it, we choose the best k number of clusters (elbow).

In [None]:
sse = []
k_value = range(1, 11)

for k in k_value:
    kmeans = KMeans(n_clusters=k, n_init=10)
    kmeans.fit(df_daily_transformed)
    sse.append(kmeans.inertia_) # inertia is SSE    

plt.figure(figsize=[10, 6])
plt.plot(k_values, sse, 'bx-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of Squared Errors')
plt.title('Elbow Method To Find Optimal Number Of Clusters')
plt.show()

Thus, the optimal number of clusters based on the Scree Plot is **k=5**.

##### **Silhouette Metric**

As we know, The silhouette coefficient or silhouette score kmeans is a measure of how similar a data point is within-cluster (cohesion) compared to other clusters (separation).

In [None]:
silhouette = []
k_value = range(2, 12) # silhouette needs at least 2 clusters

for k in k_value:
    kmeans = KMeans(n_clusters=k, n_init=10)
    kmeans.fit(df_daily_transformed)
    cluster_labels = kmeans.labels_
    silhouette.append(silhouette_score(df_daily_transformed, cluster_labels))

plt.figure(figsize=[10, 6])
plt.plot(k_values, silhouette, 'bx-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette')
plt.title('Silhouette Metric To Find Optimal Number Of Clusters')
plt.show()

Using the Silhouette metric, the optimal value for **k** is **2**
thus, we choose **k=5** as the we deduced from the Elbow method. And we cluster based on it:

In [None]:
k = 5
kmeans = KMeans(n_clusters=k, n_init=10)
cluster = kmeans.fit(df_daily_transformed)
cluster_labels = kmeans.labels_
cluster_centers = kmeans.cluster_centers_

# plotting the clustering results
plt.figure(figsize=[10, 6])
plt.scatter(df_daily_transformed[:, 0], df_daily_transformed[:, 1], c=cluster_labels)
plt.title(f'K-Means Clustering with k={k}')
plt.show()

#### DBSCAN

We will now try Hierarchical clustering with DBSCAN, directly on our daily dataframe.

In [None]:
df_daily.plot.scatter(x='tmp', y='pma')

Yet, we ought to scale pma and tmp to be approximately on the same scale. For this, we will use **min-max scaling**.

In [None]:
minmax = MinMaxScaler()

df_daily_scaled = df_daily.copy()
df_daily_scaled[['pma', 'tmp']] = minmax.fit_transform(df_daily[['pma', 'tmp']])

df_daily_scaled.plot.scatter(x='tmp', y='pma')

Now, we cluster using DBSCAN:

In [None]:
epsilon = 20 # change
min_samples = 6 # change
dbscan = DBSCAN(eps=epsilon, min_samples=min_samples)
dbscan.fit(df_daily)
labels = dbscan.labels_
df_daily_scaled['cluster_id'] = labels
df_daily_scaled.cluster_id.unique()
df_daily_scaled.plot.scatter(x='tmp', y='pma', c=labels, cmap='viridis') 

As a result, we deduce having five types of days... **<!elaboration!>**  
{very high demand day, high demand day, seasonal demand day, low demand day, very low demand day}