In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Task 1: Data Exploration

In [18]:
df = pd.read_csv("./electricity-consumption-processed.csv")


### First rows and useful statistics

Displaying the first rows of the dataset.

In [None]:
df.head(10)

Displaying the shape of the dataset on form (rows, column)

In [None]:
df.shape

From this we can see that they have split the time-series in intervals of 1 hour, and a total of 24 points of data per day. Furthermore, the dataset also points to a substation. All the data from substation A is the following:

In [None]:
substation_df = df[df['substation'] == 'A']
substation_df.shape

With 70 128 rows of data for one subregion we have the following dates, weeks and years spanning in the dataset:

In [None]:
points_per_day = 24
dataset_in_days = substation_df.shape[0] / points_per_day

print(f"Days in dataset: {dataset_in_days}")

dataset_in_weeks = dataset_in_days / 7 # 7 days per week

print(f"Weeks in dataset: {dataset_in_weeks}")

dataset_in_years = dataset_in_weeks / 52 # 52 weeks in a year

print(f"Years in dataset: {dataset_in_years}")


Creating statistics of the dataset

In [None]:
df.describe()

**Identifyting missing values**

In [None]:
missing_values = df.isnull().sum()
print(missing_values)

**Identifying unique values**

In [None]:
unique_consumption = df['consumption'].unique()
unique_consumption.size

There are 113 864 unique values for the consumption columm.

**Outliers**

Have chosen to use Interquartile Range (IQR) as this is more suitable to handle skewed data. This is certainly the case here, as electricity consumption varies from day to day, as well as the season and weather conditions. If the data had been normally ditributed, calculating ouliers using z-score would be more useful.

In [None]:
q1 = df['consumption'].quantile(0.25)
q3 = df['consumption'].quantile(0.75)

iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = df[(df['consumption'] < lower_bound) | (df['consumption'] > upper_bound)]
print(outliers)

Of the total of 1 928 520 total entries, there are 17 444 outliers.

In [None]:
round(outliers.size / df.size * 100, 4)

This accounts for approximately 0.9045 %

# Task 2: Data Cleaning

Displaying where the missing values are located in the dataset.

In [None]:
fig, ax = plt.subplots()

consumption_null = df['consumption'].isnull()
ax.plot(consumption_null)

ax.legend(['consumption'])


We chose to linear interpolate between the last known value and the next known. This is mainly because missing datapoints seems to be random. It seems logical to interpolate the data as it maps back to the real world, where seasonal changes vary, but dates within a short timeperiod should corralate more or less.

In [None]:
df['consumption'] = df['consumption'].interpolate(method='linear')
df['consumption']


# Task 3: Handling outliers

Display IQR (Interquartile range), outliers shown by the red circles.

In [None]:
plt.figure(figsize=(10, 6))

plt.plot(df.index, df['consumption'], label='Consumption', color='blue')

plt.scatter(outliers.index, outliers['consumption'], color='red', label='Outliers', zorder=5)

plt.title('Electricity Consumption with Outliers')
plt.xlabel('Date')
plt.ylabel('Consumption')
plt.legend()

plt.show()

We chose to cap the data, due to the fact that the amount i rather large (a bit less than 70 000) and skewed. This will help reduce the impact of extreme values, but not remove them completely. Furthermore, our interest in this dataset is to make predictions and spot general trends. In this case outliers are not perticularly interesting, and we therefore find it natural to minimise their effect without completely removing them. To cap the dataset rather than transforming it also makes it easier to explain the data to a potential client.

In [None]:
df['consumption'] = df['consumption'].clip(lower=lower_bound, upper=upper_bound)

plt.figure(figsize=(10, 6))

plt.plot(df.index, df['consumption'], label='Consumption', color='blue')

plt.title('Electricity Consumption when capped')
plt.xlabel('Date')
plt.ylabel('Consumption')
plt.legend()

plt.show()

# Task 4: Data Transformation

From the information we have available about this dataset, the substations are not ordinal and only unique identifiers for the specific substations. We have therefore chosen to apply one-shot encoding to them. Additionally, since the feeders seem like identifiers from information available to us, we've chosen to also apply one-hot encoding to them. This ensures theres no risk of the model assuming false relationships that might have accured if we used label encoding.

In [None]:
categorical_columns = ['substation', 'feeder']

df_encoded = pd.get_dummies(df, columns=categorical_columns, dtype='int')

print(df_encoded.head())

Because the data is skewed, large, and the outliers are already getting capped, we've opted to use min-max feature scaling since its outlier sensitivity wasnt a concern anymore. Additionally we wanted to preserve the relationships between the different datapoints that min-max scaling allows us to do.

In [33]:
numerical_columns = ['consumption']

scaler = MinMaxScaler()

df_encoded[numerical_columns] = scaler.fit_transform(df_encoded[numerical_columns])

In [None]:
plt.figure(figsize=(10, 6))

plt.plot(df_encoded.index, df_encoded['consumption'], label='Consumption', color='blue')

plt.title('After min-max scaling')
plt.xlabel('Date')
plt.ylabel('Consumption (scaled)')
plt.legend()

plt.show()

Feature scaling is necessary to use because it ensures faster convergence and prevents feature dominance. With scaled features, algorithms also perform better and lead to more accurate models.

# Task 5: Data splitting

In [36]:
# Define features (X) and target (y)
X = df[['datetime', 'substation', 'feeder']]  # Feature columns
y = df['consumption']  # Target column

# Split the data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the training and testing sets
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Training set size: (1542816, 3)
Testing set size: (385704, 3)


Splitting data into training and testing sets is important in training machine learning models. When we train a model, we want it to learn underlying patterns from the data that can be applied to new data, not just memorize the training data. This is where the concept of overfitting comes into play.

Overfitting occurs when a model becomes too complex, capturing not only the true patterns in the data but also noise or random fluctuations. This means the model performs very well on the training data but poorly on new data. Essentially, it memorizes the training data instead of generalizing well to other datasets. Overfitting leads to poor performance in real-world scenarios where the model is applied to new data.

By splitting the dataset into a training set and a testing set, we can reduce the risk of overfitting. The training set is used to train the model, learning from the data. The testing set, which the model has never seen before, acts as a test for real-world data. After training, the model is evaluated on the testing set, giving an estimate of how well it is likely to perform on new data.

This split makes us able to evaluate the model’s generalization ability. If a model performs well on both the training and testing sets, it’s likely capturing the true underlying patterns. If it performs well only on the training set but poorly on the testing set, overfitting is likely, and we may need to adjust the model by reducing its complexity, using regularization techniques, or collecting more data