In [2]:
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')

In [3]:
from IPython.display import display, HTML

import pandas as pd
import numpy as np

import matplotlib as plt 
import seaborn as sns

## 4.3 Overview of the explanatory variables

To make a good forecast model for energy consumption, it is important to understand the attributes that the variables have and know what each variable represents. Each variable serves to give a better understanding of what contributes to energy consumption. By understanding the variables, a better understanding of energy usage patterns can be gained. Therefore, the following section will introduce and describe the variables contained within the dataset.

| Variable Name | Data Type | Variable Type | Description | Source | Additional Notes |
|---------------|-----------|---------------|-------------|--------|------------------|
| start_time | datetime | Numerical | Start time of the observation | Gathered from energidataservice.dk  | - |
| is_holiday | int | Binary | Indicates if the time is a holiday | python library *holidays* | 1 for holiday, 0 for non-holiday |
| month | str | Categorical | Month of the observation | python library *calender* | - |
| weekday | str | Categorical | Weekday of the observation | python library *calender* | - |
| EC_pct_change | float | Numerical | Percent change in electrical cars registrated | Gathered from DST.dk | - |
| HC_pct_change | float | Numerical | Percent change in plug-in hybrid cars registrated | Gathered from DST.dk | - |
| humidity_past1h | float | Numerical | Humidity level in the past hour | Gathered from DMI.dk | - |
| temp_mean_past1h | float | Numerical | Mean temperature in the past hour | Gathered from DMI.dk | - |
| wind_speed_past1h | float | Numerical | Wind speed in the past hour | Gathered from DMI.dk | - |
| EL_price | float | Numerical | Price of electricity | Gathered from DST.dk | - |
| GrossConsumptionMWh | float | Numerical | Sum of energy consumption | Gathered from energidataservice.dk | Dependent variable for analysis |

- **start_time**: Timestamps marking the beginning and end of the recorded hour.
- **is_holiday**: A binary variable indicating whether the observation falls on a holiday.
- **month & weekday**: Time-related categorical variables aiding in temporal pattern identification.
- **EC_pct_change & HC_pct_change**: Percentage changes in energy consumption and heating, respectively, offering insights into variability and trends.
- **umidity_past1h, temp_mean_past1h, wind_speed_past1h**: Meteorological variables from the previous hour that might influence energy usage patterns.
- **EL_price**: Price of electricity, a potential determinant of consumption behavior.
- **GrossConsumptionMWh**: Energy consumption, serving as the dependent variable for analysis.

The foundation of any data analysis lies in understanding the basic statistical properties of a dataset. Summary statistics provide essential information that can guide initial conclusions and is a way to get directions for future analysis. Therefore, the dataset’s summary statistics are provided to offer an overview of the variables, each of which plays a role in understanding the energy consumption patterns. The summary statistics of numerical variables are as follows: 

## 4.4 Summary statistics insights of numerical variables

In [4]:
df = pd.read_csv("C:/Users/madsh/OneDrive/Dokumenter/kandidat/Fællesmappe/Forecasting-energy-consumption/Data Cleaning/output_file.csv", encoding="utf-8")
df

Unnamed: 0,HourDK,GrossConsumptionMWh,DayOfWeek,Hour,IsHoliday,humidity_past1h,temp_mean_past1h,wind_speed_past1h,Electric cars,Plug-in hybrid cars,EL_price
0,2020-01-01 00:00:00,3331.347290,Wednesday,0,1,83.916667,0.187500,4.916667,0.016835,0.038498,0.249650
1,2020-01-01 01:00:00,3257.505005,Wednesday,1,1,84.459016,2.295000,5.098246,-0.011556,-0.020534,0.237330
2,2020-01-01 02:00:00,3161.865601,Wednesday,2,1,84.016393,2.451667,5.159649,0.002639,0.008969,0.235830
3,2020-01-01 03:00:00,3074.752442,Wednesday,3,1,82.885246,2.671667,5.585965,0.002639,0.008963,0.233660
4,2020-01-01 04:00:00,3009.696167,Wednesday,4,1,81.754098,2.873333,5.877193,0.002640,0.008957,0.230450
...,...,...,...,...,...,...,...,...,...,...,...
26299,2022-12-31 19:00:00,4453.297648,Saturday,19,0,93.116667,4.655000,4.362500,0.002892,0.001197,0.305240
26300,2022-12-31 20:00:00,4245.987671,Saturday,20,0,93.650000,4.488333,4.130357,0.002893,0.001197,0.193615
26301,2022-12-31 21:00:00,4092.871013,Saturday,21,0,93.850000,4.443333,3.760714,0.002893,0.001198,0.110730
26302,2022-12-31 22:00:00,3918.759766,Saturday,22,0,94.200000,4.376667,3.714286,0.002894,0.001199,0.073920


In [5]:
import pandas as pd

# Assume your dataframe is named 'df'
# Selecting the columns of interest for the summary statistics
columns_of_interest = ['Electric cars', 'Plug-in hybrid cars', 'humidity_past1h', 'temp_mean_past1h', 'wind_speed_past1h', 'EL_price','GrossConsumptionMWh']

# Calculating the summary statistics
summary_stats = df[columns_of_interest].describe().transpose()
# Adding the sum and count manually
summary_stats['count'] = df[columns_of_interest].count()

# Renaming columns to match your required format
summary_stats.columns = ['Count', 'Mean', 'Std', 'Min', '25%', '50%', '75%', 'Max']

# Rearranging the order of columns to match your required format
summary_stats = summary_stats[['Count', 'Mean', 'Std', 'Min', '25%', '50%', '75%', 'Max']].transpose()
summary_stats

Unnamed: 0,Electric cars,Plug-in hybrid cars,humidity_past1h,temp_mean_past1h,wind_speed_past1h,EL_price,GrossConsumptionMWh
Count,26304.0,26304.0,26304.0,26304.0,26304.0,26304.0,26304.0
Mean,0.007992,0.00854,81.429244,8.33172,4.834021,0.816396,4077.927643
Std,0.004053,0.005619,10.658195,6.150023,1.990381,0.904135,771.035124
Min,-0.012073,-0.027599,39.8,-9.755,0.98,-0.365245,2396.63269
25%,0.005086,0.003659,75.193548,3.242623,3.319643,0.219713,3499.425812
50%,0.007145,0.008175,84.360656,8.131967,4.551724,0.465135,4068.527038
75%,0.010764,0.010722,89.606557,13.29502,6.061507,1.107715,4629.730163
Max,0.036147,0.062953,97.934426,26.935593,16.248214,6.47824,6664.007813


Looking at the growth patterns in the table above, the mean percentages for Electric cars and Plug-in hybrid cars indicate a steady growth trend, with means of 0.007992 and 0.008540, respectively. This steady growth, combined with the low standard deviations (0.004053 for Electric cars and 0.005619 for Plug-in hybrid cars), suggests a consistent, albeit modest, increase in new electric and hybrid vehicle registrations. The growth pattern shows a relatively narrow range between the maximum and minimum values: Electric cars range from -0.012073 to 0.036147, while Plug-in hybrid cars range from -0.027599 to 0.062953. This indicates that the market is expanding at a steady rate during the period of interest, without sudden spikes or drops. The difference in growth rates between electric and hybrid vehicles may reflect varied consumer preferences or policy incentives targeted at specific vehicle types. The slightly higher growth rate for hybrid vehicles might indicate a transitional preference among consumers, moving from conventional gas engines to fully electric options as technology and infrastructure improve.

However, looking at the period shortly after our scope shows the import of electric cars as a percentage change growth rate would be larger than that of the hybrid cars (Elbiler Udgjorde 36 Pct. Af De Nye Biler I 2023, n.d.). The environmental factors (humidity_past1h, temp_mean_past1h, and wind_speed_past1h) show a high degree of stability, with relatively low standard deviations (10.658195 for humidity_past1h, 6.150023 for temp_mean_past1h, and 1.990381 for wind_speed_past1h). This suggests that the observed period did not have significant environmental fluctuations that could have a large effect on energy consumption patterns. The direct impact of these environmental conditions on energy consumption, especially the effect they have on the need for heating or cooling, could be more nuanced. For instance, moderate temperatures might reduce the need for extensive heating or cooling, potentially moderating energy consumption related to climate control for private consumers.

Lastly, the electricity dynamics, consisting of EL_price and energy consumption (Sum_quantity), show minimal fluctuation in electricity price (standard deviation of 0.904135) compared to energy consumption (7.947110). This suggests that while electricity pricing might be regulated or have less volatility due to different means of production, the consumption patterns vary significantly among consumers over time. This variability could reflect different aspects that influence consumer behavior, not directly tied to price changes, such as seasonality and temporal patterns. The dataset includes private customers, who likely have clear usage patterns, using minimal electricity while at work and more when at home. The substantial difference between the 75th percentile (1.107715) and the maximum value (6.478240) for Sum_quantity points to sporadic high consumption events, possibly due to holidays like Christmas when consumers use more electricity-demanding systems simultaneously (Juleaften: Øerne Bruger Mest Ekstra Strøm, Mens Storbykommuner Bruger Mindst, 2022). This is accounted for with the IsHoliday variable, which helps capture these outliers.

The variable "GrossConsumptionMWh" represents the total energy consumption measured in megawatt-hours. This metric provides an overview of the overall energy demand within the observed period. The mean value of 4077.927643 MWh, with a standard deviation of 771.035124, indicates that there is considerable variation in energy consumption. The minimum and maximum values of 2396.632690 MWh and 6664.007813 MWh, respectively, further highlight this variability. The 25th percentile (3499.425812 MWh) and 75th percentile (4629.730163 MWh) values show that most consumption values lie within this range, suggesting that while there are periods of high consumption, the majority of the data points fall within a more consistent range. This variability in GrossConsumptionMWh can reflect differing energy needs across various times, influenced by factors such as seasonal changes, consumer behavior, and possibly the number of electric and hybrid cars being charged. Understanding this variable is crucial for energy providers and policymakers to ensure a stable supply and to implement strategies for efficient energy distribution and consumption.

Furthermore, the table highlights the different scales of the numerical variables, an aspect that can disrupt machine learning models' prediction accuracy. Therefore, it is evident from the tables that the numerical values should be scaled similarly before applying machine learning models. Analyzing the summary statistics allows a deeper exploration of the dataset's essential aspects: interrelations among variables, aggregating data over customer groups, and time-based dimensions. The next chapter is dedicated to examining these correlations, providing insight into the impact of these dimensions.