#  Project

**Project Goal:** 

**Dataset Period:**

**Methodology:** 

---

## 1. Business Understanding

### 1.1 Business Objectives
TODO

### 1.2 Project Goals and Successs Criteria
TODO


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import math

---
## 2. Data Understanding

### 2.1 Data Collection

The first step is to load the datasets into the working environment. This involves importing the necessary libraries and reading the data file into a suitable data structure, namely, a DataFrame using Pandas.

In [None]:
# Load datasets
weather = pd.read_csv("weather.csv", parse_dates=["time"])
energy = pd.read_csv("energy_dataset.csv", parse_dates=["time"])

# Set 'time' as index
weather = weather.set_index("time")
energy = energy.set_index("time")

### 2.2 Data Description

##### Basic dataset information

In [None]:
energy.info()

The *energy* dataset contains 35064 entries and 24 features, representing hourly records of electricity generation from various sources, total system load, and the day-ahead market price. Each row corresponds to one hour of energy system operation, and the goal is to forecast the electricity price one hour and one day ahead.

In [None]:
energy.describe()

In [None]:
energy.head()

In [None]:
weather.info()

The *weather* dataset also contains 35064 entries and 5 features representing hourly meteorological measurements such as temperature, pressure, humidity, and wind speed. Each row corresponds to one hour of weather conditions, and these variables are used as exogenous inputs to improve electricity price forecasting.

In [None]:
weather.describe()

In [None]:
weather.head()

### 2.3 Data Exploration

#### 2.3.1 Target variable analysis

The target variable `price_day_ahead` represents the eletricity market price for the upcoming hour.

In [None]:
plt.figure(figsize=(8,5))
plt.hist(energy['price_day_ahead'], bins=50, edgecolor='black')
plt.title("Histogram of Day-Ahead Electricity Price")
plt.xlabel("Price (€/MWh)")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()

The histogram shows a distribution that is very close to a normal distribution, although it is slightly right-skewed.

In [None]:
plt.figure(figsize=(14, 6))
plt.plot(energy['price_day_ahead'])
plt.title("Time Series of Electricity Price (Day-Ahead)")
plt.xlabel("Time")
plt.ylabel("Price (€/MWh)")
plt.grid(True)
plt.show()

The time series plot shows strong short-term fluctuations and clear seasonal patterns, with occasional price spikes. Prices vary over time, indicating non-stationarity and the presence of both volatility and periodic behavior.

#### 2.3.2 Feature distribution analysis

Now we will perform feature distribution analysis to examine how the data values are spread across the datasets. We will use plots and histograms to visualise the distributions features. Box plots for will be skipped in this step, as they will be specifically used for outlier detection in paragraph `2.4.2`.

In [None]:
def feat_distribution(nonDiscreteFeatures, df):
    nrows = math.ceil(len(nonDiscreteFeatures) / 2)
    ncols = 2
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, nrows * 5))
    axes = axes.flatten()

    for i, col in enumerate(nonDiscreteFeatures):
        df[col].hist(bins=30, ax=axes[i])
        axes[i].set_title(f'{col} Distribution')
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Frequency')

    plt.show()

All of the features are numerical and continuous.

##### Energy dataset

In [None]:
energy_features = [col for col in energy.columns 
                   if col not in ['time', 'price_day_ahead']]
feat_distribution(energy_features, energy)

##### Weather dataset

In [None]:
weather_features = [col for col in weather.columns if col != 'time']
feat_distribution(weather_features, weather)

TODO - INTEPRETATION

### 2.4 Data Quality Assessment

#### 2.4.1 Identify missing values

#### 2.4.2 Identify outliers

In [None]:
plt.figure(figsize=(6,5))
plt.boxplot(energy['price_day_ahead'], vert=True)
plt.title("Boxplot of Day-Ahead Electricity Price")
plt.ylabel("Price (€/MWh)")
plt.grid(True)
plt.show()

The boxplot shows presence of many outliers in the dataset

#### 2.4.3 Check data duplication

---
## 3. Data Preparation

### 3.1 Data Cleaning

#### 3.1.1 Missing values

#### 3.1.2 Duplicates Deletion

#### 3.1.3 Outliers Removal

### 3.2 Bivariate analysis

### 3.3 Data Transformation

### 3.4 Data Splitting

---
## 4. Modelling

### 4.1 Statistical Models

### 4.2 Machine Learning Models

### 4.3 Deep Learning Models