# 2_clean.ipynb  
### Team: *Team Yunus*  
### Made By: *Yunus Eren Ertas*

This notebook cleans the combined electricity demand dataset created in 
`1_combine.ipynb`.

It performs the following steps:

1. Load combined half-hourly dataset  
2. Clean timestamps and demand values  
3. Reconstruct the full half-hour time grid (fill missing 30-min intervals)  
4. Interpolate missing half-hours  
5. Convert to hourly values  
6. Remove extreme outliers  
7. Save a clean hourly dataset for EDA and modeling  


### Load the combined raw dataset

We load the file produced in the combining notebook.  
This dataset contains half-hourly electricity demand from 2001–2025.


In [31]:
import pandas as pd
from pathlib import Path

df = pd.read_csv("../data/raw/uk_electricity_combined.csv")

df.head()


Unnamed: 0,timestamp,demand_mw
0,2001-01-01 00:00:00,34060
1,2001-01-01 00:30:00,35370
2,2001-01-01 01:00:00,35680
3,2001-01-01 01:30:00,35029
4,2001-01-01 02:00:00,34047


### Convert timestamp to datetime and sort

We ensure the `timestamp` column is a valid datetime type,  
and then sort the dataset chronologically.


In [32]:
df["timestamp"] = pd.to_datetime(df["timestamp"], errors="coerce")
df = df.dropna(subset=["timestamp"]).sort_values("timestamp").reset_index(drop=True)

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435408 entries, 0 to 435407
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   timestamp  435408 non-null  datetime64[ns]
 1   demand_mw  435408 non-null  int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 6.6 MB


### Clean and validate demand values

- Convert demand to numeric  
- Remove missing values  
- Remove negative or unrealistic values  
  (UK national demand normally ranges between ~15,000 and 60,000 MW)


In [33]:
df["demand_mw"] = pd.to_numeric(df["demand_mw"], errors="coerce")
df = df.dropna(subset=["demand_mw"])

# Basic validity filtering
df = df[(df["demand_mw"] > 5_000) & (df["demand_mw"] < 100_000)]

df.describe()


Unnamed: 0,timestamp,demand_mw
count,435397,435397.0
mean,2013-06-02 11:43:17.495618816,30258.799243
min,2001-01-01 00:00:00,7654.0
25%,2007-03-18 16:30:00,24047.0
50%,2013-06-02 10:30:00,29880.0
75%,2019-08-18 08:00:00,35992.0
max,2025-11-01 23:30:00,54430.0
std,,7674.985551


### Set timestamp as the index

Time-series operations like resampling and reindexing require 
a datetime index, not a normal column.


In [34]:
df = df.set_index("timestamp").sort_index()

df.head()


Unnamed: 0_level_0,demand_mw
timestamp,Unnamed: 1_level_1
2001-01-01 00:00:00,34060
2001-01-01 00:30:00,35370
2001-01-01 01:00:00,35680
2001-01-01 01:30:00,35029
2001-01-01 02:00:00,34047


### Remove duplicated timestamps

Reindexing requires a unique time index.  
NESO files sometimes contain repeated timestamps,  
so we drop duplicates and keep the first occurrence.


In [35]:
# Ensure index is unique before reindexing
df = df[~df.index.duplicated(keep="first")]


### Rebuild the full half-hour time grid

The raw NESO files occasionally miss individual 30-minute periods.
This causes NaNs when we try to resample to hourly.

To fix this:

1. Create a complete half-hourly index  
2. Reindex the dataset to this full timeline  
3. Interpolate missing half-hours (small gaps only)


In [36]:
full_index = pd.date_range(
    start=df.index.min(),
    end=df.index.max(),
    freq="30min"
)

df = df.reindex(full_index)


### Interpolate small gaps in half-hour data

We use linear interpolation with a small limit to avoid 
filling large unexpected gaps. This restores the missing 
half-hour values needed for correct hourly resampling.


In [37]:
df["demand_mw"] = df["demand_mw"].interpolate(limit=4)

df.isna().sum()


demand_mw    0
dtype: int64

### Convert the clean half-hourly series to hourly

Each hour is computed as the mean of its two half-hour intervals.

Because we reconstructed the full 30-minute grid, no NaNs 
should appear in the hourly series.


In [38]:
hourly = df["demand_mw"].resample("1H").mean()

hourly.head(10)


2001-01-01 00:00:00    34715.0
2001-01-01 01:00:00    35354.5
2001-01-01 02:00:00    34112.0
2001-01-01 03:00:00    32361.0
2001-01-01 04:00:00    29971.5
2001-01-01 05:00:00    28214.0
2001-01-01 06:00:00    27230.5
2001-01-01 07:00:00    25309.0
2001-01-01 08:00:00    24861.0
2001-01-01 09:00:00    26999.0
Freq: H, Name: demand_mw, dtype: float64

### Remove unrealistic spikes

Outliers may come from:
- Faulty NEOS readings  
- Missing data  
- Unusual file formatting  

Values outside **10,000–70,000 MW** are filtered out.


In [39]:
hourly = hourly[(hourly > 10_000) & (hourly < 70_000)]

hourly.describe()


count    217704.000000
mean      30258.740200
std        7660.220938
min       12158.000000
25%       24059.500000
50%       29869.500000
75%       35982.000000
max       54073.000000
Name: demand_mw, dtype: float64

### Save the cleaned hourly dataset

The cleaned dataset is stored in the `data/clean` folder 
as a Parquet file for efficient loading in later notebooks 
(EDA, modeling, deployment).


In [40]:
out_path = Path("../data/clean/uk_electricity_hourly.parquet")
out_path.parent.mkdir(parents=True, exist_ok=True)

hourly.to_frame(name="demand_mw").to_parquet(out_path)

print("✔ Clean hourly dataset saved to:", out_path)


✔ Clean hourly dataset saved to: ..\data\clean\uk_electricity_hourly.parquet


# ✔ Cleaning Complete!

The dataset now has:
- A complete half-hourly timeline  
- Correct interpolation of missing periods  
- Clean hourly electricity demand  
- No extreme outliers  
- Ready for EDA and modeling  

Proceed to **03-eda.ipynb**.
