# EPF Toolbox: load data demo

First notebook to fetch a market dataset using `epftoolbox.data.read_data`.
- Uses the library defaults (`datasets/` at repo root) and auto-downloads from the Zenodo archive on first run.
- Subsequent runs load the cached CSVs from the same folder.
- Default market shown below: PJM.


In [1]:
from pathlib import Path
from epftoolbox.data import read_data

# Resolve repo root even if the notebook is opened from Notebooks/
repo_root = Path.cwd().resolve()
if repo_root.name == "Notebooks":
    repo_root = repo_root.parent

data_path = repo_root / "datasets"  # matches library defaults and examples

## Download/cached load for multiple markets
This cell loops over the five built-in markets. For each, `read_data` downloads the CSV to `datasets/` if missing and then loads it. Reruns are cached because the CSVs remain locally.

In [2]:
markets = ["PJM", "NP", "FR", "BE", "DE"]
shapes = {}

for mkt in markets:
    df_tr, df_te = read_data(path=str(data_path), dataset=mkt, years_test=2)
    shapes[mkt] = (df_tr.shape, df_te.shape)

print(f"Data folder: {data_path}")
for mkt, (tr_shape, te_shape) in shapes.items():
    print(f"{mkt}: train {tr_shape}, test {te_shape}")


Data folder: /home/llinfeng/GitRepo/1_Projects/DianLi_电力/Benchmark1-epftoolbox/datasets
PJM: train (34944, 3), test (17472, 3)
NP: train (34944, 3), test (17472, 3)
FR: train (34944, 3), test (17472, 3)
BE: train (34944, 3), test (17472, 3)
DE: train (34944, 3), test (17472, 3)


In [3]:
dataset = "PJM"
df_train, df_test = read_data(path=str(data_path), dataset=dataset, years_test=2)

print(f"Data folder: {data_path}")
print(f"Train shape: {df_train.shape} | Test shape: {df_test.shape}")
df_train.head()


Data folder: /home/llinfeng/GitRepo/1_Projects/DianLi_电力/Benchmark1-epftoolbox/datasets
Train shape: (34944, 3) | Test shape: (17472, 3)


Unnamed: 0_level_0,Price,Exogenous 1,Exogenous 2
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01 00:00:00,25.464211,85049.0,11509.0
2013-01-01 01:00:00,23.554578,82128.0,10942.0
2013-01-01 02:00:00,22.122277,80729.0,10639.0
2013-01-01 03:00:00,21.592066,80248.0,10476.0
2013-01-01 04:00:00,21.546501,80850.0,10445.0


In [4]:
# write out all data explicitly for reuse
import pandas as pd

markets = ["PJM", "NP", "FR", "BE", "DE"]
_collector = []
for mkt in markets:
    print(mkt)
    df_tr, df_te = read_data(path=str(data_path), dataset=mkt, years_test=2)
    df_tr['label'] = 'train'
    df_te['label'] = 'test'
    df = pd.concat([df_tr, df_te])
    df['mkt'] = mkt
    df.rename({'Date': 'DateTime'}, axis=1, inplace=True)
    df = df.reset_index()
    df.columns = ['DateTime', 'Price', 'Exogenous 1', 'Exogenous 2', 'label', 'mkt']
    _collector.append(df.copy())

df_out = pd.concat(_collector)

PJM
NP
FR
BE
DE


In [5]:
df_out.to_csv("datasets/hourly_data_all_markets.csv", index=False)

In [6]:
df_out.shape

(262080, 6)

In [7]:
df_out['mkt'].value_counts()

mkt
PJM    52416
NP     52416
FR     52416
BE     52416
DE     52416
Name: count, dtype: int64