<a href="https://colab.research.google.com/github/khaled-sawaid/battery-rl-env/blob/main/BatteryEnv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Before starting, we will clone the BatteryEnv github repo

In [None]:
# cloning the battery env github repo so we can
!git clone https://github.com/khaled-sawaid/battery-rl-env.git

# move into the repo
%cd battery-rl-env

# install dependencies
!pip install -r requirements.txt

Cloning into 'battery-rl-env'...
remote: Enumerating objects: 92, done.[K
remote: Counting objects: 100% (92/92), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 92 (delta 30), reused 68 (delta 14), pack-reused 0 (from 0)[K
Receiving objects: 100% (92/92), 5.18 MiB | 22.46 MiB/s, done.
Resolving deltas: 100% (30/30), done.
/content/battery-rl-env/battery-rl-env
Collecting stable_baselines3 (from -r requirements.txt (line 4))
  Downloading stable_baselines3-2.7.0-py3-none-any.whl.metadata (4.8 kB)
Collecting supersuit (from -r requirements.txt (line 5))
  Downloading supersuit-3.10.0-py3-none-any.whl.metadata (3.1 kB)
Collecting pettingzoo (from -r requirements.txt (line 6))
  Downloading pettingzoo-1.25.0-py3-none-any.whl.metadata (8.9 kB)
Collecting tinyscaler>=1.2.6 (from supersuit->-r requirements.txt (line 5))
  Downloading tinyscaler-1.2.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Downloading stable_baselines3-2.7.0-p

# BatteryEnv — Environment Overview

This notebook section documents the custom Gymnasium environment **`BatteryEnv`** used for battery arbitrage with real  electricity prices.

---

## What is being learned?

An agent controls a battery that can **charge** when prices are low and **discharge** when prices are high. The goal is to **maximize profit** over an episode.

---

## Observation, Action, Reward

### Observation `obs ∈ ℝ^2` (default)
1. **`battery_pct`** ∈ [0, 1] — state of charge (SoC) as a fraction of max capacity.  
2. **`current_price`** — energy price at the current step (currency/kWh).  
   - Negative prices are allowed (e.g., oversupply markets).

> *(Optional future extension)*: add time features like `sin(hour*2π/24)` and `cos(hour*2π/24)` to help the agent learn daily cycles - can be easily implemented thanks to very nice datasets

### Action
- **Continuous mode (`continuous_action=True`)**:  
  `a ∈ [-1, 1]`  
  - `a >= 0` ⇒ **charge** with power `a * max_charge_rate_kw`  
  - `a < 0` ⇒ **discharge** with power `|a| * max_discharge_rate_kw`  
  - Internally converted to energy per step using `step_hours` (kw * h = kwh).

- **Discrete mode (`continuous_action=False`)**:  
  `a ∈ {0, 1, 2}` mapped to {**discharge**, **hold**, **charge**} at the **max** rates.

### Reward (per step)
`reward = revenue_from_energy_sold − cost_of_energy_bought`

- **Charging**: pay `requested_kWh * price`. Charging suffers **one-way efficiency** (you pay input kWh; less is stored).  
- **Discharging**: you **sell** `battery_out_kWh * discharge_efficiency` at the current price.  
- The environment enforces **capacity** and **rate** constraints; any infeasible portion of the requested action is clipped (for example, if you charge more than you can store, you will charge only what you can store and pay only for that).

> **Intuition:** If prices go up later, you profit by charging now and discharging later. If prices are low or negative, charging is encouraged.

---

## Episode Mechanics

- Each episode has `episode_length` steps.  
- Each step advances time by `step_hours` (e.g., 1 hour ⇒ a 24-h daily episode).  
- The episode starts at a (possibly random) index into the `price_series`.  
- Termination occurs after `episode_length` steps.  
- `info["cumulative_profit"]` tracks total profit for the episode.

---

## Core Parameters (all easily tweakable)

| Parameter | Meaning | Typical Values |
|---|---|---|
| `price_series` | 1-D array of prices (currency/kWh) | **Required** |
| `episode_length` | # of steps per episode | 24, 48 |
| `step_hours` | Hours per step | 0.5, 1.0 |
| `max_capacity_kwh` / `min_capacity_kwh` | Battery energy bounds | 100 / 0 |
| `max_charge_rate_kw` / `max_discharge_rate_kw` | Power limits | 25 / 25 |
| `initial_soc_frac` | Start SoC as fraction of max | 0.5 |
| `charge_efficiency` / `discharge_efficiency` | One-way efficiencies | 0.95 / 0.95 |
| `continuous_action` | Continuous vs. discrete control | `True` / `False` |
| `seed` | RNG seed for reproducibility | e.g., 42 |

**Notes & Constraints**
- `price_series` must be at least `episode_length` long, 1-D, without NaNs.  
- Efficiencies must be in `(0, 1]`.  
- `min_capacity_kwh < max_capacity_kwh`.  
- Negative prices are supported.
- All parametes values can be set when initialising the class

---

## How to tweak behavior

- **Use different datasets:** just pass a new `price_series` array.  
- **Change horizon:** set `episode_length` and `step_hours` for 24h vs. 48h, etc.  
- **Battery dynamics:** modify capacity, rates, and (dis)charge efficiencies.  
- **Action space:** toggle `continuous_action` to switch between continuous and discrete control.  
- **Reproducibility:** set `seed` at env construction (and in the RL algorithm).

> **Future To-Do:** Add **time-of-day** features (e.g., `sin/cos`) to the observation. This is straightforward with the current dataset and typically improves learning of daily price cycles.

> Note:
The environment’s dynamics are Markovian (next state depends only on current state and action),
but the agent’s observation, battery charge and current price, is partially observable.
The agent does not directly observe the time-of-day or future price context, so the task can be seen as a partially observable MDP from the agent’s perspective.
Adding time features (sin, cos) in future work would help make the observations more informative.

---

## Quickstart (minimal code)
(run the code below multiple times to see different results)

In [None]:
import numpy as np
import pandas as pd
from envs.single_agent import BatteryEnv

# Load prices (EUR/MWh -> EUR/kWh)
COL = "Day-ahead Price (EUR/MWh)"
prices_mwh = pd.read_csv("datasets/energy_prices_2024_france.csv", usecols=[COL])[COL]
prices_kwh = (pd.to_numeric(prices_mwh, errors="coerce").dropna() / 1000.0).to_numpy(np.float32)

# Construct environment (continuous control)
env = BatteryEnv(
    price_series=prices_kwh,
    episode_length=24,
    step_hours=1.0,
    max_capacity_kwh=100.0,
    min_capacity_kwh=0.0,
    max_charge_rate_kw=25.0,
    max_discharge_rate_kw=25.0,
    initial_soc_frac=0.5,
    charge_efficiency=0.95,
    discharge_efficiency=0.95,
    continuous_action=True,
    seed=42,
)

# Run a single episode with random actions
obs, info = env.reset()
done = False
ep_return = 0.0
while not done:
    action = env.action_space.sample()   # replace with your policy later
    obs, reward, terminated, truncated, info = env.step(action)
    ep_return += reward
    done = terminated or truncated

print("Episode return:", ep_return, "| Final SoC:", info["battery_pct"], "| Cumulative profit:", info["cumulative_profit"])

Episode return: -1.0912872822189266 | Final SoC: 0.8310473706573248 | Cumulative profit: -1.0912872822189266
