**5. Reinforcement Learning for Decarbonization**

Step 1: RL Problem Definition


State :	[current_emissions, remaining_budget] — updated after each action.

Actions : 	0 => Invest in renewables (high cost, high emission cut)

            '1': Improve efficiency (moderate cost/reduction)  

            '2': Do nothing (free, no gain) 

reward = emission_reduction - 0.1 × cost — encourages net CO₂ cut at reasonable cost.  

Termination : When budget <= 0 or emissions <= 0.

 Step 2: Fixed DecarbEnv Environment

In [4]:
import gymnasium as gym
from gymnasium import spaces
import numpy as np

class DecarbEnv(gym.Env):
    def __init__(self):
        super(DecarbEnv, self).__init__()
        self.action_space = spaces.Discrete(3)
        self.observation_space = spaces.Box(low=0, high=10000, shape=(2,), dtype=np.float32)
        self.reset()

        # Action effects: (cost, emission reduction)
        self.action_effects = {
            0: (100, 150),  # Renewables
            1: (50, 60),    # Efficiency
            2: (0, 0)       # Do nothing
        }

    def step(self, action):
        action = int(action) 
        emissions, budget = self.state
        cost, reduction = self.action_effects[action]

        # If action exceeds budget, treat as no-op
        if cost > budget:
            cost, reduction = 0, 0

        emissions = max(emissions - reduction, 0)
        budget = max(budget - cost, 0)
        self.state = [emissions, budget]

        reward = reduction - 0.1 * cost

        # Episode ends when emissions aree zero or no budget is left
        terminated = emissions <= 0 or budget <= 0
        truncated = False

        return np.array(self.state, dtype=np.float32), reward, terminated, truncated, {}

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.state = [1000, 500]
        return np.array(self.state, dtype=np.float32), {}


Step 3: PPO Training Script (Stable-Baselines3)

In [1]:
pip install stable-baselines3[extra] gymnasium


Collecting stable-baselines3[extra]
  Obtaining dependency information for stable-baselines3[extra] from https://files.pythonhosted.org/packages/54/60/6900e8186168e6e23a2125655fb4fe53130256480cc7950dadcee030cd67/stable_baselines3-2.6.0-py3-none-any.whl.metadata
  Downloading stable_baselines3-2.6.0-py3-none-any.whl.metadata (4.8 kB)
Collecting gymnasium
  Obtaining dependency information for gymnasium from https://files.pythonhosted.org/packages/f9/68/2bdc7b46b5f543dd865575f9d19716866bdb76e50dd33b71ed1a3dd8bb42/gymnasium-1.1.1-py3-none-any.whl.metadata
  Downloading gymnasium-1.1.1-py3-none-any.whl.metadata (9.4 kB)
Collecting pygame (from stable-baselines3[extra])
  Obtaining dependency information for pygame from https://files.pythonhosted.org/packages/d2/55/ca3eb851aeef4f6f2e98a360c201f0d00bd1ba2eb98e2c7850d80aabc526/pygame-2.6.1-cp311-cp311-win_amd64.whl.metadata
  Downloading pygame-2.6.1-cp311-cp311-win_amd64.whl.metadata (13 kB)
Collecting ale-py>=0.9.0 (from stable-baselines3[e



In [5]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env
import gymnasium as gym

env = DecarbEnv()
check_env(env)

# Wrap in vectorized env
from stable_baselines3.common.env_util import make_vec_env
vec_env = make_vec_env(lambda: DecarbEnv(), n_envs=1)

# Train PPO agent
model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=10000)

# Evaluate
obs, _ = env.reset()
done = False
total_reward = 0
while not done:
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, _ = env.step(action)
    total_reward += reward

print(f"\n✅ Final Emissions: {obs[0]:.2f}, Remaining Budget: {obs[1]:.2f}")
print(f"🧾 Total Reward Collected: {total_reward:.2f}")


Using cpu device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 11.6     |
|    ep_rew_mean     | 644      |
| time/              |          |
|    fps             | 1459     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 9.67        |
|    ep_rew_mean          | 642         |
| time/                   |             |
|    fps                  | 912         |
|    iterations           | 2           |
|    time_elapsed         | 4           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.021810021 |
|    clip_fraction        | 0.24        |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.08       |
|    explained_variance   | -0.000251   |
|    learning

Summary : In this RL setup, the agent learns to choose actions that balance reducing emissions with minimizing costs. We structured the environment to reflect trade-offs: renewables are impactful but expensive, efficiency is moderate, and doing nothing is safe but wasteful. By training a PPO agent on this logic, the model gradually learns optimal decarbonization strategies. Real-world scenarios could integrate carbon pricing, uncertainty, and long-term planning for more realism.