# ACCEL - Adversarially Compounding Complexity by Editing Levels

### 1.1 Unsupervised Environment Design (UED)
UED addresses the question: **How can we generate a curriculum of environments (levels) that maximally challenge an agent so that it generalizes well?**

- Levels are parameterized environments (e.g., different maze layouts, obstacle configurations, etc.).
- A teacher (level-generator) creates or selects levels that push the agent (the student).
- A regret-based objective is often used to identify the levels that the agent finds hardest.

### 1.2 Regret & Minimax Regret
- Regret is the difference between the return the optimal policy *could achieve* on a given level vs. what the current agent’s policy achieves.
- Minimax Regret means the agent’s policy minimizes the worst-case regret across all levels. In practice, this encourages the agent to be robust to the most difficult levels.

### 1.3 PLR (Prioritized Level Replay)
PLR is a simpler form of UED that:

- Randomly samples levels from a design space.
- Scores each level by an approximate regret measure (e.g., large “TD-error” or “value loss” = “hard” level).
- Curates a buffer of the highest-scoring (hardest) levels.
- Trains the agent only (or primarily) on those curated levels.


### 1.4 ACCEL (Adversarially Compounding Complexity by Editing Levels) 
ACCEL extends PLR by **evolving**  previously discovered levels rather than always sampling new random ones. Specifically: 
1. **Start**  with a generator that can randomly produce levels.
 
2. **Keep**  a replay buffer $\Lambda$ of highest-regret levels (like PLR).
 
3. **Replay**  from $\Lambda$ (train on those levels).
 
4. **Edit (mutate)**  these replayed levels—small changes that typically make them *more* challenging.
 
5. **Re-evaluate**  mutated levels for regret. If they’re also high-regret, add them to $\Lambda$.
 
6. **Iterate**  this process, thus “compounding complexity” because each level can build on previously discovered “frontier” levels.

**Outcome** : Over time, the environment buffer $\Lambda$ accumulates an increasingly difficult and diverse set of levels, pushing the agent’s capabilities further.

# Example usage of gymnasium with minigrid

In [4]:
import gymnasium as gym
import minigrid

# Create the environment
env = gym.make("MiniGrid-Empty-5x5-v0", render_mode="human")

# Reset the environment to start a new episode
obs, info = env.reset()

# Render the initial state of the environment
env.render()

# Take num_steps random actions in the environment
for _ in range(50):
    # Sample a random action
    action = env.action_space.sample()
    
    # Step through the environment with the chosen action
    obs, reward, terminated, truncated, info = env.step(action)
    
    # Render the environment after each action
    env.render()
    
    # Check if the episode is done
    if terminated or truncated:
        print("Episode finished")
        break

# Close the environment
env.close()


Episode finished


# Example usage of gymnasium with CartPole-v1 and PPO

Model train

In [5]:
import gymnasium as gym
import numpy as np

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import EvalCallback

from stable_baselines3.common.monitor import Monitor

# Create the Gymnasium environment
env_id = "CartPole-v1"
env = gym.make(env_id)

# Optionally, you can create a vectorized environment for parallel training
# This can speed up training by using multiple environments simultaneously
# Here, we create 4 parallel environments
vec_env = make_vec_env(env_id, n_envs=4)

# Initialize the PPO agent
model = PPO(
    "MlpPolicy",          # Multi-layer Perceptron policy
    vec_env,              # Vectorized environment
    verbose=1,            # Verbosity level (0: no output, 1: info)
    tensorboard_log="./ppo_cartpole_tensorboard/"  # Path for TensorBoard logs
)

# Set up an evaluation callback to monitor the agent's performance
eval_env = gym.make(env_id)
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path='./logs/',
    log_path='./logs/',
    eval_freq=500,        # Evaluate the agent every 500 steps
    n_eval_episodes=5,    # Number of episodes for evaluation
    deterministic=True,
    render=False
)

# Train the agent for a total of 100,000 steps, use the eval_callback every 500 steps
model.learn(total_timesteps=100, callback=eval_callback)

# Save the trained model
model.save("ppo_cartpole")

Using cpu device


KeyboardInterrupt: 

Model eval

In [2]:
# Load the environment
env_id = "CartPole-v1"
env = gym.make(env_id, render_mode="human")

# To demonstrate loading the model, you can reload it as follows:
model = PPO.load("ppo_cartpole", env=vec_env)

# Evaluate the trained agent
episodes = 10
for episode in range(1, episodes + 1):
    obs, info = env.reset()
    done = False
    total_reward = 0
    while not done:
        action, _states = model.predict(obs, deterministic=True)
        obs, reward, done, truncated, info = env.step(action)
        total_reward += reward
        # Optionally, render the environment (requires a display)
        # env.render()
    print(f"Episode {episode}: Total Reward = {total_reward}")

# Close the environments
env.close()
eval_env.close()
vec_env.close()


Episode 1: Total Reward = 283.0
Episode 2: Total Reward = 995.0
Episode 3: Total Reward = 397.0
Episode 4: Total Reward = 423.0
Episode 5: Total Reward = 601.0
Episode 6: Total Reward = 124.0
Episode 7: Total Reward = 278.0
Episode 8: Total Reward = 1015.0
Episode 9: Total Reward = 171.0
Episode 10: Total Reward = 132.0


**Input**:  
- Level buffer size $K$, initial fill ratio $\rho$, level generator

**Initialize**:  
1. Initialize policy $\pi(\phi)$
2. Initialize level buffer $\Lambda$
3. Sample $K \cdot \rho$ initial levels to populate $\Lambda$

**while** not converged **do**  
   1. Sample replay decision $d \sim P_D(d)$  
   2. **if** $d = 0$ **then**  
      - Sample level $\theta$ from level generator  
      - Collect $\pi$'s trajectory $\tau$ on $\theta$, with stop-gradient $\phi^\perp$  
      - Compute regret score $S$ for $\theta$ (Equation 5)  
      - Update $\Lambda$ with $\theta$ if score $S$ meets threshold  
   3. **else**  
      - Sample a replay level, $\theta \sim \Lambda$  
      - Collect policy trajectory $\tau$ on $\theta$  
      - Update $\pi$ with rewards $R(\tau)$  
      - Edit $\theta$ to produce $\theta'$  
      - Collect $\pi$'s trajectory $\tau$ on $\theta'$, with stop-gradient $\phi^\perp$  
      - Compute regret score $S'$ for $\theta'$  
      - Update $\Lambda$ with $\theta$ or $\theta'$ if score $S$ or $S'$ meets threshold  
      - (Optionally) Update Editor using score $S$  
   4. **end**  
**end**


# ACCEL CODE

In [None]:
import gymnasium as gym
import minigrid
import numpy as np
from stable_baselines3 import PPO


K   = 100   # Maximum number of levels to store in the replay buffer (levels to train on)
pho = 0.5   # Initial fill rate of the replay buffer (initial ratio of random levels to store)
p   = 0.5   # Probability of requesting a new level to be generated




class LevelGenerator:
    def __init__(self, level_buffer):
        # Fill the replay buffer with K levels generated with the initial fill rate
        for _ in range(int(K * pho)):
            self.generate(level_buffer)

    def generate(self, level_buffer):
        if len(level_buffer) < K:
            level_buffer.append(gym.make("MiniGrid-LavaGapS5-v0", render_mode="human"))
        else:
            print("Replay buffer is full")




def main():

    levels = [] # Replay buffer to store the levels

    # Initialize the level generator
    level_generator = LevelGenerator(levels)

    # Initialize the agent
    model = PPO("MlpPolicy", levels[0], verbose=1)

    # Train the agent
    for _ in range(1000):
        # With probability p, generate a new level and add it to the replay buffer
        if np.random.rand() < p:
            level_generator.generate()
        else:
            # Sample a level from the replay buffer
            level = np.random.choice(levels)
            obs, reward, done, info = level.reset(), 0, False, {}
            while not done:
                action, _ = model.predict(obs, deterministic=True)
                obs, reward, done, info = level.step(action)
                level.render()
            print("Episode finished")


    # Close all the environments in the replay buffer
    for env in level_generator.levels:
        env.close()


if __name__ == "__main__":
    main()

# NEW ACCEL CODE

In [8]:
import numpy as np
import gymnasium as gym
from gymnasium.spaces import Discrete, Box, Text
import gymnasium.spaces as spaces
from minigrid.core.mission import MissionSpace
from minigrid.core.world_object import Goal, Wall
from minigrid.minigrid_env import MiniGridEnv, Grid
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback


# ====================================================
# 1. Custom MiniGrid Environment that takes a config 
# (https://minigrid.farama.org/content/create_env_tutorial/)
# ====================================================
class MyCustomGrid(MiniGridEnv):
    """
    Simple MiniGrid environment that places random wall tiles
    according to a config dict.
    """

    def __init__(self, config=None, **kwargs):
        """
        config: a dict with fields like:
            {
              "width": 8,
              "height": 8,
              "num_blocks": 5,
              "seed_val": 42,
              "agent_start": (1,1)
            }
        """
        if config is None:
            config = {}
        self.config = config

        width = config.get("width", 8)
        height = config.get("height", 8)
        self.num_blocks = config.get("num_blocks", 5)
        self.custom_seed = config.get("seed_val", None)
        self.agent_start = config.get("agent_start", None)

        # Pass a minimal MissionSpace to the parent for initialization
        super().__init__(
            width=width,
            height=height,
            max_steps=width * height * 2,
            see_through_walls = False,
            agent_view_size = 5,
            mission_space = MissionSpace(mission_func=lambda: "get to the green goal square"),
            **kwargs
        )
        
        # If you'd like to fix the seed for every reset call, you can do:
        # (Otherwise, minigrid will handle seeds via env.reset(seed=...))
        if self.custom_seed is not None:
            self.seed(self.custom_seed)

    def _gen_grid(self, width, height):
        """
        Generate the grid layout for a new episode.
        """
        # 1) Create an empty grid
        self.grid = Grid(width, height)

        # 2) Surround the grid with walls
        self.grid.wall_rect(0, 0, width, height) # Fill the whole grid with walls
        
        # Manual wall placement:
        #for i in range(width):
        #    self.put_obj(Wall(), i, 0)
        #    self.put_obj(Wall(), i, height - 1)
        #for j in range(height):
        #    self.put_obj(Wall(), 0, j)
        #    self.put_obj(Wall(), width - 1, j)

        # 3) Place random num_blocks (walls) inside
        for _ in range(self.num_blocks):
            r = self._rand_int(1, height - 1) # row
            c = self._rand_int(1, width - 1) # column
            # put_obj takes (object, i, j)
            self.put_obj(Wall(), c, r)

        # 4) Place the agent
        if self.agent_start is not None:
            ax, ay = self.agent_start
            # place_agent(...) can put the agent exactly at (ax, ay)
            # by specifying top=(ax, ay) and a size of (1,1)
            self.place_agent(
                top=(ax, ay),
                size=(1, 1),
                rand_dir=False  # Do not randomize direction
            )
        else:
            # Let parent place agent randomly
            self.place_agent()

        # 5) Place a goal object somewhere randomly
        self.place_obj(Goal())

# ====================================================
# 2. Simple “level buffer” 
# # Contains a list of (config, score) tuples
# ====================================================
class LevelBuffer:
    def __init__(self, max_size=50):
        self.max_size = max_size
        # will store tuples of (config_dict, score)
        self.data = []

    def add(self, config, score):
        # store
        self.data.append((config, score))
        # if over capacity, remove lowest-score
        if len(self.data) > self.max_size:
            self.data.sort(key=lambda x: x[1], reverse=True)
            self.data = self.data[: self.max_size]

    def sample_config(self):
        # Weighted sampling by score
        # Higher score level has more chance of being sampled
        if len(self.data) == 0:
            return None
        scores = [item[1] for item in self.data]
        total = sum(scores)
        if total <= 1e-9:
            # fallback to uniform
            idx = np.random.randint(len(self.data))
            return self.data[idx][0]
        probs = [s / total for s in scores]
        idx = np.random.choice(len(self.data), p=probs)
        return self.data[idx][0]


# ====================================================
# 3. Utility: generate random config
# ====================================================
def random_config():
    return {
        "width": np.random.randint(5, 10),
        "height": np.random.randint(5, 10),
        "num_blocks": np.random.randint(0, 15),
        "seed_val": np.random.randint(0, 999999),
        # "agent_start": (1,1)  # optional
    }


# ====================================================
# 4. Utility: mutate an existing config
# ====================================================
def edit_config(old_config):
    new_config = dict(old_config)
    # randomly tweak width/height a tiny bit
    if np.random.rand() < 0.5:
        new_config["width"] = max(5, old_config["width"] + np.random.choice([-1, 1]))
    else:
        new_config["height"] = max(5, old_config["height"] + np.random.choice([-1, 1]))

    # tweak num_blocks
    new_config["num_blocks"] = max(0, old_config["num_blocks"] + np.random.choice([-2, -1, 1, 2]))

    # optionally change seed
    new_config["seed_val"] = np.random.randint(0, 999999)
    return new_config


# ====================================================
# 5. Evaluate “difficulty” or “regret”
#    (Simplified: we do a short rollout with
#    the *current policy* and measure reward or steps)
#    In a real ACCEL, you'd measure "positive value loss."
# ====================================================
def measure_difficulty(config, model, max_steps=200):
    """
    Run 1 episode with the current model. Return a 'difficulty' score.
    Higher means the agent did poorly, so it's "harder."
    """
    env = MyCustomGrid(config)
    obs, _ = env.reset()

    total_reward = 0
    for _ in range(max_steps):
        action, _states = model.predict(obs, deterministic=True)
        obs, reward, done, truncated, info = env.step(action)
        total_reward += reward
        if done or truncated:
            break

    # e.g. define difficulty as 1 - (normalized reward)
    # or just do "steps used," or negative of final reward, etc.
    # We'll do a naive approach:
    difficulty = max(0, (1.0 - (total_reward / 1.0)))
    return difficulty


# ====================================================
# 6. Main demonstration
# ====================================================
def main_accel_demo(total_iterations=30, replay_prob=0.7, train_steps=2000):
    """
    - Create a dummy environment (for stable-baselines model init).
    - Create a PPO model
    - Create a buffer
    - Repeatedly do:
       * with prob (1 - p): sample new config, measure difficulty -> store
       * with prob p: sample from buffer, train some steps, then mutate -> measure -> store
    """

    # 6.1. Create a dummy environment for model initialization
    dummy_env = MyCustomGrid(config={"width": 5, "height": 5, "num_blocks": 1})
    
    # Initialize and render the dummy environment
    dummy_env.reset()     
    dummy_env.render()

    # Initialize PPO with MlpPolicy
    model = PPO("MlpPolicy", dummy_env, verbose=1, n_steps=128, batch_size=64)

    # 6.2. Create the level buffer
    level_buffer = LevelBuffer(max_size=50)

    # Fill with some random configs initially
    for _ in range(10):
        cfg = random_config()
        level_buffer.add(cfg, score=1.0)  # dummy score

    # 6.3. Outer loop
    for iteration in range(total_iterations):
        print(f"\n=== ITERATION {iteration+1}/{total_iterations} ===")
        use_replay = np.random.rand() < replay_prob

        # A) If we do NOT replay from buffer, we just sample a new config
        if (not use_replay) or (len(level_buffer.data) == 0):
            cfg = random_config()
            # measure difficulty with current PPO
            diff_score = measure_difficulty(cfg, model)
            level_buffer.add(cfg, diff_score)
            print(f"  Sampled new config, difficulty={diff_score:.3f}")
        else:
            # B) Replay from buffer: pick a config, do short train, mutate
            old_cfg = level_buffer.sample_config()
            # 1) Train the PPO model on old_cfg
            env = MyCustomGrid(old_cfg)
            model.set_env(env)
            model.learn(total_timesteps=train_steps)  # short training

            # 2) Now mutate config
            new_cfg = edit_config(old_cfg)
            diff_score = measure_difficulty(new_cfg, model)
            level_buffer.add(new_cfg, diff_score)
            print(f"  Replayed + mutated. old_cfg -> new_cfg. difficulty={diff_score:.3f}")

    print("\nDone. Final buffer size:", len(level_buffer.data))
    print("Buffer top-5 hardest levels (config, score):")
    level_buffer.data.sort(key=lambda x: x[1], reverse=True)
    for i, (cfg, sc) in enumerate(level_buffer.data[:5]):
        print(f"{i+1}. score={sc:.3f}, config={cfg}")


if __name__ == "__main__":
    main_accel_demo()


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


TypeError: 'NoneType' object is not iterable