# ACCEL - Adversarially Compounding Complexity by Editing Levels

### 1.1 Unsupervised Environment Design (UED)
UED addresses the question: **How can we generate a curriculum of environments (levels) that maximally challenge an agent so that it generalizes well?**

- Levels are parameterized environments (e.g., different maze layouts, obstacle configurations, etc.).
- A teacher (level-generator) creates or selects levels that push the agent (the student).
- A regret-based objective is often used to identify the levels that the agent finds hardest.

### 1.2 Regret & Minimax Regret
- Regret is the difference between the return the optimal policy *could achieve* on a given level vs. what the current agent’s policy achieves.
- Minimax Regret means the agent’s policy minimizes the worst-case regret across all levels. In practice, this encourages the agent to be robust to the most difficult levels.

### 1.3 PLR (Prioritized Level Replay)
PLR is a simpler form of UED that:

- Randomly samples levels from a design space.
- Scores each level by an approximate regret measure (e.g., large “TD-error” or “value loss” = “hard” level).
- Curates a buffer of the highest-scoring (hardest) levels.
- Trains the agent only (or primarily) on those curated levels.


### 1.4 ACCEL (Adversarially Compounding Complexity by Editing Levels) 
ACCEL extends PLR by **evolving**  previously discovered levels rather than always sampling new random ones. Specifically: 
1. **Start**  with a generator that can randomly produce levels.
 
2. **Keep**  a replay buffer $\Lambda$ of highest-regret levels (like PLR).
 
3. **Replay**  from $\Lambda$ (train on those levels).
 
4. **Edit (mutate)**  these replayed levels—small changes that typically make them *more* challenging.
 
5. **Re-evaluate**  mutated levels for regret. If they’re also high-regret, add them to $\Lambda$.
 
6. **Iterate**  this process, thus “compounding complexity” because each level can build on previously discovered “frontier” levels.

**Outcome** : Over time, the environment buffer $\Lambda$ accumulates an increasingly difficult and diverse set of levels, pushing the agent’s capabilities further.

# Example usage of gymnasium with minigrid

In [4]:
import gymnasium as gym
import minigrid

# Create the environment
env = gym.make("MiniGrid-Empty-5x5-v0", render_mode="human")

# Reset the environment to start a new episode
obs, info = env.reset()

# Render the initial state of the environment
env.render()

# Take num_steps random actions in the environment
for _ in range(50):
    # Sample a random action
    action = env.action_space.sample()
    
    # Step through the environment with the chosen action
    obs, reward, terminated, truncated, info = env.step(action)
    
    # Render the environment after each action
    env.render()
    
    # Check if the episode is done
    if terminated or truncated:
        print("Episode finished")
        break

# Close the environment
env.close()


Episode finished


# Example usage of gymnasium with CartPole-v1 and PPO

Model train

In [1]:
import gymnasium as gym
import numpy as np

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import EvalCallback

from stable_baselines3.common.monitor import Monitor

# Create the Gymnasium environment
env_id = "CartPole-v1"
env = gym.make(env_id)

# Optionally, you can create a vectorized environment for parallel training
# This can speed up training by using multiple environments simultaneously
# Here, we create 4 parallel environments
vec_env = make_vec_env(env_id, n_envs=4)

# Initialize the PPO agent
model = PPO(
    "MlpPolicy",          # Multi-layer Perceptron policy
    vec_env,              # Vectorized environment
    verbose=1,            # Verbosity level (0: no output, 1: info)
    tensorboard_log="./ppo_cartpole_tensorboard/"  # Path for TensorBoard logs
)

# Set up an evaluation callback to monitor the agent's performance
eval_env = gym.make(env_id)
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path='./logs/',
    log_path='./logs/',
    eval_freq=500,        # Evaluate the agent every 500 steps
    n_eval_episodes=5,    # Number of episodes for evaluation
    deterministic=True,
    render=False
)

# Train the agent for a total of 100,000 steps, use the eval_callback every 500 steps
model.learn(total_timesteps=100, callback=eval_callback)

# Save the trained model
model.save("ppo_cartpole")

Using cpu device
Logging to ./ppo_cartpole_tensorboard/PPO_4




Eval num_timesteps=2000, episode_reward=112.60 +/- 13.18
Episode length: 112.60 +/- 13.18
---------------------------------
| eval/              |          |
|    mean_ep_length  | 113      |
|    mean_reward     | 113      |
| time/              |          |
|    total_timesteps | 2000     |
---------------------------------
New best mean reward!
Eval num_timesteps=4000, episode_reward=126.20 +/- 37.93
Episode length: 126.20 +/- 37.93
---------------------------------
| eval/              |          |
|    mean_ep_length  | 126      |
|    mean_reward     | 126      |
| time/              |          |
|    total_timesteps | 4000     |
---------------------------------
New best mean reward!
Eval num_timesteps=6000, episode_reward=141.40 +/- 35.57
Episode length: 141.40 +/- 35.57
---------------------------------
| eval/              |          |
|    mean_ep_length  | 141      |
|    mean_reward     | 141      |
| time/              |          |
|    total_timesteps | 6000     |
------

Model eval

In [2]:
# Load the environment
env_id = "CartPole-v1"
env = gym.make(env_id, render_mode="human")

# To demonstrate loading the model, you can reload it as follows:
model = PPO.load("ppo_cartpole", env=vec_env)

# Evaluate the trained agent
episodes = 10
for episode in range(1, episodes + 1):
    obs, info = env.reset()
    done = False
    total_reward = 0
    while not done:
        action, _states = model.predict(obs, deterministic=True)
        obs, reward, done, truncated, info = env.step(action)
        total_reward += reward
        # Optionally, render the environment (requires a display)
        # env.render()
    print(f"Episode {episode}: Total Reward = {total_reward}")

# Close the environments
env.close()
eval_env.close()
vec_env.close()


Episode 1: Total Reward = 283.0
Episode 2: Total Reward = 995.0
Episode 3: Total Reward = 397.0
Episode 4: Total Reward = 423.0
Episode 5: Total Reward = 601.0
Episode 6: Total Reward = 124.0
Episode 7: Total Reward = 278.0
Episode 8: Total Reward = 1015.0
Episode 9: Total Reward = 171.0
Episode 10: Total Reward = 132.0


**Input**:  
- Level buffer size $K$, initial fill ratio $\rho$, level generator

**Initialize**:  
1. Initialize policy $\pi(\phi)$
2. Initialize level buffer $\Lambda$
3. Sample $K \cdot \rho$ initial levels to populate $\Lambda$

**while** not converged **do**  
   1. Sample replay decision $d \sim P_D(d)$  
   2. **if** $d = 0$ **then**  
      - Sample level $\theta$ from level generator  
      - Collect $\pi$'s trajectory $\tau$ on $\theta$, with stop-gradient $\phi^\perp$  
      - Compute regret score $S$ for $\theta$ (Equation 5)  
      - Update $\Lambda$ with $\theta$ if score $S$ meets threshold  
   3. **else**  
      - Sample a replay level, $\theta \sim \Lambda$  
      - Collect policy trajectory $\tau$ on $\theta$  
      - Update $\pi$ with rewards $R(\tau)$  
      - Edit $\theta$ to produce $\theta'$  
      - Collect $\pi$'s trajectory $\tau$ on $\theta'$, with stop-gradient $\phi^\perp$  
      - Compute regret score $S'$ for $\theta'$  
      - Update $\Lambda$ with $\theta$ or $\theta'$ if score $S$ or $S'$ meets threshold  
      - (Optionally) Update Editor using score $S$  
   4. **end**  
**end**


In [None]:
import gymnasium as gym
import minigrid
import numpy as np
from stable_baselines3 import PPO


K = 100 # Replay buffer, maximum number of levels to store
pho = 0.5 # Initial fill rate of the replay buffer
p = 0.5 # Probability of requesting a new level to be generated



class LevelGenerator:
    def __init__(self, level_buffer):
        # Fill the replay buffer with K levels generated with the initial fill rate
        for _ in range(int(K * pho)):
            self.generate(level_buffer)

    def generate(self, level_buffer):
        if len(level_buffer) < K:
            level_buffer.append(gym.make("MiniGrid-LavaGapS5-v0", render_mode="human"))
        else:
            print("Replay buffer is full")


class LevelEditor(gym.Wrapper):
    



def main():

    levels = [] # Replay buffer to store the levels

    # Initialize the level generator
    level_generator = LevelGenerator(levels)

    # Initialize the agent
    model = PPO("MlpPolicy", levels[0], verbose=1)

    # Train the agent
    for _ in range(1000):
        # With probability p, generate a new level and add it to the replay buffer
        if np.random.rand() < p:
            level_generator.generate()
        else:
            # Sample a level from the replay buffer
            level = np.random.choice(levels)
            obs, reward, done, info = level.reset(), 0, False, {}
            while not done:
                action, _ = model.predict(obs, deterministic=True)
                obs, reward, done, info = level.step(action)
                level.render()
            print("Episode finished")


    # Close all the environments in the replay buffer
    for env in level_generator.levels:
        env.close()


if __name__ == "__main__":
    main()