# RL Environments

By default, there are 4 off-the-shelf RL environments:
- Generalization environment
- Safe RL environment
- MARL environment
- Real-world environment


## Generalization Environment

<iframe width="560" height="315" src="https://www.youtube.com/embed/hL0XDfNHYjA?si=7cn1CpzgpNAf8OAd" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

We developed an RL environment through procedural generation where maps are composed by connecting various types of blocks and then traffic vehicles are scattered on maps randomly.
Thus the environment can generate an unlimited number of diverse driving scenarios.
By training RL agents in one set of scenarios and testing them in another held-out set, we can benchmark the generalizability of the driving policy. 


<img src="figs/blocks_and_big_case_page.jpg" width="600" class="center">


The following script creates a basic environment that can be used for such purpose:


In [None]:
from metadrive import MetaDriveEnv
import tqdm

training_env = MetaDriveEnv(dict(
    num_scenarios=1000,
    start_seed=1000,
    random_lane_width=True,
    random_agent_model=True,
    random_lane_num=True
))


test_env = MetaDriveEnv(dict(
    num_scenarios=200,
    start_seed=0,
    random_lane_width=True,
    random_agent_model=True,
    random_lane_num=True
))


User can specify the training set with 1000 driving scenarios by setting `num_scenarios=1000` and `start_seed=1000`, while creating the test set by setting `num_scenarios=200` and `start_seed=0`.
In this case, the scenarios generated by random seeds [1000, 1999] will be used to train the agents and those by [0, 199] will be used to test the trained agent.

**Note: Please note that each process should only have one single MetaDrive instance due to the limit of the underlying simulation engine, but sometimes we want to have one environment for training and the other one for testing.** There are two ways to overcome this:
1. Launching the training environment and test environment in two separate processes using tools like `Ray` or `stablebaseline3.SubprocVecEnv`. Generally, We use [Ray/RLLib](https://docs.ray.io/en/latest/rllib.html) to train RL agents. The training and test environments are naturally hosted in training workers (processes) and evaluation workers. Therefore we do not worry about this singleton problem.
2. Closing the training environment `training_env.close()` before launch the test environment via `test_env.reset()`. After evaluation, it is allowed to restore the training environment after closing the test environment by simply `training_env.reset()`. An example is as follows.

In [3]:
def print_env_info(env, name):
    print("{} env, start_seed: {}, end_seed: {}".format(name, env.start_seed, env.start_seed+env.num_scenarios))

print_env_info(training_env, "training")
print_env_info(test_env, "test")

for training_epoch in range(2):
    # training
    training_env.reset()
    print("\nStart fake training epoch {}...".format(training_epoch))
    for _ in range(10):
        # execute 10 step
        training_env.step(training_env.action_space.sample())
    training_env.close()

    # evaluation
    print("Evaluate checkpoint for training epoch {}...\n".format(training_epoch))
    test_env.reset()
    for _ in range(10):
        # execute 10 evaluation step
        test_env.step(test_env.action_space.sample())
    test_env.close()

assert test_env.config is not training_env.config


[38;20m[INFO] Assets version: 0.4.1.2[0m
[38;20m[INFO] Known Pipes: glxGraphicsPipe[0m
[38;20m[INFO] Start Scenario Index: 1000, Num Scenarios : 1000[0m
[38;20m[INFO] Assets version: 0.4.1.2[0m
[38;20m[INFO] Known Pipes: glxGraphicsPipe[0m
[38;20m[INFO] Start Scenario Index: 0, Num Scenarios : 200[0m


training env, start_seed: 1000, end_seed: 2000
test env, start_seed: 0, end_seed: 200

Start fake training epoch 0...
Evaluate checkpoint for training epoch 0...



[38;20m[INFO] Assets version: 0.4.1.2[0m
[38;20m[INFO] Known Pipes: glxGraphicsPipe[0m
[38;20m[INFO] Start Scenario Index: 1000, Num Scenarios : 1000[0m
[38;20m[INFO] Assets version: 0.4.1.2[0m
[38;20m[INFO] Known Pipes: glxGraphicsPipe[0m
[38;20m[INFO] Start Scenario Index: 0, Num Scenarios : 200[0m



Start fake training epoch 1...
Evaluate checkpoint for training epoch 1...



The other config `dict(random_lane_width=True, random_agent_model=True, random_lane_num=True)` specifies that the agent model, lane num and lane width will be randomized to make the scenarios more diverse. In the following example, we sample 50 scenarios from the training set and show the statistics.

In [None]:
from metadrive.component.vehicle.vehicle_type import vehicle_type

env_seed=1000
lane_nums = set()
lane_widths = set()
vehicle_models = set()
traffic_vehicle_models = set()

maps_to_sample = 50
end_seed = training_env.config["start_seed"] + maps_to_sample
for env_seed in tqdm.tqdm(range(training_env.config["start_seed"], end_seed)):
    
    # use `seed` argument to choose which scenario to run
    training_env.reset(seed=env_seed)
    
    # collect statistics
    lane_nums.add(training_env.current_map.config["lane_num"]) 
    lane_widths.add(training_env.current_map.config["lane_width"])
    vehicle_models.add(training_env.vehicle.__class__.__name__)
    traffic_models = set([obj.__class__ for obj in training_env.engine.traffic_manager.spawned_objects.values()])
    traffic_vehicle_models = traffic_vehicle_models.union(traffic_models)
    assert vehicle_type[training_env.vehicle.config["vehicle_model"]] is training_env.vehicle.__class__
    
training_env.close()

print("Number of lanes in {} maps are: {}".format(maps_to_sample, lane_nums))
print("{} maps have {} different widths".format(maps_to_sample, len(lane_widths)))
print("The policy is learning to drive {} types of vehicles".format(len(vehicle_models)))
print("There are {} types of traffic vehicles".format(len(traffic_vehicle_models)))


assert lane_nums == {2, 3}
assert len(lane_widths) == 50
assert len(vehicle_models) == 5
assert len(traffic_vehicle_models) == len(vehicle_models) - 1


Actually, we provide an upgraded version for generalization environment with full PG functionality and can additionally randomize the dynamics of ego vehicle.
The environment is called `VaryingDynamicsEnv` and you can control the `random_dynamics` dict in the config
to adjust the randomizing range of specific dynamics parameters.
In the below example (which is also the default config), we randomize the dynamics of vehicle to the lowest and highest limit we recommended.

In [None]:
from metadrive.envs.varying_dynamics_env import VaryingDynamicsEnv
from metadrive.component.vehicle.vehicle_type import vehicle_type
import tqdm

training_env = VaryingDynamicsEnv(dict(
        num_scenarios=1000,  
        
        # Stop randomizing them
        # random_lane_width=True,
        # random_agent_model=True,
        # random_lane_num=True
    
        # We will sample each parameter from (min_value, max_value)
        # You can set it to None to stop randomizing the parameter.
        random_dynamics=dict(
            max_engine_force=(100, 3000),
            max_brake_force=(20, 600),
            wheel_friction=(0.1, 2.5),
            max_steering=(10, 80),  # The maximum steering angle if action = +-1
            mass=(300, 3000)
        )
    ))

In [None]:
env_seed=1000
lane_nums = set()
lane_widths = set()
vehicle_models = set()
traffic_vehicle_models = set()

# collect more
to_collect = ["max_engine_force", "max_brake_force", "wheel_friction", "max_steering", "mass"]
to_collect_set = {k: set() for k in to_collect}

maps_to_sample = 50
end_seed = training_env.config["start_seed"] + maps_to_sample
for env_seed in tqdm.tqdm(range(training_env.config["start_seed"], end_seed)):
    
    # use `seed` argument to choose which scenario to run
    training_env.reset(seed=env_seed)
    
    # collect statistics
    lane_nums.add(training_env.current_map.config["lane_num"]) 
    lane_widths.add(training_env.current_map.config["lane_width"])
    vehicle_models.add(training_env.vehicle.__class__.__name__)
    traffic_models = set([obj.__class__ for obj in training_env.engine.traffic_manager.spawned_objects.values()])
    traffic_vehicle_models = traffic_vehicle_models.union(traffic_models)
    assert vehicle_type[training_env.vehicle.config["vehicle_model"]] is training_env.vehicle.__class__
    
    # collect more
    for k, v in to_collect_set.items():
        v.add(training_env.vehicle.config[k])
    
training_env.close()

print("Number of lanes in {} maps are: {}".format(maps_to_sample, lane_nums))
print("{} maps have {} different widths".format(maps_to_sample, len(lane_widths)))
print("The policy is learning to drive vehicles with {} different dyamics".format(len(to_collect_set["wheel_friction"])))

assert all([len(s)==50 for s in to_collect_set.values()])
assert lane_nums == {3}
assert len(lane_widths) == 1
assert vehicle_models == set([vehicle_type["varying_dynamics"].__name__])
assert len(traffic_vehicle_models) == 4


In the very early stage of MetaDrive, we have experimented randomizing the `wheel_friction` in the training environment.
We find that `wheel_friction > 1.2` makes little impact to the performance. So you can try a training environment
with `wheel_friction in [1.0, 1.4)` and test the trained agent in `wheel_friction in [0.6, 1.0)`.
The training environment is significantly easier than the test environment.
We are expecting that the agent trained in less training scenarios will perform poorly in the test environment.



------------

<img src="figs/metadrive-envs.jpg" width="600" class="center">


## Safety Environments



<iframe width="560" height="315" src="https://www.youtube.com/embed/6YNgwxEvYtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>


Safety is a major concern for the trial-and-error nature of RL.
As driving itself is a safety-critical application, it is essential to evaluate the constrained optimization methods under the domain of autonomous driving.
We therefore define a new suite of environments to benchmark the **safe exploration** in RL.


As shown in the left panel of the figure above, we randomly display static and movable obstacles in the traffic.
Different from the generalization task, we do not terminate the agent if a collision with those obstacles and traffic vehicles happens.
Instead, we allow agent to continue driving but flag the crash with a cost +1.
Thus as safe exploration task, the learning agent is required to balance the reward and the cost to solve the constrained optimization problem.


The following script can setup such environment. Same as in generalization environment, you can also specify the number of environment and the start seed to initialize two sets of environments to train and test the RL agents and benchmark their safety generalization. The environment-specific parameter is `accident_prob`, which controls the density of obstacles on the road. Apart from this, all parameters are the same as generalization environment.

In [None]:
from metadrive import SafeMetaDriveEnv

env=SafeMetaDriveEnv(dict(
    num_scenarios=1000,
    start_seed=0,
    accident_prob = 0.8, # accepted parameter is in [0, 1.0]
))

You can also experience the safety environment via 

```bash
python -m metadrive.examples.drive_in_safe_metadrive_env
```

## Multi-agent Environments

<iframe width="560" height="315" src="https://www.youtube.com/embed/1-sXZv2ZzXM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>


As shown in the above figure,
we develop a set of environments to evaluate MARL methods for simulating traffic flow.
The descriptions and typical settings of the six traffic environments are as follows:

1. **Roundabout**: A four-way roundabout with two lanes. 40 vehicles spawn during environment reset. This environment includes merge and split junctions.
2. **Intersection**: An unprotected four-way intersection allowing bi-directional traffic as well as U-turns. Negotiation and social behaviors are expected to solve this environment. We initialize 30 vehicles.
3. **Tollgate**: Tollgate includes narrow roads to spawn agents and ample space in the middle with multiple tollgates. The tollgates create static obstacles where the crashing is prohibited. We force agent to stop at the middle of tollgate for 3s. The agent will fail if they exit the tollgate before being allowed to pass. 40 vehicles are initialized. Complex behaviors such as deceleration and queuing are expected. Additional states such as whether vehicle is in tollgate and whether the tollgate is blocked are given.
4. **Bottleneck**: Complementary to Tollgate, Bottleneck contains a narrow bottleneck lane in the middle that forces the vehicles to yield to others. We initialize 20 vehicles.
5. **Parking Lot**: A compact environment with 8 parking slots. Spawn points are scattered in both parking lots or in external roads. 10 vehicles spawn initially and need to navigate toward external roads or enter parking lots. In this environment, we allow agents to back their cars to spare space for others.  Maneuvering and yielding are the key to solve this task.
6. **PGMA** (Procedural Generation Multi-Agent environment): We reuse the procedurally generated scenarios in the generalization environment and replaces the traffic vehicles by controllable target vehicles. These environments contain rich interactions between agents and complex road structures. This multi-agent environment introduces new challenge under the setting of mixed motive RL. Each constituent agent in this traffic system is self-interested and the relationship between agents is constantly changing.

In Multi-agent environment, the termination criterion for each vehicle is identical to that in single-agent environment.
We explicitly add two config to adjust the termination processing in MARL: `crash_done = True` and `out_of_road_done = True`.
They denotes whether to terminate the agent episode if crash / out of road happens.

Besides, in Multi-agent environment, the controllable target vehicles consistently respawn in the scene if old target vehicles are terminated.
To limit the length of *environmental episode*, we also introduce a config `horizon = 1000` in MARL environments.
The environmental episode has a **minimal length** of `horizon` steps and the environment will stop spawning new target vehicles if this horizon is exceeded.
If you wish to disable the respawning mechanism in MARL, set the config `allow_respawn = False`. In this case, the environmental episode will terminate if no active vehicles are in the scene.


You can try to drive a vehicle in Multi-agent environment through this example:
```bash
# Options for --env: roundabout, intersection, tollgate, bottleneck, parkinglot, pgma
python -m metadrive.examples.drive_in_multi_agent_env --env pgma
```

The following script initialize arbitrary Multi-agent environment:

In [None]:
from metadrive import (
    MultiAgentMetaDrive,
    MultiAgentTollgateEnv,
    MultiAgentBottleneckEnv,
    MultiAgentIntersectionEnv,
    MultiAgentRoundaboutEnv,
    MultiAgentParkingLotEnv
)

envs_classes = dict(
    roundabout=MultiAgentRoundaboutEnv,
    intersection=MultiAgentIntersectionEnv,
    tollgate=MultiAgentTollgateEnv,
    bottleneck=MultiAgentBottleneckEnv,
    parkinglot=MultiAgentParkingLotEnv,
    pgma=MultiAgentMetaDrive
)
envs = [envs_classes[CLASS_NAME]() for CLASS_NAME in envs_classes.keys()]

## Real-world environment




We are developing new environments for benchmarking novel and challenging RL tasks! Any idea on the design of new tasks are welcomed!