# Reward, Cost, Termination, and Step Information

[![Click and Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/metadriverse/metaurban/blob/main/documentation/source/reward_cost_done.ipynb)



Following the standard OpenAI Gym API, after each step of the environment `env.step(...)`, the environment will return a tuple containing five items: `(obs, reward, terminated, truncated, info)`. In this page, we discuss the design of reward function `reward`, cost function `info["cost"]`, termination criterion `terminated` in various settings, truncation information `truncated`, and the details of step information `info`.

## Reward Function

For all environments, reward functions consist of generally a dense driving reward and a sparse terminal reward. The dense reward is the longitudinal movement along the reference line or lane toward destination. When the episode is terminated due to, i.e. arriving the destination or driving out of the road, a sparse reward will be added to the dense reward. In practice, the concrete implementations of reward function are slightly different across all environments. 

The reward functions are implemented as follows.

In [2]:
from metaurban.envs import SidewalkStaticMetaUrbanEnv
from metaurban.utils import print_source
print_source(SidewalkStaticMetaUrbanEnv.reward_function)

[34mdef[39;49;00m[37m [39;49;00m[35mreward_function[39;49;00m[35m([39;49;00m[35mself[39;49;00m[35m,[39;49;00m [30mvehicle_id[39;49;00m[35m:[39;49;00m [35mstr[39;49;00m[35m)[39;49;00m[35m:[39;49;00m[37m[39;49;00m
[37m    [39;49;00m[33m"""[39;49;00m
[33m    Override this func to get a new reward function[39;49;00m
[33m    :param vehicle_id: id of BaseVehicle[39;49;00m
[33m    :return: reward[39;49;00m
[33m    """[39;49;00m[37m[39;49;00m
    [30mvehicle[39;49;00m = [35mself[39;49;00m.[30magents[39;49;00m[35m[[39;49;00m[30mvehicle_id[39;49;00m[35m][39;49;00m[37m[39;49;00m
    [30mstep_info[39;49;00m = [35mdict[39;49;00m[35m([39;49;00m[35m)[39;49;00m[37m[39;49;00m
[37m[39;49;00m
    [32m# Reward for moving forward in current lane[39;49;00m[37m[39;49;00m
    [30mcurrent_lane[39;49;00m = [30mvehicle[39;49;00m.[30mlane[39;49;00m[37m[39;49;00m
    [30mlong_last[39;49;00m = [30mvehicle[39;49;00m.[30mnavigation[39;4

This reward function is composed of three parts as follows:

$R = R_{term} + c_1 R_{disp} + c_2 R_{lateral} + c_3 R_{steering} + c_4 R_{crash}$

- **Terminal reward** $R_{term}$: a sparse reward set to $+5$ if the vehicle reaches the destination, and $-5$ for out of route. If given $R_{term}\neq 0$ at any time step $t$, the episode will be terminated at $t$ immediately. 
- **Displacement reward** $R_{disp}$: a dense reward defined as $R_{disp}=d_t-d_{t-1}$, wherein the $d_t$ and $d_{1}$ denote the longitudinal position of the ego agent in Frenet coordinates of current lane at time $t$ and $t-1$, respectively. We set the weight of $R_{disp}$ as $c_1=0.5$.

- **Lateral reward** $R_{lateral}$: a dense reward defined as $R_{lateral}=-||l_t||$, wherein the $l_t$ denotes the lateral offset of the ego agent in Frenet coordinates of current lane at time $t$, which is designed to prevent agent driving on  non walkable areas. We set the weight of $R_{lateral}$ as $c_2=1.0$.

- **Steering smoothness reward** $R_{steering}$: a dense reward defined as $R_{steering}=-||s_t-s_{t-1}||\cdot v_t$, wherein the $s_t$ and $s_{t-1}$ denotes the steering of the agent at $t$ and $t-1$, respectively. And $v_t$ denotes the speed of the agent at time $t$. This reward term is designed as a regularization to prevent the agent changing the steering too frequently. We set the weight of $R_{steering}$ as $c_3=0.1$.

- **Crash reward** $R_{crash}$: a dense negative reward defined as $-1(c_{t})$, wherein the $c_{t}$ denotes the collision between agents and any other objects at time $t$ and $1(\cdot)$ is the indicator function. It's notable we do not use the termination strategy for collision as in MetaDrive~\citep{li2022metadrive}. We set the weight of $R_{crash}$ as $c_4=1.0$.

## Cost Function

Similar to the reward function, we also provide default cost function to measure the safety during driving. The cost function will be placed in the returned information dict as `info["cost"]` after `env.step` function.

- `crash_vehicle_cost = 1.0`: yield cost when crashing to other vehicles.
- `crash_human_cost = 1.0`: yield cost when crashing to other vehicles.
- `crash_object_cost = 1.0`: yield cost when crashing to objects, such as cones and triangles.

The implementation of cost function is simple:

In [3]:
from metaurban.utils import print_source
from metaurban.envs import SidewalkStaticMetaUrbanEnv
print_source(SidewalkStaticMetaUrbanEnv.cost_function)

[34mdef[39;49;00m[37m [39;49;00m[35mcost_function[39;49;00m[35m([39;49;00m[35mself[39;49;00m[35m,[39;49;00m [30mvehicle_id[39;49;00m[35m:[39;49;00m [35mstr[39;49;00m[35m)[39;49;00m[35m:[39;49;00m[37m[39;49;00m
    [30mvehicle[39;49;00m = [35mself[39;49;00m.[30magents[39;49;00m[35m[[39;49;00m[30mvehicle_id[39;49;00m[35m][39;49;00m[37m[39;49;00m
    [30mstep_info[39;49;00m = [35mdict[39;49;00m[35m([39;49;00m[35m)[39;49;00m[37m[39;49;00m
    [30mstep_info[39;49;00m[35m[[39;49;00m[33m"[39;49;00m[33mcost[39;49;00m[33m"[39;49;00m[35m][39;49;00m = [34m0[39;49;00m[37m[39;49;00m
    [34mif[39;49;00m [35mself[39;49;00m.[30m_is_out_of_road[39;49;00m[35m([39;49;00m[30mvehicle[39;49;00m[35m)[39;49;00m[35m:[39;49;00m[37m[39;49;00m
        [30mstep_info[39;49;00m[35m[[39;49;00m[33m"[39;49;00m[33mcost[39;49;00m[33m"[39;49;00m[35m][39;49;00m = [35mself[39;49;00m.[30mconfig[39;49;00m[35m[[39;49;00m[33m"[3

You can modify this function to add more information to the `step_info` dict. For example, you can log what kind of object raises this cost. Thus you can calculate how many cars the ego vehicle collides with in one episode by summing up the number of vehicle crashes in each step.  

## Termination and Truncation

MetaUrban will terminate an episode of a vehicle if:

1. the target vehicle arrive its destination,
2. the vehicle drives out of the road,
3. the vehicle crashes to other agents (vehicles),
4. the vehicle crashes to obstacles,
5. the vehicle crashes to human, 
6. reach max step (horizon) limits.

The above termination function is implemented as:

In [4]:
print_source(SidewalkStaticMetaUrbanEnv.done_function)

[34mdef[39;49;00m[37m [39;49;00m[35mdone_function[39;49;00m[35m([39;49;00m[35mself[39;49;00m[35m,[39;49;00m [30mvehicle_id[39;49;00m[35m:[39;49;00m [35mstr[39;49;00m[35m)[39;49;00m[35m:[39;49;00m[37m[39;49;00m
    [30mvehicle[39;49;00m = [35mself[39;49;00m.[30magents[39;49;00m[35m[[39;49;00m[30mvehicle_id[39;49;00m[35m][39;49;00m[37m[39;49;00m
    [30mdone[39;49;00m = [34mFalse[39;49;00m[37m[39;49;00m
    [30mmax_step[39;49;00m = [35mself[39;49;00m.[30mconfig[39;49;00m[35m[[39;49;00m[33m"[39;49;00m[33mhorizon[39;49;00m[33m"[39;49;00m[35m][39;49;00m [35mis[39;49;00m [35mnot[39;49;00m [34mNone[39;49;00m [35mand[39;49;00m [35mself[39;49;00m.[30mepisode_lengths[39;49;00m[35m[[39;49;00m[30mvehicle_id[39;49;00m[35m][39;49;00m >= [35mself[39;49;00m.[30mconfig[39;49;00m[35m[[39;49;00m[33m"[39;49;00m[33mhorizon[39;49;00m[33m"[39;49;00m[35m][39;49;00m[37m[39;49;00m
    [30mdone_info[39;49;00m = [35m{

## Step Information

The step information dict `info` contains rich information about current state of the environment and the target vehicle. 
The step info is collected from various sources such as the engine, reward function, termination function, traffic manager, agent manager and so on.
We summarize the dict as follows:
```
    {
        # Number of vehicles being overtaken by ego vehicle in this episode
        'overtake_vehicle_num': 0,

        # Current velocity in km/h
        'velocity': 0.0,

        # The current normalized steering signal in [-1, 1]
        'steering': -0.06901532411575317,

        # The current normalized acceleration signal in [-1, 1]
        'acceleration': -0.2931942343711853,

        # The normalized action after clipped who is applied to the ego vehicle
        'raw_action': (-0.06901532411575317, -0.2931942343711853),

        # Whether crash to vehicle / object / building
        'crash_vehicle': False,
        'crash_object': False,
        'crash_building': False,
        'crash': False,  # Whether any kind of crash happens

        # Whether going out of the road / arrive destination
        # or exceeding the maximal episode length
        'out_of_road': False,
        'arrive_dest': False,
        'max_step': False,

        # The reward in this time step / the whole episode so far
        'step_reward': 0.0,
        'episode_reward': 0.0,

        # The cost in this time step
        'cost': 0,

        # The length of current episode
        'episode_length': 1
    }
```

The content of this dict keeps updating, and thus the content above may be out of date.
We encourage users to write customized data to this dict, so more status can be exposed to monitor the simulation even without visualization.    

## Customization
To compose your own reward, cost and termination function. Just make a new environment and override the `reward_function`, `cost_function`, and `termination_function` of the base environment class. You can also record more information in `step_info` returned by these functions and deliver it outside the simulator.

In [1]:
from metaurban.envs import SidewalkStaticMetaUrbanEnv

class MyEnv(SidewalkStaticMetaUrbanEnv):
    
    def reward_function(*args, **kwargs):
        return -10, {"is_customized": True}
    
env=MyEnv({'object_density': 0.1})
env.reset()
_,r,_,_,info = env.step([0,0])
assert r==-10 and info["is_customized"]
print("reward: {}, `is_customized` in info: {}".format(r, info["is_customized"]))
env.close()

[38;20m[INFO] Environment: MyEnv[0m
[38;20m[INFO] MetaUrban version: 0.0.1[0m
[38;20m[INFO] Sensors: [lidar: Lidar(), side_detector: SideDetector(), lane_line_detector: LaneLineDetector()][0m
[38;20m[INFO] Render Mode: none[0m
[38;20m[INFO] Horizon (Max steps per agent): None[0m
[38;20m[INFO] Assets version: 0.0.1[0m
[38;20m[INFO] Known Pipes: glxGraphicsPipe[0m
[38;20m[INFO] Start Scenario Index: 0, Num Scenarios : 1[0m
[38;20m[INFO] Agents can walk on all regions[0m
[38;20m[INFO] Agents can walk on all regions[0m


reward: -10, `is_customized` in info: True
