# Outlook

In this notebook we explain how to properly deal with time limits when interacting with a gym environment.

## Dealing with time limits

A lot of RL problems are episodic: the agent must achieve a task and when this is done, the episode stops. To deal with the case where the agent does not succeed and to accelerate learning, it is standard for these episodic problems to come with a time limit: if the agent did not succeed after a number of time steps, the episode stops.

As explained in [this paper](http://proceedings.mlr.press/v80/pardo18a/pardo18a.pdf), this situation is a source of instability in RL, as it creates some non-stationarity in the underlying MDP. The point is that, in the same state, the agent may either continue and receive some later reward, or get stopped by the time limit and receive nothing.

The proper way to deal with time limits consists in still propagating values in the critic from the next state to the current state (a Bellman backup) over the last transition when the episode is stopped by a time limit, by contrast with the case where the episode stops because the task is done, in which case the value of the next state should be ignored.

In gym environments, the variable `done` is set to True when the task is done, but also when the time limit is reached. To distinguish the latter case, the `TimeLimit` wrapper sets the `TimeLimit.Truncated` to True when the time limit is reached and the task is not done, and to False when the time limit is reached AND the task is done simultaneously. This is not much intuitive, but that's how it is...

So the rules to apply when an episode stops are the following:
- if `TimeLimit.Truncated` is True, the episode should be bootstrapped whatever the value of done,
- otherwise, and if the task is done, the Bellman backup should be ignored.

To implement the above, rather than using a complicated "if... then... " set of rules, we compute a boolean `must_bootstrap` determining whether the value of the next state should be bootstrapped or not, and then we multiply the value of the next state by this boolean (that is, 1 if the boolean is true, 0 otherwise). This results in the following piece of code:

In [None]:
done = train_workspace["env/done"]
truncated = (train_workspace["env/timestep"] == max_episode_steps)
must_bootstrap = torch.logical_or(~done, truncated)

# In the line below, reward[:-1] means "the reward of all steps but the last one"
# and critic[1:] means the values of the critic at the next state
target = reward[:-1] + discount_factor * critic[1:].detach() * (must_bootstrap.float())
td = target - critic[:-1]

Note that `reward[:-1]` stands for "all the rewards but the last one", whereas `critic[1:]` stands for "all the critic values but the first one", which in the context of pairs respectively mean all values of $r_t$ and $V(s_{t+1})$.

Note also the `critic[1:].detach()`: the gradient is computed with respect to $V(s_t)$ but not with respect to $V(s_{t+1})$.

Finally, note that if the environment is wrapped into several time limit wrappers, the `TimeLimit.Truncated` variable does not work properly, even if the various time limits are the same number of steps. So one should avoid put a TimeLimit wrapper around an environment that already contains one.