# TorchRL
 A modular, primitive-first, python-first PyTorch library for Reinforcement Learning. 

![Image Title](./torchrl.png)

TorchRL overview. The left side showcases the key components of the library, demonstrating the data flow with TensorDict instances passing between modules. On the right side, a code snippet is provided as a toy example, illustrating the training of DDPG. The script provides users with full control over the algorithm’s hyperparameters, offering a concise yet comprehensive solution. Still, replacing a minimal number of components in the script enables a seamless transition to another similar algorithm, like SAC or REDQ. Instead of a custom train loop there are is also a Trainer class available.

The key concept compared to other RL libraries is that TorchRL is highly modular, providing well-integrated, standalone components.

## The TensorDict

The `TensorDict` is the core PyTorch primitive that is used in `torchrl`.  `TensorDict`s are used as a communication object for interaction between the independent components of the `torchrl` library.

*  An `TensorDict` is a dictionary-like object that stores tensors(-like) objects. It includes additional features that optimize its use with PyTorch.
*  As the function signatures are generic, it eliminates the challenge of accommodating different data formats.

In [55]:
import torch
from tensordict import TensorDict

In [56]:
batch_size = 5
tensordict = TensorDict(
    source={
        "key 1": torch.zeros(batch_size, 3),
        "key 2": torch.zeros(batch_size, 5, 6, dtype=torch.bool),
    },
    batch_size=[batch_size],
)
print(tensordict)

TensorDict(
    fields={
        key 1: Tensor(shape=torch.Size([5, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        key 2: Tensor(shape=torch.Size([5, 5, 6]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([5]),
    device=None,
    is_shared=False)


You can index a TensorDict as well as query keys.

In [57]:
print(tensordict[2])
print(tensordict["key 1"] is tensordict.get("key 1"))

TensorDict(
    fields={
        key 1: Tensor(shape=torch.Size([3]), device=cpu, dtype=torch.float32, is_shared=False),
        key 2: Tensor(shape=torch.Size([5, 6]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)
True


## Environment Wrappers

`torchrl` offers wrappers for environment from different libraries. For `gymnasium` environments there is the `GymEnv` wrapper. This wrapper allows for a few things such as converting relevant information such as trajectories into `TensorDict`s and allowing for multiple instances of the environment to be run in parallel, which can speed up training. The most important thing to remember is that in `torchrl`, environment methods read and write `TensorDict` instances. This means that methods like `env.reset()` and `env.rand_action(state)` now return `TensorDict`s.

In [12]:
import torch


def fetch_device() -> str:
    """
    Returns 'cuda' if GPU is available otherwise 'cpu'.
    Can be used to determine where to move tensors to.
    :return: string 'cuda' or 'cpu'.
    """
    return 'cuda' if torch.cuda.is_available() else 'cpu'

In [14]:
from torchrl.envs import GymEnv
import warnings
warnings.filterwarnings("ignore")

env = GymEnv("Pendulum-v1", device=fetch_device())

In [15]:
reset = env.reset()
print(reset)

TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        observation: Tensor(shape=torch.Size([3]), device=cuda:0, dtype=torch.float32, is_shared=True),
        terminated: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        truncated: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
    batch_size=torch.Size([]),
    device=cuda,
    is_shared=True)


In [16]:
reset_with_action = env.rand_action(reset)
print(reset_with_action)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        done: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        observation: Tensor(shape=torch.Size([3]), device=cuda:0, dtype=torch.float32, is_shared=True),
        terminated: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        truncated: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
    batch_size=torch.Size([]),
    device=cuda,
    is_shared=True)


Environment related information are now all converted to PyTorch tensors. We can see for example that the action is a tensor instead of just a float.

In [17]:
print(reset_with_action['action'])

tensor([0.3574], device='cuda:0')


Let's take a look at what the `step()` method now outputs. It returns a new TensorDict containing:
* `action`, what action did our agent perform.
* `done`, as is defined in Gymnasium environments.
* `next`, contains another tensordict with `done`, `observation`, `reward`, `terminated`, `truncated` fields. Naturally all the elements are tensors.

In [20]:
stepped_data = env.step(reset_with_action)
print(stepped_data)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        done: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                observation: Tensor(shape=torch.Size([3]), device=cuda:0, dtype=torch.float32, is_shared=True),
                reward: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.float32, is_shared=True),
                terminated: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                truncated: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
            batch_size=torch.Size([]),
            device=cuda,
            is_shared=True),
        observation: Tensor(shape=torch.Size([3]), device=cuda:0, dtype=torch.float32, is_shared=True),
 

In [23]:
# Extracting the next state tensor
stepped_data = env.step(reset_with_action)
print(stepped_data['next']['observation'])

tensor([-0.9552, -0.2959, -3.1185], device='cuda:0')


This format returned by the `step()` method is called `TED` (TorchRL Episode Data format). This format is crucial for the seamless integration and functioning of various components within TorchRL.

The last bit of information you need to run a rollout in the environment is how to bring that "next" entry at the root to perform the next step. TorchRL provides a dedicated step_mdp() function that does just that: it filters out the information you won’t need and delivers a data structure corresponding to your observation after a step in the Markov Decision Process, or MDP.

In [24]:
from torchrl.envs import step_mdp

data = step_mdp(stepped_data)
print(data)

TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        observation: Tensor(shape=torch.Size([3]), device=cuda:0, dtype=torch.float32, is_shared=True),
        terminated: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        truncated: Tensor(shape=torch.Size([1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
    batch_size=torch.Size([]),
    device=cuda,
    is_shared=True)


### Collecting Environment Interaction Data with Rollouts

Writing down those three steps (computing an action, making a step, moving in the MDP) can be a bit tedious and repetitive. Fortunately, TorchRL provides a nice `rollout()` function that allows you to run them in a closed loop at will.
The `rallout` method functions across all kinds of use-cases (single-agent, parallel, multi-agent).

`TensorDict` will automatically check if the index you provided is a key (in which case we index along the key-dimension) or a spatial index (e.g., [3]).

In [25]:
# Gives a tensordict of the same shape as is returned by the `step()` method but with
# a batch size of 10.
rollout = env.rollout(max_steps=10)    # By default executes with random policy.
print(rollout)

# rollout[3] gives third trajectory.

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        done: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                observation: Tensor(shape=torch.Size([10, 3]), device=cuda:0, dtype=torch.float32, is_shared=True),
                reward: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                truncated: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
            batch_size=torch.Size([10]),
            device=cuda,
            is_shared=True),
        observation: Tensor(shape=torch.Size([10, 3]), device=cuda:0, dtype

### Environment Transformations

Often you want to modify the output of the environment to better suit your requirements. This can include:
* Monitor the number of steps executed since the last reset.
* Resize images, stack consecutive observations together.

In [26]:
from torchrl.envs import StepCounter, TransformedEnv

transformed_env = TransformedEnv(env, StepCounter(max_steps=10))
rollout = transformed_env.rollout(max_steps=100)
print(rollout)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        done: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                observation: Tensor(shape=torch.Size([10, 3]), device=cuda:0, dtype=torch.float32, is_shared=True),
                reward: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
                step_count: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.int64, is_shared=True),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                truncated: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
            batch_size=torch.Size([10]),
            devi

As you can see, our environment now has one more entry, "step_count" that tracks the number of steps since the last reset. Given that we passed the optional argument max_steps=10 to the transform constructor, we also truncated the trajectory after 10 steps (not completing a full rollout of 100 steps like we asked with the rollout call). We can see that the trajectory was truncated by looking at the truncated entry:

In [27]:
print(rollout["next", "truncated"])

tensor([[False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [ True]], device='cuda:0')


Various other transformations are possible. For example, `ParallelEnv` allows you to run multiple copies of one same (or different!) environments on multiple processes.

## TorchRL modules

### TensorDictModules

Similar to how environments interact with instances of TensorDict, the modules used to represent policies and value functions also do the same. The core idea is simple: encapsulate a standard Module (or any other function) within a class that knows which entries need to be read and passed to the module, and then records the results with the assigned entries. To illustrate this, we will use the simplest policy possible: a deterministic map from the observation space to the action space. For maximum generality, we will use a LazyLinear module with the Pendulum environment we instantiated in the previous tutorial.

Primitives like `TensorDictModule` and `TensorDictSequential` allow one to design complex PyTorch operations in an explicit and programmable way. `TensorDictModule` wraps PyTorch modules, transforming them into tensordict-compatible objects for effortless integration into the TorchRL framework. Concurrently, `TensorDictSequential` concatenates `TensorDictModule`s, functioning similar to `nn.Sequential`.

In [34]:
import torch

from tensordict.nn import TensorDictModule
from torchrl.envs import GymEnv

env = GymEnv("Pendulum-v1", device=fetch_device())
module = torch.nn.LazyLinear(out_features=env.action_spec.shape[-1], device=fetch_device())   # Lazy module allows us to bypass the need to fetch the shape of the observation space, as the module will automatically determine it.

# This TensorDictModule describes that our policy is a function from `observation` to `action`, where the function is the previously defined module.
policy = TensorDictModule(
    module,                     # Function (approximate) for our policy
    in_keys=["observation"],    # dictionary key for input
    out_keys=["action"],        # dictionary key for action
)

In [35]:
rollout = env.rollout(max_steps=10, policy=policy)
print(rollout)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        done: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                observation: Tensor(shape=torch.Size([10, 3]), device=cuda:0, dtype=torch.float32, is_shared=True),
                reward: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                truncated: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
            batch_size=torch.Size([10]),
            device=cuda,
            is_shared=True),
        observation: Tensor(shape=torch.Size([10, 3]), device=cuda:0, dtype

TorchRL also provides regular modules that can be used without recurring to tensordict features. The two most common networks you will encounter are the `MLP` and the `ConvNet` (CNN) modules. We can substitute our policy module with one of these:

In [38]:
from torchrl.modules import MLP, Actor

module = MLP(
    out_features=env.action_spec.shape[-1],
    num_cells=[32, 64],
    activation_class=torch.nn.Tanh,
    device=fetch_device()
)
policy = Actor(module)    # Actor is shorthand for the previously defined TensorDictModule.
rollout = env.rollout(max_steps=10, policy=policy)

#### Probabilistic Policies

Policy-optimization sota-implementations like PPO require the policy to be stochastic: unlike in the examples above, the module now encodes a map from the observation space to a parameter space encoding a distribution over the possible actions. TorchRL facilitates the design of such modules by grouping under a single class the various operations such as building the distribution from the parameters, sampling from that distribution and retrieving the log-probability. Here, we’ll be building an actor that relies on a regular normal distribution using three components:*
* An `MLP` backbone reading observations of size [3] and outputting a single tensor of size [2]
* A `NormalParamExtractor` module that will split this output on two chunks, a mean and a standard deviation of size [1];
* A `ProbabilisticActor` that will read those parameters as in_keys, create a distribution with them and populate our tensordict with samples and log-probabilities.


In [41]:
from tensordict.nn.distributions import NormalParamExtractor
from torch.distributions import Normal
from torchrl.modules import ProbabilisticActor

backbone = MLP(in_features=3, out_features=2, device=fetch_device())
extractor = NormalParamExtractor()
module = torch.nn.Sequential(backbone, extractor)
td_module = TensorDictModule(module, in_keys=["observation"], out_keys=["loc", "scale"])
policy = ProbabilisticActor(
    td_module,
    in_keys=["loc", "scale"],    # Parameters of Gaussian distribution.
    out_keys=["action"],
    distribution_class=Normal,
    return_log_prob=True,
)

rollout = env.rollout(max_steps=10, policy=policy)
print(rollout)

# Since we asked for it during the construction of the actor, the log-probability of the actions given the distribution at that time is also written. This is necessary for sota-implementations like PPO.

# The parameters of the distribution are returned within the output tensordict too under the "loc" and "scale" entries.

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        done: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        loc: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                observation: Tensor(shape=torch.Size([10, 3]), device=cuda:0, dtype=torch.float32, is_shared=True),
                reward: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                truncated: Tensor(shape=torch.Size([10, 1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
            batch_size=torch.Size([10]),
            device=cuda,
    

You can control the sampling of the action to use the expected value or other properties of the distribution instead of using random samples if your application requires it. This can be controlled via the set_exploration_type() function:
```
from torchrl.envs.utils import ExplorationType, set_exploration_type

with set_exploration_type(ExplorationType.MEAN):
    # takes the mean as action
    rollout = env.rollout(max_steps=10, policy=policy)
with set_exploration_type(ExplorationType.RANDOM):
    # Samples actions according to the dist
    rollout = env.rollout(max_steps=10, policy=policy)
```

### Epsilon-Greedy Action Selection for Value-Based algorithms

Stochastic policies like this somewhat naturally trade off exploration and exploitation, but deterministic policies won’t. Fortunately, TorchRL can also palliate to this with its exploration modules. We will take the example of the EGreedyModule exploration module (check also AdditiveGaussianWrapper and OrnsteinUhlenbeckProcessWrapper). To see this module in action, let’s revert to a deterministic policy:

In [42]:
from tensordict.nn import TensorDictSequential
from torchrl.modules import EGreedyModule

policy = Actor(MLP(3, 1, num_cells=[32, 64], device=fetch_device()))

Our $\epsilon$-greedy exploration module will usually be customized with a number of annealing frames and an initial value for the parameter. A value of means that every action taken is random, while means that there is no exploration at all. To anneal (i.e., decrease) the exploration factor, a call to step() is required (see the last tutorial for an example).

Because it must be able to sample random actions in the action space, the EGreedyModule must be equipped with the action_space from the environment to know what strategy to use to sample actions randomly.

In [43]:
exploration_module = EGreedyModule(
    spec=env.action_spec, annealing_num_steps=1000, eps_init=0.5
)

To build our explorative policy, we only had to concatenate the deterministic policy module with the exploration module within a TensorDictSequential module (which is the analogous to Sequential in the tensordict realm).

In [44]:
from torchrl.envs.utils import ExplorationType, set_exploration_type

exploration_policy = TensorDictSequential(policy, exploration_module)

with set_exploration_type(ExplorationType.MEAN):
    # Turns off exploration
    rollout = env.rollout(max_steps=10, policy=exploration_policy)
with set_exploration_type(ExplorationType.RANDOM):
    # Turns on exploration
    rollout = env.rollout(max_steps=10, policy=exploration_policy)

### Q-Value actors

In some settings, the policy isn’t a standalone module but is constructed on top of another module. This is the case with Q-Value actors. In short, these actors require an estimate of the action value (most of the time discrete) and will greedily pick up the action with the highest value. In some settings (finite discrete action space and finite discrete state space), one can just store a 2D table of state-action pairs and pick up the action with the highest value. The innovation brought by DQN was to scale this up to continuous state spaces by utilizing a neural network to encode for the Q(s, a) value map. Let’s consider another environment with a discrete action space for a clearer understanding:

In [45]:
env = GymEnv("CartPole-v1", device=fetch_device())
print(env.action_spec)

OneHotDiscreteTensorSpec(
    shape=torch.Size([2]),
    space=DiscreteBox(n=2),
    device=cuda,
    dtype=torch.int64,
    domain=discrete)


We build a value network that produces one value per action when it reads a state from the environment:

In [51]:
num_actions = 2
value_net = TensorDictModule(
    MLP(out_features=num_actions, num_cells=[32, 32], device=fetch_device()),
    in_keys=["observation"],
    out_keys=["action_value"],
)

We can easily build our Q-Value actor by adding a QValueModule after our value network:

In [52]:
from torchrl.modules import QValueModule

policy = TensorDictSequential(
    value_net,  # writes action values in our tensordict
    QValueModule(
        action_space=env.action_spec
    ),  # Reads the "action_value" entry by default
)

Let’s check it out! We run the policy for a couple of steps and look at the output. We should find an "action_value" as well as a "chosen_action_value" entries in the rollout that we obtain:

In [53]:
rollout = env.rollout(max_steps=3, policy=policy)
print(rollout)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([3, 2]), device=cuda:0, dtype=torch.int64, is_shared=True),
        action_value: Tensor(shape=torch.Size([3, 2]), device=cuda:0, dtype=torch.float32, is_shared=True),
        chosen_action_value: Tensor(shape=torch.Size([3, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        done: Tensor(shape=torch.Size([3, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([3, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                observation: Tensor(shape=torch.Size([3, 4]), device=cuda:0, dtype=torch.float32, is_shared=True),
                reward: Tensor(shape=torch.Size([3, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
                terminated: Tensor(shape=torch.Size([3, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                truncated: Tensor(shape=torch.Size([3, 1]), device=

Because it relies on the argmax operator, this policy is deterministic. During data collection, we will need to explore the environment. For that, we are using the EGreedyModule once again:

In [54]:
policy_explore = TensorDictSequential(policy, EGreedyModule(env.action_spec))

with set_exploration_type(ExplorationType.RANDOM):
    rollout_explore = env.rollout(max_steps=3, policy=policy_explore)

## Model Optimization