# Notes on Stable Baselines 3 - v1.0
#### 2022/09/14, AJ Zerouali
#### Updated: 22/10/17

**<span style='color:red'> To do (22/10/03) </span>**:

* Order of function calls for training.
* Order of function calls for prediction (backtesting).
* The probability distributions submodule.
* The replay buffer submodule.
* Stochastic policies.
* Deterministic policies.
* Saving and loading.

### Contents:
1) [The FinRL algorithmic trading pipeline](#1)
    * a) [Stock trading pipeline](#1.a)
    * b) [Portfolio optimization pipeline](#1.b)
    * c) [Paper trading with Alpaca](#1.c)
2) [Stable Baselines 3](#2)
    * a) [Environment classes](#2.a)
    * b) [Models I - Algorithm classes](#2.b)
    * c) [Models II - Policy classes](#2.c)
    * d) [Loggers](#2.d)
    * e) [TensorBoard and Callbacks](#2.e)
3) [The portfolio optimization environment](#3)
    * a) [FinRL's updated environment](#3.a)
    * b) [Attributes](#3.b)
    * c) [The *step()* and *reset()* methods](#3.c)
4) [Deep RL Agents](#4)
    * a) [The *DRLAgent* class](#4.a)
    * b) [Training - The *DRLAgent.train_model()* method](#4.b)
    * c) [Backtesting - The *DRLAgent.DRL_prediction()* method](#4.c)
    * d) [Backtest plots and statistics](#4.d)
5) [Financial data - Acquisition and Preprocessing](#5)
    * a) [Downloader](#5.a)
    * b) [The *FeatureEngineer()* function](#5.b)
    * c) [Return covariances](#5.c)

<a id='2'></a>
## 2 - Stable Baselines 3

FinRL is mainly a library of wrappers for financial applications, so when it comes to RL implementations, this section is maybe the most important part of these notes. We address the following topics.

   * a) [Environment classes](#2.a)
   * b) [Models I - Algorithm classes](#2.b)
   * c) [Models II - Policy classes](#2.c)
   * d) [Loggers](#2.d)
   * e) [TensorBoard and Callbacks](#2.e)

    
The main references for the discussion below are the following pages:
* SB3 official docs: https://stable-baselines3.readthedocs.io/en/master/index.html
* SB3 GitHub: https://github.com/DLR-RM/stable-baselines3
* Tutorial repo: https://github.com/araffin/rl-tutorial-jnrr19

<a id='2.a'></a>
### a) Environment classes


#### The *VecEnv* and *DummyVecEnv* classes

The main abstract environment class of SB3 is the vectorized environment *VecEnv*.
* A description of vectorized envs is given here: https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html;
* The class is implemented at: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/vec_env/base_vec_env.py.

The idea is to allow the possibility of having multiple agents evolving in stacked environments (particularly important when dealing with Atari frames for states), and thus get vectorized actions, rewards and state spaces with various shapes. Just as with Gym, there is a plethora of wrappers that the library uses, which are briefly described at the first link above. 

There are two subclasses of *VecEnv* that are called in practice. From the docs: 
* ***SubprocEnv***: Creates a multiprocess vectorized wrapper for multiple environments, distributing each environment to its own process, allowing significant speed up when the environment is computationally complex.
* ***DummyVecEnv***: Creates a simple vectorized wrapper for multiple environments, calling each environment in sequence on the current Python process. This is useful for computationally simple environment such as cartpole-v1, as the overhead of multiprocess or multithread outweighs the environment computation time. This can also be used for RL methods that require a vectorized environment, but that you want **a single environment** to train with. 

The latter is the one used by FinRL for SB3 environments. The implementation is here:
* https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/vec_env/dummy_vec_env.py

Comments and facts on *DummyVecEnv* and *VecEnv*:
* *VecEnv* is quite similar to the *VectorEnv* class of Gym (https://github.com/openai/gym/blob/master/gym/vector/vector_env.py), but in contrast to the latter, *VecEnv* is an **abstract base class** (https://docs.python.org/3/library/abc.html) in SB3.
* Concerning the classes representing state and action spaces, *VecEnv* declares these as *gym.spaces.Space* objects in the constructor. 
* Here is an important note from SB3's Vectorized Envs. docs: When using vectorized environments, the environments are automatically reset at the end of each episode. Thus, the observation returned for the i-th environment when *done[i]* is true will in fact be the first observation of the next episode, **not** the last observation of the episode that has just terminated. You can access the “real” final observation of the terminated episode—that is, the one that accompanied the *done* event provided by the underlying environment—using the *terminal_observation* keys in the info dicts returned by the *VecEnv*.
* The *step()* method itself is not implemented in *DummyVecEnv*, it is only inherited from *VecEnv*, where it is implemented in a very abstract manner using the *typing* module:

In [None]:
            # Typing imports
            from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple, Type, Union
            
            # VecEnvStepReturn is what is returned by the step() method
            # it contains the observation, reward, done, info for each env
            VecEnvStepReturn = Tuple[VecEnvObs, np.ndarray, np.ndarray, List[Dict]]
            
            # Class def
            class VecEnv(ABC):
                ....
                 def step(self, actions: np.ndarray) -> VecEnvStepReturn:
                    """
                    Step the environments with the given action
                    :param actions: the action
                    :return: observation, reward, done, information
                    """
                    self.step_async(actions)
                    return self.step_wait()

   Similarly, the usual *reset()*, *close()*, and *render()* are all abstract methods of *VecEnv* that wrap helper methods implemented in *DummyVecEnv*. For instance, *DummyVecEnv* class implements the *step_wait()* method called by *VecEnv*'s *step()* method. 
* In the end, a *DummyVecEnv* instance is constructed from stacked *gym.Env* objects and other environment functions, as one can see from its constructor:

In [None]:
            def __init__(self, env_fns: List[Callable[[], gym.Env]]):
                self.envs = [fn() for fn in env_fns]
                env = self.envs[0]
                VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
                obs_space = env.observation_space
                self.keys, shapes, dtypes = obs_space_info(obs_space)
                ...

In summary, and in relation to our use in FinRL, the *DummyVecEnv* class is simply a wrapper for the portfolio optimization environment that allows us to use SB3 agents. Concretely, this means that the state/action spaces and rewards constituting our MDP are specified in the *StockPortfolioEnv* class, which subclasses the foundational *gym.Env* class. In the next part we give some reminders on the main moving parts of this class.

<span style="color:red">**To do (22/09/30):**</span> Add a comment on how **custom** *step()*/*reset()*/*init()* methods from a gym environment are adopted by the *VecEnv*/*DummyVecEnv* classes. This is particularly important for the discussion of FinRL's *StockPortFolioEnv* below, since:

* At first glance, this is not obvious from the code or the docs. What's the mechanism? Check:
    * https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/vec_env/dummy_vec_env.py
    * https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/vec_env/base_vec_env.py
* Unclear at the moment on what the argument 
    
        env_fns: List[Callable[[], gym.Env]]
  means.
* Comment on the *DummyVecEnv.env_method(method_name, *method_args)* method. For instance, it is used for non-gym methods proper to *StockPortfolioEnv*.
* Might be useful to give a sequence of method calls here.
* There's more to say about environments in the *BaseAlgorithm* section. Algorithms have an *env* parameter that is wrapped by SB3 before being assigned to *BaseAlgorithm.env*. 
* In FinRL, *StockPortfolioEnv* is converted to a *DummyVecEnv* before being passed to the SB3 algorithm. This closes the loop between FinRL's *DRLAgent* class and SB3 models.

#### Custom environments
**Note (22/09/19):** See: https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html

#### The *gym.Env* class

To summarize, this class consists of:

* ***action_space:*** This attribute encodes the actions in the environment, and is typically a subclass of *Space* discussed below.
* ***observation_space:*** Same as above, but this attribute encodes the state space.
* ***reset():*** This method puts resets the agent back to the initial state in the environment. It should always be called after instantiating the *Env* object to obtain the initial observation of the episode.
* ***step():*** This method is used as follows:
        

In [None]:
        next_state, reward, done, info = env.step(action)

  The input is the action of the agent, and the output consists of the new state (NumPy array), the corresponding reward (float), and the boolean *done* saying whether or not *next_state* is terminal. The *info* output contains information relevant for debugging and learning.
        
* ***render():*** This method allows to visualize the evolution of the agent. It calls the PyGame package which in turn calls upon some lower-level C functions (one of the prerequisite packages of Gym is *SWIG*, which calls a C++ compiler). One of the args of *render()* is *mode*, which equals *'human'* by default and displays the game screen. Once done with rendering, one must call ***Env.close()*** to shutdown PyGame.

The action space and the state space are represented by the abstract Gym class entitled ***Space***, whose two most relevant methods are:
* ***sample()***, which returns a random sample from the space. This method is typically called when performing actions.
* ***contains(x)***, which checks whether the state $x$ belongs to the space's domain.

These abstract methods are reimplemented in the 3 usual subclasses of *Space*:
* ***Discrete:*** This space class models finite spaces, with elements labelled from $0$ to $n$. The values assigned to each label are described in the environment subclass (see example below).

* ***Box:*** Boxes represent $n$-dimensional tensor of rational numbers in intervals $[x_\min, x_\max]$, and have a *dtype* parameter in their constructor. The first use of this class is to define the bounds of a rectangular region that will be discretized in the background, and the *dtype* specifies the desired accuracy. There is also a *shape* argument in the constructor, which for example is used when the states are represented by screenshots of a game (think of the Atari environments). In the case of images of size 210x160 pixels, one calls

In [None]:
            Box(low = 0, high=255, size = (210, 160, 3)),

          
    where 3 stands for the RGB channels.
    
* ***Tuple:*** Some spaces could be of various complexity, such as having discrete and continuous components. The *Tuple* class allows to define such spaces in a nested way, by combining the previous classes for instance (see example of car controls).

In terms of RL algorithms, the *step()* method is the crucial one, as it implements the environment dynamics: This method is typically where the state transitions and rewards are computed. This remark will be of particular importance when returning to FinRL.

**Some remarks on *gym.Env*:**
- When calling *gym.make('Env_name')*, Gym actually calls a wrapper to create the environment, and not exactly the class itself.
- Gym's pre-built environments do not readily give access to the possible actions in a given state or the rewards corresponding to a new state. To access these, one can use the ***env.unwrapped*** attribute, such as for instance:

In [None]:
        env = gym.make('CartPole-v1')
        # Get action space
        cartpole_action_space = env.unwrapped.action_space
        # Get no. of possible actions
        cartpole_n_actions = env.unwrapped.action_space.n

<a id='2.b'></a>
### b) Models I - Algorithm classes

#### b.1 - Basics

The main documentation can be found at the following links: 

* Getting started: https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html
* Developer guide: https://stable-baselines3.readthedocs.io/en/master/guide/developer.html#
* Algos: https://stable-baselines3.readthedocs.io/en/master/guide/algos.html
* Base RL class: https://stable-baselines3.readthedocs.io/en/master/modules/base.html
* Saving and loading: https://stable-baselines3.readthedocs.io/en/master/guide/save_format.html
* More algorithms: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

One of the most interesting features of SB3 is the stable implementation of policy gradient and actor-critic methods, along with the intuitive interface for their training and deployment.

Before delving into details, let's start with the algorithms structure discussed in the developer guide above: 
* The DRL algorithms that train agents are stored in individual submodules of stable_baselines3. 
* Each of these submodules contains two files: policy.py and algo_name.py. The former implements the policy, while the latter implements the training algorithm for the agent.
* In SB3, "policy" refers to all neural nets involved in the algorithm, and not only the learned policy network.
* Each algorithm has **two main methods**. First is *collect_rollouts()* that defines how the samples are collected, and stores them in a *RolloutBuffer* (discarded after grad. update) or a *ReplayBuffer*. Second is the *train()* method that updates the parameters using samples from the buffer.
* All the environments handled by agents are inherited from *VecEnv*.
* At present, the algorithms that SB3 supports (on- and off-policy versions) are A2C, DDPG, DQN, HER, PPO, SAC and TD3.

On top of the algorithms implemented in the main SB3 library, there is a second library called *stable_baselines3-contrib* (sb3-contrib), where there are more experimental algorithms/agents implemented. Among the presently supported algorithms are ARS (Aug. Rand. Search), TRPO, and Recurrent PPO. The GitHub repo is here:

https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

At this point, it's helpful to give an example of how to use SB3 models. Say we would like to train and deploy a PPO agent in the cartpole environment. The code will look like this:

In [None]:
        import gym
        from stable_baselines3 import PPO
        from stable_baselines3.common.env_util import make_vec_env

        # Create 4 parallel environments
        env = make_vec_env("CartPole-v1", n_envs=4)
        
        # Instantiate agent
        model = PPO("MlpPolicy", env, verbose=1)
        
        # Train
        model.learn(total_timesteps=25000)
        
        # Save
        model.save("ppo_cartpole")

        obs = env.reset()
        while True:
            action, _states = model.predict(obs)
            obs, rewards, dones, info = env.step(action)
            env.render()

Loading a trained agent is done as follows:
 

In [None]:
        from stable_baselines3 import PPO
        model = PPO.load("ppo_cartpole")

As mentioned above, any DRL algorithm consists of two parts: 
* (A) An algorithm class inherited from **BaseAlgorithm(ABC)**, which implements the sample collection (*collect_rollouts()*) and the corresponding learning algorithm (*train()* and *learn()*).
* (B) A policy class inherited from **BasePolicy(torch.nn.Module)**, which gathers all neural nets used in the *collect_rollouts()* and *learn()* methods of the algorithm class. Any instance of a *BaseAlgorithm* subclass has a *policy* attribute that is a *BasePolicy* object.

The *BasePolicy* class is discussed in Section (2.c). For the remainder of Section (2.b) we focus on the specifics of the subclasses of *BaseAlgorithm*.

#### b.2 - Abstract classes - *BaseAlgorithm*, *On/OffPolicyAlgorithm*

The abstract base class underlying all SB3 algorithms is the *BaseAlgorithm* class in *stable_baselines3.common.base_class*. The implementation can be found here:

https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/base_class.py

Stable Baselines 3 separates its DRL implementations into on-policy algorithms and off-policy ones. As such, any algorithm implemented in this library is a subclass of either *OnPolicyAlgorithm* or *OffPolicyAlgorithm*, (inherited from *BaseAlgorithm*). These are also in the submodule *stable_baselines3.common*, and their implementations are at:

* On: https://github.com/DLR-RM/stable-baselines3/blob/88e1be9ff5e04b7688efa44951f845b7daf5717f/stable_baselines3/common/on_policy_algorithm.py
* Off: https://github.com/DLR-RM/stable-baselines3/blob/88e1be9ff5e04b7688efa44951f845b7daf5717f/stable_baselines3/common/off_policy_algorithm.py

We have the following class diagram for the algorithms implemented at the time of writing:

<h3 align="center"><img src = "Figures/SB3_BaseAlgorithm_Class_Diagram.png"  height= 340 width = 340></h3>

For the sake of completeness and for future reference, we close this part by listing some of the interface components of SB3 on/off-policy classes.

**Attributes:**
* *policy*: *Policy* object, discussed in more detail below. Can be built at instantiation if *_init_setup_model*=True, and customized by passing the attribute *policy_kwargs* to the constructor.
* *env*: The *gym.Env* environment to learn from. <span style="color:red">**(Requires comments)**</span>
* *learning_rate*: Learning rate for the optimizer, can be a function of the current progress remaining.
* *gamma*: Discount factor.
* *action_noise*: Action noise type (*None* by default) for hard exploration problems. See *common.noise* for the different action noise types.
* *seed*: Seed for the pseudo-random generators.
* *use_sde*: Whether to use generalized State Dependent Exploration (gSDE) instead of action noise exploration (default: False). The sample frequency can be customized with *sde_sample_freq*.
* *device*: Specify CPU or GPU. Will try to use a Cuda compatible device by default and fallback to CPU otherwise.
* For *OnPolicyAlgorithm*, one has attributes *ent_coef* (entropy coeff.), *vf_coef* (value f'n coeff.), and *max_grad_norm*.
* For *OffPolicyAlgorithm*, one has attributes *batch_size*, *buffer_size*, *learning_starts*, *tau* (soft update coeff.), *train_freq*, *gradient_steps* etc.
* *supported_action_spaces*: The action spaces supported by the algorithm.
* In addition to the above, we also have the following attributes:
    * For training monitoring: *monitor_wrapper*, *tensorboard_log*, and *verbose*.
    * More environment attributes: e.g. *support_multi_env*, *create_eval_env*.

The policies are handled using the *stable_baselines3.common.policies* submodule, and the replay buffer classes are in the *stable_baselines3.common.buffers* submodule.

**Methods:**

* *\_\_init\_\_()*: The specific parameters of the constructor depend on the algorithm type (on/off-policy).
* ***collect_rollouts()***: Collects experiences using the current policy and fills a *RolloutBuffer*. "Rollout" here refers to the model-free RL notion and **should not be confused** with the rollout concept of model-based RL or planning. This method is implemented in the *On/OffPolicyAlgorithm* classes (not in the Base), and calls the *step()* method of the environment, .
* ***train()***: This is an abstract method of *On/OffPolicyAlgorithm*, and is **implemented for each deep RL algorithm individually**. In particular, this is the method containing the usual PyTorch training loop with *loss.backward()* and *optimizer.step()* (for a concrete example, see implementation at https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/dqn/dqn.py).
* ***learn()***: Returns a trained model, included in the Base class and implemented differently in *On/OffPolicyAlgorithm*. The purpose of this method is to manage the *logger* and *callback* objects during training. It first calls *_setup_learn* to initialize training parameters, and next loops over time steps to collect rollouts using the method above. Once all the callback/logger instructions are done, the method *train* is called to update network weights.
* ***predict()***: This is a method of *BaseAlgorithm* that wraps the *predict()* method of the *policy* attribute (see below).
* *get_parameters()*: Returns the parameters of the agent. This includes parameters from different networks, e.g. critics (value functions) and policies (pi functions).
* The remaining methods manage several other aspects of the interface. For instance:
    * More deep learning functions: *_setup_model*, *_setup_lr_schedule*, *_update_current_progress_remaining*, *_update_learning_rate*, *_get_torch_save_params*, *_setup_learn*, *set_random_seed*,  (loads files), 
    * Saving and loading: *set_parameters*, *load*, *save*
    * Environment methods: *_wrap_env*, *_get_eval_env*, *get_env*, *set_env*, *get_vec_normalize_env*.
    * Memory buffer methods: *_update_info_buffer*, 
    * Callback methods: *_init_callback*, 
    * Logger methods: *set_logger*, *logger*, *_dump_logs*.

#### b.3 - The *learn()* and *train()* methods

Regarding these two methods, here are the points to keep in mind:

* The first notable difference between the *On/OffPolicyAlgorithm* classes is their *learn()* methods. Both call the methods *_setup_learn()*, *callback.on_training_start()*, *callback.on_training_end()*, *collect_rollouts()*, and both contain the main training loop where the *train()* method is called. The main difference is that *OnPolicyAlgorithm.learn()* records the training process in the *logger* attribute.

* The *train()* method is where the deep reinforcement learning algorithms are effectively implemented. Indeed, *train()* is an abstract method of *On/OffPolicyAlgorithm.learn()*, and it is only implemented in their subclasses (A2C, PPO, TD3, DDPG, SAC and DQN).

* The *collect_rollouts()* method is discussed in Sec. 2.b.4 below.

<span style="color:red">**(Requires more comments)**</span>

#### b.4 - The *collect_rollouts()* method

This method is implemented in the *On/OffPolicyAlgorithm* classes. <span style="color:red">**(how different are they)**</span>

#### b.5 - Saving and loading models

* When loading a trained model, you do not need to have an environment argument.

* To train a model you need an environment attribute.

<span style="color:red">**(Finish this)**</span>

    
Our next topic is the handling of neural networks and policies in *stable_baselines3*.

<a id='2.c'></a>
### c) Models II - Policy classes



The submodule of interest here is *stable_baselines3.common.policies*, which can be found at:

https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/policies.py.

More information on the structure of SB3's neural nets can be found at the article on customization of policies:

https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html

Here we look at the classes from which the *policy* attribute of *BaseAlgorithm* inherits. As previously stated, this attribute stores and manages **all** of the neural networks involved in a given algorithm, not only the policy network. Here is the diagram class of the submodule above:

<h3 align="center"><img src = "Figures/SB3_BaseModel_Class_Diagram.png" height= 340 width = 340></h3>

The next paragraphs discuss the following specifics: 

1) How SB3 subdivides neural nets.
2) Probabilistic policies of actor-critic algorithms
3) The critic network of a deterministic policy algorithm
4) The training process of SB3 policies

<span style="color:red">**Note (22/09/30): This section to be re-written. Each of the points in the above list should in fact be a subsection**</span>

**(1) Neural networks** in SB3 are separated into two main parts:
* A **features extractor** network that is usually shared between the actor and the critic (for computational efficiency). This could be a CNN for instance when dealing with images. This part of the model is controlled by the ***features_extractor***, ***features_extractor_class***, ***share_features_extractor*** and ***features_extractor_kwargs*** attributes.
* A **fully-connected network** mapping features to the policy network (actions) or to the Q-function (values). The attribute specifying this DNN is the ***net_arch*** parameter.

The feature extractor of an SB3 "policy" is implemented as a separate class in the *common.torch_layers* submodule:

https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/torch_layers.py.

As discussed at the "Custom Policy Network" page, writing a custom policy could involve writing a custom feature extractor class. For a concrete example of CNNs as feature extractors, one can look at the *ActorCriticCNNPolicy* class.

**(2)** For algorithms with **probabilistic policies**, 

https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/distributions.py

**finish me...**

(3) The ContinuousCritic class

**finish me...**

(4) Training

**finish me...**

**Attributes:**

BaseModel/BasePolicy:

* *observation_space*, *action_space*
* *features_extractor*, *features_extractor_class*, *features_extractor_kwargs*
* *optimizer*, *optimizer_class*, *optimizer_kwargs*
* *normalize_images*

ActorCriticAlgorithm:
* *net_arch*, *activation_fn*, *ortho_init*
* *dist_kwargs*, *action_dist*

ContinuousCritic:

**Attributes:**

BaseModel/BasePolicy:
* *_get_constructor_parameters()*
* *_update_features_extractor()*, *make_features_extractor()*, *extract_features()*
* *set_training_mode()* (policy, calls nn.Module.train())
* *device()*
* *save()*, *load()*, 
* *init_weights()*, *load_from_vector*
* *_predict()* (abstract), *predict()*
* 

ActorCriticAlgorithm:
* *_build_mlp_extractor()*: Creates the feature extractor layers for the policy and value networks. Part of the layers can be shared.
* *_build()*: Creates the policy network (**), the critic network (*value_net*), and the optimizer (*torch.optimizer*). Calls *_build_mlp_extractor()*.
* *reset_noise()* gSDE, Sample new weights for the exploration matrix.
* *forward()*: Forward pass in all the networks (actor and critic)
* *_predict*: Get the action according to the policy for a given observation.
* *_get_action_dist_from_latent()*: Retrieve action distribution given the latent codes.
* *evaluate_actions()*: Evaluate actions according to the current policy, given the observations.
* *get_distributions()*: Get the current policy distribution given the observations.

ContinuousCritic:



#### Saving and loading

<a id='2.d'></a>
### d) Loggers
**Note (22/09/19):** This is about the information logged during model training. In *BaseAlgorithm*, can find the methods *logger()* and *set_logger()* for instance See: https://stable-baselines3.readthedocs.io/en/master/common/logger.html#logger


<a id='2.e'></a>
### e) TensorBoard and Callbacks
**Note (22/09/19):** This is mostly about Tensorboard integration. See:

* https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html
* https://stable-baselines3.readthedocs.io/en/master/guide/tensorboard.html

# 

# ActorCriticPolicy
##### 22/10/02

Focus for now on stochastic policies. Subdivide discussion into:
- Feature extraction layer (torch_layers)
- Probability distributions (distributions)
- Attributes (feature_extr-actor-critic)
- Methods (initializations-training-prediction)

## Probability distributions



## Attributes

These will be organized by neural network.

### Environment, training, and general attributes:
* Env.: *observation_space*, *action_space*
* Training params: *lr_schedule*, *ortho_init*, *use_sde*, *use_expln*, *normalize_images*.
* Optimizer: *optimizer_class*, *optimizer_kwargs*

### Feature extractor
* *features_extractor_class*, *features_extractor_kwargs*, *features_dim*, *net_arch* and *activation_fn*: Feature extractor parameters. These are modified in the subclasses of *ActorCriticPolicy* (namely *ActorCriticCnnPolicy* and *MultiInputActorCriticCnnPolicy*). Obviously, this imp
* *mlp_extractor*: The feature extractor for the policy. Instantiated by *_build_mlp_extractor()* (called by constructor and *_build()*).


### Actor
* 
* *action_dist*: Action distribution. Instantiated in the constructor using *common.distributions.make_proba_distribution*. See *common.distributions* for more details.
* *action_net* (and *log_std_init*): Instantiated in the *_build()* method, which is called by the constructor. **Clarify what this network computes (mean? variance?)**

### Critic

* *value_net*: Instantiated in the *_build()* method (and hence by the constructor). Set to be a scalar output of the feature extractor, by adding an *nn.Linear* layer.


## Methods

These will be organized by task.

### Initializations

* *\_\_init\_\_()*:
* *_build_mlp_extractor()*: Create the feature extraction layers of actor and critic network.
* *_build()*: Instantiate the actor and critic networks and learning optimizer.

* *_get_constructor_parameters()*. Self-explanatory.

* *reset_noise()*: Resamples noise for exploration matrix of state-dependent exploration.

### Training

Finding the methods used for training requires tracking function calls.

* *evaluate_actions()*: Given an observation and actions, get corresponding values and log probabilities.  ***PPO.train()*** for concrete example.

* *predict_values()*: Get estimated Q-value according to current policy, and given an observation. (See ***OnPolicyAlgorithm.collect_rollouts()***)

### Prediction

Strictly speaking, evaluation refers to *forward()* methods and the like, and are also used during training. The following methods are in the *common.policies* submodule.

* *extract_features()*: Inherited from *BaseModel*. Preprocess observation if needed and call *forward()* method of *feature_extractor*.

* *get_distribution()*: Get current policy distribution corresponding to observation.

* *_get_action_dist_from_latent()*: Cryptic description. Has latent_pi argument. Computes the action distribution given current data? What's the meaning of latent here?

* *forward()*: Forward pass in all networks: feature extractor(s), value_net, and action_net. Calls *extract_features()*, then *self.mlp_extractor.forward()*, then *self.value_net.forward()*, then actions from the actor distribution. (See ***OnPolicyAlgorithm.collect_rollouts()***)

* *_predict()*: Given an observation, get action from the policy distribution. Wraps the instruction self.get_distribution(observation).get_actions(deterministic=deterministic).





# PPO algorithm class
##### 22/10/03

Subclass of *OnPolicyAlgorithm*. Will have to comment on:

* collect_rollouts() and where it is called.
* Environment argument for training.
* Environment call for prediction and *step()*.


**Notes - PPO train()/learn():**

*PPO* is an *OnPolicyAlgorithm* algorithm. **Crucial:** *learn()* is implemented in *OnPolicyAlgorithm*, and calls the abstract method *train()* at the very end. The latter is implemented in the *PPO* class, where *train()* calls the usual PyTorch training functions. 

The *PPO.learn()* method just returns the output of *OnPolicyAlgorithm.learn()*.

**Notes - PPO train() and policy calls**
* Reference: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/ppo/ppo.py
* The PPO algorithm is really implemented in the *train()* method.
* There are two losses that are computed in the main loop.
    
    (i) The value loss (line 243)
        
        value_loss = F.mse_loss(rollout_data.returns, values_pred)
        
    (ii) The PPO objective function loss (255):
    
        loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss
        
* The optimization step has the following calls:

        # Optimization step
        self.policy.optimizer.zero_grad()
        loss.backward()
        # Clip grad norm
        th.nn.utils.clip_grad_norm_(self.policy.parameters(),
                                    self.max_grad_norm)
        self.policy.optimizer.step()
        
    Note here that the optimization is also done on the critic network, since its parameters are used for the computation of the loss function.
        
* Additional *PPO* class attributes: *target_kl*, *ent_coef*, *vf_coef*, *max_grad_norm*, *policy_kwargs*, *_init_setup_model*

* *PPO.train()* calls the *policy.evaluate_actions()* method of the *ActorCriticPolicy*. Recall:

        def evaluate_actions(self, obs, actions)]:
            # Preprocess the observation if needed
            features = self.extract_features(obs)
            latent_pi, latent_vf = self.mlp_extractor(features)
            distribution = self._get_action_dist_from_latent(latent_pi)
            log_prob = distribution.log_prob(actions)
            values = self.value_net(latent_vf)
            return values, log_prob, distribution.entropy()
    
    This looks like the evaluation of a Q-function...
* The above is inside a loop on sampled transitions from the rollout buffer of *PPO.rollout_buffer*.

**Notes - OnPolicyAlgorithm.collect_rollouts():**
* Ref: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/on_policy_algorithm.py.
* This method calls *policy.forward()* (line 170) and *policy.predict_values()* (lines 205 and 214), where *policy*=*ActorCriticPolicy*.

## Environment and step() - FinRL
##### 22/10/03

Here is the order of calls during model deployment (backtesting in my case). If *model* is an SB3 algorithm, before calling *env.step(actions)*:

1) *actions* is an output of *model.predict(obs, deterministic)*.
2) *BaseAlgorithm.predict()* wraps *BaseAlgorithm.policy.predict()*. *BaseAlgorithm.policy* = *BasePolicy*.
3) *BasePolicy.predict()* in turn calls *BasePolicy._predict()*, which is an abstract method of *BasePolicy*.
4) Several cases here:
    
    * a - Stochastic policy: *policy* = *ActorCriticPolicy*. Here: *ActorCriticPolicy._predict()* wraps *ActorCriticPolicy.get_distribution(obs).get_actions(deterministic)*.
    
    * b - TD3 deterministic policy: *policy* = *TD3Policy*. Here *TD3Policy._predict()* wraps *TD3Policy.actor.forward()*. (Note also that *TD3Policy.forward()* wraps *TD3Policy._predict()*.)  
    
    * c - SAC policy: *policy* = *SACPolicy*. Here *SACPolicy._predict()* wraps *SACPolicy.actor.forward()*. (Note also that *SACPolicy.forward()* wraps *SACPolicy._predict()*.)  
    
    * d - DQN case...
    
5) With the same order of cases above:
    
    * a - *ActorCriticPolicy.get_distribution(obs).get_actions(deterministic)* wraps ....
    
    * b - TD3 deterministic policy: *TD3Policy.actor* = *Actor*. Here *Actor.forward()* wraps *Actor.mu.forward()*. The *Actor.mu* attribute is an *nn.Sequential* object (see TD3.policies).
    
    * c - SAC policy...
    
    * d - DQN case...

# ***torch_layers* submodule** 
##### 22/10/14

Source: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/torch_layers.py

This one contains several classes and helper functions:

### Classes:

* *BaseFeaturesExtractor(nn.Module)*: Base class for feature extractors.

* *FlattenExtractor(BaseFeaturesExtractor)*: Feature extractor flattening input. Used as placeholder when feature extraction not needed.

* *NatureCNN(BaseFeaturesExtractor)*: The CNN from Mnih et al 2015 DQN paper.

* *CombinedExtractor(BaseFeaturesExtractor)*: Used with *Dict* observation spaces. Builds a feature extractor for each key of the dictionary, so that inputs are fed through separate submodules, and then concatenated through an additional "combined" MLP net.

* *MlpExtractor(nn.Module)*: Constructs an MLP whose inputs are outputs of a previous feature extractor (e.g. CNN) or environment observations. Outputs a latent representation for the policy and value networks. Has a *net_arch* parameter that specifies the no. and size of hidden layers, as well as the no. of shared layers between the policy and value nets.

### Helper functions:

* *create_mlp(input_dim, output_dim, net_arch, activation_fn, squash_output)*: Creates a multi layer perceptron (MLP), which is a collection of fully-connected layers each followed by an activation function. Output is a list of *nn.Module* objects.

* *get_actor_critic_arch(net_arch)*: Gets the policy and value network architectures for the off-policy actor-critic algorithms (SAC, DDPG, TD3). Has a *net_arch* argument that specifies the no. and size of hidden layers. Note: Other than the feature extraction layers, no other layers can be shared by the actor and critic networks. This is to avoid evaluation issues with the target networks.

**To do (22/10/14):** Clarify when/where this submodule is called.

**Notes:** Is this *ActorCriticPolicy.forward()* method really important?

# The *TD3.td3* and *TD3.policies* submodules
##### 22/10/17

The source code of this part is at the following links:

* TD3 algorithm: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/td3/td3.py
* TD3 policy: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/td3/policies.py
* Off-policy algorithms: https://github.com/DLR-RM/stable-baselines3/blob/88e1be9ff5e04b7688efa44951f845b7daf5717f/stable_baselines3/common/off_policy_algorithm.py

We note that our goal here is to give an idea of how the library is organized, and to give the readers a starting point to implement their own policies. As such, the present description of SB3 is in no way exhaustive, as there are several features that are not discussed.

In Section (2.b.1), we gave a template of how an SB3 algorithm is instantiated, trained and deployed. We revisit this template below with comments on each step, where we give more details on the objects, functions and submodules involved.

In [None]:
# Create the environment
env = make_vec_env("CartPole-v1", n_envs=1)
'''
    Same as gym.make("env_name"), except that we are 
    wrapping the gym.Env object by a SB3 VecEnv class.
    Comment: 1) When instantiating SB3
    models with an argument env of class gym.Env,
    the latter is wrapped by an SB3 VecEnv object.
    2) Continuing (1). In BaseAlgorithm.__init__(),
    find the instruction:
        env = self._wrap_env(env, self.verbose, monitor_wrapper)
    The BaseAlgorithm._wrap_env() method looks for appropriate
    environment and monitor wrappers and re-orders
    image channels.
'''

# Instantiate TD3 algorithm object
model = TD3(policy = "MlpPolicy", env=env, verbose=1)
'''
    This instruction creates a TD3 object, with arguments specifying those of
    the corresponding TD3Policy object (TD3.policy attribute).
    
    I - TD3Policy Constructor arguments:
    ------------------------------------
    1) For BaseAlgorithm, the policy argument is either "MlpPolicy", "CnnPolicy",
    or "MultiInputPolicy". These 3 aliases are always defined in the corresponding 
    algorithm submodule. In the case of TD3, MlpPolicy = TD3Policy, and the feature
    extractor class is FlattenExtractor. "CnnPolicy" is a subclass of TD3Policy for 
    which the feature extractor is a CNN (NatureCNN by default), typically used for
    image data. "MultiInputPolicy" is a subclass of MlpPolicy for which the 
    environment has a Dict observation space, and whose feature extractor class
    is CombinedFeatureExtractor.
    
    2) The env parameter could be a gymEnv, VecEnv or DummyVecEnv object, see comments
    above. This parameter is not needed to load a trained SB3 model.
    
    3) To further customize the network architecture, one specifies the following
    parameters: net_arch, activation_fn, features_extractor_class, 
    features_extractor_kwargs, and share_features_extractor.
    
    4) The optimizer parameters are optimizer_class and optimizer_kwargs. By 
    default, the optimizer class is torch.optim.Adam with eps = 1e-5.
    
    5) The training parameters are lr_schedule, optimizer_class, optimizer_kwargs,
    and normalize_images.
    
    II - TD3Policy Neural nets:
    ---------------------------
    1) The TD3 constructor calls TD3Policy._build() to initialize 4 neural nets, 
    each of which is an attribute of the class TD3Policy. The helper functions used 
    in _build() are make_features_extractor(), make_actor(), and make_critic(). 
    As opposed to the last 2 functions, make_features_extractor() is inherited from 
    BaseModel (see line 114 of the common.policies submodule).
    
    2) Actor networks: The TD3Policy.actor and TD3Policy.actor_target attributes are
    instances of the Actor[BasePolicy] class from the TD3.policies submodule. 
    In detail, the constructor of Actor calls common.torch_layers.create_mlp() to 
    combine the feature extractor layers and the activation function, then compiles
    this module with:
        (Line 60) self.mu = nn.Sequential(*actor_net) 
    The function TD3Policy.actor.forward() wraps mu.forward().
    
    3) Critic: The TD3Policy.critic and TD3Policy.critic_target attributes are
    instances of the ContinuousCritic[BaseModel] class from the common.policies submodule.
    This class allows to have several critic networks, and to decide whether or not
    to share the feature extraction layers with the actor. 
    The main attribute of ContinuousCritic is the list of q_networks, which ss for the actor,
    are created using common.torch_layers.create_mlp() and by setting:
        (Line 875) q_net = nn.Sequential(*q_net)
    for each critic in (see common.policies).
    The ContinuousCritic.forward() function returns a tuple containing the output
    of each of these networks, while the ContinuousCritic.q1_forward() function
    returns only the output of the first entry in the ContinuousCritic.q_networks list.
    
    4) Aliases: The TD3 model constructor calls a helper method called _setup_model(),
    which initializes the TD3.policy object as a TD3Policy instance, and then calls 
    _make_aliases() that initialized the TD3.actor, TD3.actor_target, TD3.critic, and 
    TD3.critic_target attributes. These are equated to the attributes of the same name 
    of TD3.policy, and used in the TD3.train() method discussed below.
    
'''
        
# Train TD3 model
model.learn(total_timesteps=25000)
'''
    1) OffPolicyAlgorithm.learn() as an abstract method calls:
        a) OffPolicyAlgorithm.collect_rollouts()
        b) TD3.train()
        
    2) For (a) above, collect_rollouts() calls the OffPolicyAlgorithm.predict()
    method. The details on the predict() method are given at the deployment
    step below.
    
    3) For (b) above, TD3.train() implements the TD3 algorithm. It uses all the
    neural networks built by the constructor with the following
    calls in the main loop:
        (Line 168) next_actions = (self.actor_target(replay_data.next_observations)
                                  + noise).clamp(-1, 1)
        (Line 171) next_q_values = th.cat(self.critic_target(replay_data.next_observations, 
                                   next_actions), dim=1)
        (Line 176)  current_q_values = self.critic(replay_data.observations, 
                                       replay_data.actions)
        (Line 190) actor_loss = -self.critic.q1_forward(replay_data.observations, 
                                self.actor(replay_data.observations)).mean()
    All of these are calls to the forward() method of the corresponding network.
    (The line numbers refer to the TD3.td3 submodule.)
    
    4) The critic loss computation and its gradient step are performed in
    lines 178 to 185 of the TD3.td3 submodule.
    
    5) The actor loss computation and its gradient step are performed in 
    lines 189 to 196 of the Td3.td3 submodule.
'''
        
# Deploy model
obs = env.reset()
while True:
    # Predict actions
    action, _states = model.predict(obs)
    '''
        1) At level of abstract classes, the above is BaseAlgorithm.predict(),
        which wraps BaseAlgorithm.policy.predict(), which in turn wraps the abstract
        method BasePolicy._predict() of the policy class.
        
        2) In the case of TD3: TD3Policy._predict() wraps the method:
            TD3Policy.actor.forward()
        (which is the same here as TD3Policy.forward().)
        
        3) The above wraps the method :
            Actor.mu.forward()
    '''
    # One time step
    obs, rewards, dones, info = env.step(action)
    

<a id='2.d'></a>
### d) Models III - Neural nets for stochastic policies

<a id='2.d.1'></a>
#### d.1 - The *ActorCriticPolicy* class

In this section we discuss the second design pattern used by Stable Baselines 3 for policies. We will mostly focus on the stochastic policies of on-policy algorithms such as A2C and PPO, by looking into the details of the *ActorCriticPolicy* class in the *common.policies* submodule.

In contrast to the previous section, where the actor and critic networks were represented by distinct classes, the *ActorCriticPolicy* class uses several common layers between the actor and critic networks. In more detail:
* On top of the *features_extractor* layers, *ActorCriticPolicy* has a *mlp_extractor* attribute that computes auxilary tensors *latent_vf* and *latent_pi*. These layers are shared by the attributes *value_net*, *action_net* and *log_std*.
* The *features_extractor* and *mlp_extractor* layers are specified by the *features_extractor_class*, *features_extractor_kwargs*,*net_arch*, and *activation_fn* parameters of the constructor.
* The stochastic policy is encoded by the *action_dist* attribute, which is a *Distribution* object from the *common.distributions* submodule. 
* The *action_net* attribute represents the neural net computing the mean of *action_dist*, while *log_std* is the net computing the log-standard deviation of this distribution.

Section 2.(d.2) gives more details on the *Distribution* class of which *action_dist* is an instance. This is particularly important here, as it solves the additional layer of complexity coming from encoding a probability distribution and appropriately evaluating random actions during training.

When instantiating an *ActorCriticPolicy* object, the attributes discussed above are initalized in the following order:
1) *ActorCriticPolicy.__init__()* first initializes *features_extractor*.
2) *action_dist* is initalized using *common.distributions.make_proba_distribution()*, according to the class of the action space (see next subsection for more details).
3) The constructor calls the *_build()* method that calls *_build_mlp_extractor()*, which in turn initializes *mlp_extractor* (a *MlpExtractor* object, implemented in *common.toch_layers*).
4) Within *_build()*, the attributes *action_net* and *log_std* are initialized using *action_dist.proba_distribution_net()*, depending on the distribution class used.
5) *_build()* initializes *value_net* as a *nn.Linear* object with one output feature.

The precise outputs *value_net*, *action_net*, and *log_std* networks are specified in the *ActorCriticPolicy.forward()* method, reproduced in the next cell with comments.

In [None]:
    def forward(self, obs: th.Tensor, deterministic: bool = False) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
        """
        Forward pass in all the networks (actor and critic)
        :param obs: Observation
        :param deterministic: Whether to sample or use deterministic actions
        :return: action, value and log probability of the action
        """
        ### This part calls the common.torch_layers submodule
        # Preprocess the observation if needed
        features = self.extract_features(obs)
        '''
            This is a method inherited from BasePolicy that
            performs the following sequence of calls:
            BasePolicy.extract_features()->BaseFeaturesExtractor.features_extractor.forward().
        '''
        
        # Compute latent policy and latent value function
        latent_pi, latent_vf = self.mlp_extractor(features)
        '''
            1) Here, "latent" refers to the shared features
            by the actor and critic networks, which are
            computed by the feature extraction layers.
            2) The line above is in fact a call to
            MlpExtractor.forward(), which returns the outputs
            of:
            - MlpExtractor.policy_net.forward()
            - MlpExtractor.value_net.forward()
        '''
        
        # Evaluate the values for the given observations
        values = self.value_net(latent_vf)
        '''
            This line calls ActorCriticPolicy.value_net, whic is 
            initialized in ActorCriticPolicy._build() as:
               self.value_net = nn.Linear(self.mlp_extractor.latent_dim_vf, 1)
        '''
        
        ### This part calls the common.distributions submodule
        # Get action_distribution from latent features
        distribution = self._get_action_dist_from_latent(latent_pi)
        '''
            1) This line involves three attributes:
            - ActorCriticPolicy.action_dist 
            - ActorCriticPolicy.action_net 
            - ActorCriticPolicy.log_std
            and wraps the function:
                ActorCriticPolicy.action_dist.proba_distribution(),
            which returns the latent probability distribution.
            2) For continuous actions distributed according to a Gaussian, 
            action_dist.proba_distribution_net() sets up the mean and std
            via:
                action_net =  nn.Linear(latent_dim, self.action_dim) 
                log_std = nn.Parameter(th.ones(self.action_dim) * log_std_init, 
                                       requires_grad=True)
                (see common.distributions submodule).
        '''
        
        # Compute actions from the action distribution and their log probabilities
        actions = distribution.get_actions(deterministic=deterministic)
        '''
            This calls the Distribution.get_actions() method that
            calls one of the following 2 abstract methods: 
            - sample() if deterministic = False
            - mode() if deterministic = True.
        '''
        log_prob = distribution.log_prob(actions)
        actions = actions.reshape((-1,) + self.action_space.shape)
        
        
        # return output
        return actions, values, log_prob

One of the confusing points about *ActorCriticPolicy* are the various methods used to compute the outputs of the neural nets involved. These are as follows:

* *forward()*: Forward pass in all networks: feature extractor, MLP extractor, value_net, and action_net. Used when collecting rollouts (see ***OnPolicyAlgorithm.collect_rollouts()***).
* *evaluate_actions()*: Given an observation and actions, get corresponding values and log probabilities. Used during training (see ***PPO.train()*** for a concrete example).
* *extract_features()*: Inherited from *BaseModel*. Preprocess observation if needed and call *forward()* method of *feature_extractor*.
* *get_distribution()*: Get current policy distribution corresponding to observation.
* *_get_action_dist_from_latent()*: Returns the action distribution from the latent values of features (*latent_pi* output by *mlp_extractor*)
* *_predict()*: Given an observation, get action from the policy distribution. Wraps the instruction self.get_distribution(observation).get_actions(deterministic=deterministic), and is wrapped by *BasePolicy.predict()*. Used when deploying a trained model for instance.
* *predict_values()*: Get estimated Q-value according to current policy, and given an observation. (See ***OnPolicyAlgorithm.collect_rollouts()***)

<a id='2.d.2'></a>
#### d.2 - The *common.distributions* submodule

As one would expect, the *distributions* submodule is central to the implementation of stochastic policies:

https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/torch_layers.py

It might be important to note that *common.distributions* imports PyTorch's distributions submodule *torch.distributions*, which is itself inspired by TensorFlow's distributions package. Some useful links:
* Torch distributions docs: https://pytorch.org/docs/stable/distributions.html.
* TensorFlow distributions paper: https://arxiv.org/pdf/1711.10604.pdf.
* PyTorch abstract *Ditribution* class: https://github.com/pytorch/pytorch/blob/master/torch/distributions/normal.py.
* PyTorch *Normal* class: https://github.com/pytorch/pytorch/blob/master/torch/distributions/normal.py.
* PyTorch Kullback-Liebler submodule: https://github.com/pytorch/pytorch/blob/master/torch/distributions/kl.py

In particular, SB3's distributions submodule imports the *Bernoulli*, *Categorical*, and *Normal* classes from PyTorch.

For reference, this submodule contains the following:

**Classes:**

* *Distribution(ABC)*: Base abstract class for SB3 distributions. Has one attribute *distribution*, and declares most of the abstract **FINISH THIS**.
* *DiagGaussianDistribution(Distribution)*: To model Gaussian distributions with diagonal covariance matrix on continuous action spaces. Superclass of *SquashedDiagGaussianDistribution*.
* *CategoricalDistribution(Distribution)*: To model discrete distributions on discrete action spaces. Superclass of *MultiCategoricalDisrtibution*.
* *BernoulliDistribution(Distribution)*: Bernoulli distribution for MultiBinary action spaces.
* *StateDependentNoiseDistribution(Distribution)*: Used for state-dependent noise exploration, as used elsewhere in SB3.
* *TanhBijector*: Bijective transformation of a probability distribution. Used in implementation of SAC.

**Helper functions:**

* *sum_independent_dims()*: Function summing log probabilities when computing the entropy of a Gaussian.
* *make_proba_distribution(action_space, use_sde, dist_kwargs)*: Takes a *gym.spaces.Space* argument and distribution arguments to return a *Distribution* instance adapted to the action space class. Notably, if the action space class is *gym.spaces.Box*, then the output is a *DiagGaussianDistribution* object of the appropriate action space dimension. If the action space if of *gym.spaces.Discrete* class, then the output is of *CategoricalDistribution* class.
* *kl_divergence(dist_true, dist_pred)*: Wrapper for PyTorch's *torch.distributions.kl_divergence(P,Q)*. Inputs are SB3 distributions, and output is a *torch.Tensor*. **Note:** Comment on how SB3's distributions wrap PyTorch's.

<span style="color:red">**Finish this. Discuss sampling and mode().**</span>

<a id='2.d.3'></a>
#### d.3 - Example: The *PPO.ppo* and *PPO.policy* submodules

<span style="color:red">**Finish this...**</span>

In [None]:
# Create the environment
env = make_vec_env("CartPole-v1", n_envs=1)

# Instantiate TD3 algorithm object
model = PPO(policy = "MlpPolicy", env=env, verbose=1)
'''
    1) Differences between "MlpPolicy", "CnnPolicy", and "MultiInputPolicy" for PPO
    
    2) Action distribution
    
    3) Actor network: 
    
    4) Critic: 
    
    ?) Aliases?
    
'''
        
# Train PPO model
model.learn(total_timesteps=25000)
'''
    1) OnPolicyAlgorithm.learn() as an abstract method calls:
        a) OnPolicyAlgorithm.collect_rollouts()
        b) PPO.train()
        
    2) For (a) above, collect_rollouts() calls 
        (Line 166) actions, values, log_probs = self.policy(obs_tensor)
        (Line 210) values = self.policy.predict_values(obs_as_tensor(new_obs, self.device))
        (in common.OnPolicyAlgorithm submodule)
    
    3) For (b) above, PPO.train() implements the PPO algorithm. 
        (Line 208) values, log_prob, entropy = self.policy.evaluate_actions(rollout_data.observations, actions)
    
    4) Computation of loss functions and gradient step.
'''
        
# Deploy model
obs = env.reset()
while True:
    # Predict actions
    action, _states = model.predict(obs)
    '''
        This line wraps the ActorCriticAlgorithm._predict() method, which
        returns:
            (Line 613) ActorCriticAlgorithm.get_distribution(observation).get_actions(deterministic=deterministic)
            (see common.policies and common.distributions)
    '''
    # One time step
    obs, rewards, dones, info = env.step(action)
    