# Getting our snake into the zoo!

<img src="stablebaselineszoo.jpg" alt="drawing" width="600"/>

Stable Baselines Zoo is a training framework for RL. It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos. Tuning hyperparameters is done via [Optuna](https://optuna.org/), a hyperparameter optimization framework.

Zoo [documentation](https://stable-baselines3.readthedocs.io/en/master/guide/rl_zoo.html) and [repo README](https://github.com/DLR-RM/rl-baselines3-zoo).

## Using Zoo locally

### Steps to install zoo

* `git clone --recursive https://github.com/DLR-RM/rl-baselines3-zoo` to install zoo
* installing `swig`, `cmake` and `ffmpeg` on linux is very easy, but on Windows it's quite hard. As it use for video capturing only, you can skip it
* `cd rl-baselines3-zoo`
* `pip install -r .\requirements.txt` This will give error messages due to missing swig and cmake. Just ignore them. Due to the errors the install stops halfway and some modules like seaborn should be installed manually. Look in the file `requirements.txt` for the names of the modules.

To use Stable Baselines Zoo, the environment must be a registered Open AI Gym environment. All the necessary changes to the source code to enable this have already been made ([info](https://medium.com/@apoddar573/making-your-own-custom-environment-in-gym-c3b65ff8cdaa) how to register a gym environment with open AI Gym). 

### Steps to prepare our gym_snake environment for usage in zoo and prepare zoo itself

* `cd ../gym_snake` This assumes that `rl-baselines3-zoo` is in a folder next to `gym_snake`.
* `pip install -e .` This installs the gym_snake module and registers the snake environment with open AI Gym under name `snake-v0`.
* In file `utils/import_envs.py` add the following lines to import a custom environment, in this case the snake environment:
```
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, '..')
import gym_snake
```
* In file `hyperparameters/dqn.yml` add at the end (after 1st line indented with two spaces):
```
# not yet tuned
snake-v0:
    n_timesteps: !!float 1e6
    policy: 'MlpPolicy'
    learning_rate: !!float 5e-4
    batch_size: 32
    buffer_size: 50000
    learning_starts: 1000
    gamma: 0.99
    target_update_interval: 500
    train_freq: 1
    gradient_steps: -1
    exploration_fraction: 0.1
    exploration_final_eps: 0.02
    policy_kwargs: "dict(net_arch=[64, 64])"
    normalize: "{'norm_obs': True, 'norm_reward': True}"
    device: 'cpu'
```
* In file `hyperparameters/ppo.yml` add at the end (after 1st line indented with two spaces):
```
# not yet tuned
snake-v0:
    n_envs: 1
    policy: 'MlpPolicy'
    n_timesteps: !!float 1e6
    batch_size: 64
    n_steps: 2048
    gamma: 0.99
    learning_rate: 0.0003
    ent_coef: 0.0
    clip_range: 0.2
    n_epochs: 10
    gae_lambda: 0.95
    max_grad_norm: 0.5
    vf_coef: 0.5
    normalize: "{'norm_obs': True, 'norm_reward': True}"
    device: 'cpu'
```
* remove `device: 'cpu'` from `hyperparameters/dqn.yml` and `hyperparameters/ppo.yml` if your GPU is faster than your CPU. You can easily test this by doing a training run on CPU and one on GPU.


### Some examples of using zoo

**Training:**
  * `python train.py --verbose 1 --tensorboard-log tensorboard_log --algo ppo --env snake-v0 --env-kwargs "grid_size:[6, 6]" snake_size:2 --n-timesteps 3000` train snake with zoo using the hyperparameters as specified in `hyperparameters/dqn.yml`, evaluate the model every 10000 steps (default eval_freq is 10000) and save the model at the end of the session in `snake-v0.zip`. Evaluating means: the performance of the last 5 episodes is evaluated and the model is saved in `best_model.zip` if it is better than the best model of all previous evaluations. 
  * `python train.py --verbose 1 --tensorboard-log tensorboard_log --algo ppo --env snake-v0 --env-kwargs "grid_size:[6, 6]" snake_size:2 --eval-freq 1000 --save-freq 2000 --n-timesteps 10000` as previous, but with intermediary evaluation every 1000 steps and intermediary saving of the model every 2000 steps, ending up with 5 saved models as the number of steps is 10000. 
  * `python train.py --verbose 1 --tensorboard-log tensorboard_log --algo ppo --env snake-v0 --env-kwargs "grid_size:[6, 6]" snake_size:2 -i logs/ppo/snake-v0_1/best_model.zip --n-timesteps 3000` as previous but continue training a preloaded model, in this case the best model of the mentioned experiment.

**Enjoying a trained model:**
  * `python enjoy.py --verbose 1 --algo ppo --env snake-v0 --folder logs/` enjoy the **last** saved model from the **last** experiment.
  * `python enjoy.py --verbose 1 --algo ppo --env snake-v0 --folder logs/ --load-best` enjoy the **best** saved model from the **last** experiment.
  * `python enjoy.py --verbose 1 --algo ppo --env snake-v0 --folder logs/ --load-best --exp-id` enjoy the **best** saved model from the experiment with id `exp-id`.

**Hyperparameter optimization (typically prior to training):**
  * `python train.py --verbose 1 --algo ppo --env snake-v0 --env-kwargs "grid_size:[6, 6]" snake_size:2 -n 50000 -optimize --n-trials 1000 --n-jobs 2 --sampler random --pruner median` use Optuna for hyperparameter optimization (1000 trials of 50000 steps each, 2 parallel jobs).

The latter command first loads the hyperparameters as specified in `hyperparameters/dqn.yml` and then uses the the values in `utils/hyperparams_opt.py` as the hyperparameter search space. It uses a random sampler and median pruner ([Optuna tutorial](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/003_efficient_optimization_algorithms.html#)). It takes several days to run (°◇°).

All results of a training run or optimization run (used hyperparameter settings, best hyperparameter settings, tensorboard logs, trained models) are stored in the folder `logs/dqn/snake-v0_1`, so a separate folder for each run.


### Source code changes made to `gym` and `zoo`

These source code changes are not essential.

* added print statement in `miniconda3\envs\sb3\Lib\site-packages\gym\wrappers\time_limit.py` when max episode steps has been reached
* changed location where tensorboard logs are stored in `rl-baselines3-zoo\utils\exp_manager.py`, line 143 and 155.


## Getting learning behavior!

We've got a basic understanding of RL. We've setup Zoo, allowing us to efficiently run experiments. We're fully equipped to learn our snake to behave! Let the fun begin! This section discusses pitfalls, strategies, considerations and not-tested hypotheses. Note the strategies are not exclusive to each other; they might be combined.

**Strategy: play with number of steps**
The first basic strategy that you'll probably try:
1. Make an intelligent guess for the values of the hyperparameters. Very often this will be the default values. Unfortunately, default values of Stable Baselines have not been chosen to give decent initial performance. Maybe this is because different environments ask for very different values of hyperparameters, meaning that smart default values is simply not possible. I'm not sure.
2. Choose a number of steps, train the agent and monitor how `ep_rew_mean` is evolving in Tensorboard. As long as `ep_rew_mean` increases, longer training makes sense. Here it comes in handy that interrupting a training session with Ctrl-c will save the model before stopping the session.
3. Enjoy the result :).

**Strategy: tune hyperparameters before training**
1. Prepare `hyperparameters/ppo.yml` to your best knowledge.
2. Do an optimize run with limited number of timesteps (e.g. 50000) for let's say 100 trials. This takes 4-5 hours for gym_snake.  Monitor that `max steps reached` is not occurring at a regular basis, as this might mean that you're cutting off wanted snake behavior. 
3. Do a long training with 5000000 timesteps with tensorboard logging enabled, using the hyperparameter values of the best trial of the optimization step. Monitor how `ep_rew_mean` is evolving. Increase the number of steps if `ep_rew_mean` is still increasing (in Tensorboard, move the smoothin slider to the right to allow assessing this).
4. Enjoy the result :).

A possible (i.e. not tested) pitfall of this strategy is that the optimal hyperparameter settings are determined using rather short training sessions. For example, in step 2. Optuna might suggest a small neural network as, during the (short) trial, it showed fast learning. However, during the long training session of step 3. the snakes will be longer in general and a bigger neural network might be needed to deal with such long snakes.

**Strategy: staged learning**
Instead of starting each training session with a new, empty model, start of with an already trained model. This provides many possibilities:
* start of with a model that has expert knowledge in it to get decent behavior. Continue with a training session without expert knowledge to allow finding the optimal policy.
* start with a training session with subgoals to at least find some rewards. Continue with a training session without suboals to allow finding the optimal policy. Typically useful for problems where there's only a reward at the very end of an episode like chess. 
* start with a training session with a small initial snake length. Continue with a training session with a longer intial snake length (requires source code changes in gym_snake). This allows later training runs to focus on long snakes. Without this, every episode starts with a small snake, so a lot of training time is lost training small snakes, whereas the model is not learning a lot any more for small snakes. It is important that the long snake is initialized at a random position, otherwise a large part of the state space is not examined. As well, it is important to increase the initial snake length not too fast, again to avoid missing a large part of the state space.

Note that the hyperparameters that vary during a training session (e.g. exploration rate) are reinitialized with every training session. Splitting the training sessions in multiple sessions means that at the start of every session there's a lot of exploration. This might ruin the neural network. Or ... it might be beneficial as the snake examines new, possible better parts of the state space.

Note that some hyperparameters (e.g. learning rate) and also aspects of the environment like reward signal can be changed between staged training session, but some hyperparameters (e.g. network architecture) cannot. 

**Literature intermezzo**
An environment that returns few rewards (like chess where there's only a reward at the very end of an episode) is called an 'RL-problem with sparse rewards' in literature. The sparse rewards are the reason that getting good results is so hard in RL compared to supervised learning, because the feedback whether an action is good or bad only comes in the far future. This is called 'sampling inefficiency'. In contrast, in supervised learning the system gets feedback every sample, whether the system predicted the label correctly. Adding a subgoal like 'occupy the center of the board' is called in literature 'adding a dense reward', resulting in s system tha resembles more supervised learning. An example of a dense reward often seen in snake is the distance + direction to the food. In this way the snake get a reward every step. Designing reward signals is called 'reward shaping' in literature. A second problem of the sparse, future rewards is, what if the agent performed a chain of good actions, but made an error with the last action, therefore missing the reward? All taken actions, also the whole chain of good ones will be discredited. This is called 'the credit assignment problem'in literature.

**Strategy: tweak the reward signal (reward shaping)**
The reward signal is central in reinforcement learning and therefore the design of the reward signal deserves a lot of attention. Try philosophizing about the resulting snake behavior of a certain reward signal. Some examples of unexpected results (called 'the alignment problem' in literature):
* dying is better than walking (reward: -1 per step and a -1 for dying)
* turn-based duosnake: one snake kills itself immediately, after which the other snake quietly collects food (reward: the snake that has the turn gets +1 for finding food; -1 for dying)

Actually the holy grail of RL is ***not*** needing reward shaping (== adding subgoals), because (1) it is a custom process that has to be done for every new environment again, (2) because of the alignment problem, and (3) because it constrains the agent policy to the behavior of humans, which is not true optimal behavior.

**Strategy: start off simple, BUT ...**
Already mentioned several times: KISS. Start off with a simple version of the problem (e.g. small grid for snake). Only when you observe decent learning behavior, increase complexity. There's one important pitfall here! If you start off with a problem that's **too simple**, finding a reward is not a matter of skill but of luck!! This means that the agent receives random rewards and will not learn. We saw this in the workshop about neuroevolution. 

**Problem: decreasing rewards**
When during a training run, the reward first increases, but then decreases. Possible counter strategies ([source with explanation of causes](https://stackoverflow.com/questions/51960225/dqn-average-reward-decrease-after-training-for-a-period-of-time) and a [second source with explanation of causes](https://www.reddit.com/r/reinforcementlearning/comments/9zwr0r/why_do_rewards_start_to_drop_after_a_certain/)):
* (not tested) rather than a constant learning rate, decrease the learning rate during the training session
* (not tested) rather than a constant batch size, increase the batch size during the training session
* (not tested) rather than a prioritized experience replay, decrease prioritized experience replay during the training session

**Pitfall: max steps per episode**
It is vital that an episode stops after a number of steps, otherwise a lot of training time can be lost in a forever cycling snake; only an exploration step can get the snake out of this endless loop. This functionality need not be implemented by the environment itself, as it is provided by a Open AI Gym wrapper (`miniconda3\envs\sb3\Lib\site-packages\gym\wrappers\time_limit.py`). The value of the maximum number of steps per episode can be set in `gym_snake/__init__.py` by means of the line `max_episode_steps=1500,`. A too low value might cut off wanted snake behavior. A too high value means that a lot of training time is lost in endless loops. Monitor whether `max steps reached` is not happening too often.

**Pitfall: normalized observations and normalized rewards**
Many RL algorithms rely on a Gaussian underlying distribution. Therefore it is important to normalize observations and rewards. This can be hard-coded in the environment, but it is also possible to use the wrapper `VecNormalize`. In Zoo, this wrapper can be enabled by hyperparameter `normalize: "{'norm_obs': True, 'norm_reward': True}"`. [More info in section 1.3.3](https://buildmedia.readthedocs.org/media/pdf/stable-baselines/master/stable-baselines.pdf).

**Pitfall: normalized actions**
Many RL algorithms, typically continuous action spaces, rely on a normalized and symmetric action space. This must be hard-coded in the environment. [More info in section 1.3.3](https://buildmedia.readthedocs.org/media/pdf/stable-baselines/master/stable-baselines.pdf).


## Using Zoo in Google Colab

Instead of installing it locally, it is easier to create a copy of this [Google Colab notebook](https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/rl-baselines-zoo.ipynb), but ... the notebook disconnects after 90 minutes of idle time and has a maximum running time of 12 hours. All data is then lost if you didn't download it before disconnect. So the ease of installation comes at a price. It's a good option for quickly assessing the usefulness of Zoo.

* create a zip of gym_snake gym_snake.zip
* upload the zip to Google Colab
* within Google Colab:
```
%cd /content/rl-baselines3-zoo/
!unzip /content/gym_snake.zip
%cd /content/rl-baselines3-zoo/gym_snake/
!pip install -e .
```
`pip install -e .` installs the gym_snake module and registers the snake environment with open AI Gym under name `snake-v0`.

* In file `train.py` add the line `import gym_snake` below the line `import gym`. 
* within Google Colab:
```
%cd /content/rl-baselines3-zoo/
!python train.py --algo dqn --env snake-v0 --n-timesteps 100000
```
* An alternative for uploading a zip to Google Colab is to mount your Google Drive. To do this, within Google Colab:
```
from google.colab import drive
drive.mount('/content/gdrive')
```

### Preventing a Google Colab notebook from closing if untoched for a while

**Not sure whether this workaround still works.**

* in the Chrome window where Colab is running, right-click and choose 'inspect'
* paste the javascript code below in the console: 
```
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
var colab = setInterval(ConnectButton,60000);
```

* when your colab session is ready you want to stop the timer. To stop the timer paste the javascript code below in the console:
```
clearInterval(colab)
```