# Tensor2Tensor Reinforcement Learning

The `rl` package provides the ability to run model-free and model-based reinforcement learning algorithms.

Currently, we support the Proximal Policy Optimization ([PPO](https://arxiv.org/abs/1707.06347)) and Simulated Policy Learning ([SimPLe](https://arxiv.org/abs/1903.00374)).

Below you will find examples of PPO training using `trainer_model_free.py` and SimPLe traning using `trainer_model_based.py`.


In [1]:
#@title
# Copyright 2018 Google LLC.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [None]:
!pip install -q tensorflow==1.13.1
!pip install -q tensorflow_probability==0.6.0
!pip install -q tensor2tensor==1.13.1
!pip install -q gym[atari]

[31mERROR: Could not find a version that satisfies the requirement tensorflow==1.13.1 (from versions: 2.16.0rc0, 2.16.1, 2.16.2, 2.17.0rc0, 2.17.0rc1, 2.17.0, 2.17.1, 2.18.0rc0, 2.18.0rc1, 2.18.0rc2, 2.18.0, 2.18.1, 2.19.0rc0, 2.19.0, 2.19.1, 2.20.0rc0, 2.20.0)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow==1.13.1[0m[31m
[0m

In [None]:
# Helper function for playing videos in the colab.
def play_video(path):
  from IPython.core.magics.display import HTML
  display_path = "/nbextensions/vid.mp4"
  display_abs_path = "/usr/local/share/jupyter" + display_path
  !rm -f $display_abs_path
  !ffmpeg -loglevel error -i $path $display_abs_path
  return HTML("""
    <video width="640" height="480" controls>
      <source src="{}" type="video/mp4">
    </video>
  """.format(display_path))

# Play using a pre-trained policy

We provide pretrained policies for the following games from the Atari Learning Environment ( [ALE](https://github.com/mgbellemare/Arcade-Learning-Environment)) : alien,
amidar,
 assault,
 asterix,
 asteroids,
 atlantis,
 bank_heist,
 battle_zone,
 beam_rider,
 bowling,
 boxing,
 breakout,
 chopper_command,
 crazy_climber,
 demon_attack,
 fishing_derby,
 freeway,
 frostbite,
 gopher,
 gravitar,
 hero,
 ice_hockey,
 jamesbond,
 kangaroo,
 krull,
 kung_fu_master,
 ms_pacman,
 name_this_game,
 pong,
 private_eye,
 qbert,
 riverraid,
 road_runner,
 seaquest,
 up_n_down,
 yars_revenge.

 We have 5 checkpoints for each game saved on Google Storage. Run the following command get the storage path:

In [None]:
# experiment_id is an integer from [0, 4].
def get_run_dir(game, experiment_id):
  from tensor2tensor.data_generators.gym_env import ATARI_GAMES_WITH_HUMAN_SCORE_NICE
  EXPERIMENTS_PER_GAME = 5
  run_id = ATARI_GAMES_WITH_HUMAN_SCORE_NICE.index(game) * EXPERIMENTS_PER_GAME + experiment_id + 1
  return "gs://tensor2tensor-checkpoints/modelrl_experiments/train_sd/{}".format(run_id)

get_run_dir('pong', 2)

To evaluate and generate videos for a pretrained policy on Pong:

In [None]:
game = 'pong'
run_dir = get_run_dir(game, 1)
!python -m tensor2tensor.rl.evaluator \
  --loop_hparams_set=rlmb_long_stochastic_discrete \
  --loop_hparams=game=$game,eval_max_num_noops=8,eval_sampling_temps=[0.5] \
  --policy_dir=$run_dir/policy \
  --eval_metrics_dir=pong_pretrained \
  --debug_video_path=pong_pretrained \
  --num_debug_videos=4

The above command will run a single evaluation setting to get the results fast. We usually run a grid of different settings (sampling temperatures and whether to do initial no-ops). To do that, remove `eval_max_num_noops=8,eval_sampling_temps=[0.5]` from the command. You can override the evaluation settings:

```
  --loop_hparams=game=pong,eval_max_num_noops=0,eval_sampling_temps=[0.0]
 ```

 The evaluator generates videos from the environment:

In [None]:
play_video('pong_pretrained/0.avi')

# Train your policy (model-free training)
Training model-free on Pong (it takes a few hours):

In [None]:
!python -m tensor2tensor.rl.trainer_model_free \
  --hparams_set=rlmf_base \
  --hparams=game=pong \
  --output_dir=mf_pong

Hyperparameter sets are defined in `tensor2tensor/models/research/rl.py`. You can override them using the hparams flag, e.g.

```
--hparams=game=kung_fu_master,frame_stack_size=5
```

As in model-based training, the periodic evaluation runs with timestep limit of 1000. To do full evaluation after training, run:

In [None]:
!python -m tensor2tensor.rl.evaluator \
  --loop_hparams_set=rlmf_tiny \
  --hparams=game=pong \
  --policy_dir=mf_pong \
  --debug_video_path=mf_pong \
  --num_debug_videos=4 \
  --eval_metrics_dir=mf_pong/full_eval_metrics

In [None]:
play_video('mf_pong/0.avi')

# Model-based training

The `rl` package offers many more features, including model-based training. For instructions on how to use them, go to our [README](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/rl/README.md).