# Single Goal Environment

Here we want to train an agent to reach a goal using reinforcement learning algorithms.

## Warm-up
First, we do some installations.

In [None]:
%tensorflow_version 1.x

!apt-get install graphviz libgraphviz-dev pkg-config
!apt-get install -y xvfb python-opengl ffmpeg
!pip install git+https://github.com/mhtb32/tl-env.git#egg=tl-env
!pip install stable-baselines==2.10.0 pyvirtualdisplay

Now we do imports and initializations.

In [1]:
from pathlib import Path

import gym
import numpy as np
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.noise import OrnsteinUhlenbeckActionNoise
from stable_baselines import DDPG
from stable_baselines.common.callbacks import EvalCallback
from stable_baselines.common.evaluation import evaluate_policy

# noinspection PyUnresolvedReferences
import tl_env

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Then, we specify a save path to save trained model and make the environments.

In [2]:
(Path.cwd().parent / 'out').mkdir(exist_ok=True)
save_path = Path.cwd().parent / 'out'

env = gym.make('tl_env:SingleGoalIDM-v0')
eval_env = gym.make('tl_env:SingleGoalIDM-v0')

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html


## Training
Now we train the agent using Soft Actor Critic(SAC) algorithm.

In [3]:
n_actions = env.action_space.shape[-1]
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))

eval_callback = EvalCallback(eval_env, eval_freq=2000, best_model_save_path=str(save_path))

model = DDPG(MlpPolicy, env, verbose=1, action_noise=action_noise, buffer_size=5000)
model.learn(total_timesteps=40000, log_interval=100, callback=eval_callback)
model.save(str(save_path / 'final_ddpg_model_40000'))
del model





Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
Use keras.layers.Dense instead.




Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where






  "{} != {}".format(self.training_env, self.eval_env))


Eval num_timesteps=2000, episode_reward=-2.56 +/- 0.12
Episode length: 23.60 +/- 2.80
New best mean reward!
Eval num_timesteps=4000, episode_reward=-2.50 +/- 0.00
Episode length: 25.00 +/- 0.00
New best mean reward!
Eval num_timesteps=6000, episode_reward=-2.50 +/- 0.00
Episode length: 25.00 +/- 0.00
Eval num_timesteps=8000, episode_reward=-1.48 +/- 0.84
Episode length: 20.80 +/- 3.49
New best mean reward!
Eval num_timesteps=10000, episode_reward=-1.32 +/- 1.14
Episode length: 15.20 +/- 4.49
New best mean reward!
--------------------------------------
| reference_Q_mean        | -0.411   |
| reference_Q_std         | 0.147    |
| reference_action_mean   | 0.584    |
| reference_action_std    | 0.453    |
| reference_actor_Q_mean  | -0.381   |
| reference_actor_Q_std   | 0.167    |
| rollout/Q_mean          | -0.284   |
| rollout/actions_mean    | 0.0928   |
| rollout/actions_std     | 0.312    |
| rollout/episode_steps   | 22.1     |
| rollout/episodes        | 451      |
| rollout/ret

## Testing
Now we test the agent for a few episodes to see how it is doing. We first define a simple helper function for
visualization of episodes:

In [4]:
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
from gym.wrappers import Monitor
import base64
from tqdm.notebook import trange

display = Display(visible=0, size=(1400, 900))
display.start()

def show_video():
    html = []
    for mp4 in Path("../out/video").glob("*.mp4"):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append('''<video alt="{}" autoplay
                      loop controls style="height: 400px;">
                      <source src="data:../out/video/mp4;base64,{}" type="video/mp4" />
                 </video>'''.format(mp4, video_b64.decode('ascii')))
    # noinspection PyTypeChecker
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

Now we test the policy:

In [7]:
env = Monitor(eval_env, '../out/video', force=True, video_callable=lambda episode: True)
model = DDPG.load(str(save_path / 'final_ddpg_model_40000'))
for episode in trange(3, desc="Test episodes"):
    obs, done = env.reset(), False
    env.unwrapped.automatic_rendering_callback = env.video_recorder.capture_frame
    while not done:
        action, _ = model.predict(obs)
        obs, reward, done, info = env.step(action)
env.close()
show_video()

Loading a model without an environment, this model cannot be trained until it has a valid environment.


HBox(children=(FloatProgress(value=0.0, description='Test episodes', max=3.0, style=ProgressStyle(description_…




## Evaluation

Finally, we evaluate the policy to have a quantitative sense of how it works.

In [None]:
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)

print(f"10-episode reward is {mean_reward:.2f} +/- {std_reward:.2f}")
