<a href="https://colab.research.google.com/github/iskra3138/tf-agent/blob/main/1_TF_Agent_DQN_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2018 The TF-Agents Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Train a Deep Q Network with TF-Agents

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/agents/docs/tutorials/1_dqn_tutorial.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

본 Colab note는 TF-Agent Tutorial 문서 중 1_dqn_tutorial.ipynb을 기반으로 합니다. 

# TF-Agent

TF-Agent는 Tensorflow를 기반으로 RL algorithm을 쉽게 구현하기 위해, Google에서 발표
현재 아래 7개의 알고리즘 제공

*   [DQN: __Human level control through deep reinforcement learning__ Mnih et
    al., 2015](https://deepmind.com/research/dqn/)
*   [DDQN: __Deep Reinforcement Learning with Double Q-learning__ Hasselt et
    al., 2015](https://arxiv.org/abs/1509.06461)
*   [DDPG: __Continuous control with deep reinforcement learning__ Lillicrap et
    al., 2015](https://arxiv.org/abs/1509.02971)
*   [TD3: __Addressing Function Approximation Error in Actor-Critic Methods__
    Fujimoto et al., 2018](https://arxiv.org/abs/1802.09477)
*   [REINFORCE: __Simple Statistical Gradient-Following Algorithms for
    Connectionist Reinforcement Learning__ Williams,
    1992](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
*   [PPO: __Proximal Policy Optimization Algorithms__ Schulman et al., 2017](https://arxiv.org/abs/1707.06347)
*   [SAC: __Soft Actor Critic__ Haarnoja et al., 2018](https://arxiv.org/abs/1812.05905)

## Introduction


본 note는 OpneAI의 Cartpole 환경에서 TF-agent의 DQN을 사용해서 학습하는 예제

This example shows how to train a [DQN (Deep Q Networks)](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)  agent on the Cartpole environment using the TF-Agents library.

![Cartpole environment](https://raw.githubusercontent.com/tensorflow/agents/master/docs/tutorials/images/cartpole.png)

It will walk you through all the components in a Reinforcement Learning (RL) pipeline for training, evaluation and data collection.


To run this code live, click the 'Run in Google Colab' link above.


## Setup

필요한 라이브러리들 설치

If you haven't installed the following dependencies, run:

In [None]:
!sudo apt-get install -y xvfb ffmpeg
!pip install gym
!pip install 'imageio==2.4.0'
!pip install PILLOW
!pip install pyglet
!pip install pyvirtualdisplay
!pip install tf-agents

In [None]:
from __future__ import absolute_import, division, print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay

import tensorflow as tf


from tf_agents.drivers import dynamic_step_driver

from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics

from tf_agents.policies import random_tf_policy

from tf_agents.trajectories import trajectory
from tf_agents.utils import common

In [None]:
tf.compat.v1.enable_v2_behavior()

# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

In [None]:
tf.version.VERSION

## Hyperparameters

In [None]:
num_iterations = 20000 # @param {type:"integer"}

# 데이터 수집 관련 Hyperparameters
initial_collect_steps = 100  # @param {type:"integer"} 
collect_steps_per_iteration = 1  # @param {type:"integer"}
replay_buffer_max_length = 100000  # @param {type:"integer"}


batch_size = 64  # @param {type:"integer"}
learning_rate = 1e-3  # @param {type:"number"}

# 학습할 때, 몇 번에 한번씩 log값을 찍어볼 것인가?
log_interval = 200  # @param {type:"integer"}

# 평가할 때 몇 개의 Episode로 Average Return을 구할 것인가?
num_eval_episodes = 10  # @param {type:"integer"}

# 학습할 때, 몇 번에 한번씩 평가해볼 것인가?
eval_interval = 1000  # @param {type:"integer"}

## Environment 정의

In Reinforcement Learning (RL), an environment represents the task or problem to be solved. Standard environments can be created in TF-Agents using `tf_agents.environments` suites. TF-Agents has suites for loading environments from sources such as the OpenAI Gym, Atari, and DM Control.

Load the CartPole environment from the OpenAI Gym suite. 

tf_agents.environments의 suite_gym을 이용하여 OpenAI Gym의 'CartPole-v0' 를 Loading

In [None]:
from tf_agents.environments import suite_gym
env_name = 'CartPole-v0'
env = suite_gym.load(env_name)

In [8]:
# env 확인
env

<tf_agents.environments.wrappers.TimeLimit at 0x7fa5fdbc5da0>

You can render this environment to see how it looks. A free-swinging pole is attached to a cart.  The goal is to move the cart right or left in order to keep the pole pointing up.

env.reset()으로 state초기화 하고, env.render()를 통해 해당 env의 현재 State를 Image로 받을 수 있음

In [9]:
env.reset()
# step_type, reward, discount, observation으로 구성

TimeStep(step_type=array(0, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([ 0.01580044, -0.02399861, -0.02483287, -0.01644955], dtype=float32))

In [10]:
env.render()

array([[[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       ...,

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]]

In [None]:
#@test {"skip": true}
env.reset()
PIL.Image.fromarray(env.render())

The `environment.step` method takes an `action` in the environment and returns a `TimeStep` tuple containing the next observation of the environment and the reward for the action.

The `time_step_spec()` method returns the specification for the `TimeStep` tuple. Its `observation` attribute shows the shape of observations, the data types, and the ranges of allowed values. The `reward` attribute shows the same details for the reward.


.time_step_spec()을 통해 Envirionment가 어떻게 정의되어 있는 지 확인할 수 있음 

In [12]:
# YJ
# CartPole예제는 다음 4가지 속성으로 구성되어 있음
print(env.time_step_spec())
# TimeStep(
#   step_type=ArraySpec(shape=(), dtype=dtype('int32'), name='step_type'),
#   reward=ArraySpec(shape=(), dtype=dtype('float32'), name='reward'), 
#   discount=BoundedArraySpec(shape=(), dtype=dtype('float32'), name='discount', minimum=0.0, maximum=1.0),
#   observation=BoundedArraySpec(shape=(4,), dtype=dtype('float32'), name='observation', minimum=[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], maximum=[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]))

TimeStep(step_type=ArraySpec(shape=(), dtype=dtype('int32'), name='step_type'), reward=ArraySpec(shape=(), dtype=dtype('float32'), name='reward'), discount=BoundedArraySpec(shape=(), dtype=dtype('float32'), name='discount', minimum=0.0, maximum=1.0), observation=BoundedArraySpec(shape=(4,), dtype=dtype('float32'), name='observation', minimum=[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], maximum=[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]))


In [None]:
print('Observation Spec:')
print(env.time_step_spec().observation)

In [None]:
print('Reward Spec:')
print(env.time_step_spec().reward)

The `action_spec()` method returns the shape, data types, and allowed values of valid actions.

.action_spec()을 통해 environment에서 정의된 Action Set에 대한 정보를 얻을 수 있음

In [None]:
print('Action Spec:')
print(env.action_spec())


In the Cartpole environment:

-   `observation` is an array of 4 floats: 
    -   the position and velocity of the cart
    -   the angular position and velocity of the pole 
-   `reward` is a scalar float value
-   `action` is a scalar integer with only two possible values:
    -   `0` — "move left"
    -   `1` — "move right"

.time_step_spec(), .action_spec()에서 실행한 Spec.들과 같음을 볼 수 있음

In [None]:
# 초기화
time_step = env.reset()
print('Time step:')
print(time_step)

# 임의로 오른쪽 이동의 Action을 부여함
action = np.array(1, dtype=np.int32)

# .step에 정의된 action을 집어넣어 next state정보를 구함
next_time_step = env.step(action)
print('Next time step:')
print(next_time_step)

In [17]:
# 50번 반복하면서 next state 정보들이 어떻게 바뀌는 지 확인해 봄
## step_type은 0[episode의 시작], 1[episode의 중간], 2[episode의 종료]를 나타냄
## discount는 step_type이 0,1일 때는 1, step_type이 2일 때는 0
## reward는 step_type이 0일 때는 0, step_type이 1, 2일 때는 1
## observation은 위에서 정의한 4차원 실수 벡터
## 가 됨을 볼 수 있음

print ('{}\t{}\t{}\t{}'.format('type','discount','reward','obs'))
next_time_step = env.reset()
for i in range(50):
  print('{}\t{}\t\t{}\t{}'.format
        (next_time_step.step_type, 
        next_time_step.discount, 
        next_time_step.reward, 
        next_time_step.observation)
  )
  next_time_step = env.step(action)

type	discount	reward	obs
0	1.0		0.0	[ 0.03486604  0.0424347  -0.02102845 -0.02165117]
1	1.0		1.0	[ 0.03571473  0.23785181 -0.02146148 -0.32089394]
1	1.0		1.0	[ 0.04047177  0.43327272 -0.02787935 -0.620267  ]
1	1.0		1.0	[ 0.04913722  0.62877274 -0.04028469 -0.9215985 ]
1	1.0		1.0	[ 0.06171268  0.8244152  -0.05871666 -1.2266645 ]
1	1.0		1.0	[ 0.07820098  1.0202417  -0.08324996 -1.5371511 ]
1	1.0		1.0	[ 0.09860582  1.2162613  -0.11399297 -1.8546093 ]
1	1.0		1.0	[ 0.12293104  1.4124365  -0.15108517 -2.1804037 ]
1	1.0		1.0	[ 0.15117978  1.608669   -0.19469324 -2.5156496 ]
2	0.0		1.0	[ 0.18335316  1.8047816  -0.24500623 -2.8611395 ]
0	1.0		0.0	[-0.00711055 -0.01558679 -0.00643361 -0.01728525]
1	1.0		1.0	[-0.00742228  0.17962682 -0.00677932 -0.3119911 ]
1	1.0		1.0	[-0.00382974  0.3748447  -0.01301914 -0.60680425]
1	1.0		1.0	[ 0.00366715  0.57014626 -0.02515523 -0.90355927]
1	1.0		1.0	[ 0.01507007  0.7655997  -0.04322641 -1.2040416 ]
1	1.0		1.0	[ 0.03038207  0.961253   -0.06730724 -1.509952  ]

Usually two environments are instantiated: one for training and one for evaluation. 

학습을 위한 환경/평가를 위한 환경을 각각 선언함

In [None]:
train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)

The Cartpole environment, like most environments, is written in pure Python. This is converted to TensorFlow using the `TFPyEnvironment` wrapper.

The original environment's API uses Numpy arrays. The `TFPyEnvironment` converts these to `Tensors` to make it compatible with Tensorflow agents and policies.


python으로 작성된 OpenAI Gym 환경을 Tensoflow 에 적합하게 하기 위해서 Wrapping함

In [None]:
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

## Agent

The algorithm used to solve an RL problem is represented by an `Agent`. TF-Agents provides standard implementations of a variety of `Agents`, including:

-   [DQN](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) (used in this tutorial)
-   [REINFORCE](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
-   [DDPG](https://arxiv.org/pdf/1509.02971.pdf)
-   [TD3](https://arxiv.org/pdf/1802.09477.pdf)
-   [PPO](https://arxiv.org/abs/1707.06347)
-   [SAC](https://arxiv.org/abs/1801.01290).

The DQN agent can be used in any environment which has a discrete action space.

At the heart of a DQN Agent is a `QNetwork`, a neural network model that can learn to predict `QValues` (expected returns) for all actions, given an observation from the environment.

Use `tf_agents.networks.q_network` to create a `QNetwork`, passing in the `observation_spec`, `action_spec`, and a tuple describing the number and size of the model's hidden layers.


tf_agents.networks.q_network의 QNetwork를 이용해서 DQN을 위한 Q Network정의

In [None]:
from tf_agents.networks import q_network

fc_layer_params = (100,)

# input에 해당되는 observation_spec과 action_spec을 인수로 전달
q_net = q_network.QNetwork(
    train_env.observation_spec(),
    train_env.action_spec(),
    fc_layer_params=fc_layer_params)

Now use `tf_agents.agents.dqn.dqn_agent` to instantiate a `DqnAgent`. In addition to the `time_step_spec`, `action_spec` and the QNetwork, the agent constructor also requires an optimizer (in this case, `AdamOptimizer`), a loss function, and an integer step counter.

tf_agents.agents.dqn.dqn_agent의 DqnAgent를 이용해서 DQN Agent를 선언 

In [None]:
from tf_agents.agents.dqn import dqn_agent

optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

## Policies

A policy defines the way an agent acts in an environment. Typically, the goal of reinforcement learning is to train the underlying model until the policy produces the desired outcome.

In this tutorial:

-   The desired outcome is keeping the pole balanced upright over the cart.
-   The policy returns an action (left or right) for each `time_step` observation.

Agents contain two policies: 

-   `agent.policy` — The main policy that is used for evaluation and deployment.
-   `agent.collect_policy` — A second policy that is used for data collection.


성능 평가를 위한 policy와 data collection을 위한 policy를 각각 선언함

In [None]:
eval_policy = agent.policy
collect_policy = agent.collect_policy

Policies can be created independently of agents. For example, use `tf_agents.policies.random_tf_policy` to create a policy which will randomly select an action for each `time_step`.

policy는 agent와 무관하게 random값을 뱉도록 선언할 수 있음. 이 때 input/output에 해당하는 spec을 인자로 전달함

In [None]:
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(),
                                                train_env.action_spec())

To get an action from a policy, call the `policy.action(time_step)` method. The `time_step` contains the observation from the environment. This method returns a `PolicyStep`, which is a named tuple with three components:

-   `action` — the action to be taken (in this case, `0` or `1`)
-   `state` — used for stateful (that is, RNN-based) policies
-   `info` — auxiliary data, such as log probabilities of actions

policy에서 action을 얻기 위해서는, 즉 $a = \pi(s)$를 구하기 위해서는 policy.action(time_step)을 호출함

In [None]:
# 예제 CartPole 환경 정의
example_environment = tf_py_environment.TFPyEnvironment(
    suite_gym.load('CartPole-v0'))

In [None]:
# 초기 state값 생성하고 결과 출력
time_step = example_environment.reset()
print(time_step)

In [None]:
# 초기 state에서 Random Action 선택
for i in range(10):
  print (random_policy.action(time_step))

## Metrics and Evaluation

The most common metric used to evaluate a policy is the average return. The return is the sum of rewards obtained while running a policy in an environment for an episode. Several episodes are run, creating an average return.

The following function computes the average return of a policy, given the policy, environment, and a number of episodes.


environment, policy, 테스트 해볼 episode의 수가 주어졌을 때, 해당 policy를 따르는 average return 구하는 함수 정의

In [None]:
#@test {"skip": true}
def compute_avg_return(environment, policy, num_episodes=10):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last(): # 종료 조건을 만족하지 않았다면, 
      action_step = policy.action(time_step) # 해당 state에서 policy를 따르는 action 선택
      time_step = environment.step(action_step.action) # action으로 인한 다음 state 선택
      episode_return += time_step.reward # 얻게되는 Reward 합산(여기서는 감가율 적용 안함)
    total_return += episode_return # 전체 에피소드의 Return 합

  avg_return = total_return / num_episodes # Average Return 구하기
  return avg_return.numpy()[0]


# See also the metrics module for standard implementations of different metrics.
# https://github.com/tensorflow/agents/tree/master/tf_agents/metrics

Running this computation on the `random_policy` shows a baseline performance in the environment.

In [None]:
# Random Policy 평가 (200점 만점)
compute_avg_return(eval_env, random_policy, num_eval_episodes)

## Replay Buffer

The replay buffer keeps track of data collected from the environment. This tutorial uses `tf_agents.replay_buffers.tf_uniform_replay_buffer.TFUniformReplayBuffer`, as it is the most common. 

The constructor requires the specs for the data it will be collecting. This is available from the agent using the `collect_data_spec` method. The batch size and maximum buffer length are also required.


tf_agents.replay_buffers.tf_uniform_replay_buffer의 TFUniformReplayBuffer를 이용해서 Replay Buffer 선언

In [None]:
from tf_agents.replay_buffers import tf_uniform_replay_buffer

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec, # agent에 따른 collect_data_spec 전달 (아래 Code 셀 참조)
    batch_size=train_env.batch_size, # 여기에서는 1
    max_length=replay_buffer_max_length) # 입력받는 Hyperparameter

For most agents, `collect_data_spec` is a named tuple called `Trajectory`, containing the specs for observations, actions, rewards, and other items.

In [None]:
agent.collect_data_spec

In [None]:
agent.collect_data_spec._fields

In [32]:
# YJ
print (train_env.batch_size)
print (replay_buffer_max_length)

1
100000


## Data Collection

Now execute the random policy in the environment for a few steps, recording the data in the replay buffer.

data 수집 함수 정의

In [None]:
#@test {"skip": true}
def collect_step(environment, policy, buffer):
  time_step = environment.current_time_step() #현재 Time Step 
  action_step = policy.action(time_step) # 현재 Time Step에서 policy에 의한 action
  next_time_step = environment.step(action_step.action) # 해당 action으로 인한 next state
  traj = trajectory.from_transition(time_step, action_step, next_time_step) # s,a,s' 쌍

  # Add trajectory to the replay buffer
  buffer.add_batch(traj)

def collect_data(env, policy, buffer, steps):
  for _ in range(steps):
    collect_step(env, policy, buffer)

collect_data(train_env, random_policy, replay_buffer, initial_collect_steps)

# This loop is so common in RL, that we provide standard implementations. 
# For more details see the drivers module.
# https://www.tensorflow.org/agents/api_docs/python/tf_agents/drivers

The replay buffer is now a collection of Trajectories.

In [None]:
# For the curious:
# Uncomment to peel one of these off and inspect it.
# iter(replay_buffer.as_dataset()).next()

In [35]:
# YJ
collect_data(train_env, random_policy, replay_buffer, initial_collect_steps)
replay_buffer_ds = replay_buffer.as_dataset() # replay_buffer를 dataset class로 변경
replay_buffer_it = iter(replay_buffer.as_dataset()) # python iterator로 변경

next_trj = next(replay_buffer_it) # replay_buffer에 있는 데이터 하나 추출
print (next_trj)

Instructions for updating:
Use `as_dataset(..., single_deterministic_pass=False) instead.
(Trajectory(step_type=<tf.Tensor: shape=(), dtype=int32, numpy=1>, observation=<tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 0.03103212, -0.16749024, -0.02390645,  0.3096822 ], dtype=float32)>, action=<tf.Tensor: shape=(), dtype=int64, numpy=1>, policy_info=(), next_step_type=<tf.Tensor: shape=(), dtype=int32, numpy=1>, reward=<tf.Tensor: shape=(), dtype=float32, numpy=1.0>, discount=<tf.Tensor: shape=(), dtype=float32, numpy=1.0>), BufferInfo(ids=<tf.Tensor: shape=(), dtype=int64, numpy=20>, probabilities=<tf.Tensor: shape=(), dtype=float32, numpy=0.005>))


In [36]:
# replay_buffer는 Trajectory와 BufferInfo로 구성되어 있음
## Trajectory는 'step_type', 'observation', 'action', 'policy_info', 'next_step_type', 'reward', 'discount'로 구성
## BufferInfo는 'ids', 'probabilities'로 구성
print (next_trj[0])
print (next_trj[1])

Trajectory(step_type=<tf.Tensor: shape=(), dtype=int32, numpy=1>, observation=<tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 0.03103212, -0.16749024, -0.02390645,  0.3096822 ], dtype=float32)>, action=<tf.Tensor: shape=(), dtype=int64, numpy=1>, policy_info=(), next_step_type=<tf.Tensor: shape=(), dtype=int32, numpy=1>, reward=<tf.Tensor: shape=(), dtype=float32, numpy=1.0>, discount=<tf.Tensor: shape=(), dtype=float32, numpy=1.0>)
BufferInfo(ids=<tf.Tensor: shape=(), dtype=int64, numpy=20>, probabilities=<tf.Tensor: shape=(), dtype=float32, numpy=0.005>)


The agent needs access to the replay buffer. This is provided by creating an iterable `tf.data.Dataset` pipeline which will feed data to the agent.

Each row of the replay buffer only stores a single observation step. But since the DQN Agent needs both the current and next observation to compute the loss, the dataset pipeline will sample two adjacent rows for each item in the batch (`num_steps=2`).

This dataset is also optimized by running parallel calls and prefetching data.

In [37]:
# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, # 병렬 실행을 위한 개수
    sample_batch_size=1, # 샘플 배치의 수
    num_steps=2) # 두개의 쌍으로 s,a,s'을 만들겠다.

dataset = dataset.prefetch(3) # 미리 가져오는 개수

iterator = iter(dataset)
print(iterator.next())

(Trajectory(step_type=<tf.Tensor: shape=(1, 2), dtype=int32, numpy=array([[1, 1]], dtype=int32)>, observation=<tf.Tensor: shape=(1, 2, 4), dtype=float32, numpy=
array([[[ 0.07531126, -0.41026783, -0.10986915,  0.3779574 ],
        [ 0.0671059 , -0.213771  , -0.10231   ,  0.05275274]]],
      dtype=float32)>, action=<tf.Tensor: shape=(1, 2), dtype=int64, numpy=array([[1, 0]])>, policy_info=(), next_step_type=<tf.Tensor: shape=(1, 2), dtype=int32, numpy=array([[1, 1]], dtype=int32)>, reward=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[1., 1.]], dtype=float32)>, discount=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[1., 1.]], dtype=float32)>), BufferInfo(ids=<tf.Tensor: shape=(1, 2), dtype=int64, numpy=array([[136, 137]])>, probabilities=<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.00502513], dtype=float32)>))


In [None]:
# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, 
    sample_batch_size=batch_size, 
    num_steps=2).prefetch(3)


dataset


In [None]:
iterator = iter(dataset)

print(iterator)


In [None]:
# For the curious:
# Uncomment to see what the dataset iterator is feeding to the agent.
# Compare this representation of replay data 
# to the collection of individual trajectories shown earlier.

#iterator.next()

## Training the agent

Two things must happen during the training loop:

-   collect data from the environment
-   use that data to train the agent's neural network(s)

This example also periodicially evaluates the policy and prints the current score.

The following will take ~5 minutes to run.

In [41]:
#@test {"skip": true}
try:
  %%time
except:
  pass

dataset = replay_buffer.as_dataset(
  num_parallel_calls=3, 
  sample_batch_size=batch_size, 
  num_steps=2).prefetch(3)
iterator = iter(dataset)


# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):

  # Collect a few steps using collect_policy and save to the replay buffer.
  collect_data(train_env, agent.collect_policy, replay_buffer, collect_steps_per_iteration)

  # Sample a batch of data from the buffer and update the agent's network.
  experience, unused_info = next(iterator)
  train_loss = agent.train(experience).loss

  step = agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.96 µs
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.foldr(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.foldr(fn, elems))
step = 200: loss = 10.710550308227539
step = 400: loss = 26.235721588134766
step = 600: loss = 7.133241653442383
step = 800: loss = 16.48409080505371
step = 1000: loss = 17.304441452026367
step = 1000: Average Return = 16.5
step = 1200: loss = 15.005210876464844
step = 1400: loss = 15.476096153259277
step = 1600: loss = 7.5435075759887695
step = 1800: loss = 11.7515287399292
step = 2000: loss = 36.65065383911133
step = 2000: Average Return = 29.5
step = 2200: loss = 15.634055137634277
step = 2400: loss = 74.43091583251953
step = 2600: loss = 37.68577575683594
step = 2800: loss = 8.644510269165039
step = 3000: loss = 7.891654968261719
step = 3000: Average Return = 38.400001525878906
step = 3200

## Visualization


### Plots

Use `matplotlib.pyplot` to chart how the policy improved during training.

One iteration of `Cartpole-v0` consists of 200 time steps. The environment gives a reward of `+1` for each step the pole stays up, so the maximum return for one episode is 200. The charts shows the return increasing towards that maximum each time it is evaluated during training. (It may be a little unstable and not increase monotonically each time.)

In [None]:
#@test {"skip": true}

iterations = range(0, num_iterations + 1, eval_interval)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
plt.ylim(top=250)

### Videos

Charts are nice. But more exciting is seeing an agent actually performing a task in an environment. 

First, create a function to embed videos in the notebook.

In [None]:
def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>'''.format(b64.decode())

  return IPython.display.HTML(tag)

Now iterate through a few episodes of the Cartpole game with the agent. The underlying Python environment (the one "inside" the TensorFlow environment wrapper) provides a `render()` method, which outputs an image of the environment state. These can be collected into a video.

In [None]:
def create_policy_eval_video(policy, filename, num_episodes=5, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = eval_env.reset()
      video.append_data(eval_py_env.render())
      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = eval_env.step(action_step.action)
        video.append_data(eval_py_env.render())
  return embed_mp4(filename)




create_policy_eval_video(agent.policy, "trained-agent")

For fun, compare the trained agent (above) to an agent moving randomly. (It does not do as well.)

In [None]:
create_policy_eval_video(random_policy, "random-agent")