##### Copyright 2021 The TF-Agents Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Train a Deep Q Network with TF-Agents

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/agents/docs/tutorials/1_dqn_tutorial.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

## Introduction


이 예에서는 TF-Agents 라이브러리를 사용하여 Cartpole 환경에서 DQN (Deep Q Networks) 에이전트를 교육하는 방법을 보여줍니다.

![Cartpole environment](https://raw.githubusercontent.com/tensorflow/agents/master/docs/tutorials/images/cartpole.png)

교육, 평가 및 데이터 수집을위한 RL (Reinforcement Learning) 파이프 라인의 모든 구성 요소를 안내합니다.

이 코드를 실시간으로 실행하려면 위의 'Google Colab에서 실행'링크를 클릭하십시오.


## Setup

다음 종속성을 설치하지 않은 경우 다음을 실행하십시오.

In [None]:
!sudo apt-get update
!sudo apt-get install -y xvfb ffmpeg
!pip install 'imageio==2.4.0'
!pip install pyvirtualdisplay
!pip install tf-agents

In [None]:
from __future__ import absolute_import, division, print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay

import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import sequential
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.specs import tensor_spec
from tf_agents.utils import common

In [None]:
# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

In [None]:
tf.version.VERSION

## Hyperparameters

num_iterations은 전체 훈련 횟수입니다.

collect_steps_per_iteration는 한 훈련 당 몇 회의 데이터 수집을 할지를 지정합니다.

log_interval는 훈련 과정 중 로그를 출력할 간격을 지정합니다.

num_eval_episodes는 평균 보상을 계산할 에피소드의 개수를 지정합니다.

eval_interval는 훈련 과정 중 검증을 수행할 간격을 지정합니다.

In [None]:
num_iterations = 20000 # @param {type:"integer"}

initial_collect_steps = 100  # @param {type:"integer"} 
collect_steps_per_iteration = 1  # @param {type:"integer"}
replay_buffer_max_length = 100000  # @param {type:"integer"}

batch_size = 64  # @param {type:"integer"}
learning_rate = 1e-3  # @param {type:"number"}
log_interval = 200  # @param {type:"integer"}

num_eval_episodes = 10  # @param {type:"integer"}
eval_interval = 1000  # @param {type:"integer"}

훈련 과정은 collect_data() 함수를 사용해서 데이터를 수집하는 것으로 시작합니다.

데이터셋으로부터 훈련용 데이터를 얻고 train() 메서드를 사용해서 훈련을 진행합니다.

log_interval과 eval_interval마다 손실값 (loss)과 평균 리턴값이 출력됩니다.

파이썬 문자열 format() 메서드의 사용에 대해서는 이 페이지를 참고하세요.

## Environment

강화 학습 (RL)에서 환경은 해결해야 할 과제 또는 문제를 나타냅니다. tf_agents.environments 제품군을 사용하여 TF-Agent에서 표준 환경을 만들 수 있습니다. TF-Agents에는 OpenAI Gym, Atari 및 DM Control과 같은 소스에서 환경을로드하기위한 제품군이 있습니다.

OpenAI Gym 제품군에서 CartPole 환경을로드합니다.

In [None]:
env_name = 'CartPole-v0'
env = suite_gym.load(env_name)

이 환경을 렌더링하여 어떻게 보이는지 확인할 수 있습니다. 자유 스윙 폴이 카트에 부착됩니다. 목표는 막대가 위로 향하도록 카트를 오른쪽 또는 왼쪽으로 이동하는 것입니다.

In [None]:
#@test {"skip": true}
env.reset()
PIL.Image.fromarray(env.render())

environment.step 메소드는 environment.step 에서 action 하고 다음 환경 관찰과 작업에 대한 보상을 포함하는 TimeStep 튜플을 반환합니다.

TimeStep과 time_step_spec() 메서드는 TimeStep 튜플에 대한 사양을 반환합니다. observation 속성은 observation 의 모양, 데이터 유형 및 허용되는 값의 범위를 보여줍니다. reward 속성은 reward 대해 동일한 세부 사항을 표시합니다.

In [None]:
print('Observation Spec:')
print(env.time_step_spec().observation)

In [None]:
print('Reward Spec:')
print(env.time_step_spec().reward)

action_spec() 메서드는 유효한 작업의 모양, 데이터 유형 및 허용 된 값을 반환합니다.

In [None]:
print('Action Spec:')
print(env.action_spec())

In the Cartpole environment:

-   `observation` is an array of 4 floats: 
    -   카트의 위치와 속도
    -   폴의 각도와 속도
-   `reward` 는 스칼라 부동 값입니다.
-   `action` 은 가능한 값이 두 개 뿐인 정수 스칼라입니다.
    -   `0` — "move left"
    -   `1` — "move right"



In [None]:
time_step = env.reset()
print('Time step:')
print(time_step)

action = np.array(1, dtype=np.int32)

next_time_step = env.step(action)
print('Next time step:')
print(next_time_step)

일반적으로 두 개의 환경이 인스턴스화됩니다. 하나는 교육용이고 다른 하나는 평가 용입니다.

In [None]:
train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)

대부분의 환경과 마찬가지로 Cartpole 환경은 순수 Python으로 작성됩니다. 이것은 TFPyEnvironment 래퍼를 사용하여 TensorFlow로 변환됩니다.

원래 환경의 API는 Numpy 배열을 사용합니다. TFPyEnvironment 는이를 TFPyEnvironment 로 변환하여 Tensors 에이전트 및 정책과 호환되도록합니다.

In [None]:
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

## Agent

RL 문제를 해결하는 데 사용되는 알고리즘은 Agent 로 표시됩니다. TF-Agents는 다음과 같은 다양한 Agents 의 표준 구현을 제공합니다.


-   [DQN](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) (used in this tutorial)
-   [REINFORCE](https://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
-   [DDPG](https://arxiv.org/pdf/1509.02971.pdf)
-   [TD3](https://arxiv.org/pdf/1802.09477.pdf)
-   [PPO](https://arxiv.org/abs/1707.06347)
-   [SAC](https://arxiv.org/abs/1801.01290).

DQN 에이전트는 이산적인 action 공간이있는 모든 환경에서 사용할 수 있습니다.

DQN 에이전트의 핵심은 QNetwork 입니다. 현재 상태에서 가능한 모든 action에 대해 Qnetwork는 Qvalue를 예측합니다.

tf_agents.networks를 사용해서 tf_agents.networks. QNetwork 를 만듭니다. 네트워크는 일련의 tf.keras.layers.Dense 레이어로 구성되며, 여기서 최종 레이어에는 가능한 각 action에 대해 1 개의 출력이 있습니다.

In [None]:
fc_layer_params = (100, 50)
action_tensor_spec = tensor_spec.from_spec(env.action_spec())
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1

print(action_tensor_spec)
print(num_actions)

In [None]:
# Define a helper function to create Dense layers configured with the right
# activation and kernel initializer.
def dense_layer(num_units):
  return tf.keras.layers.Dense(
      num_units,
      activation=tf.keras.activations.relu,
      kernel_initializer=tf.keras.initializers.VarianceScaling(
          scale=2.0, mode='fan_in', distribution='truncated_normal'))

# QNetwork consists of a sequence of Dense layers followed by a dense layer
# with `num_actions` units to generate one q_value per available action as
# it's output.
dense_layers = [dense_layer(num_units) for num_units in fc_layer_params]

q_values_layer = tf.keras.layers.Dense(
    num_actions,
    activation=None,
    kernel_initializer=tf.keras.initializers.RandomUniform(
        minval=-0.03, maxval=0.03),
    bias_initializer=tf.keras.initializers.Constant(-0.2))

q_net = sequential.Sequential(dense_layers + [q_values_layer])

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

## Policies

정책은 에이전트가 환경에서 작동하는 방식을 정의합니다. 일반적으로 강화 학습의 목표는 정책이 원하는 결과를 생성 할 때까지 기본 모델을 훈련하는 것입니다.

이 튜토리얼에서 :

원하는 결과는 폴을 카트 위에 똑바로 세우는 것입니다.
정책은 각 time_step 관찰에 대해 작업 (왼쪽 또는 오른쪽)을 반환합니다. 에이전트에는 두 가지 정책이 있습니다.

- agent.policy — 평가 및 배포에 사용되는 기본 정책입니다.
- agent.collect_policy — 데이터 수집에 사용되는 두 번째 정책입니다.

In [None]:
eval_policy = agent.policy
collect_policy = agent.collect_policy

에이전트와 독립적으로 정책을 만들 수 있습니다. 무작위적인 정책을 갖도록 만들 수 있습니다.

In [None]:
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(),
                                                train_env.action_spec())

정책에서 action을 가져 오려면 policy.action(time_step) 메서드를 호출합니다. 다음을 반환합니다.

-   `action` — the action to be taken (in this case, `0` or `1`)
-   `state` — used for stateful (that is, RNN-based) policies
-   `info` — auxiliary data, such as log probabilities of actions

In [None]:
example_environment = tf_py_environment.TFPyEnvironment(
    suite_gym.load('CartPole-v0'))

In [None]:
time_step = example_environment.reset()

In [None]:
random_policy.action(time_step)

## Metrics and Evaluation

The most common metric used to evaluate a policy is the average return. The return is the sum of rewards obtained while running a policy in an environment for an episode. Several episodes are run, creating an average return.

The following function computes the average return of a policy, given the policy, environment, and a number of episodes.


In [None]:
#@test {"skip": true}
def compute_avg_return(environment, policy, num_episodes=10):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]


# See also the metrics module for standard implementations of different metrics.
# https://github.com/tensorflow/agents/tree/master/tf_agents/metrics

Running this computation on the `random_policy` shows a baseline performance in the environment.

In [None]:
compute_avg_return(eval_env, random_policy, num_eval_episodes)

## Replay Buffer

The replay buffer keeps track of data collected from the environment. This tutorial uses `tf_agents.replay_buffers.tf_uniform_replay_buffer.TFUniformReplayBuffer`, as it is the most common. 

The constructor requires the specs for the data it will be collecting. This is available from the agent using the `collect_data_spec` method. The batch size and maximum buffer length are also required.


In [None]:
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_max_length)

For most agents, `collect_data_spec` is a named tuple called `Trajectory`, containing the specs for observations, actions, rewards, and other items.

In [None]:
agent.collect_data_spec

In [None]:
agent.collect_data_spec._fields

## Data Collection

Now execute the random policy in the environment for a few steps, recording the data in the replay buffer.

In [None]:
#@test {"skip": true}
def collect_step(environment, policy, buffer):
  time_step = environment.current_time_step()
  action_step = policy.action(time_step)
  next_time_step = environment.step(action_step.action)
  traj = trajectory.from_transition(time_step, action_step, next_time_step)

  # Add trajectory to the replay buffer
  buffer.add_batch(traj)

def collect_data(env, policy, buffer, steps):
  for _ in range(steps):
    collect_step(env, policy, buffer)

collect_data(train_env, random_policy, replay_buffer, initial_collect_steps)

# This loop is so common in RL, that we provide standard implementations. 
# For more details see tutorial 4 or the drivers module.
# https://github.com/tensorflow/agents/blob/master/docs/tutorials/4_drivers_tutorial.ipynb 
# https://www.tensorflow.org/agents/api_docs/python/tf_agents/drivers

The replay buffer is now a collection of Trajectories.

In [None]:
# For the curious:
# Uncomment to peel one of these off and inspect it.
# iter(replay_buffer.as_dataset()).next()

The agent needs access to the replay buffer. This is provided by creating an iterable `tf.data.Dataset` pipeline which will feed data to the agent.

Each row of the replay buffer only stores a single observation step. But since the DQN Agent needs both the current and next observation to compute the loss, the dataset pipeline will sample two adjacent rows for each item in the batch (`num_steps=2`).

This dataset is also optimized by running parallel calls and prefetching data.

In [None]:
# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, 
    sample_batch_size=batch_size, 
    num_steps=2).prefetch(3)


dataset

In [None]:
iterator = iter(dataset)
print(iterator)

In [None]:
# For the curious:
# Uncomment to see what the dataset iterator is feeding to the agent.
# Compare this representation of replay data 
# to the collection of individual trajectories shown earlier.

# iterator.next()

## Training the agent

Two things must happen during the training loop:

-   collect data from the environment
-   use that data to train the agent's neural network(s)

This example also periodicially evaluates the policy and prints the current score.

The following will take ~5 minutes to run.

In [None]:
#@test {"skip": true}
try:
  %%time
except:
  pass

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):

  # Collect a few steps using collect_policy and save to the replay buffer.
  collect_data(train_env, agent.collect_policy, replay_buffer, collect_steps_per_iteration)

  # Sample a batch of data from the buffer and update the agent's network.
  experience, unused_info = next(iterator)
  train_loss = agent.train(experience).loss

  step = agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)

## Visualization


### Plots

Use `matplotlib.pyplot` to chart how the policy improved during training.

One iteration of `Cartpole-v0` consists of 200 time steps. The environment gives a reward of `+1` for each step the pole stays up, so the maximum return for one episode is 200. The charts shows the return increasing towards that maximum each time it is evaluated during training. (It may be a little unstable and not increase monotonically each time.)

In [None]:
#@test {"skip": true}

iterations = range(0, num_iterations + 1, eval_interval)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
plt.ylim(top=250)

### Videos

Charts are nice. But more exciting is seeing an agent actually performing a task in an environment. 

First, create a function to embed videos in the notebook.

In [None]:
def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>'''.format(b64.decode())

  return IPython.display.HTML(tag)

Now iterate through a few episodes of the Cartpole game with the agent. The underlying Python environment (the one "inside" the TensorFlow environment wrapper) provides a `render()` method, which outputs an image of the environment state. These can be collected into a video.

In [None]:
def create_policy_eval_video(policy, filename, num_episodes=5, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = eval_env.reset()
      video.append_data(eval_py_env.render())
      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = eval_env.step(action_step.action)
        video.append_data(eval_py_env.render())
  return embed_mp4(filename)




create_policy_eval_video(agent.policy, "trained-agent")

For fun, compare the trained agent (above) to an agent moving randomly. (It does not do as well.)

In [None]:
create_policy_eval_video(random_policy, "random-agent")