# Day4 強化学習に対するニューラルネットワークの適用

## DQN ~ CNNを使う ~

* DQNのキモ?
  * Experience Replay: 行動履歴を保存し、さまざまなエピソードから異なる時点の状態を同じミニバッチの学習データに使用する
  * Fixed Target Q-Network: 学習最中のモデルから遷移先の価値を計算するのではなく、一定期間固定されたパラメータから算出する
  * Clipping: 報酬を全ゲームを通じて、成功を1, 失敗を-1に統一する

* CNNは6層
  * conv:    8×8のフィルタを32枚, ストライド4, paddingは入力画像と出力画像が同じになるように, 活性化関数 ReLU
  * conv:    4×4のフィルタを64枚, ストライド2, paddingは入力画像と出力画像が同じになるように, 活性化関数 ReLU
  * conv:    3×3のフィルタを64枚, ストライド2`, paddingは入力画像と出力画像が同じになるように, 活性化関数 ReLU
  * flatten: 画像の形をしたデータを１列のベクトルにする
  * Dense:   出力するベクトルの次元を256次元に, 活性化関数 ReLU
  * Dense:   出力するベクトルの次元は行動の数と同じに

## tensorflowでgpuを使う場合
一度, tensorflowはuninstallし、tensorflow-gpuをinstallする

```sh
$ pip uninstall tensorflow tensorflow-estimator
$ pip install tensorflow-gpu==1.14.0
```

In [1]:
#TensorFlowがGPUを認識しているか確認
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

2022-11-16 11:33:04.708758: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2022-11-16 11:33:04.750807: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2022-11-16 11:33:04.755093: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a47c8e5f60 executing computations on platform Host. Devices:
2022-11-16 11:33:04.755118: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2022-11-16 11:33:04.757134: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2022-11-16 11:33:05.203358: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a47eb9d300 executing computations on platform CUDA. Devices:
2022-11-16 11:33:05.203389: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-PC

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 6708173589168198501,
 name: "/device:XLA_CPU:0"
 device_type: "XLA_CPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 1972572433080393565
 physical_device_desc: "device: XLA_CPU device",
 name: "/device:XLA_GPU:0"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 4716525398441322657
 physical_device_desc: "device: XLA_GPU device",
 name: "/device:XLA_GPU:1"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 3992034414864283460
 physical_device_desc: "device: XLA_GPU device",
 name: "/device:XLA_GPU:2"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 9670817580364291828
 physical_device_desc: "device: XLA_GPU device",
 name: "/device:XLA_GPU:3"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 5841414615415302248
 physical_device_desc: "device: XLA_GPU device"]

In [2]:
import tensorflow as tf
config = tf.ConfigProto(
    gpu_options=tf.GPUOptions(
        visible_device_list="3", # specify GPU number
        allow_growth=True
    )
)
sess = tf.Session(config=config)

In [3]:
import random
import argparse
from collections import deque
import numpy as np
from tensorflow.python import keras as K
from PIL import Image

import gym
import gym_ple

import sys
sys.path.append("../FN")
from fn_framework import FNAgent, Trainer, Observer


class DeepQNetworkAgent(FNAgent):

    def __init__(self, epsilon, actions):
        super().__init__(epsilon, actions)
        self._scaler = None
        self._teacher_model = None

    def initialize(self, experiences, optimizer):
        """モデルの構築"""
        # 入力の形状取得(今回であれば画像のサイズ?)
        feature_shape = experiences[0].s.shape
        self.make_model(feature_shape)
        self.model.compile(optimizer, loss="mse")
        self.initialized = True
        print("Done initialization. From now, begin training!")

    def make_model(self, feature_shape):
        """モデルの作成"""
        normal = K.initializers.glorot_normal()
        model = K.Sequential()
        model.add(K.layers.Conv2D(
            32, kernel_size=8, strides=4, padding="same",
            input_shape=feature_shape, kernel_initializer=normal,
            activation="relu"))
        model.add(K.layers.Conv2D(
            64, kernel_size=4, strides=2, padding="same",
            kernel_initializer=normal,
            activation="relu"))
        model.add(K.layers.Conv2D(
            64, kernel_size=3, strides=1, padding="same",
            kernel_initializer=normal,
            activation="relu"))
        model.add(K.layers.Flatten())
        model.add(K.layers.Dense(256, kernel_initializer=normal,
                                 activation="relu"))
        model.add(K.layers.Dense(len(self.actions),
                                 kernel_initializer=normal))
        self.model = model
        self._teacher_model = K.models.clone_model(self.model)

    def estimate(self, state):
        return self.model.predict(np.array([state]))[0]

    def update(self, experiences, gamma):
        states = np.array([e.s for e in experiences])
        n_states = np.array([e.n_s for e in experiences])

        estimateds = self.model.predict(states)
        # MLPの時とは違い、一定期間固定されたDQNから計算
        # 学習の安定につながる
        future = self._teacher_model.predict(n_states)

        for i, e in enumerate(experiences):
            reward = e.r
            if not e.d:
                reward += gamma * np.max(future[i])
            estimateds[i][e.a] = reward

        loss = self.model.train_on_batch(states, estimateds)
        return loss
    # Networkの更新
    def update_teacher(self):
        self._teacher_model.set_weights(self.model.get_weights())

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
couldn't import doomish
Couldn't import doom


code 4-16 CartPoleでテストするためのクラス  
学習のプロセスが入っているmake_modelは時間がかかるため、それ以外の部分

In [4]:
class DeepQNetworkAgentTest(DeepQNetworkAgent):

    def __init__(self, epsilon, actions):
        super().__init__(epsilon, actions)

    def make_model(self, feature_shape):
        normal = K.initializers.glorot_normal()
        model = K.Sequential()
        model.add(K.layers.Dense(64, input_shape=feature_shape, kernel_initializer=normal, activation="relu"))
        model.add(K.layers.Dense(len(self.actions), kernel_initializer=normal, activation="relu"))
        self.model = model
        self._teacher_model = K.models.clone_model(self.model)

code 4-17 Observer定義  

時系列に並んだ4つの画面フレームを入力とするため、4フレームをまとめる処理を行う

In [5]:
class CatcherObserver(Observer):

    def __init__(self, env, width, height, frame_count):
        super().__init__(env)
        self.width = width
        self.height = height
        self.frame_count = frame_count
        self._frames = deque(maxlen=frame_count)

    def transform(self, state):
        grayed = Image.fromarray(state).convert("L")
        resized = grayed.resize((self.width, self.height))
        resized = np.array(resized).astype("float")
        normalized = resized / 255.0  # scale to 0~1
        if len(self._frames) == 0:
            for i in range(self.frame_count):
                self._frames.append(normalized)
        else:
            self._frames.append(normalized)
        feature = np.array(self._frames)
        # Convert the feature shape (f, w, h) => (h, w, f).
        feature = np.transpose(feature, (1, 2, 0))

        return feature

code4-16 CartPoleでテストするためのクラス定義


In [6]:
class DeepQNetworkAgentTest(DeepQNetworkAgent):

    def __init__(self, epsilon, actions):
        super().__init__(epsilon, actions)

    def make_model(self, feature_shape):
        normal = K.initializers.glorot_normal()
        model = K.Sequential()
        model.add(K.layers.Dense(64, input_shape=feature_shape,
                                 kernel_initializer=normal, activation="relu"))
        model.add(K.layers.Dense(len(self.actions), kernel_initializer=normal,
                                 activation="relu"))
        self.model = model
        self._teacher_model = K.models.clone_model(self.model) # Fixed Target Q-Network

code4-17 Catcherゲームを扱うためのObserver定義  
時系列に並んだ4つのフレームをまとめる処理を行う

In [7]:
class CatcherObserver(Observer):

    def __init__(self, env, width, height, frame_count):
        super().__init__(env)
        self.width = width
        self.height = height
        self.frame_count = frame_count
        self._frames = deque(maxlen=frame_count)

    def transform(self, state):
        grayed = Image.fromarray(state).convert("L")
        resized = grayed.resize((self.width, self.height))
        resized = np.array(resized).astype("float")
        normalized = resized / 255.0  # scale to 0~1
        # 最初は4フレーム揃わないので、最初のフレームを４つコピー
        if len(self._frames) == 0:
            for i in range(self.frame_count):
                self._frames.append(normalized)
        else:
            self._frames.append(normalized)
        feature = np.array(self._frames)
        # Convert the feature shape (f, w, h) => (h, w, f).
        feature = np.transpose(feature, (1, 2, 0))

        return feature

code4-18 Trainerの定義

In [9]:
class DeepQNetworkTrainer(Trainer):

    def __init__(self, buffer_size=50000, batch_size=32,
                 gamma=0.99, initial_epsilon=0.5, final_epsilon=1e-3,
                 learning_rate=1e-3, teacher_update_freq=3, report_interval=10,
                 log_dir="", file_name=""):
        super().__init__(buffer_size, batch_size, gamma,
                         report_interval, log_dir)
        self.file_name = file_name if file_name else "dqn_agent.h5"
        self.initial_epsilon = initial_epsilon
        self.final_epsilon = final_epsilon
        self.learning_rate = learning_rate
        self.teacher_update_freq = teacher_update_freq
        self.loss = 0
        self.training_episode = 0
        self._max_reward = -10

    def train(self, env, episode_count=1200, initial_count=200,
              test_mode=False, render=False, observe_interval=100):
        actions = list(range(env.action_space.n))
        if not test_mode:
            agent = DeepQNetworkAgent(1.0, actions)
        else:
            agent = DeepQNetworkAgentTest(1.0, actions)
            observe_interval = 0
        self.training_episode = episode_count

        self.train_loop(env, agent, episode_count, initial_count, render,
                        observe_interval)
        return agent

    def episode_begin(self, episode, agent):
        self.loss = 0

    def begin_train(self, episode, agent):
        optimizer = K.optimizers.Adam(lr=self.learning_rate, clipvalue=1.0)
        agent.initialize(self.experiences, optimizer)
        self.logger.set_model(agent.model)
        # epsilonを減衰して、ランダムな行動をしないようにしている
        agent.epsilon = self.initial_epsilon
        self.training_episode -= episode

    def step(self, episode, step_count, agent, experience):
        if self.training:
            batch = random.sample(self.experiences, self.batch_size)
            self.loss += agent.update(batch, self.gamma)

    def episode_end(self, episode, step_count, agent):
        """報酬や誤差(Loss)を記録し、学習途中でモデル保存"""
        reward = sum([e.r for e in self.get_recent(step_count)])
        self.loss = self.loss / step_count
        self.reward_log.append(reward)
        if self.training:
            self.logger.write(self.training_count, "loss", self.loss)
            self.logger.write(self.training_count, "reward", reward)
            self.logger.write(self.training_count, "epsilon", agent.epsilon)
            if reward > self._max_reward:
                agent.save(self.logger.path_of(self.file_name))
                self._max_reward = reward
            if self.is_event(self.training_count, self.teacher_update_freq):
                agent.update_teacher()

            diff = (self.initial_epsilon - self.final_epsilon)
            decay = diff / self.training_episode
            agent.epsilon = max(agent.epsilon - decay, self.final_epsilon)

        if self.is_event(episode, self.report_interval):
            recent_rewards = self.reward_log[-self.report_interval:]
            self.logger.describe("reward", recent_rewards, episode=episode)

code 4-19 実際に学習を行う

In [10]:
def main(play, is_test):
    file_name = "dqn_agent.h5" if not is_test else "dqn_agent_test.h5"
    trainer = DeepQNetworkTrainer(file_name=file_name)
    path = trainer.logger.path_of(trainer.file_name)
    agent_class = DeepQNetworkAgent

    if is_test:
        print("Train on test mode")
        obs = gym.make("CartPole-v0")
        agent_class = DeepQNetworkAgentTest
    else:
        env = gym.make("Catcher-v0")
        obs = CatcherObserver(env, 80, 80, 4)
        trainer.learning_rate = 1e-4

    if play:
        agent = agent_class.load(obs, path)
        agent.play(obs, render=True)
    else:
        trainer.train(obs, test_mode=is_test)

In [11]:
parser = argparse.ArgumentParser(description="DQN Agent")
parser.add_argument("--play", action="store_true",
                    help="play with trained model")
parser.add_argument("--test", action="store_true",
                    help="train by test mode")

args = parser.parse_args(args=[])
main(args.play, args.test)



At episode 10, reward is -6.8 (+/-1.72)
At episode 20, reward is -7.3 (+/-1.005)
At episode 30, reward is -7.4 (+/-0.8)
At episode 40, reward is -7.0 (+/-1.0)
At episode 50, reward is -7.3 (+/-0.781)
At episode 60, reward is -6.7 (+/-1.1)
At episode 70, reward is -7.3 (+/-1.005)
At episode 80, reward is -6.2 (+/-1.47)
At episode 90, reward is -6.6 (+/-1.114)
At episode 100, reward is -7.3 (+/-0.64)
At episode 110, reward is -7.1 (+/-0.831)
At episode 120, reward is -6.5 (+/-1.36)
At episode 130, reward is -6.9 (+/-1.044)
At episode 140, reward is -6.5 (+/-1.36)
At episode 150, reward is -7.0 (+/-0.632)
At episode 160, reward is -6.8 (+/-1.166)
At episode 170, reward is -7.1 (+/-1.221)
At episode 180, reward is -6.3 (+/-1.1)
At episode 190, reward is -6.8 (+/-1.077)
At episode 200, reward is -6.8 (+/-1.077)
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dt