## DQN 구현

- 먼저, 코드에 사용되는 각종 모듈을 import 해옵니다.

In [1]:
import gym
import random
import numpy as np
import tensorflow as tf
import tensorflow.keras.layers as kl
import tensorflow.keras.optimizers as ko
from collections import deque
from tqdm import tqdm, notebook  # 학습 과정을 더 깔끔하게 보여주는 library 입니다.

### 1. Gym CartPole 환경 탐색

#### env.observation_space

- 관측가능한 state 정보를 담고 있는 객체입니다. Box(4)

Num | Observation | Min | Max
---|---|---|---
0 | Cart Position | -2.4 | 2.4
1 | Cart Velocity | -Inf | Inf
2 | Pole Angle | ~ -41.8&deg; | ~ 41.8&deg;
3 | Pole Velocity At Tip | -Inf | Inf


#### env.action_space

- 환경에 적용 가능한 action 정보를 담고 있는 객체 입니다. Discrete(2)

Num | Action
--- | ---
0 | Push cart to the left
1 | Push cart to the right

In [2]:
# CartPole-v0 라는 gym 환경을 만들어 env에 저장합니다.
env = gym.make('CartPole-v0')

# Box(4,) 는 4개의 요소로 구성된 벡터를 뜻합니다.
print("Observation space:", env.observation_space)
print("Observation space Max:", env.observation_space.high)
print("Observation space Min:", env.observation_space.low)
# Discrete(2) 는 개별적인 두 개의 action이 있음을 뜻합니다.
print("Action space:", env.action_space)
print("Action space num: ", env.action_space.n)

Observation space: Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
Observation space Max: [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
Observation space Min: [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
Action space: Discrete(2)
Action space num:  2


### 2. env 와 상호작용하기 : env.reset(), env.step(action)


#### env.reset()
- env를 original setting으로 초기화 합니다.
- 초기화된 observation을 반환합니다.


#### obs, reward, done, info =  env.step(action)
- action을 env에 적용합니다.
- action을 적용한 후의 새로운 state인 obs (env.observation_space), reward (float), done (bool), meta data (dict)를 반환합니다.
- 만약 done==True일 경우, episode가 끝난 것이며 env를 다시 초기화해 주어야 합니다.

In [3]:
# reset은 매 episode가 시작할 때마다 호출해야 합니다.
obs = env.reset()

# random한 action을 뽑아 환경에 적용합니다..
action = env.action_space.sample()
print("Sampled action: {}\n".format(action))
obs, reward, done, info = env.step(action)

# info는 현재의 경우 비어있는 dict지만 debugging과 관련된 정보를 포함할 수 있습니다.
# reward는 scalar 값 입니다.
print("obs : {}\nreward : {}\ndone : {}\ninfo : {}\n".format(obs, reward, done, info))

# 한 episode 에 대한 testing
obs, done, ep_reward = env.reset(), False, 0

# 대부분의 gym 환경은 다음과 같은 흐름으로 진행됩니다.
while True: 
    action = env.action_space.sample() # action 선택
    obs, reward, done, info = env.step(action)  # 환경에 action 적용
    ep_reward += reward
    if done:  # episode 종료 여부 체크
        break
        
env.close()  
# Cartpole에서 reward = episode 동안 지속된 step 을 뜻합니다.
print("episode reward : ", ep_reward) 


print(obs.shape)
print(len(obs.shape))

Sampled action: 0

obs : [ 0.04930597 -0.21241097  0.03936892  0.32806232]
reward : 1.0
done : False
info : {}

episode reward :  20.0
(4,)
1


### 3. DQN 구현 

## Network model

먼저, 학습에 사용될 neural network model을 구현해 보겠습니다. 

- Network는 Q value를 approximation하는데 사용됩니다.
    - input : state
    - output : 입력된 state에서 각 action에 대한 action-value (Q value)

**Model(num_of_actions)** 
- output 출력시 action과 같은 형식이 되도록 설정해줍니다.
- Cartpole의 경우 2개의 discrete한 action으로 이루어져 있으므로 Model(2)로 초기화 해줍니다.

**action_value(state)**
- state를 network에 입력 후 출력된 Q value를 기반으로 Q value가 가장 큰 action(best action)을 반환합니다.

In [5]:
# Neural Network Model 
class Model(tf.keras.Model):
    def __init__(self, num_actions, units=[32, 32]):
        super().__init__()
        self.fc1 = kl.Dense(units[0], activation='relu', kernel_initializer='he_uniform')
        self.fc2 = kl.Dense(units[1], activation='relu', kernel_initializer='he_uniform')
        self.logits = kl.Dense(num_actions, name='q_values')

    # forward propagation
    def call(self, inputs):
        x = self.fc1(inputs)
        x = self.fc2(x)
        x = self.logits(x)
        return x

    # return best action that maximize action-value (Q) from network
    # a* = argmax_a' Q(s, a')
    def action_value(self, obs):
        q_values = self.predict(obs)
        best_action = np.argmax(q_values, axis=-1)
        return best_action[0]
    
    def get_value(obs):
        return 

## Hyperparameter 설정

학습에 필요한 Hyperparameter를 선언합니다.

In [6]:
units=[32, 32]         # network의 구조. [32, 32]로 설정시 두개의 hidden layer에 32개의 node로 구성된 network가 생성
epsilon=1.0            # epsilon의 초기 값
min_epsilon=.01        # epsilon의 최솟값
epsilon_decay=0.995    # 매 step마다 epsilon이 줄어드는 비율 
train_nums=5000        # train이 진행되는 총 step
gamma=0.95             # discount factor
start_learning = 20

buffer_size=5000        # Replay buffer의 size
batch_size=8           # Repaly buffer로 부터 가져오는 transition minbatch의 크기

target_update_iter=400 # Target network가 update 되는 주기 (step 기준)

In [7]:
network = Model(2)
print("network id ", id(network))
# network의 optimizer 와 loss 함수를 정의해 줍니다.
opt = ko.Adam(learning_rate=.0015, clipvalue=10.0)  # do gradient clip
network.compile(optimizer=opt, loss='mse')

network id  139823061000880


## Test network

학습 전, 초기화된 network를 이용해 cartpole 환경을 진행해 봅니다.

In [8]:
# test before train
epi_rewards = []
n_episodes = 10
for i in range(n_episodes):
    obs, done, epi_reward = env.reset(), False, 0.0 
    while not done:
        action = # TODO : get action from network
        next_obs, reward, done, _ = env.step(action)
        
        epi_reward += reward
        
        obs = next_obs
    
    print("{} episode reward : {}".format(i, epi_reward))
    epi_rewards.append(epi_reward)

mean_reward = np.mean(epi_rewards)
std_reward = np.std(epi_rewards)

print(f"mean_reward : {mean_reward:.2f} +/- {std_reward:.2f}")

0 episode reward : 8.0
1 episode reward : 9.0
2 episode reward : 15.0
3 episode reward : 12.0
4 episode reward : 9.0
5 episode reward : 11.0
6 episode reward : 9.0
7 episode reward : 11.0
8 episode reward : 8.0
9 episode reward : 11.0
mean_reward : 10.30 +/- 2.05


그럼 본격적으로 DQN을 구현해 보겠습니다.

## Train network

state(obs)와 **env.step(action)** 을 통해 얻어지는  next_state, reward, done 값을 이용하여 target value를 계산합니다.

**np.amax()** = array에서 가장 큰 값을 반환합니다.  
**network.train_on_batch(input, target)** =  input 입력 시 나오는 출력값이 target과 가까워지도록 loss 값을 기반으로 network를 업데이트합니다.


Train은 다음과 같은 순서로 진행됩니다.

step 1. Select action using epsilon-greedy  
step 2. Take step and store transition to replay buffer  
step 3. Train Network  
step 4. Target network update  

- network는 다음과 같은 target value를 기준으로 학습 됩니다.

 \begin{aligned}
& Y(s, a, r, s') = r + \gamma \max_{a'} Q_{\theta^{-}}(s', a') \\
& \mathcal{L}(\theta) = \mathbb{E}_{(s, a, r, s') \sim U(D)} \Big[ \big( Y(s, a, r, s') - Q_\theta(s, a) \big)^2 \Big]
\end{aligned}


In [8]:
# initialize the initial observation of the agent
print("network id ", id(network))
obs = env.reset()
epi_reward = 0.0
epi = 0 # number of episode taken
epsilon=1.0

for t in notebook.tqdm(range(1, train_nums+1), desc='train with DQN'):
    # epsilon update
    if epsilon > min_epsilon:
        epsilon = max(epsilon * epsilon_decay, min_epsilon)

    #######################  step 1  ####################### 
    ####        Select action using episolon-greedy      ### 
    ########################################################  

    # select action that maximize Q value f
    best_action = network.action_value(obs[None])  # input the obs to the network model // obs : (4, ) -> obs[None] : (1, 4)
    
    # e-greedy
    if np.random.rand() < epsilon:
        action = # TODO
    else:
        action = # TODO
    
    
    #######################  step 2  ####################### 
    #### Take step and store transition to replay buffer ### 
    ########################################################
    
    next_obs, reward, done, _ = env.step(action)    # Excute action in the env to return s'(next state), r, done
    epi_reward += reward
    
    #######################  step 3  ####################### 
    ####     Train network (perform gradient descent)    ### 
    ########################################################
    
    # target values r + gamma * maxQ(s', a') 계산
    # np.amax -> list 에서 가장 큰 값 반환
    target_q = # TODO
    
    # get action values from Q network
    q_values = # TODO
    
    # update q_value
    q_values[0][action] = target_q
    
    # perform a gradient descent on Q network
    # Ths loss measures the mean squared error between prediction and target    
    network.train_on_batch(obs[None], q_values)
 
    obs = next_obs  # s <- s'
    
    # if episode ends (done)
    if done:
        epi += 1 # num of episode +
        if epi % 20 == 0:
            print("[Episode {:>5}] epi reward: {:>6.2f}  --eps : {:>4.2f} --steps : {:>5}".format(epi, epi_reward, epsilon, t))
        obs, done, epi_reward = env.reset(), False, 0.0  # Environmnet reset
            

network id  140503010370840


HBox(children=(FloatProgress(value=0.0, description='train with DQN', max=5000.0, style=ProgressStyle(descript…

[Episode    20] epi reward:  11.00  --eps : 0.22 --steps :   300
[Episode    40] epi reward:   9.00  --eps : 0.08 --steps :   504
[Episode    60] epi reward:   9.00  --eps : 0.03 --steps :   689
[Episode    80] epi reward:  10.00  --eps : 0.01 --steps :   878
[Episode   100] epi reward:   9.00  --eps : 0.01 --steps :  1060
[Episode   120] epi reward:   9.00  --eps : 0.01 --steps :  1246
[Episode   140] epi reward:  10.00  --eps : 0.01 --steps :  1436
[Episode   160] epi reward:   9.00  --eps : 0.01 --steps :  1622
[Episode   180] epi reward:  10.00  --eps : 0.01 --steps :  1808
[Episode   200] epi reward:   9.00  --eps : 0.01 --steps :  1989
[Episode   220] epi reward:   8.00  --eps : 0.01 --steps :  2178
[Episode   240] epi reward:   9.00  --eps : 0.01 --steps :  2370
[Episode   260] epi reward:  10.00  --eps : 0.01 --steps :  2555
[Episode   280] epi reward:   9.00  --eps : 0.01 --steps :  2744
[Episode   300] epi reward:  11.00  --eps : 0.01 --steps :  2929
[Episode   320] epi rewar

In [9]:
epi_rewards = []
# After training    
for i in range(n_episodes):
    obs, done, epi_reward = env.reset(), False, 0.0 # Using [None] to extend its dimension (4,) -> (1, 4)
    while not done :
        action = network.action_value(obs[None])
        obs, reward, done, _ = env.step(action)
        epi_reward += reward
    print("{} episode reward : {}".format(i, epi_reward))
    epi_rewards.append(epi_reward)

mean_reward = np.mean(epi_rewards)
std_reward = np.std(epi_rewards)

print(f"mean_reward : {mean_reward:.2f} +/- {std_reward:.2f}")

0 episode reward : 11.0
1 episode reward : 9.0
2 episode reward : 9.0
3 episode reward : 8.0
4 episode reward : 10.0
5 episode reward : 11.0
6 episode reward : 10.0
7 episode reward : 10.0
8 episode reward : 10.0
9 episode reward : 8.0
mean_reward : 9.60 +/- 1.02


##  Experience Replay

- 지금까지는 Experience Replay 없이 매 step 마다 얻어지는 transition (s, a, r, s', done) 을 사용해 network를 학습하였습니다. 
- 이제 Replay Buffer를 적용해 보겠습니다.

먼저 Replay Buffer을 다음과 같이 정의해 줍니다.

**store(s, a, r, s', done)** = 각 step의 transition 정보를 buffer에 저장합니다.  
**sample(batch_size)** = buffer에서 batch_size 크기 만큼의 mini_batch를 sampling 합니다.

In [9]:
class ReplayBuffer:
    def __init__(self, buffer_size):
        self.buffer_size = buffer_size
        self.count = 0
        self.buffer = deque(maxlen=buffer_size) 

    # store transition of each step in replay buffer
    def store(self, s, a, r, next_s, d):
        experience = (s, a, r, d, next_s)
        self.buffer.append(experience)
        self.count += 1

    # Sample random minibatch of transtion
    def sample(self, batch_size):
        assert batch_size < self. count
        batch = random.sample(self.buffer, batch_size)
        s_batch, a_batch, r_batch, d_batch, s2_batch = map(np.array, list(zip(*batch)))
        
        return s_batch, a_batch, r_batch, s2_batch, d_batch
    
    def clear(self):
        self.buffer.clear()
        self.count = 0

Replay Buffer를 학습에 포함시켜보겠습니다.

step 2 에서 step 실행 후 바로 학습을 진행하지 않고, transtion을 replay buffer에 저장합니다.  
step 3 에서 buffer로 부터 minibatch를 sampling후 minibatch를 기반으로 똑같이 학습을 진행합니다.

In [11]:
# initialize the initial observation of the agent
print("network id ", id(network))
obs = env.reset()
epi_reward = 0.0
epi = 0 # number of episode taken
epsilon=1.0

for t in notebook.tqdm(range(1, train_nums+1), desc='train with DQN'):
    # epsilon update
    if epsilon > min_epsilon:
        epsilon = max(epsilon * epsilon_decay, min_epsilon)

    #######################  step 1  ####################### 
    ####        Select action using episolon-greedy      ### 
    ########################################################  

    # select action that maximize Q value f
    best_action = network.action_value(obs[None])  # input the obs to the network model // obs : (4, ) -> obs[None] : (1, 4)
    
    # e-greedy
    if np.random.rand() < epsilon:
        action = env.action_space.sample()
    else:
        action = best_action   # with prob. epsilon, select a random action
    
    #######################  step 2  ####################### 
    #### Take step and store transition to replay buffer ### 
    ########################################################
    
    next_obs, reward, done, _ = env.step(action)    # Excute action in the env to return s'(next state), r, done
    epi_reward += reward
    #TODO : store transition (s, a, r, s', done)
    
    
    #######################  step 3  ####################### 
    ####     Train network (perform gradient descent)    ### 
    ########################################################
    
    if t > start_learning:
        s_batch, a_batch, r_batch, ns_batch, done_batch = #TODO : get sample from batch
        target_q = #TODO : Calculate targe value   #Tip : why we used Obs[None]?
        #target_q = reward + gamma * np.amax(network.predict(next_obs[None])) * (1- done)  
        
        # get action values from Q network
        q_values = network.predict(obs[None])  
    
        for i, action in enumerate(a_batch):
            # TODO :  Upadet q_values 'in batch'
            #q_values[0][action] = target_q
     
        network.train_on_batch(obs[None], q_values)
 
    obs = next_obs  # s <- s'
    
    # if episode ends (done)
    if done:
        epi += 1 # num of episode +
        if epi % 20 == 0:
            print("[Episode {:>5}] epi reward: {:>6.2f}  --eps : {:>4.2f} --steps : {:>5}".format(epi, epi_reward, epsilon, t))
        obs, done, epi_reward = env.reset(), False, 0.0  # Environmnet reset
            

HBox(children=(FloatProgress(value=0.0, description='train with DQN', max=5000.0, style=ProgressStyle(descript…

[Episode    20] epi reward:  12.00  --eps : 0.29 --steps :   244
[Episode    40] epi reward:  12.00  --eps : 0.10 --steps :   456
[Episode    60] epi reward:  28.00  --eps : 0.03 --steps :   682
[Episode    80] epi reward: 172.00  --eps : 0.01 --steps :  2017



In [12]:
epi_rewards = []
# After training    
for i in range(n_episodes):
    obs, done, epi_reward = env.reset(), False, 0.0 # Using [None] to extend its dimension (4,) -> (1, 4)
    while not done :
        action = network.action_value(obs[None])
        obs, reward, done, _ = env.step(action)
        epi_reward += reward
    print("{} episode reward : {}".format(i, epi_reward))
    epi_rewards.append(epi_reward)

mean_reward = np.mean(epi_rewards)
std_reward = np.std(epi_rewards)

print(f"mean_reward : {mean_reward:.2f} +/- {std_reward:.2f}")

0 episode reward : 200.0
1 episode reward : 180.0
2 episode reward : 200.0
3 episode reward : 200.0
4 episode reward : 200.0
5 episode reward : 200.0
6 episode reward : 200.0
7 episode reward : 200.0
8 episode reward : 200.0
9 episode reward : 200.0
mean_reward : 198.00 +/- 6.00


## Target network

- 이제 target network를 추가해 보겠습니다.  

stpe 3 에서 target value 계산 시 다음 state의 Q value를 target network로부터 가져옵니다.  
step 4 에서 일정 주기마다 target network가 network와 같아지도록 weights를 업데이트 해줍니다.

In [13]:
replay_buffer = ReplayBuffer(buffer_size)
network = Model(2)
opt = ko.Adam(learning_rate=.0015, clipvalue=10.0)  # do gradient clip
network.compile(optimizer=opt, loss='mse')

obs = env.reset()
epi_reward = 0.0
epi = 0 # number of episode taken
epsilon=1.0

for t in notebook.tqdm(range(1, train_nums+1), desc='train with DQN'):
    # epsilon update
    if epsilon > min_epsilon:
        epsilon = max(epsilon * epsilon_decay, min_epsilon)

    #######################  step 1  ####################### 
    ####        Select action using episolon-greedy      ### 
    ########################################################
    # select action that maximize Q value f
    best_action = network.action_value(obs[None])  # input the obs to the network model // obs : (4, ) -> obs[None] : (1, 4)
    
    # e-greedy
    if np.random.rand() < epsilon:
        action = env.action_space.sample()
    else:
        action = best_action   # with prob. epsilon, select a random action
    
    #######################  step 2  ####################### 
    #### Take step and store transition to replay buffer ### 
    ########################################################
    
    next_obs, reward, done, _ = env.step(action)    # Excute action in the env to return s'(next state), r, done
    epi_reward += reward
    replay_buffer.store(obs, action, reward, next_obs, done)
    
    #######################  step 3  ####################### 
    ####     Train network (perform gradient descent)    ### 
    ########################################################
    
    if t > start_learning:
        
        s_batch, a_batch, r_batch, ns_batch, done_batch = replay_buffer.sample(batch_size) 
        # TODO :
        # calculate target values r + gamma * maxQ(s', a') using 'Target Q network'
        # target_q = r_batch + gamma * np.amax(network.predict_on_batch(ns_batch), axis=1) * (1- done_batch)  
        
        q_values = network.predict(s_batch)

        for i, action in enumerate(a_batch):
            q_values[i][action] = target_q[i]
 
        network.train_on_batch(s_batch, q_values)
    
    #######################  step 4  ####################### 
    ####             Update target network               ### 
    ########################################################
      
    if t % target_update_iter == 0:
        # TODO : upadet target network
 
 
    obs = next_obs  # s <- s'
    # if episode ends (done)
    if done:
        epi += 1 # num of episode 
        if epi % 20 == 0:
            print("[Episode {:>5}] epi reward: {:>6.2f}  --eps : {:>4.2f} --steps : {:>5}".format(epi, epi_reward, epsilon, t))
        obs, done, epi_reward = env.reset(), False, 0.0  # Environmnet reset
            

HBox(children=(FloatProgress(value=0.0, description='train with DQN', max=5000.0, style=ProgressStyle(descript…

[Episode    20] epi reward:  15.00  --eps : 0.25 --steps :   278
[Episode    40] epi reward:  11.00  --eps : 0.08 --steps :   492
[Episode    60] epi reward:   9.00  --eps : 0.03 --steps :   694
[Episode    80] epi reward:   9.00  --eps : 0.01 --steps :   897
[Episode   100] epi reward:  10.00  --eps : 0.01 --steps :  1092
[Episode   120] epi reward:  17.00  --eps : 0.01 --steps :  1314
[Episode   140] epi reward:  15.00  --eps : 0.01 --steps :  1547
[Episode   160] epi reward:  14.00  --eps : 0.01 --steps :  1791
[Episode   180] epi reward:  17.00  --eps : 0.01 --steps :  2071
[Episode   200] epi reward:  14.00  --eps : 0.01 --steps :  2402
[Episode   220] epi reward:  98.00  --eps : 0.01 --steps :  3758



In [14]:
epi_rewards = []
# After training    
for i in range(n_episodes):
    obs, done, epi_reward = env.reset(), False, 0.0 # Using [None] to extend its dimension (4,) -> (1, 4)
    while not done :
        action = network.action_value(obs[None])
        obs, reward, done, _ = env.step(action)
        epi_reward += reward
    print("{} episode reward : {}".format(i, epi_reward))
    epi_rewards.append(epi_reward)

mean_reward = np.mean(epi_rewards)
std_reward = np.std(epi_rewards)

print(f"mean_reward : {mean_reward:.2f} +/- {std_reward:.2f}")

0 episode reward : 200.0
1 episode reward : 200.0
2 episode reward : 200.0
3 episode reward : 200.0
4 episode reward : 200.0
5 episode reward : 200.0
6 episode reward : 200.0
7 episode reward : 200.0
8 episode reward : 200.0
9 episode reward : 200.0
mean_reward : 200.00 +/- 0.00


In [15]:
from gym.wrappers import Monitor
import glob
import io
import base64
from IPython.display import HTML

env = gym.make('CartPole-v0')
env = Monitor(env, './video', force=True)
epi_reward = 0
obs = env.reset()
while True:
    action = network.action_value(obs[None])
    obs, reward, done, _ = env.step(action)
    epi_reward += reward
    if done:
        print("episode reward : {}".format(epi_reward))
        break
env.close()

video = io.open('./video/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''
    <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
.format(encoded.decode('ascii')))

NoSuchDisplayException: Cannot connect to "None"