# Actor Critic Method with CartPole

## Reinforcement Learning

<img src='../img/RL01.png' width='600'>

- policy : $\pi(a|s)=P(a|s), \forall s, \forall a$

- value function : $v_{\pi}(s)=\sum P(z)R(z)=\sum_{a\in\mathbb{A}(s)}P(a|s)(r+v_{\pi}(s')), \forall s \in \mathbb{S}$

- reward : $R(z)=r_{t+1}+\gamma r_{t+2} + \gamma^2r_{t+3}+\cdots=\sum_{k=1}^{\infty}\gamma^{k-1}r_{t+k}$

- Q-Value : $Q_{\pi}(s,a)=E_{\pi}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\cdots|S_t=s,A_t=a]$

### Actor Critic

Agent가 action을 취하고 생성된 state를 두 가지로 출력

1. Actor : 상태가 주어졌을 때 행동을 결정

2. Critic : 상태의 가치를 평가

## Setup

In [24]:
!pip install gym
!apt-get install python-opengl -y
!apt install xvfb -y
!pip install pyvirtualdisplay
!pip install piglet

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  freeglut3 libglu1-mesa libpython-stdlib libpython2.7-minimal
  libpython2.7-stdlib libxi6 python python-minimal python2.7 python2.7-minimal
Suggested packages:
  python-doc python-tk python-numpy libgle3 python2.7-doc binfmt-support
The following NEW packages will be installed:
  freeglut3 libglu1-mesa libpython-stdlib libpython2.7-minimal
  libpython2.7-stdlib libxi6 python python-minimal python-opengl python2.7
  python2.7-minimal
0 upgraded, 11 newly installed, 0 to remove and 46 not upgraded.
Need to get 4734 kB of archives.
After this operation, 23.1 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpython2.7-minimal amd64 2.7.17-1~18.04ubuntu1.6 [335 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python2.7-minimal amd64 2.7.17-1~18.04ubuntu1.6 [1291 k

In [2]:
import gym
from gym import wrappers
import numpy as np
import pandas as pd
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [4]:
# 게임 이미지를 그리는 가상 디스플레이 생성
# Colab이나 Jupyter 같은 환경에서만 필요. 로컬은 필요 없음
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

bash: ../xvfb: No such file or directory
env: DISPLAY=:1


In [5]:
# Configuration parameters for the whole setup
seed = 42
gamma = 0.99  # 과거 보상에 대한 감가율
alpha = 0.01  # learning rate
max_steps_per_episode = 10000
env = gym.make("CartPole-v0")  # Create the environment
env.seed(seed)
eps = np.finfo(np.float32).eps.item()  # float으로 표현 가능한 가장 작은 값

- CartPole

<img src='../img/RL03.png' width='300'>

## Actor Critic 네트워크 구현

1. Actor : state를 입력으로 받아 action에 대한 확률 값을 반환, $\pi(s,a)$

2. critic : state를 입력으로 받아 향후 총 보상의 추정치를 반환, $V(s)$

<img src='../img/RL02.png' width='300'>

In [6]:
num_inputs = 4  # state의 크기
num_actions = 2 # action -> 좌, 우로 이동
num_hidden = 128  # hidden layer node 수

class ActorCritic(nn.Module):
    def __init__(self, num_inputs, num_actions, num_hidden, gamma=0.99, alpha=0.01):
        super(ActorCritic, self).__init__()
        self.num_inputs = num_inputs
        self.num_actions = num_actions
        self.gamma = gamma
        self.alpha = alpha

        self.common_layer = nn.Sequential(
            nn.Linear(self.num_inputs, num_hidden),
            nn.ReLU()
        )
        self.action_layer = nn.Linear(num_hidden, num_actions)
        self.critic_layer = nn.Linear(num_hidden, 1)

    def forward(self, input):
        common = self.common_layer(input)
        action_prob = F.softmax(self.action_layer(common), dim=-1)
        critic_value = self.critic_layer(common)

        return action_prob, critic_value

In [7]:
model = ActorCritic(num_inputs, num_actions, num_hidden, gamma)
optimizer = optim.Adam(model.parameters(), lr=alpha)

- Running reward : $\mathrm{Running \; reward}_{t}=0.05*\mathrm{episode \; reward}_{t}+(1-0.05)*\mathrm{Running \;reward}_{t-1}$

- Sum of reward : $R(z)=r_{t+1}+\gamma r_{t+2} + \gamma^2r_{t+3}+\cdots=\sum_{k=1}^{\infty}\gamma^{k-1}r_{t+k}$

- actor loss : $\pi(s_t,a_t)$에 $R_{t}$와 $V(s_t)$의 차이를 곱한 값

- critic loss : $R_{t}$와 $V(s_t)$의 잔차 이용, Huber loss(평균 제곱 오차 함수와 절대 값 함수의 조합) 사용

Huber loss

일정한 범위(
δ)를 정해서 그 안에 있으면 오차를 제곱하고, 그 밖에 있으면 오차의 절대값을 구하는 것

$$L_{\delta}(e)=\left\{\begin{matrix}
 \frac{1}{2}e^2&\textrm{for}\left |  e\right | \leq \delta\\ 
\delta(\left | e \right |- \frac{1}{2}\delta), & \mathrm{otherwise}
\end{matrix}\right.$$

In [9]:
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

while True:
    state = env.reset()
    episode_reward = 0

    for timestep in range(1, max_steps_per_episode):
        state = torch.FloatTensor(state)
        action_probs, critic_value = model(state)
        critic_value_history.append(critic_value)
        m = Categorical(action_probs)
        action = m.sample()
        action_probs_history.append(m.log_prob(action))

        # env.step(action)'s ouput: state, reward, done, info
        state, reward, done, _ = env.step(action.item())  
        rewards_history.append(reward)
        episode_reward += reward

        if done:
            break
    
    running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward
    
    returns = []
    discounted_sum = 0
    for r in rewards_history[::-1]:
        discounted_sum = r + gamma * discounted_sum
        returns.insert(0, discounted_sum)
    
    # Normalize
    returns_amount = len(returns)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + eps)
    returns.resize_(returns_amount, 1)

    # Calculating loss values to update our network
    history = zip(action_probs_history, critic_value_history, returns)
    actor_losses = []
    critic_losses = []
    for log_prob, value, ret in history:
        diff = ret - value.item()
        actor_losses.append(-log_prob * diff)  # actor loss
        critic_losses.append(
            F.smooth_l1_loss(value, ret.clone().detach())
        )

    optimizer.zero_grad()

    # Backpropagation
    loss_value = torch.stack(actor_losses).sum() + torch.stack(critic_losses).sum()
    loss_value.backward()
    optimizer.step()

    # Clear the loss and reward history
    action_probs_history.clear()
    critic_value_history.clear()
    rewards_history.clear()

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break

running reward: 76.09 at episode 10
running reward: 118.85 at episode 20
running reward: 144.52 at episode 30
running reward: 153.17 at episode 40
running reward: 154.83 at episode 50
running reward: 160.32 at episode 60
running reward: 168.81 at episode 70
running reward: 181.33 at episode 80
running reward: 188.82 at episode 90
running reward: 193.31 at episode 100
Solved at episode 106!


----

## Visualization

In [12]:
# Render an episode and save as a GIF file

from IPython import display as ipythondisplay
from PIL import Image
from pyvirtualdisplay import Display


display = Display(visible=0, size=(400, 300))
display.start()


def render_episode(env, model, max_steps): 
    screen = env.render(mode='rgb_array')
    im = Image.fromarray(screen)

    images = [im]
  
    state = env.reset()
    for i in range(1, max_steps + 1):
        state = torch.FloatTensor(state)
        action_probs, _ = model(state)
        m = Categorical(action_probs)
        action = m.sample()

        state, _, done, _ = env.step(action.item())
        state = torch.FloatTensor(state)

        # Render screen every 10 steps
        if i % 10 == 0:
            screen = env.render(mode='rgb_array')
            images.append(Image.fromarray(screen))
  
        if done:
            break
  
    return images


# Save GIF image
images = render_episode(env, model, max_steps_per_episode)
image_file = 'cartpole-v0.gif'
# loop=0: loop forever, duration=1: play each frame for 1ms
images[0].save(
    image_file, save_all=True, append_images=images[1:], loop=0, duration=1)

------