# Lab 13: Deep RL

exploring deep reinforcement learning with policy gradient
and actor-critic approaches.
Here are some of the references used in today's lab:

- https://gym.openai.com
- Deep Reinforcement Learning Hands-On (Packtpub)
- https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py
- https://towardsdatascience.com/breaking-down-richard-suttons-policy-gradient-9768602cb63b
- https://towardsdatascience.com/learning-reinforcement-learning-reinforce-with-pytorch-5e8ad7fc7da0
- https://github.com/woithook/A2C-Pytorch-implementations

In [1]:
id = 123012
name ='Todsavad Tangtortan'

# Take-home exercise

Implement REINFORCE and A2C for one of the Atari games such as Space Invaders using a CNN for the policy
network and (for A2C) the value network.

## Setting Environment

In [None]:
# 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco210/bin' 
# mkdir ~/.mujoco
# wget -q https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz -O mujoco.tar.gz
# rm mujoco.tar.gz
# pip install mujoco-py
# pip install gymnasium[mujoco]
# pip install gymnasium[classic-control]
# apt-get install libglew-dev patchelf libosmesa6-dev libgl1-mesa-glx
# apt-get install -y xvfb python-opengl 
# xvfb-run -a -s "-screen 0 1400x900x24" bash

## REINFORCE

In [2]:
# Gym is an OpenAI toolkit for RL
from gymnasium.spaces import Box
from gymnasium.wrappers import FrameStack
import gymnasium as gym

import torch
import torch.nn as nn
from torch import optim
from torch.distributions import Categorical
import torch.autograd as autograd 
import torch.nn.functional as F
import torchvision.transforms as T

from collections import namedtuple
import matplotlib.pyplot as plt

import numpy as np
from utils import GrayScaleObservation, ResizeObservation, SkipFrame

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
class Policy(nn.Module):
    def __init__(self, env):
        super(Policy, self).__init__()
        # self.n_inputs = env.observation_space.shape[2]
        self.n_outputs = env.action_space.n

        self.conv1 = nn.Conv2d(in_channels=4, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=2, stride=1)
        
        self.affine1 = nn.Linear(4096, 128)
        self.dropout = nn.Dropout(p=0.6)
        self.affine2 = nn.Linear(128, self.n_outputs)

        self.saved_log_probs = []
        self.rewards = []
    
    def forward(self, x):
        x = x / 255.0  # normalize pixel values
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = x.view(x.size(0), -1)
        
        x = self.affine1(x)
        x = self.dropout(x)
        x = F.relu(x)
        action_scores = self.affine2(x)
        return F.softmax(action_scores, dim=1)
    
    def select_action(self, state):
        state = torch.from_numpy(np.array(state)).float().unsqueeze(0)
        probs = self.forward(state)
        m = torch.distributions.Categorical(probs)
        action = m.sample()
        self.saved_log_probs.append(m.log_prob(action))
        return action.item()

In [4]:
#RL environment parameters
gamma = 0.95
seed = 0
render = False
log_interval = 10

env = gym.make("ALE/SpaceInvaders-v5", render_mode="rgb_array")
# define a reward threshold
reward_threshold = 300
# register the reward threshold with the environment
env.spec.reward_threshold = reward_threshold
# env = gym.make("SpaceInvaders-v0")
env = FrameStack(ResizeObservation(GrayScaleObservation(SkipFrame(env, skip=4)), shape=84), num_stack=4)
env.reset(seed=seed)
torch.manual_seed(seed)

#Create out policy Network
policy = Policy(env)
optimizer = optim.Adam(policy.parameters(), lr=1e-2)
eps = np.finfo(np.float32).eps.item()

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


In [5]:
def finish_episode():
    R = 0
    policy_loss = []
    returns = []
    for r in policy.rewards[::-1]:
        R = r + gamma * R
        returns.insert(0, R)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + eps)
    for log_prob, R in zip(policy.saved_log_probs, returns):
        policy_loss.append(-log_prob * R)
    optimizer.zero_grad()
    policy_loss = torch.cat(policy_loss).sum()
    policy_loss.backward()
    optimizer.step()
    del policy.rewards[:]
    del policy.saved_log_probs[:]

from itertools import count
def reinforce():
    running_reward = 10
    for i_episode in count(1):
        (state, info), ep_reward = env.reset(), 0
        # print('Initial State', state)
        for t in range(1, 10000):  # Don't infinite loop while learning
            action = policy.select_action(state)
            state, reward, done, truncated, info = env.step(action)
            # print('New State', state)
            if render:
                env.render()
            policy.rewards.append(reward)
            ep_reward += reward
            if done:
                break

        # calculate reward
        # It accepts a list of rewards for the whole episode and needs to calculate 
        # the discounted total reward for every step. To do this efficiently,
        # we calculate the reward from the end of the local reward list.
        # The last step of the episode will have the total reward equal to its local reward.
        # The step before the last will have the total reward of ep_reward + gamma * running_reward
        running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
        finish_episode()
        if i_episode % log_interval == 0:
            print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
                  i_episode, ep_reward, running_reward))

        if running_reward > env.spec.reward_threshold:
            print("Solved! Running reward is now {} and "
                  "the last episode runs to {} time steps!".format(running_reward, t))
            break

In [6]:
reinforce()
env.close()

Episode 10	Last reward: 180.00	Average reward: 95.77
Episode 20	Last reward: 285.00	Average reward: 181.60
Episode 30	Last reward: 485.00	Average reward: 244.85
Episode 40	Last reward: 220.00	Average reward: 267.49
Episode 50	Last reward: 135.00	Average reward: 242.66
Episode 60	Last reward: 155.00	Average reward: 243.16
Episode 70	Last reward: 135.00	Average reward: 207.00
Episode 80	Last reward: 85.00	Average reward: 190.89
Episode 90	Last reward: 290.00	Average reward: 178.00
Episode 100	Last reward: 65.00	Average reward: 139.06
Episode 110	Last reward: 125.00	Average reward: 129.55
Episode 120	Last reward: 305.00	Average reward: 130.85
Episode 130	Last reward: 175.00	Average reward: 135.96
Episode 140	Last reward: 70.00	Average reward: 142.77
Episode 150	Last reward: 110.00	Average reward: 125.79
Episode 160	Last reward: 45.00	Average reward: 122.52
Episode 170	Last reward: 15.00	Average reward: 94.65
Episode 180	Last reward: 125.00	Average reward: 118.72
Episode 190	Last reward: 3

## A2C

In [1]:
# Gym is an OpenAI toolkit for RL
from gymnasium.spaces import Box
from gymnasium.wrappers import FrameStack
import gymnasium as gym

import torch
import torch.nn as nn
from torch import optim
from torch.distributions import Categorical
import torch.autograd as autograd 
import torch.nn.functional as F
import torchvision.transforms as T

from collections import namedtuple
import matplotlib.pyplot as plt

import numpy as np
from utils import GrayScaleObservation, ResizeObservation, SkipFrame
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
class PolicyNet(torch.nn.Module):
    def __init__(self, input_size, output_size, hidden_layer_size=64):
        super(PolicyNet, self).__init__()
        
        self.conv1 = nn.Conv2d(in_channels=input_size, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=2, stride=1)

        self.fc1 = torch.nn.Linear(4096, hidden_layer_size)
        self.fc2 = torch.nn.Linear(hidden_layer_size, output_size)
        self.softmax = torch.nn.Softmax(dim=0)

    def forward(self, x):
        x = x / 255.0  # normalize pixel values
        x = torch.from_numpy(x).float().unsqueeze(0)

        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = x.view(x.size(0), -1)

        return self.softmax(self.fc2(torch.nn.functional.relu(self.fc1(x))))

    def get_action_and_logp(self, x):
        x = x.__array__()/255.0
        action_prob = self.forward(x)
        m = torch.distributions.Categorical(action_prob)
        action = m.sample()
        logp = m.log_prob(action)
        return action.item(), logp

    def act(self, x):
        action, _ = self.get_action_and_logp(x)
        return action


class ValueNet(torch.nn.Module):
    def __init__(self, input_size, hidden_layer_size=64):
        super(ValueNet, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=input_size, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=2, stride=1)

        self.fc1 = torch.nn.Linear(4096, hidden_layer_size)
        self.fc2 = torch.nn.Linear(hidden_layer_size, 1)

    def forward(self, x):
        x = x.__array__() / 255.0  # normalize pixel values
        x = torch.from_numpy(x).float().unsqueeze(0)

        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = x.view(x.size(0), -1)

        return self.fc2(torch.nn.functional.relu(self.fc1(x)))

In [3]:
def vpg(env, num_iter=200, num_traj=10, max_num_steps=1000, gamma=0.98,
        policy_learning_rate=0.01, value_learning_rate=0.01,
        policy_saved_path='vpg_policy_invader.pt', value_saved_path='vpg_value_invader.pt'):
    input_size = env.observation_space.shape[0] # Box(3,210,160)
    output_size = env.action_space.n
    print(f'input_size {input_size}')
    print(f'output_size {output_size} actions')
    Trajectory = namedtuple('Trajectory', 'states actions rewards dones logp')

    def collect_trajectory():
        state_list = []
        action_list = []
        reward_list = []
        dones_list = []
        logp_list = []
        state, info = env.reset()
        done = False
        steps = 0
        while not done and steps <= max_num_steps:
            action, logp = policy.get_action_and_logp(state)
            newstate, reward, done, truncated, info = env.step(action)
            #reward = reward + float(state[0])
            state_list.append(state)
            action_list.append(action)
            reward_list.append(reward)
            dones_list.append(done)
            logp_list.append(logp)
            steps += 1
            state = newstate

        traj = Trajectory(states=state_list, actions=action_list,
                          rewards=reward_list, logp=logp_list, dones=dones_list)
        return traj

    def calc_returns(rewards):
        dis_rewards = [gamma**i * r for i, r in enumerate(rewards)]
        return [sum(dis_rewards[i:]) for i in range(len(dis_rewards))]

    policy = PolicyNet(input_size, output_size)
    value = ValueNet(input_size)
    policy_optimizer = torch.optim.Adam(
        policy.parameters(), lr=policy_learning_rate)
    value_optimizer = torch.optim.Adam(
        value.parameters(), lr=value_learning_rate)

    mean_return_list = []
    for it in tqdm(range(num_iter)):
        traj_list = [collect_trajectory() for _ in range(num_traj)]
        returns = [calc_returns(traj.rewards) for traj in traj_list]

        policy_loss_terms = [-1. * traj.logp[j] * (returns[i][j] - value(traj.states[j]))
                             for i, traj in enumerate(traj_list) for j in range(len(traj.actions))]

        policy_loss = 1. / num_traj * torch.cat(policy_loss_terms).sum()
        policy_optimizer.zero_grad()
        policy_loss.backward()
        policy_optimizer.step()

        value_loss_terms = [1. / len(traj.actions) * (value(traj.states[j]) - returns[i][j])**2.
                            for i, traj in enumerate(traj_list) for j in range(len(traj.actions))]
        value_loss = 1. / num_traj * torch.cat(value_loss_terms).sum()
        value_optimizer.zero_grad()
        value_loss.backward()
        value_optimizer.step()

        mean_return = 1. / num_traj * \
            sum([traj_returns[0] for traj_returns in returns])
        mean_return_list.append(mean_return)
        if it % 10 == 0:
            print('Iteration {}: Mean Return = {}'.format(it, mean_return))
            torch.save(policy.state_dict(), policy_saved_path)
            torch.save(value.state_dict(), value_saved_path)
    return policy, mean_return_list

In [4]:
env = gym.make("ALE/SpaceInvaders-v5", render_mode="rgb_array")
env = FrameStack(ResizeObservation(GrayScaleObservation(SkipFrame(env, skip=4)), shape=84), num_stack=4)

agent, mean_return_list = vpg(env, num_iter=200, max_num_steps=500, gamma=1.0,
                              num_traj=5)

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


input_size 4
output_size 6 actions


  0%|          | 1/200 [00:11<36:35, 11.03s/it]

Iteration 0: Mean Return = 136.0


  6%|▌         | 11/200 [02:00<35:14, 11.19s/it]

Iteration 10: Mean Return = 183.0


 10%|█         | 21/200 [03:54<34:27, 11.55s/it]

Iteration 20: Mean Return = 190.0


 16%|█▌        | 31/200 [05:47<29:52, 10.60s/it]

Iteration 30: Mean Return = 135.0


 20%|██        | 41/200 [07:41<30:17, 11.43s/it]

Iteration 40: Mean Return = 132.0


 26%|██▌       | 51/200 [09:31<27:46, 11.18s/it]

Iteration 50: Mean Return = 164.0


 30%|███       | 61/200 [11:20<24:17, 10.48s/it]

Iteration 60: Mean Return = 70.0


 35%|███▌      | 70/200 [13:03<25:39, 11.84s/it]

In [None]:
state,_  = env.reset()
for t in range(1000):
    action = agent.act(state)
    env.render()
    state, reward, done, truncated, info  = env.step(action)
    if done:
        break
env.close()

## Conclusion

In Atari environments, Space Invader return Actions are 6 and Observation Space (210, 160, 3) which refer to width, heihgt, channels. 
Firstly, frames was skipping by 4 frames. Furthermore, changing it to greyscale then resize from (3, 210, 160) to (4, 84, 84) instead.
Each model has added convoluton layers.

        self.conv1 = nn.Conv2d(in_channels=4, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=2, stride=1)
        
In first model, REINFORCE was set reward_threshold = 300 Therefore, Running reward is now 307.94 and the last episode runs to 260 time steps!

In second model, A2C :