# RTML Final 2022

Welcome to the RTML final exam, version 2022!

Prepare your answer to each question, writing your answers directly in this notebook, print as PDF, and turn in via Google Classroom by the deadline.

You have 2.5 hours to complete the exam. Good luck!

## Question 1 (20 points)

Suppose you have a dataset consisting of 500 essays on the assigned topic,
"Why is it so difficult to produce a computer program that can pass the Turing Test?"
The essays are by Data Science and AI students from all over Asia and are each 250-350 words long.

Suppose further that having taken RTML, you know a lot about GANs and RNNs, so you decide to build a recurrent
GAN to generate fresh essays on the same topic.

Explain in detail how you could use a LSTM-based RNN as the generator in a GAN with this goal.
Be sure to indicate the detailed structure of the generator and discriminator, the loss functions,
how the models are trained, how the cell/hidden state is initialized, what is the input to the model
during training, and how the resulting model is used for inference.

*Write your answer here.*

## Question 2 (10 points)

Explain how could BERT be fine tuned on the task of Question 1 and how the resulting model would be used for inference.

*Write your answer here.*

## Question 3 (10 points)

In Lab 12, we implemented the basic REINFORCE algorithm on CartPole.
Run your trained REINFORCE model on CartPole. Show a screenshot of your trained
REINFORCE model playing the game here.

*Put your screenshot here*

In [6]:
# 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco210/bin' 
# mkdir ~/.mujoco
# wget -q https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz -O mujoco.tar.gz
# rm mujoco.tar.gz
# pip install mujoco-py
# pip install gymnasium[mujoco]
# pip install gymnasium[classic-control]
# apt-get install libglew-dev patchelf libosmesa6-dev libgl1-mesa-glx
# apt-get install -y xvfb python-opengl 
# xvfb-run -a -s "-screen 0 1400x900x24" bash

import gymnasium as gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch import optim
from torch.distributions import Categorical

class Policy(nn.Module):
    def __init__(self, env):
        super(Policy, self).__init__()
        self.n_inputs = env.observation_space.shape[0]
        self.n_outputs = env.action_space.n
        
        self.affine1 = nn.Linear(self.n_inputs, 128)
        self.dropout = nn.Dropout(p=0.6)
        self.affine2 = nn.Linear(128, self.n_outputs)

        self.saved_log_probs = []
        self.rewards = []

    def forward(self, x):
        x = self.affine1(x)
        x = self.dropout(x)
        x = F.relu(x)
        action_scores = self.affine2(x)
        return F.softmax(action_scores, dim=1)
    
    def select_action(self, state):
        state = torch.from_numpy(np.array(state)).float().unsqueeze(0)
        probs = self.forward(state)
        m = torch.distributions.Categorical(probs)
        action = m.sample()
        self.saved_log_probs.append(m.log_prob(action))
        return action.item()

#RL environment parameters
gamma = 0.95
seed = 0
render = False
log_interval = 10

#Set up 

# env = gym.make("ALE/SpaceInvaders-ram-v5")
env = gym.make("CartPole-v1", render_mode='rgb_array')
reward_threshold = env.spec.reward_threshold
print(reward_threshold)
env.reset(seed=seed)
torch.manual_seed(seed)

#Create out policy Network
policy = Policy(env)
optimizer = optim.Adam(policy.parameters(), lr=1e-2)
eps = np.finfo(np.float32).eps.item()

# env.reset()
# x = env.render()
# print(x)

def finish_episode():
    R = 0
    policy_loss = []
    returns = []
    for r in policy.rewards[::-1]:
        R = r + gamma * R
        returns.insert(0, R)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + eps)
    for log_prob, R in zip(policy.saved_log_probs, returns):
        policy_loss.append(-log_prob * R)
    optimizer.zero_grad()
    policy_loss = torch.cat(policy_loss).sum()
    policy_loss.backward()
    optimizer.step()
    del policy.rewards[:]
    del policy.saved_log_probs[:]

from itertools import count
def reinforce():
    running_reward = 10
    for i_episode in count(1):
        (state, info), ep_reward = env.reset(), 0
        # print('Initial State', state)
        for t in range(1, 10000):  # Don't infinite loop while learning
            action = policy.select_action(state)
            state, reward, done, truncated, info = env.step(action)
            # print('New State', state)
            if render:
                env.render()
            policy.rewards.append(reward)
            ep_reward += reward
            if done:
                break

        # calculate reward
        # It accepts a list of rewards for the whole episode and needs to calculate 
        # the discounted total reward for every step. To do this efficiently,
        # we calculate the reward from the end of the local reward list.
        # The last step of the episode will have the total reward equal to its local reward.
        # The step before the last will have the total reward of ep_reward + gamma * running_reward
        running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
        finish_episode()
        if i_episode % log_interval == 0:
            print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
                  i_episode, ep_reward, running_reward))
            
        if running_reward > env.spec.reward_threshold: #475
            print("Solved! Running reward is now {} and "
                  "the last episode runs to {} time steps!".format(running_reward, t))
            break
        
reinforce()
env.close()

475.0
Episode 10	Last reward: 9.00	Average reward: 15.14
Episode 20	Last reward: 16.00	Average reward: 17.23
Episode 30	Last reward: 11.00	Average reward: 18.63
Episode 40	Last reward: 15.00	Average reward: 20.65
Episode 50	Last reward: 31.00	Average reward: 30.82
Episode 60	Last reward: 45.00	Average reward: 30.54
Episode 70	Last reward: 234.00	Average reward: 51.10
Episode 80	Last reward: 127.00	Average reward: 86.97
Episode 90	Last reward: 63.00	Average reward: 102.29
Episode 100	Last reward: 229.00	Average reward: 105.88
Episode 110	Last reward: 326.00	Average reward: 204.60
Episode 120	Last reward: 193.00	Average reward: 235.82
Episode 130	Last reward: 305.00	Average reward: 236.45
Episode 140	Last reward: 300.00	Average reward: 247.11
Episode 150	Last reward: 428.00	Average reward: 293.05
Episode 160	Last reward: 352.00	Average reward: 351.60
Episode 170	Last reward: 223.00	Average reward: 304.66
Episode 180	Last reward: 636.00	Average reward: 314.75
Solved! Running reward is now

## Question 4 (20 points)

Next, let's replace the policy network that is currently working with the fully observed MDP with a POMDP using only the image
of the environment as the observation.

If you completed Lab 12, you should already have an implementation of REINFORCE on Space Invaders that you can reuse.

By default, CartPole will render at 600x400 resolution. We will want to downscale that, perhaps to 150x100, and stack subsequent
frames, perhaps 4 of them, in order to provide some history information.

Below, show a revision of your REINFORCE policy model that takes as input a stack of the four most recent downscaled grayscale
images and outputs an action. The model should have an appropriate series of convolutions, one or more fully connected layers,
and a linear/softmax layer that ouptuts an action.

In [29]:
# Gym is an OpenAI toolkit for RL
from gymnasium.spaces import Box
from gymnasium.wrappers import FrameStack
import gymnasium as gym

import torch
import torch.nn as nn
from torch import optim
from torch.distributions import Categorical
import torch.autograd as autograd 
import torch.nn.functional as F
import torchvision.transforms as T

from collections import namedtuple
import matplotlib.pyplot as plt

import numpy as np
from utils import GrayScaleObservation, ResizeObservation, SkipFrame

class Policy(nn.Module):
    def __init__(self, env):
        super(Policy, self).__init__()
        # self.n_inputs = env.observation_space.shape[2]
        self.n_outputs = env.action_space.n

        self.conv1 = nn.Conv2d(in_channels=4, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=2, stride=1)
        
        self.affine1 = nn.Linear(10240, 128)
        self.dropout = nn.Dropout(p=0.6)
        self.affine2 = nn.Linear(128, self.n_outputs)

        self.saved_log_probs = []
        self.rewards = []
    
    def forward(self, x):
        x = x / 255.0  # normalize pixel values
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = x.view(x.size(0), -1)
        
        x = self.affine1(x)
        x = self.dropout(x)
        x = F.relu(x)
        action_scores = self.affine2(x)
        return F.softmax(action_scores, dim=1)
    
    def select_action(self, state):
        state = torch.from_numpy(np.array(state)).float().unsqueeze(0)
        probs = self.forward(state)
        m = torch.distributions.Categorical(probs)
        action = m.sample()
        self.saved_log_probs.append(m.log_prob(action))
        return action.item(), probs

In [46]:
#RL environment parameters
gamma = 0.95
seed = 0
render = False
log_interval = 10

env = gym.make("ALE/SpaceInvaders-v5", render_mode="rgb_array")
# define a reward threshold
reward_threshold = 250
# register the reward threshold with the environment
env.spec.reward_threshold = reward_threshold
# env = gym.make("SpaceInvaders-v0")
env = FrameStack(ResizeObservation(GrayScaleObservation(SkipFrame(env, skip=4)), shape=(150,100)), num_stack=4)
env.reset(seed=seed)
torch.manual_seed(seed)

#Create out policy Network
policy = Policy(env)
optimizer = optim.Adam(policy.parameters(), lr=1e-2)
eps = np.finfo(np.float32).eps.item()    

In [20]:
def finish_episode():
    R = 0
    policy_loss = []
    returns = []
    for r in policy.rewards[::-1]:
        R = r + gamma * R
        returns.insert(0, R)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + eps)
    for log_prob, R in zip(policy.saved_log_probs, returns):
        policy_loss.append(-log_prob * R)
    optimizer.zero_grad()
    policy_loss = torch.cat(policy_loss).sum()
    policy_loss.backward()
    optimizer.step()
    del policy.rewards[:]
    del policy.saved_log_probs[:]

from itertools import count
def reinforce():
    running_reward = 10
    for i_episode in count(1):
        (state, info), ep_reward = env.reset(), 0
        # print('Initial State', state)
        for t in range(1, 10000):  # Don't infinite loop while learning
            action, _ = policy.select_action(state)
            state, reward, done, truncated, info = env.step(action)
            # print('New State', state)
            if render:
                env.render()
            policy.rewards.append(reward)
            ep_reward += reward
            if done:
                break

        # calculate reward
        # It accepts a list of rewards for the whole episode and needs to calculate 
        # the discounted total reward for every step. To do this efficiently,
        # we calculate the reward from the end of the local reward list.
        # The last step of the episode will have the total reward equal to its local reward.
        # The step before the last will have the total reward of ep_reward + gamma * running_reward
        running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
        finish_episode()
        if i_episode % log_interval == 0:
            print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
                  i_episode, ep_reward, running_reward))

        if running_reward >= env.spec.reward_threshold:
            print("Solved! Running reward is now {} and "
                  "the last episode runs to {} time steps!".format(running_reward, t))
            break

reinforce()
env.close()

Episode 10	Last reward: 285.00	Average reward: 119.22
Episode 20	Last reward: 285.00	Average reward: 180.98
Episode 30	Last reward: 180.00	Average reward: 210.65
Episode 40	Last reward: 285.00	Average reward: 240.48
Solved! Running reward is now 250.55427543651658 and the last episode runs to 182 time steps!


Next, demonstrate that your policy model when given a 4x100x150 tensor of zeros outputs, the policy model outputs an appropriate
shaped vector representing a multinomial distribution over the action space.

## Question 5 (20 points)

Modify the `reinforce()` function to generate the visual input to the Policy model rather than the fully
observed state.

In your code, after each of the following lines

    state, ep_reward = env.reset(), 0
    
    ...
    
        state, reward, done, _ = env.step(action)
    
please add code to replace the fully observed state with the observation:

    obs_t = env.render(mode="rgb_array")
    obs_seq.append(obs_t)
    state = make_observation(obs_seq)

You'll have to add some code to initialize the `obs_seq` array appropriately and write the `make_observation()` function
to convert the four most recent observations to grayscale and stack them in a tensor that your policy
network can use.

In [None]:
def make_observation():
    pass

In [None]:
# Place revised reinforce() code here
def reinforce():
    running_reward = 10
    obs_seq = list()
    for i_episode in count(1):
        (state, info), ep_reward = env.reset(), 0
        # print('Initial State', state)
        for t in range(1, 10000):  # Don't infinite loop while learning
            action, _ = policy.select_action(state)
            state, reward, done, truncated, info = env.step(action)
            # print('New State', state)
            if render:
                obs_t = env.render(mode="rgb_array")
                obs_seq.append(obs_t)
                state = make_observation(obs_seq)
            policy.rewards.append(reward)
            ep_reward += reward
            if done:
                break

Show that the resulting policy model can be trained for a few episodes. You don't have to train the model to perfection -- you can do it on your own PC in CPU mode and just
show that policy model is learning.

In [None]:
# Code to train for a few episodes goes here


## Question 6 (10 points)

What are the major differences between this visual REINFORCE method and Mnih et al.'s DQN method?

*Your answer goes here.*

## Question 7 (10 points)

What are the major differences between this visual REINFORCE method and the A2C (Advantage-Actor-Critic) method? In your answer, assume we make the
same modification to A2C to use visual observations instead of full state observations.

*Your answer goes here.*