<a href="https://colab.research.google.com/github/JanNogga/rl_ss22/blob/main/RL_Assignment_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Robot Learning

## Assignment 6

Solutions are due on 24.05.2022 before the lecture.

### Introduction

On this assignment sheet, we will use an environment from [OpenAi Gym](https://gym.openai.com/). The setup requires some further packages. Since this installation is not trivial, and we could only test it in our setup, we strongly recommend that you execute the cells in this notebook in [Google Colab](https://research.google.com/colaboratory/). You should find a button which opens this file directly in Colab at the top of this notebook. If not, you can simply import the .ipynb manually. 

If you have started your Colab session and are ready to proceed, uncomment the two lines in the code cell below. They will install everything required to simulate the Gym environment.

**Warning: This is unlikely to work on your own computer, and might even mess up your system! Please only use the following lines in Colab. If you insist on using your own machine, please refer to installation instructions for Gym, torch and the box2d environments for your system.**

In [None]:
#!apt-get -qq install python-opengl xvfb x11-utils &> /dev/null
#!pip install box2d-py pyvirtualdisplay moviepy pyglet PyOpenGL-accelerate --quiet &> /dev/null

The following cell imports packages required for the task. 

In [None]:
import numpy as np
import gym
import matplotlib.pyplot as plt
from pyvirtualdisplay import Display
from moviepy.editor import VideoClip
from moviepy.video.io.bindings import mplfig_to_npimage
import torch
import torch.nn as nn
from tqdm import tqdm
import random

## Task 6.1)

Your agent is in state $s_t$ and has the $Q$-values $Q(s_t,a) = [Q(s_t,a_0), Q(s_t,a_1), Q(s_t,a_2), Q(s_t,a_3)] = [4, 2, 0, 7]$. If the agent samples its action according to a probabilistic policy $\pi(s_t,a)$ which is created by softmax action selection from $Q(s_t,a)$, then what is the probability $Pr(a_3 | s_t)$ of taking action $a_3$ in state $s_t$?

<div style="text-align: right; font-weight:bold"> 3 Points </div>

Please answer in this text cell.

## Task 6.2)

In this task, we will combine an actor-critic method like on the previous sheet with a policy gradient algorithm to control an agent in a challenging environment: the Lunar Lander. 

In [None]:
# set up showing animations from the environment in Colab.
Display(visible=False).start()

<pyvirtualdisplay.display.Display at 0x7f87ed76a7d0>

Examine the code cell below. It has two distinct purposes:

* Showcase the agent-environment interaction for LunarLander-v2

* Show how you can capture frames from this environment to animate an episode afterwards.

Note that for training an agent in this environment, it is advisable to omit all code corresponding to the rendering. You can seperately render an episode of your agent's play afterwards.


In [None]:
# Name of the environment, if you are having problems you can switch to 'CartPole-v1', which is easier to solve.
ENV_NAME = 'LunarLander-v2' #'CartPole-v1' 
# Dimension of the LunarLander state space. For 'CartPole-v1', use 4 instead
ENV_STATE_DIM =  8 # 4
# Lunar Lander has 4 discrete actions: [Do Nothing, Fire Left Booster, Fire Main Engine, Fire Right Booster], 'CartPole-v1' has 2
ENV_ACTION_DIM = 4 # 2
# If the agent reaches this score the task is seen as solved
SCORE_TO_SOLVE = 200 # 195
MAX_STEPS = None # 500
# Create the environment
env = gym.make(ENV_NAME)
# Reset the environment
state = env.reset() # state = [x, y, dx, dy, theta, dtheta, leg1_contact, leg2_contact]
# Track whether the episode is over
done = False
# List to append the frames produced by the environment renderer
frames = []
while not done:
  # Render current situation and append to frames
  frames.append(env.render('rgb_array'))
  # Select a random action
  action = env.action_space.sample()
  # Execute this action
  state, reward, done, info = env.step(action)
# Print the number of frames
print('Number of frames:', len(frames))
# Prevent the renderer from showing artifacts
plt.close()

Number of frames: 76


In [None]:
# Helper function to animate a list of frames as produced above
def visualize_trajectory(frames, fps=50):
  duration = int(len(frames) // fps + 1)
  fig, ax = plt.subplots()
  def make_frame(t, ind_max=len(frames)):
      ax.clear()
      ax.imshow(frames[min((int(fps*t),ind_max-1))])
      return mplfig_to_npimage(fig)
  plt.close()
  return VideoClip(make_frame, duration=duration)

In [None]:
# Get the animation from the frames of the played episode
animation = visualize_trajectory(frames)
# Show the animation
animation.ipython_display(fps=50, loop=True, autoplay=True)

 99%|█████████▉| 100/101 [00:14<00:00,  6.92it/s]


Probably, the random agent will destroy itself instead of landing between the two flags. We would like you to improve upon this. Below, you are given a neural net with learnable weights $\theta$ which takes an environment state $s_t$ as input and can output either a state value $V_{\theta}(s_t)$ or a probability distribution over the actions $\pi_{\theta}(s_t,a)$. 

In [None]:
# If you feel like it, you can, but you do not need to adapt this
class DualNet(nn.Module):
    def __init__(self, state_dim=ENV_STATE_DIM, action_dim=ENV_ACTION_DIM, hidden_layer_dim=42):
        super(DualNet, self).__init__()
        # Create some layers to encode the input state
        self.layers = [nn.Linear(state_dim, hidden_layer_dim),
                       nn.PReLU(num_parameters=hidden_layer_dim)]
        # Combine these layers into a net
        self.net = nn.Sequential(*self.layers)
        # Critic output layers to estimate V from the state encoding
        self.critic = nn.Sequential(*[nn.Linear(hidden_layer_dim,1)])
        # Actor output layers to estimate pi from the state encoding
        self.actor = nn.Sequential(*[nn.Linear(hidden_layer_dim,action_dim),
                                        nn.Softmax(dim=-1)])
    def forward(self, s, mode):
        # Convert input state to tensor
        x = torch.tensor(s).float().view(1,-1)
        # Encode state
        x = self.net(x)
        if mode == 'actor':
          # Return probability distribution over actions
          x = self.actor(x)
        else:
          # Return estimate of state value
          x = self.critic(x)
        return x.squeeze()

# Example usage:
# Create instance of the DualNet class
test_net = DualNet()
# Create a dummy state, round just for pretty printing
test_input = np.around(np.random.rand(ENV_STATE_DIM),2)
# Get the actor output
actor_out = test_net(test_input, mode='actor')
# Get the critic output
critic_out = test_net(test_input, mode='critic')
print('Dummy Input:', test_input)
print('Actor output:', actor_out)
print('Critic output:', critic_out)

Dummy Input: [0.32 0.84 0.62 0.89 0.79 0.32 0.05 0.59]
Actor output: tensor([0.3090, 0.2827, 0.2410, 0.1672], grad_fn=<SqueezeBackward0>)
Critic output: tensor(0.1115, grad_fn=<SqueezeBackward0>)


Now to the task: Play episodes according to the following scheme:

* For each visited state $s_t$, store the output of the critic $V_\theta(s_t)$ in a list.

* Select an action $a_t$ by sampling from the distribution $\pi_{\theta}(s_t,a)$ output by the actor. In a list, store the log prob of the action: $l_t = log(\pi_{\theta}[s_t,a_t])$. 

* Execute the action and observe the reward $r_{t+1}$ provided by the environment.

After each episode, use the stored rewards to calculate the Returns $R_t$ following each state $s_t$ using the discount factor $\gamma$. Next, calculate for each $t$

$$\delta_t = R_t - V_{\theta}(s_t)$$

Then, calculate the loss of the critic as

$$L_{critic} = 0.5 \sum_t \delta_t^2$$

and the loss of the actor using the log probs as

$$L_{actor} = \sum_t - l_t  \delta_t$$

and finally the total loss

$$L = L_{critic} + L_{actor}$$

Now, update the parameters $\theta$ using

$$\theta \leftarrow \theta + \alpha \nabla_{\theta}L$$

The Lunar Lander problem is considered solved when the agent achieves an average return of 200 over 100 episodes. Solve the problem, or play around 3000 episodes as described above and then report the average return of the final 100 episodes played. This means that you are welcome to preemptively stop training if the average return is sufficient. Then, play one more episode and animate it like in the example above. Use $\gamma = 0.99$, and $\alpha \approx 0.001$ in your experiments. Save your final animation and place it into your sciebo folder along with this notebook. When we tested this in our colab session, the environment was solved after around 1500 steps, taking around 10 minutes of training time. We used a learning rate of $0.005$, but please note that this parameter might be sensitive to the specifics of your implementation.

### Hints:

Following tips might help you complete this task:

* You might need to convert your return $R_t$ to the correct datatype:

$$\delta_t = torch.tensor(R_t) - V_{\theta}(s_t)$$

* If you want to use numpy to sample from $\pi_{\theta}(s_t,a)$, you can get a numpy array by calling 

$$\pi_{\theta}(s_t,a).detach().numpy()$$

* When you calculate the log probs $l_t$, preserve the torch gradient graph by using the torch function

$$l_t = torch.log(\pi_{\theta}[s_t,a_t])$$

* When you calculate $L_{actor}$, use $\delta_t.item()$ to ensure that the actor's loss does not directly influence the critic's gradients.

* It might be easier to solve the task for the Cart-Pole environment first, just change ENV_NAME, ENV_STATE_DIM and ENV_ACTION_DIM in one of the previous code blocks. 

* Standardizing the Returns (zero mean and std 1) before calculating $\delta_t$ can boost performance.

* Below you are already given a rough structure for the algorithm. If you stick to it, torch will compute and apply $\nabla_{\theta}L$ for you!

<div style="text-align: right; font-weight:bold"> 10 +  3 (animation) Points </div>

In [None]:
# Complete this code or write your own!
# Create the environment
env = gym.make(ENV_NAME)
# Get the combined actor and critic model
net = DualNet()
# Number of episodes to play, maybe use fewer at first
num_iter = 3000
# Learning rate for the parameter updates
alpha = 1e-3
# Discount factor
gamma = 0.99
# The optimizer will do the gradient updates for you
# It needs the trainable parameters and a learning rate
optimizer = torch.optim.Adam(net.parameters(), lr=alpha)

# This progress_bar is useful to know how far along the training is
progress_bar = tqdm(range(num_iter), total=num_iter, position=0, leave=True)
# For each episode (episode can be used like an int)
for episode in progress_bar:
    # Reset the accumulated gradients of the model parameters
    optimizer.zero_grad()
    # Reset the environment and observe the starting state
    s = env.reset()
    done = 0
    # Collect the following terms during the episode
    rewards = []
    state_vals = []
    log_probs = []
    while not done:
        # During each Episode:
        # Evaluate the critic for s, store it
        
        # Evaluate the actor for s

        # Sample a from the distribution given by the actor
        
        # Store log_prob of a

        # Execute action a, observe next state, reward and done
        s, r, done, _ = env.step(a)
        # Store the reward
    

    # After each episode is done
    # Calculate the Returns from the episode. Use rewards and gamma

    # You might want to standardize the Returns

    # Calculate delta_t, L_actor, L_critic
    
    # Calculate the loss L
    L = L_critic + L_actor
    # Set the gradients with respect to the parameters
    L.backward()
    # Update the parameters based on the gradients
    optimizer.step()
    # If you want your progress bar to print info, you can use the following template
    # How often to update info
    if episode % 10 == 0:
        # When to first update info
        if episode > 99:
            # List of strings containing info
            episode_summary = [f"{episode+1}:"] + ['List of',  'further strings you might want in your progress bar']
            # Set progress bar
            progress_bar.set_description("".join(episode_summary))

In [None]:
 # Your code for showing the results can go here

## Task 6.3)

In the previous task, you calculate the loss of the policy network as 

$$L_{actor} = \sum_t - log(\pi_{\theta}[s_t,a_t])  \delta_t$$

with 

$$\delta_t = R_t - V_{\theta}(s_t).$$

Give an intuitive explanation why minimizing this term leads to actions with a good outcome being more likely and actions with a bad outcome becoming less likely.

<div style="text-align: right; font-weight:bold"> 4 Points </div>

Please answer in this text cell.