<a href="https://colab.research.google.com/github/portal-cornell/cs4756_robot_learning/blob/main/assignments/HW1/CS_4756_Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Introduction**

Welcome to your first official coding Assignment of 4756. In this short notebook, we are going to train agents to play the LunarLander game from OpenAI Gym, using data provided by an expert agent. Specifically, you will gain exposure to implementing behavioral cloning (BC) and dataset aggregation (DAgger) methods.

**Evaluation:**
Your code will be tested for correctness, and for certain assignments, speed. Please remember that all assignments should be completed individually.

**Academic Integrity:** We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else’s code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don’t try. We trust you all to submit your own work only; please don’t let us down. If you do, we will pursue the strongest consequences available to us.

**Getting Help:** The [#resources](https://www.cs.cornell.edu/courses/cs4756/2023sp/#resources) section on the course website is your friend (especially for this first assignment)! If you ever feel stuck in these projects, please feel free to avail yourself to office hours and Edstem! If you are unable to make any of the office hours listed, please let TAs know and we will be happy to assist. Since this is the first iteration of this course, please do not hesitate to reach out to TAs if you find any errors in the assignments. 

### Preliminaries

In this assignment we will be using modules and agents from OpenAI Gym. Please run the following cells to make sure this notebook is properly configured. You should only need to run the following cell once, so feel free to comment it out after it has been installed the first time. 

If the `pip install` leaves you with any messages at the bottom telling you to install more packages, please feel free to add those lines in a separate cell. 

In [None]:
!pip3 install gym[box2d]
!pip3 install -q stable-baselines3[extra]

In [None]:
# Part 1 imports
from typing import Type, List
import torch
from torch import nn
import numpy as np

# Part 2 imports
from torch import optim

# Part 3 imports
import gym
from stable_baselines3.ppo import PPO
import torch.nn as nn
import argparse

Please run the following cells to get the expert data onto your noteboook. Then run `!ls` and verify that the file "lunarlander_expert.zip" exits



In [None]:
!wget -nc https://github.com/portal-cornell/cs4756_robot_learning/blob/main/assignments/HW1/LunarLander-v2/lunarlander_expert.zip?raw=true

In [None]:
!mv lunarlander_expert.zip?raw=true lunarlander_expert.zip
!ls | grep "lunar"

To verify that everything has been downloaded correctly, please run the following cell and check for errors. If none appear, you're good to go! (You may ignore the warnings.warn() messages)

In [None]:
env = gym.make("LunarLander-v2")
if PPO.load("./lunarlander_expert"):
    print("Success!")

**Please make sure there are no errors in the above import statements before continuing to the rest of the assignment**

### Part 1: Simple utilities for later use

In the following section, we will define most of the helper methods that will become useful to you in training and evaluating you imitation agents. There are three sets of utility functions: "NEURAL NET UTILS", "ENV UTILS", and "EVAL UTILS." Please take the time to understand what each function is doing, and also implement the `argmax_policy()` function in the second cell below. 

**Note: you may ignore all functions that deal with truncate until you get to the extra credit.**

In [None]:
# ====== NEURAL NET UTILS ======

def create_mlp(input_dim: int, output_dim: int, architecture: List[int], squash=False, activation: Type[nn.Module]=nn.ReLU) -> List[nn.Module]:
    '''Creates a list of modules that define an MLP.'''
    if len(architecture) > 0:
        layers = [nn.Linear(input_dim, architecture[0]), activation()]
    else:
        layers = []
        
    for i in range(len(architecture) - 1):
        layers.append(nn.Linear(architecture[i], architecture[i+1]))
        layers.append(activation())
    
    if output_dim > 0:
        last_dim = architecture[-1] if len(architecture) > 0 else input_dim
        layers.append(nn.Linear(last_dim, output_dim))
        
    if squash:
        # squashes output down to (-1, 1)
        layers.append(nn.Tanh())
    
    return layers

def create_net(input_dim: int, output_dim: int, squash=False):
    layers = create_mlp(input_dim, output_dim, architecture=[64, 64], squash=squash)
    net = nn.Sequential(*layers)
    return net

def argmax_policy(net):
    # TODO: Return a FUNCTION that takes in a state, passes it through the network, and outputs index of the action with the highest probability.
    # Inputs:
    # - net: (type nn.Module). A neural network module, going from state dimension to number of actions.
    # Wanted output:
    # - argmax_fn: A function which takes in a state, and outputs argmax of the action vector.
    pass

def expert_policy(expert, s):
    '''Returns a one-hot encoded action of what the expert predicts at state s.'''
    action = expert.predict(s)[0]
    one_hot_action = np.eye(4)[action]
    return one_hot_action

In [None]:
# ====== ENV UTILS ======

def rollout(net, env, truncate=True):
    '''Rolls out a trajectory in the environment, with optional state masking.'''
    states = []
    actions = []
    
    ob = env.reset()
    done = False
    total_reward = 0
    
    while not done:
        states.append(ob.reshape(-1))
        ob_tensor = torch.from_numpy(np.array(ob))
        if truncate:
            action = net(ob_tensor[:-2].float())
        else:
            action = net(ob_tensor.float())
            
        # detach action and convert to np array
        if isinstance(action, torch.FloatTensor) or isinstance(action, torch.Tensor):
            action = action.detach().numpy()
        actions.append(action.reshape(-1))
        
        # step env
        ob, r, done, _ = env.step(np.argmax(action))
        total_reward += r
        
    states = np.array(states, dtype='float')
    actions = np.array(actions, dtype='float')
    return states, actions

def expert_rollout(expert, env, truncate=False):
    '''Rolls out an expert trajectory in the environment, with optional state masking.'''
    expert_net = lambda s: expert.predict(s)[0]
    return rollout(expert_net, env, truncate=truncate)

In [None]:
# ====== EVAL UTILS ======

def eval_policy(policy, env, truncate=True):
    '''Evaluates policy with one trajectory in environment. Returns accumulated reward.'''
    done = False
    ob = env.reset()
    total_reward = 0
    while not done:
        if truncate:
            action = policy(ob[:-2])
        else:
            action = policy(ob)
        
        # detach action and convert to np array
        if isinstance(action, torch.FloatTensor) or isinstance(action, torch.Tensor):
            action = action.detach().numpy()
        
        # step env and observe reward
        ob, r, done, _ = env.step(action)
        total_reward += r
    
    return total_reward

### Part 2: Behavioral Cloning & DAgger

It is now time to build up our agents! Please read the directions carefully, and avail yourself to the myriad of resources in this class if you feel stuck!

**Behavioral cloning:** Behavioral cloning is the simplest imitation learning algorithm, where we perform supervised learning on the given (offline) expert dataset. We either do this via log likelihood maximization (cross entropy minimization) in the discrete action case, or mean-squared error minimization (can also do MLE) in the continuous control setting.

Please implement the following `learn()` function for BC.

In [None]:
class BC:
    def __init__(self, net, loss_fn):
        self.net = net
        self.loss_fn = loss_fn
        
        self.opt = optim.Adam(self.net.parameters(), lr=3e-4)
        
    def learn(self, env, states, actions, n_steps=1e4, truncate=True):
        # TODO: Implement this method. Return the final greedy policy (argmax_policy).
        pass

**Dataset aggregation (DAgger):** DAgger is a fundamentally interactive algorithm, where we are able to query the expert any time we want to get information about how to proceed. This allows for significantly more freedom for the learner, as it can ask the expert anywhere and not be limited by the dataset that it is given to learn from.

Like BC, please implement the following `learn()` function for DAgger.

In [None]:
class DAgger:
    def __init__(self, net, loss_fn, expert):
        self.net = net
        self.loss_fn = loss_fn
        self.expert = expert
        
        self.opt = optim.Adam(self.net.parameters(), lr=3e-4)
        
    def learn(self, env, n_steps=1e4, truncate=True):
        # TODO: Implement this method. Return the final greedy policy (argmax_policy).
        # Make sure you are making the learning process fundamentally expert-interactive.
        pass

### Part 3: Training loop

Now with the hard part out of the way, it's time to see the performance of your networks! For imitation learning to work, all you need is access to some expert trajectories. The good news is, we've got you covered! 🙂

Please implement the training loop under train() according to the instructions in the code.

In [None]:
def make_env():
    return gym.make("LunarLander-v2")

def get_expert():
    return PPO.load("./lunarlander_expert.zip")

def get_expert_performance(env, expert):
  Js = []
  for _ in range(100):
      obs = env.reset()
      J = 0
      done = False
      hs = []
      while not done:
          action, _ = expert.predict(obs)
          obs, reward, done, info = env.step(action)
          hs.append(obs[1])
          J += reward
      Js.append(J)
  ll_expert_performance = np.mean(Js)
  return ll_expert_performance

In [None]:
def train(train_bc=True, truncate=False, n_steps=10000):
    env = make_env()
    expert = get_expert()
    
    performance = get_expert_performance(env, expert)
    print('=' * 20)
    print(f'Expert performance: {performance}')
    print('=' * 20)
    
    # net + loss fn
    if truncate:
        net = create_net(input_dim=6, output_dim=4)
    else:
        net = create_net(input_dim=8, output_dim=4)
    
    loss_fn = nn.CrossEntropyLoss()
    
    if train_bc:
        # TODO: train BC
        # Things that need to be done:
        # - Roll out the expert for X number of trajectories (a standard amount is 10).
        # - Create our BC learner, and train BC on the collected trajectories.
        # - It's up to you how you want to structure your data!
        # - Evaluate the argmax_policy by printing the total rewards.
        pass
    else:
        # TODO: train DAgger
        # Things that need to be done.
        # - Create our DAgger learner.
        # - Set up the training loop. Make sure it is fundamentally interactive!
        # - It's up to you how you want to structure your data!
        # - Evaluate the argmax_policy by printing the total rewards.
        pass

In [None]:
train_bc = True
truncate = False
n_steps = 10_000

train(train_bc, truncate, n_steps)

### Extra Credit:

As a reminder, all extra credit problems are optional for students taking the 4xxx version of this course, but compulsory for students taking the 5xxx version.

Using the `args.truncate` option, create a “partially observable” lunar lander environment where the angular velocity is masked out and not available to the learner (it’s still available to the expert!) You may find yourself needing to add/modify some of the existing code you have, so we recommend saving a version beforehand so that the performance of the fully-observable version is not affected by these changes.


### Writeup

In a separate PDF, please include answers to the questions at the bottom of the Assignment [doc](https://docs.google.com/document/d/1YA8hpE7R8M0prgMtSMKHLdUzWZq3VSoW8mvMOw1EB90/edit#). Double check your work to make sure all cells are running properly. There are two additional answers for you to respond to in the writeup if you did the extra credit. 

**Some questions will ask for an accompanying graph to support your response. Don't forget to include these!**