# HW 1 - Policy Gradients & Proximal Policy Optimization
This assignment builds to a simple PPO (2017) implementations by progressing from PPOs predecessor algorithms: <br> REINFORCE (\~1992) and Vanilla Policy Gradients (\~1999). Note, many variations of these algorithms exist. Please use the math contained in this notebook for the coding sections.



## 0. Warm Up Questions [30 pts total; 2 pt each]
Answer each question concisely. One sentence, one formula, one line of code, etc. Use of $\LaTeX$ formatting for math is encouraged.

1.    What is an MDP and what are its four main parts? <br>
Type answer here ...

2.    What is the markov property? <br>


3.    What is the formula for the objective aka sum of discounted rewards? <br>


4.    Complete the sentence: 'Policy gradient' is shorthand for the 'gradient of ??? with respect to ???'. <br>


5.    What does $\nabla _\theta J (\theta)$ mean in basic language? <br>


6.    What is the formula for gradient of the objective in REINFORCE? (policy gradient slides - Canvas/files/lec-4)

$$ \nabla _\theta J (\theta) \approx  ???$$


7.    Does subtracting a baseline from returns introduce bias in expectation? (policy gradient slides - Canvas/files/lec-4) <br>

8.    Do on-policy algorithms use a replay buffer? <br>

9.    What does $\pi _\theta (a_t | s_t)$ mean in basic language? <br>

10.    What is the log prob of getting heads when flipping a coin? <br>

11.    Finish this basic property of logs:
$\frac{A}{B} = \exp (\log A - ? )$

12. What is a logit in the DRL context? <br>

13. Is a Categorical Distribution continuous or discrete? <br>

14. Logits are used to construct a Categorical distribution. Finish the code to get the log probability of the actions that were sampled. Hint: https://pytorch.org/docs/stable/distributions.html

        logits = self.policy(obs)
        probs = categorical.Categorical(logits=logits)
        actions = probs.sample()
        log_probs = _____


15. In [CartPole-v1](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) what are the physical meanings of states and actions and are they discrete or continuous?


## Imports and Set up
Installs gymnasium, imports deep learning libs, sets torch device. **You shouldnt need to change this code.** Your colab runtime should default to CPU. To double check: **click Runtime (top left of notebook) -> Change runtime type -> select a CPU -> Save**. For simplicity, this notebook doesnt manage data transfers between CPU and GPU. You need to use CPU runtime for it to work unmodified. Feel free to experiment with GPUs after submitting.

In [11]:
!pip install gymnasium
import gymnasium as gym

import torch
from torch import nn
from torch.optim import Adam
from torch.distributions import categorical
from copy import deepcopy
from torch.utils.tensorboard import SummaryWriter
import random

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# random seeds for reproducability
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
random.seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Using device: cuda


In [12]:
#@title Device check
def test_device_is_cpu():
    assert device.type == "cpu", "Test failed: Device is not CPU! Read Imports and Set up"

# Run the test
test_device_is_cpu()
print("Test passed: Device is CPU.")


AssertionError: Test failed: Device is not CPU! Read Imports and Set up

## Trajectory Data Storage
This is boiler plate code that lets your on-policy algorithms store their interactions with the environment. It also calculates returns as the sum of discounted rewards
$
R = \sum_{t=0}^{T} \gamma^t r_t
$.
Note, when a terminal condition is reached (not_dones = False), the sum resets to 0. More sopisticated PPO implementations use GAE, but it will work without it. **You shouldn't need to change this code,** but you should understand the store() and calc_returns() functions.

In [4]:
class TrajData:
    def __init__(self, n_steps, n_envs, n_obs, n_actions):
        s, e, o, a = n_steps, n_envs, n_obs, n_actions
        from torch import zeros

        self.states = zeros((s, e, o))
        self.actions = zeros((s, e))
        self.rewards = zeros((s, e))
        self.not_dones = zeros((s, e))

        self.log_probs = zeros((s, e))
        self.returns = zeros((s, e))

        self.n_steps = s

    def detach(self):
        self.actions = self.actions.detach()
        self.log_probs = self.log_probs.detach()

    def store(self, t, s, a, r, lp, d):
        self.states[t] = s
        self.actions[t] = a
        self.rewards[t] = torch.Tensor(r)

        self.log_probs[t] = lp
        self.not_dones[t] = 1 - torch.Tensor(d)

    def calc_returns(self, gamma = .99):
        self.returns = deepcopy(self.rewards)

        for t in reversed(range(self.n_steps-1)):
            self.returns[t] += self.returns[t+1] * self.not_dones[t] * gamma

## DRL Rollout and Update Loop
This is more boiler plate code. It instantiates your parallel gym environments, neural nets (which you will define next), optimizer, and tensorboard logging. It also establishes the rollout/update cycle. During rollout, the agent collects $(s, a, r)$ tuples from the environment. During update, losses are calculated and the DRL agent is updated via gradient descent. **You shouldn't need to change this code.**

In [6]:
class DRL:
    def __init__(self):

        self.n_envs = 64
        self.n_steps = 256
        self.n_obs = 4

        self.envs = gym.vector.SyncVectorEnv([lambda: gym.make("CartPole-v1") for _ in range(self.n_envs)])

        self.traj_data = TrajData(self.n_steps, self.n_envs, self.n_obs, n_actions=1) # 1 action choice is made
        self.agent = Agent(self.n_obs, n_actions=2)  # 2 action choices are available
        self.optimizer = Adam(self.agent.parameters(), lr=1e-3)
        self.writer = SummaryWriter(log_dir=f'runs/{self.agent.name}')


    def rollout(self, i):

        obs, _ = self.envs.reset()
        obs = torch.Tensor(obs)

        for t in range(self.n_steps):
            # PPO doesnt use gradients here, but REINFORCE and VPG do.
            with torch.no_grad() if self.agent.name == 'PPO' else torch.enable_grad():
                actions, probs = self.agent.get_action(obs)
            log_probs = probs.log_prob(actions)
            next_obs, rewards, done, truncated, infos = self.envs.step(actions.numpy())
            done = done | truncated  # episode doesnt truncate till t = 500, so never
            self.traj_data.store(t, obs, actions, rewards, log_probs, done)
            obs = torch.Tensor(next_obs)

        self.traj_data.calc_returns()

        self.writer.add_scalar("Reward", self.traj_data.rewards.mean(), i)
        self.writer.flush()


    def update(self):

        # A primary benefit of PPO is that it can train for
        # many epochs on 1 rollout without going unstable
        epochs = 10 if self.agent.name == 'PPO' else 1

        for _ in range(epochs):

            loss = self.agent.get_loss(self.traj_data)

            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

        self.traj_data.detach()


## Tensorboard
This will launch an interactive tensorboard window within collab. It will display rewards in (close to) real time while your agents are training. You'll likely have to refresh if its not updating (circular arrow to right in the orange bar). **You shouldn't need to change this code.**

In [15]:
# Launch TensorBoard
%load_ext tensorboard
%tensorboard --logdir runs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 56989), started 0:01:40 ago. (Use '!kill 56989' to kill it.)

In [14]:
# @title Visualization code. Used later.

import os
from gym.wrappers import RecordVideo
from IPython.display import Video, display, clear_output

def visualize(agent):

    video_dir = "./videos"  # Directory to save videos
    os.makedirs(video_dir, exist_ok=True)

    # Create environment with proper render_mode
    env = gym.make("CartPole-v1", render_mode="rgb_array")

    # Apply video recording wrapper
    env = RecordVideo(env, video_folder=video_dir, episode_trigger=lambda x: True)

    obs, _ = env.reset()


    for t in range(4096):
        actions, _ = agent.get_action(torch.Tensor(obs)[None, :])  # Get action from policy
        obs, _, done, _ = env.step(actions.cpu().item())

        if done:
            # self.writer.add_scalar("Duration", t, i)
            break

    env.close()

    # Display the latest video
    video_path = os.path.join(video_dir, sorted(os.listdir(video_dir))[-1])  # Get the latest video


    clear_output(wait=True)
    display(Video(video_path, embed=True))

--------------------------------------------------------------------------------
## 1. REINFORCE [30 pts]
1.   Define your policy network [10 pts]
2.   Define the reinforce policy loss using rollout data stored in traj_data [15 pts]
3.   Conceptual question [5 pts]

--------------------------------------------------------------------------------
HINTS:

If you're not super familar with defining networks in pytorch, check out this [tutorial](https://medium.com/writeasilearn/using-sequential-module-to-build-a-neural-network-a34ca3f37203).

#### Policy loss for REINFORCE:

$$
\mathcal{L}(\theta) = -\frac{1}{N \cdot T} \sum_{i=0}^N \sum_{t=0}^T \log \pi_\theta(a_{i,t} | s_{i,t}) \cdot R_{i,t}
$$

Where:
- $ \mathcal{L} $ is the policy loss; a function of network parameters $\theta$
- $N$ is the total number of environments
- $T$ is the total number of time steps (the slides don't divide by $T$, but it doesnt change the gradient, and you need it to pass the unit tests)
- $ \log \pi_{\theta}(a_{i, t} | s_{i,t}) $ is the logarithm of the probability of the action $a$ that was taken in state $s$, given policy $\pi$ parametrized by $\theta$, at timestep t in environment i
- $ R_{i,t} $ is the return (sum of discounted rewards) for environment i at timestep t

<br>

For simplicity, expectation notation is often used, and the subscript $i$ is often dropped:
$$
\mathcal{L}(\theta) = -\mathbb{E}\left[ \log \pi_{\theta}(a_t | s_t) \cdot R_t \right]
$$

We follow this convention going forward.



In [22]:
class Agent(nn.Module):
    def __init__(self, n_obs, n_actions):  # use these
        super().__init__()
        self.name = 'REINFORCE'

        torch.manual_seed(0)  # needed before policy init for fair comparison

        # todo: student code here
        self.policy = nn.Sequential(
            nn.Linear(n_obs, 64),
            nn.ReLU(),
            nn.Linear(64, n_actions),
        ) 
        # end student code

    def get_loss(self, traj_data):
        # todo: student code here

        # Multiplying each return by its respective log probability, summing it all together, and dividing by the number of steps and time
        policy_loss = -torch.sum(traj_data.log_probs * traj_data.returns)/(traj_data.n_steps*traj_data.returns.shape[1])

        # end student code
        return policy_loss

    def get_action(self, obs):
        logits = self.policy(obs)
        probs = categorical.Categorical(logits=logits)
        actions = probs.sample()
        return actions, probs


In [23]:
# @title REINFORCE Unit Tests (must run REINFORCE Agent cell above first)
def REINFORCE_policy():
    a = Agent(16, 4)
    assert a.name == 'REINFORCE' and \
    a.policy(torch.randn(8, 16)).shape == (8, 4) and \
    isinstance(list(a.policy.children())[-1], nn.Linear), \
    f"Network not initialized correctly"
    print("Test passed: REINFORCE policy appears correct!")

REINFORCE_policy()

def REINFORCE_loss():
    n_steps, n_envs, n_obs, n_actions = 10, 5, 4, 1
    traj_data = TrajData(n_steps, n_envs, n_obs, n_actions)
    torch.manual_seed(0)
    traj_data.states = torch.rand_like(traj_data.states)
    traj_data.actions = torch.randint(0, n_actions, traj_data.actions.shape)
    traj_data.rewards = torch.rand_like(traj_data.rewards)
    traj_data.not_dones = torch.randint(0, 2, traj_data.not_dones.shape)
    traj_data.log_probs = torch.rand_like(traj_data.log_probs)
    traj_data.returns = torch.rand_like(traj_data.returns)
    a = Agent(n_obs=n_obs, n_actions=n_actions)
    assert abs(a.get_loss(traj_data).item() - (-0.2369)) < 1e-4, \
    "REINFORCE loss does not match expected value."
    print("Test passed: REINFORCE loss appears correct!")

REINFORCE_loss()

Test passed: REINFORCE policy appears correct!
Test passed: REINFORCE loss appears correct!


Run the REINFORCE Agent cell above, and then run the rollout/update cell below. <br> Scroll back up to tensorboard and refresh (circular white arrow in the right of the orange bar) to visualize your reward curve.



In [24]:
drl = DRL()
for i in range(250):
    drl.rollout(i)
    drl.update()

#### REINFORCE Conceptual question:
In 1 or 2 sentences, how does minimizing the REINFORCE loss above achieve our RL goal? <br> Hint: (policy gradient slides - Canvas/files/lec-4 - "What did we just do?")<br>

Minimizing the REINFORCE loss allows us to update our policy accordingly, thus maximizing the probability of choosing actions that provide the highest cumulative reward.

--------------------------------------------------------------------------------
## 2. Vanilla Policy Gradient (aka REINFORCE with Baseline)[40 pts]

1.   Define your networks [15 pts]
  *   Value network
  *   Policy network (same as before)


2.   Define your losses [20 pts]
  *   Value loss
  *   Policy loss (similar to before)
  *   Add them

3.   Conceptual question [5 pts]
--------------------------------------------------------------------------------
HINTS:

#### Value loss
Mean Squared Error (MSE) between the experienced returns and predicted value:

$$
\mathcal{L}_{\text{value}}(\theta) = \mathbb{E}\left[ (R_t - V_{\theta}(s_t))^2 \right]
$$

Where:
- $ \mathcal{L}_{\text{value}}(\theta) $ is the value network loss
- $ V_{\theta}(s_t) $ is the predicted value for state $ s_t $ from the value network
- $ R_t $ is the return (sum of discounted rewards)


#### Policy Loss

The VPG policy loss is quite similar to REINFORCE, but rather than using returns, we use returns minus a baseline value prediction. This quantity is known as the advantage $A(s_t, a_t)$. The advantage is usually defined as $A(s_t, a_t) = Q(s_t, a_t) - V(s_t)$. Returns act as the Q function in our case.

$$A(s_t, a_t) = R_t - V_{\theta}(s_t)$$

$$
\mathcal{L}_{\text{policy}}(\theta) = - \mathbb{E}\left[ \log \pi_{\theta}(a_t | s_t) \cdot A(s_t, a_t) \right]
$$

Where:
- $ \mathcal{L}_{\text{policy}}(\theta) $ is the policy loss
- $ \log \pi_{\theta}(a_t | s_t) $ is the logarithm of the probability of the action $a_t$ that was taken in state $s_t$
- $A(s_t, a_t)$ is the advantage of action $a_t$ that was taken in $s_t$, campared to the average for state $s_t$


In [None]:
class Agent(nn.Module):
    def __init__(self, n_obs, n_actions):  # use these
        super().__init__()
        self.name = 'VPG'

        torch.manual_seed(0)  # needed before network init for fair comparison

        # todo: student code here
        self.policy = None  # replace





        self.value = None  # replace





        # end student code

    def get_loss(self, traj_data):


        # todo: student code here




        loss = None  # replace
        # end student code

        return loss


    def get_action(self, obs):
        logits = self.policy(obs)
        probs = categorical.Categorical(logits=logits)
        actions = probs.sample()
        return actions, probs



In [None]:
# @title VPG Units Tests (must run VPG Agent cell above first)
def VPG_networks():
    a = Agent(32, 6)
    assert a.name == 'VPG' and \
    a.policy(torch.randn(64, 32)).shape == (64, 6) and \
    a.value(torch.randn(64, 32)).shape == (64, 1) and \
    isinstance(list(a.policy.children())[-1], nn.Linear), \
    f"Networks not initialized correctly"
    print("Test passed: VPG Networks appear correct!")

VPG_networks()

def VPG_loss():
    n_steps, n_envs, n_obs, n_actions = 10, 5, 4, 1
    traj_data = TrajData(n_steps, n_envs, n_obs, n_actions)
    torch.manual_seed(0)
    traj_data.states = torch.rand_like(traj_data.states)
    traj_data.actions = torch.randint(0, n_actions, traj_data.actions.shape)
    traj_data.rewards = torch.rand_like(traj_data.rewards)
    traj_data.not_dones = torch.randint(0, 2, traj_data.not_dones.shape)
    traj_data.log_probs = torch.rand_like(traj_data.log_probs)
    traj_data.returns = torch.rand_like(traj_data.returns)
    a = Agent(4, 1)
    torch.manual_seed(0)
    a.policy = nn.Linear(4, 1)
    a.value = nn.Linear(4, 1)
    assert abs(a.get_loss(traj_data).item() - 0.0618) < 1e-4, \
    "VPG loss does not match expected value."
    print("Test passed: VPG loss appears correct!")

VPG_loss()

Run the VPG Agent cell above, and then run the rollout/update cell below. <br> Scroll back up to tensorboard and refresh (circular white arrow in the right of the orange bar) to visualize your reward curve.

In [None]:
drl = DRL()
for i in range(250):
    drl.rollout(i)
    drl.update()

#### VPG Conceptual Question:
In 2 or 3 sentences, why might subtracting a value network baseline improve performance of our RL agent? (Hint: policy gradient slides) Based on the tensorboard curves, what is the effect in this environment? Why? <br>

Type answer here...

--------------------------------------------------------------------------------
## 3. Optional Extra Credit: Proximal Policy Optimization <br> (aka REINFORCE with Baseline and Clipped Surrogate Objective) [10pts]

1.   Define your networks [1 pts]
  *   Value network (same as VPG)
  *   Policy network (same as VPG)

2.   Define your losses [5 pts]
  *   Value loss (same as VPG)
  *   Policy loss (the heart of PPO)
  *   Add them

3. Generalized Advantage Estimation (GAE) [4 pts]
--------------------------------------------------------------------------------
HINTS:

#### Policy Loss

Our PPO policy loss still uses the advantage defined in VPG:$$A(s_t, a_t)  = A_t = R_t - V_{\theta}(s_t)$$

But we maximize a clipped surrogate objective which is designed to keep policy updates bounded:

$$
\mathcal{L}_{\text{clip}}(\theta) = \mathbb{E}_t \left[ \min \left( \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \cdot A_t, \text{clip}\left(\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}, 1 - \epsilon, 1 + \epsilon\right) \cdot A_t \right) \right].
$$

where:

- $ \pi_\theta(a_t | s_t) $: the probability of taking action $a_t$ in state $s_t$ under the current policy with parameters $ \theta $.
- $ \pi_{\theta_{\text{old}}}(a_t | s_t) $: the probability of taking action $a_t$ in state $s_t$ under the old policy before the update.
- $ A_t $: the advantage estimate at timestep $t$.
- $ \epsilon $: the clip range hyperparameter that limits policy updates.
- $ \text{clip}(x, 1 - \epsilon, 1 + \epsilon) $: clips $x$ to the range $[1 - \epsilon, 1 + \epsilon]$ to ensure conservative updates.

<br>

Lets break it down.


*   First, $\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$ is the probability ratio between the policy being updated and the policy that was rolled out to collect the training data (traj_data). It is only meaningful when multiple epochs of training are performed on the training data from a single rollout. Indeed, in the first epoch, the current and old policies are the same so the ratio will be one.

*   For numerical stability, we leverage a basic property of logs ($\frac{A}{B} = \exp (\log A - \log B )$), and we calculate this ratio as

$$
\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} = \exp \left( \log \pi_\theta(a_t | s_t) - \log \pi_{\theta_{\text{old}}}(a_t | s_t) \right)
$$


*   Conceptually, defining policy loss as the product $\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \cdot A_t$ is enough to train a policy. Feel free to try it and view your learning results in tensorboard.

*   However, after several epochs of training, the new policy probabilities $\pi_\theta(a_t | s_t)$ may deviate so far from the old policy probabilities $\pi_{\theta_{\text{old}}}(a_t | s_t)$, that the advantage data from the rollout (traj_data) is no longer valid. This can cause catastropic collapse in the policy.

*   Enter $\text{clip}\left(\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}, 1 - \epsilon, 1 + \epsilon\right)$, which never lets the probability ratio become smaller than $1 - \epsilon$ or larger than $1 + \epsilon$, where common values of $\epsilon$ are .2, .1, or .05. It's applied pointwise across all $(s, a)$ pairs.

*   Finally, by taking the pointwise minimum of the unclipped and clipped products across all $(s, a)$ pairs, we ensure the largest policy update possible is made, while remaining conservatively close to the old policy.


In [None]:
class Agent(nn.Module):
    def __init__(self, n_obs, n_actions):
        super().__init__()
        self.name = 'PPO'

        torch.manual_seed(0)  # needed before network init for fair comparison

        # todo: student code here
        self.policy = None  # replace




        self.value = None  # replace




        # end student code

    def get_loss(self, traj_data, epsilon=.1):

        # todo: student code here







        loss = None  # replace
        # end student code

        return loss

    def get_action(self, obs):
        logits = self.policy(obs)
        probs = categorical.Categorical(logits=logits)
        actions = probs.sample()
        return actions, probs

In [None]:
# @title PPO Unit Tests (must run PPO Agent cell above first)
def PPO_networks():
    a = Agent(12, 7)
    assert a.name == 'PPO' and \
    a.policy(torch.randn(128, 12)).shape == (128, 7) and \
    a.value(torch.randn(4, 12)).shape == (4, 1) and \
    isinstance(list(a.policy.children())[-1], nn.Linear), \
    f"Networks not initialized correctly"
    print("Test passed: PPO Networks appear correct!")

PPO_networks()

def PPO_loss():
    n_steps, n_envs, n_obs, n_actions = 10, 5, 4, 1
    traj_data = TrajData(n_steps, n_envs, n_obs, n_actions)
    torch.manual_seed(0)
    traj_data.states = torch.rand_like(traj_data.states)
    traj_data.actions = torch.randint(0, n_actions, traj_data.actions.shape)
    traj_data.rewards = torch.rand_like(traj_data.rewards)
    traj_data.not_dones = torch.randint(0, 2, traj_data.not_dones.shape)
    traj_data.log_probs = torch.rand_like(traj_data.log_probs)
    traj_data.returns = 2*torch.rand_like(traj_data.returns) - 1
    a = Agent(4, 1)
    torch.manual_seed(0)
    a.policy = nn.Linear(4, 1)
    a.value = nn.Linear(4, 1)
    assert abs(a.get_loss(traj_data).item() - 0.9314) < 1e-4, \
    "PPO loss does not match expected value."
    print("Test passed: PPO loss appears correct!")

PPO_loss()

Run the PPO cell above, and then run this cell, to plot results in tensorboard.

In [None]:
drl = DRL()
for i in range(250):
    drl.rollout(i)
    drl.update()

#### PPO Conceptual Questions [ 0 pts / Ungraded / Not for credit]:
Useful questions to check your understanding, **do not contribute to your grade** ...
*   If advantage for a state-action pair $A(s_t, a_t)$ is a large positive number and epsilon is $\epsilon = .2$, how might the probability ratio for $(s_t, a_t)$ evolve over 10 training epochs with a large learning rate? What if Advantage is large and negative? What if its zero? <br>


*   When the clipped expression is activated for a given state-action pair, what is the gradient of the loss function with respect to network parameters for that state action pair? How will the probability of that action be changed during back propagation? $$
\text{clip}\left(\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}, 1 - \epsilon, 1 + \epsilon\right)  
$$
<br>

## Optional Visulization

In [None]:
untrained = DRL()
visualize(untrained.agent)
print("untrained agent")

In [None]:
# each time this cell is run, a new random rollout is recorded
# optional: put this cell below REINFORCE or VPG and re-run their training if youd like to visualize them.
visualize(drl.agent)
print("PPO trained agent")

#### More Optional Extra Credit: GAE [4 pts]
If you have time and you're up for a challenge... Implement Generalized Advantage Estimation for PPO  or VPG instead of our simple sum of discounted rewards. Plot the reward curves in tensorboard.

This is a free form assignment. You must maintain the original functionality previously implemented, but besides that modify the code how ever you see fit. Copy-paste code below or modify it in place. For GAE, disregard these warnings: **You shouldnt need to change this code**. You will probably need to modify TrajData, DRL.rollout(), and implement a new Agent. Note, we havent tested how easy or hard these changes are, so proceed with caution.


Here's a great lecture by Sergey Levine on [Eligability traces and GAE](https://www.youtube.com/watch?v=quRjnkj-MA0), especially starting from 7:41. Good luck and happy coding!

