## Policy Gradient

The general organization of the homework is given below.

- pg
    - reinforce
        - model
        - box2d
    - a2c
        - model
        - vecenv
        - box2d
        - pong
    - common

In this homework, we will be implementing REINFORCE and A2C agents and run these agents in CartPole and LunarLander environments. Moreover, there will be a Pong run with A2C agent (luckily, GPU is not a must with PG agents).

#### Running

Each experiment will be trained from scratch with a different seed 5 times to have a good understanding of the stochasticity involved in the training. You can run your experiments with command-line arguments from the ipython notebook as shown below or using a bash script. **Please do not change the default values of the arguments!**

In [None]:
!python pg/a2c/box2d.py --nenv 16 > logs/a2c/lunarlander/lunar_1.csv


You can then parse the CSV file using the parser method to use it in visualizations. Note that, except the pong experiment where you will be running a single time, all the experiments must have 5 different runs (with the same arguments and code but with different seeds).

#### Plotting

When you are done with experiments, you can plot the statistics. We are interested to see how much variation exists in the training. **You need to keep log files for the submission!** Do not rely on the plots in the Ipython notebook!

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def plot(x, train_rewards, title):
    """ Plot the statistics.
        Arguments:
            - x: Shared x axis of (N) dim array
            - train_rewards: (5, N) dim array 
    """
    if len(train_rewards.shape) != 2:
        raise ValueError("train_rewards must be 2 dimensional")
    
    fig = plt.figure(figsize=(7, 5))

    plt.title(title)
    plt.xlabel("Iteration")
    plt.ylabel("Mean Episodic reward")
    plt.plot(
        x,
        np.percentile(train_rewards, 50, axis=0),
        label="InterQuartile Range"
    )
    plt.fill_between(
        x,
        np.percentile(train_rewards, 25, axis=0),
        np.percentile(train_rewards, 75, axis=0),
        alpha=0.5
    )

    plt.legend()

## Example Usage

In [None]:
from pg.common import PrintWriter
import numpy as np

# Assumeing that you have the files (lunar_1, lunar_2, ..., lunar_5)
runs = [PrintWriter.parse("logs/a2c/lunarlander/lunar_{}.csv".format(i)) for i in range(1, 6)]

In [None]:
train_rewards = np.array([r[" Reward"] for r in runs])
x_axis = np.array(runs[0]["Iteration"])

In [None]:
train_rewards[0] *= 0.8
train_rewards[1] *= 0.9
train_rewards[2] *= 1.1
train_rewards[3] *= 1.2

In [None]:
plot(x_axis, train_rewards, "A2C LunarLander")

### Reinforce Implementation

The implementation of REINFORCE is pretty simple. There are two python files and we will begin with filling the "model.py". Since you are familiar with REINFORCE, we can start implementing the ```learn``` method in the ```Reinforce``` class. It excepts ```args```, ```opt```, and ```env``` as parameters. We will be looping as many episodes as stated in the argument ```args.n_episodes```. Note that, if an episode cannot terminate in ```args.max_episode_len``` many steps, we will terminate it manually.

We need to obtain a rollout that consists of a list of transitions. Each transition in the rollout needs to include the log probability of the selected (and performed) action and the reward. Simply, call the ```policynet``` and create a Categorical distribution. Then, take a sample from the distribution and step the environment with it. Now, we just need to obtain the log probability of taking that particular action. If you implemented everything correctly, the rollout list that is just initialized above the rollout loop is filled with transitions of log probability and reward (Number of transitions must be equal to the length of the episode).

Since we have the rollout, we can calculate gradients with respect to the policy parameters. Fill the missing part in ```accumulate_gradient```. Don't forget to call ```.backward()``` for every log probability to minimize the negative log-likelihood. Note that, we will not update the parameters in this method.

Now, the only part remaining is ```box2d.py``` file. Fill the missing parts in the ```PolicyNet``` class.

After the implementation is completed, you can run experiments. Don't forget to tune the hyperparameters as they affect the performance of the training.

### Experiments

We will be running two experiments each contains 5 runs as mentioned previously.

- Run REINFORCE in CartPole-v1 (Reach at least +400 mean reward for the last 20 episodes)
- Run REINFORCE in LunarLander-v2 (Reach at least +100 mean reward for the last 20 episodes)


By default, the writer logs the mean reward of the last 20 episodes. This can be changed by overwriting --log-window command-line argument.

Plot these results (2 Plots). Also **keep** the ```.csv``` files.

In [None]:
# Plot Experiment 1

In [None]:
# Plot Experiment 2

### A2C Implementation

A2C is a synchronized version of the popular A3C algorithm. The synchronization is done via a wrapper that runs multiple environments in parallel. We use the wrapper provided in ```pg/ac2/vecenv.py``` file.

#### Model

Fill the missing part in ```forward``` method in ```pg/ac2/model.py```. You may want to check the return value of the network in ```pg/ac2/box2d.py``` to know what to expect when you call ```self.network(.)``` in the ```forward``` method.

Start filling from the ```forward``` method in ```pg/ac2/model.py```. You may want to use ```torch.distributions.categorical.Categorical``` to represent the categorical distribution using the ```logit``` output from the ```network``` call. Please check out the Categorical function from the torch documentation.

Next part to fill is ```accumulate_gradient``` where the Rollout (Rollout object) is used to calculate loss per transition in the rollout. Since we are using a baseline now, you need to calculate the n-step target. Observe that Rollout class is a namedtuple with the attributes as ```list``` and ```target_value```. Here ```list``` is the list of Transitions (also defined in the class definition) while ```target_value``` is/must be the value of the next_state at the last transition (simply, it is the (n+1)th state's value). Similar to REINFORCE, backpropagate the loss but do not update it.

The only remaining method to fill is ```learn``` method where all the training happens. Start with the first missing part where you need to gather a rollout. Please, read the comments under the missing part. Below that, we have the writer object. It is the same writer object as in REINFORCE.

The last part to fill in ```learn``` lies after the rollout loop. At this point, we have a rollout full of transitions. You must calculate ```target_value``` for the nth next state here. After you have it create a Rollout object with the list of Transitions and target value. Note that, we used lowercase rollout to denote a list of Transitions and uppercase Rollout to denote Rollout namedtuple. After creating a Rollout object, you can call the ```accumulate_gradient``` method with the rollout. Update the parameters by calling ```step``` function of the optimizer and you are done.

#### Box2d

Now we need to create a neural network that represents the policy and the value. Unlike before, we will be using a recurrent layer (GRU layer). That is why we have additional tensors like ```gru_hx``` in the ```learn``` method. We assume a familiarity with the recurrent networks.

Start filling the ```__init__``` method. You can use separate networks to represent policy and value or a shared feature network and two separate head layers. Next, fill the ```forward``` method. Remember to return policy logits (no nonlinearity), value (no nonlinearity), and the hidden vector for the GRU layer. The Categorical distribution will be created in the ```A2C``` class and not in the ```network```.

When you complete all the implementations mentioned above you can start experimenting (and debugging) with CartPole and LunarLander environments.

#### Experiments

We will be running two experiments each contains 5 runs as mentioned previously.

- Run A2C in CartPole-v1 (Reach at least +400 mean reward for the last 20 episodes)
- Run A2C in LunarLander-v2 (Reach at least +100 mean reward for the last 20 episodes)


By default, the writer logs the mean reward of the last 20 episodes. This can be changed by overwriting --log-window command-line argument.

Plot these results (2 Plots). Also **keep** the ```.csv``` files.

In [None]:
# Plot Experiment 1

In [None]:
# Plot Experiment 2

#### Pong

Pong is a visual domain so you may want to use convolutional layers (not mandatory). Other than that, there is only a handful of differences between the Pong network and the box2d network. We are using the same environment wrappers that are designed specifically for the Pong environment. You don't have to use a GPU in this experiment as it learns very fast given the implementation is correct.

#### Experiments

We will be running a **single run**.

- Run A2C in Pong (Reach at least +10 mean reward for the last 20 episodes)

By default, the writer logs the mean reward of the last 20 episodes. This can be changed by overwriting --log-window command-line argument.

Plot the result (1 Plot). Also **keep** the ```.csv``` file.

Note that, you need to save the model parameters for the trained agent. It will be tested in homework evaluations using the ```test``` function given in the ```model.py``` file. If your model file is too large to submit via Ninova please give a google drive link below. Do not forget to keep the link until the homework is graded and **please** do not modify the file in the drive after the submission deadline!

In [None]:
# Plot Pong run

Put the Google Drive link if necessary!

In [None]:
# Link