## Policy Gradient

The general organization of the homework is given below.

- pg
    - reinforce
        - model
        - cartpole
    - a2c
        - model
        - vecenv
        - box2d
        - pong
    - ppo
        - model
        - box2d
        - walker
    - common

In this homework, you will implement REINFORCE, A2C, and PPO agents and run these agents in CartPole, LunarLander, Pong, and BipedalWalker environments.

### Running

Each experiment will be trained from scratch with 3 different seeds (except for Pong) to have a good understanding of the stochasticity involved in the training. You can run your experiments with command-line arguments from the jupyter notebook as shown below or using a bash script.


In [None]:
!python pg/a2c/box2d.py --nenv 16 --log-dir log/a2c_lunar


You should obtain scores higher than below:
- CartPole: 400
- LunarLander: 200
- BipedalWalker: 100
- Pong: 10

The default hyperparameters are not tuned but tested and they work well. However, hyperparemeters are sensitive to implementation details and hence you may need to tune them if you feel the need.

By default, the log file (named: progress.csv) will be saved in a temporary file if ```--log_dir``` CL argument is not given. You can use the csv file to obtain a Pandas dataframe object. Using the dataframe object, you can visualize the training with the given visualization script.

### Submission

Submission should include logs and models folders.

Example log folder:
- log
    - a2c_lunar
        - 05-23-2021-00-01-02
        - 05-23-2021-01-02-02
        - 05-23-2021-02-03-02
    - a2c_pong
        - 05-23-2021-03-01-02
    - ppo_lunar
        - 05-23-2021-04-01-02
        - 05-23-2021-05-02-02
        - 05-23-2021-06-03-02
    - ppo_walker
        - 05-23-2021-07-01-02
        - 05-23-2021-08-02-02
        - 05-23-2021-09-03-02
    - reinforce_cartpole
        - 05-23-2021-10-01-02
        - 05-23-2021-11-02-02
        - 05-23-2021-12-03-02
        
As long as you use CL argument ```--log-dir``` with the values such as ```log/a2c_lunar```, the folder will be filled automatically.

### Plotting

When you are done with experiments, you can plot the statistics. As long as the number of rows in every log within the given directory matches ```Plotter``` draws the statistics of the runs. For example, if you have 3 runs for a certain experiment, ```Plotter``` draws a mean curve and a shaded region of the area between the $\alpha$ and $1-\alpha$ quantiles of the runs.

Below is an example plotting code.

In [None]:
import os
import pandas as pd
from pg.visualize import Plotter

log_dir = os.path.join("log", "a2c_lunar")
df_dict = {
    "lunar a2c": [pd.read_csv(os.path.join(log_dir, folder, "progress.csv"))
                for folder in os.listdir(log_dir)]
}

plotter = Plotter(df_dict)
plotter()

## Instructions

### Reinforce

The implementation of REINFORCE is pretty simple. There are two python scripts and we begin with filling the "model.py" script. Since you are familiar with REINFORCE, we can start implementing the ```learn``` method in the ```Reinforce``` class. It accepts ```args```, ```opt```, and ```env``` as parameters. We loop as many episodes as stated in the argument ```args.n_episodes``` and collect rollouts to update the policy parameters. Note that, if an episode does not terminate in ```args.max_episode_len``` many steps, we terminate it manually.

We need to obtain a rollout that consists of a list of transitions. Each transition in the rollout needs to include the log probability of the selected (and performed) action and the intermediate reward. We expect from the ```policynet``` to return a Categorical distribution of the actions for a given state whenever the forward method is called. At each iteration in an episode, use the action distribution to sample action and step the environment with it. Store the log probability of the sampled action and the intermediate reward in a ```Transition``` object. The rollout is a list of ```Transition```s created during the episode. 

When we have the rollout, we can calculate gradients with respect to the policy parameters. Fill the missing part in ```accumulate_gradient```. Don't forget to call ```.backward()``` for each log probability to minimize the negative log-likelihood. Note that, we do not update the parameters in this method but only calculate the gradients.

Now, fill the ```cartpole.py``` script. Fill the missing parts in the ```PolicyNet``` class so that it can return a categorical distribution for a given state.

After the implementation is completed, you can run the experiments. Don't forget to tune the hyperparameters as they affect the performance of the training.

### Experiments

- Run REINFORCE in CartPole-v1 (Reach at least +400 mean reward for the last 20 episodes)


By default, the writer logs the mean reward of the last 20 episodes. This can be changed by overwriting --log-window-length command-line argument.

Plot the results. Also, please **keep** the ```progress.csv``` files as shown under the submission section above.

#### used seeds:
- 42
- 5555
- 7

In [38]:
!python pg/reinforce/cartpole.py --seed 7 --log-dir log/reinforce_cartpole

  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
Logging at: log/reinforce_cartpole/05-23-2024-14-29-42
Episode   :     20, Reward    :  22.75
Episode   :     40, Reward    :   29.7
Episode   :     60, Reward    :   30.3
Episode   :     80, Reward    :  42.95
Episode   :    100, Reward    :   47.5
Episode   :    120, Reward    :   53.1
Episode   :    140, Reward    :  55.65
Episode   :    160, Reward    :   64.6
Episode   :    180, Reward    :  72.05
Episode   :    200, Reward    :   89.3
Episode   :    220, Reward    :   96.7
Episode   :    240, Reward    :  186.1
Episode   :    260, Reward    : 173.85
Episode   :    280, Reward    :  119.8
Episode   :    300, Reward    : 159.35
Episode   :    320, Reward    : 325.05
Episode   :    340, Reward    :  215.0
Episode   :    360, Reward    :  103.9
Episode   :    380, Reward    :  170.5
Episode   :    400, Reward    :  396.6
Episode   :    420, Reward    :  174.8
Episode   :    440, Reward    : 149.

In [40]:
import os
import pandas as pd
from pg.visualize import Plotter

# Plot Experiment 1
log_dir = os.path.join("log", "reinforce_cartpole")
df_dict = {
    "reinforce cartpole": [pd.read_csv(os.path.join(log_dir, folder, "progress.csv"))
                for folder in os.listdir(log_dir)]
}
plotter = Plotter(df_dict)
plotter()

VBox(children=(Dropdown(description='X axis', options=('Episode',), value=None), Dropdown(description='Y axis'…

### A2C Implementation

A2C is a synchronized version of the popular A3C algorithm. The synchronization is done via a wrapper that runs multiple environments in parallel. We use the wrapper provided in ```pg/ac2/vecenv.py``` script. Unlike in REINFORCE algorithm, we do not collect rollouts until the termination. Instead, the rollouts in A2C consist of fixed length transitions in parallel.

Although you do not need to fill ```pg/ac2/vecenv.py``` please take a look to understand the vectorization of the environments. 

Before starting, you may want to check ```learn``` method in the ```pg/ac2/model.py``` script to observe the overall structure of the algorithm.

#### Model

Fill the missing part in ```forward``` method in ```pg/ac2/model.py``` script. You may want to check the return value of the network at ```pg/ac2/box2d.py``` to know what to expect when you call ```self.network(.)``` in the ```forward``` method. We want to sample actions and calculate the log probabilities of these actions as well as the entropy of the action distribution to use them later in the parameter update.

Next, you need to fill ```collect_rollout``` method in the ```pg/ac2/model.py``` script. This is the method that collects ```args.n-step``` many transitions in parallel to make a rollout. Note that, you need to calculate the value of the last state (the one that is returned by the ```step``` function at the end of the rollout) so that we can calculate the target value later. Combining the list of transitions and the value of the last state you can form a ```Rollout``` object. We also return the last state and the last gru hidden state for future calls.

We continue filling with ```calculate_gae``` method in the ```pg/ac2/model.py``` script. You need to read GAE paper before implementing this part. This method returns a list of advantages and a list of returns (the capital letter G in the book notation). We will use the advantages to calculate the policy loss and use returns to calculate value loss.

Now we have a rollout and a list of advantages and returns, we can calculate a loss and update the parameters. Fill ```parameter_update``` method in the ```pg/ac2/model.py``` script. You can use ```rollout_data_loader``` method to obtain flatten tensors.

Finally, fill the ```evaluate``` method that runs the policy for multiple episodes to obtain an average score. We use this method to measure the performance of a trained agent. Now that we filled all of the missing parts, we can observe how they come together in the ```learn``` method at ```pg/ac2/model.py```.

#### Box2d

Now we need to create a neural network that represents the policy and the value functions. Unlike before, we will use a recurrent layer (GRU layer). That is why we have additional tensors like ```gru_hx``` in the ```forward``` and ```collect_rollout``` methods. We assume a familiarity with the recurrent networks for this part.

Start filling the ```__init__``` method. You can use separate networks to represent the policy and the value functions or a shared feature network and two separate head layers. Next, fill the ```forward``` method. Remember to return policy logits (no nonlinearity so that you can use them in the ```forward``` method in the ```pg/ac2/model.py``` script to create a Categorical distribution), value (no nonlinearity), and the hidden vector for the GRU layer. The Categorical distribution will be created in the ```A2C``` class and not within the ```network```.

COMMON MISTAKE HINT: Notice that there are two (or maybe even more) different functions in PyTorch for GRU; ```torch.nn.GRU``` and ```torch.nn.GRUCell```, but only one of them is suitable as a layer. Please read the documentation and select the suitable one for our ```pg/ac2/box2d.py```.

#### Pong

Pong is a visual domain, so you may want to use convolutional layers (not mandatory). Other than that, there is only a handful of differences between the Pong network and the box2d network. We use simple environment wrappers that are designed specifically for the Pong environment. You don't have to use a GPU in this experiment as it trains quite fast given the implementation is correct.


#### Experiments
When you complete all the implementations mentioned above, you can start experimenting (and debugging) with  LunarLander and Pong environments.

We will run two experiments.

- Run A2C in LunarLander-v2 (Reach at least +200 mean reward for the last 20 episodes)
- Run A2C in Pong (Reach at least 10 mean reward for the last 20 episodes) (a single seed is enough)

Plot these results (2 Plots). Also **keep** the ```.csv``` files and the models (you can leave a Google Drive link for the model files if you can not submit them through Ninova).

**seeds used** :
- 42 
- 555
- 7 

In [17]:
!python pg/a2c/box2d.py --log-dir ./log/a2c_lunar

Logging at: ./log/a2c_lunar/05-26-2024-14-56-06
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated and will be removed in the future. "
  "Function `env.seed(seed)` is marked as deprecated

timestep  : 143920, episodic_reward: -213.50356397252548, value_loss: 17.42850112915039, policy_loss: -1.3027054071426392, entropy_loss: 0.10085280239582062
timestep  : 151920, episodic_reward: -208.2596437493355, value_loss: 17.35693359375, policy_loss: 1.7544673681259155, entropy_loss: 0.10781097412109375
timestep  : 159920, episodic_reward: -190.94532790442688, value_loss: 22.047239303588867, policy_loss: -2.9401001930236816, entropy_loss: 0.1280996948480606
timestep  : 167920, episodic_reward: -188.95272764223984, value_loss: 9.546184539794922, policy_loss: -2.3834116458892822, entropy_loss: 0.09802546352148056
timestep  : 175920, episodic_reward: -216.54743369904386, value_loss: 79.06524658203125, policy_loss: -2.4131383895874023, entropy_loss: 0.11327233165502548
timestep  : 183920, episodic_reward: -263.1541941226026, value_loss: 8.991231918334961, policy_loss: -1.0068137645721436, entropy_loss: 0.12429343909025192
timestep  : 191920, episodic_reward: -250.8871518353306, value_l

timestep  : 567920, episodic_reward: -248.9515150766261, value_loss: 29.959421157836914, policy_loss: 4.2107834815979, entropy_loss: 0.12655961513519287
timestep  : 575920, episodic_reward: -264.2442322777158, value_loss: 47.32240676879883, policy_loss: -5.840904235839844, entropy_loss: 0.12571287155151367
timestep  : 583920, episodic_reward: -234.8587693800624, value_loss: 15.2008056640625, policy_loss: -0.1749885082244873, entropy_loss: 0.11728958040475845
timestep  : 591920, episodic_reward: -242.2100806931079, value_loss: 20.042224884033203, policy_loss: 2.6322691440582275, entropy_loss: 0.12607644498348236
timestep  : 599920, episodic_reward: -233.38183929658436, value_loss: 25.088476181030273, policy_loss: -0.5120387077331543, entropy_loss: 0.13148878514766693
timestep  : 607920, episodic_reward: -250.3619194381108, value_loss: 32.60450744628906, policy_loss: -0.08639488369226456, entropy_loss: 0.13031965494155884
timestep  : 615920, episodic_reward: -246.82148417274857, value_lo

timestep  : 991920, episodic_reward: -45.18301952213519, value_loss: 8.012768745422363, policy_loss: 0.6600214838981628, entropy_loss: 0.10604751110076904
timestep  : 999920, episodic_reward: -53.81960770118278, value_loss: 5.4806389808654785, policy_loss: 1.0389084815979004, entropy_loss: 0.09696029126644135
timestep  : 1007920, episodic_reward: -61.43111897731998, value_loss: 12.758116722106934, policy_loss: 0.3336659371852875, entropy_loss: 0.1030082255601883
timestep  : 1015920, episodic_reward: -53.31594104796448, value_loss: 34.508243560791016, policy_loss: -0.45166879892349243, entropy_loss: 0.10674603283405304
timestep  : 1023920, episodic_reward: -60.64908076324517, value_loss: 10.737945556640625, policy_loss: -0.4682319164276123, entropy_loss: 0.10883714258670807
timestep  : 1031920, episodic_reward: -60.93519495893352, value_loss: 36.78978729248047, policy_loss: -1.9036462306976318, entropy_loss: 0.10691683739423752
timestep  : 1039920, episodic_reward: -58.06263888882274, v

timestep  : 1415920, episodic_reward: -10.013996393406982, value_loss: 1.9937089681625366, policy_loss: 0.32059744000434875, entropy_loss: 0.09521834552288055
timestep  : 1423920, episodic_reward: -22.33511312809521, value_loss: 2.407254934310913, policy_loss: 0.2437247931957245, entropy_loss: 0.10913797467947006
timestep  : 1431920, episodic_reward: -31.674617775810177, value_loss: 61.68798065185547, policy_loss: -1.7854864597320557, entropy_loss: 0.08743705600500107
timestep  : 1439920, episodic_reward: -24.742739956079767, value_loss: 2.92094087600708, policy_loss: 0.13459832966327667, entropy_loss: 0.09686441719532013
timestep  : 1447920, episodic_reward: 8.472940599531771, value_loss: 2.220209836959839, policy_loss: -1.0174839496612549, entropy_loss: 0.09626159816980362
timestep  : 1455920, episodic_reward: 28.810280723108285, value_loss: 137.9493408203125, policy_loss: 0.26693326234817505, entropy_loss: 0.09755031764507294
timestep  : 1463920, episodic_reward: 33.42064591840958, 

timestep  : 1839920, episodic_reward: -0.6364188147875374, value_loss: 2.1987202167510986, policy_loss: -0.6127950549125671, entropy_loss: 0.0915670096874237
timestep  : 1847920, episodic_reward: 6.658532114482855, value_loss: 1.9369741678237915, policy_loss: 0.006868789903819561, entropy_loss: 0.09590213000774384
timestep  : 1855920, episodic_reward: -9.143074680100307, value_loss: 2.0452609062194824, policy_loss: 0.34169381856918335, entropy_loss: 0.09129587560892105
timestep  : 1863920, episodic_reward: -15.733989629567336, value_loss: 1.4338865280151367, policy_loss: -0.6271981000900269, entropy_loss: 0.091999851167202
timestep  : 1871920, episodic_reward: -13.24852899865798, value_loss: 3.595991611480713, policy_loss: 0.057608626782894135, entropy_loss: 0.09809029847383499
timestep  : 1879920, episodic_reward: -12.552042481952487, value_loss: 2.2961792945861816, policy_loss: -0.7353233098983765, entropy_loss: 0.09521651268005371
timestep  : 1887920, episodic_reward: -18.7780847789

timestep  : 2263920, episodic_reward: 196.84349819495162, value_loss: 1.5050843954086304, policy_loss: -0.6442278027534485, entropy_loss: 0.07536616176366806
timestep  : 2271920, episodic_reward: 182.11224985399267, value_loss: 3.0177512168884277, policy_loss: -1.0490694046020508, entropy_loss: 0.06910540908575058
timestep  : 2279920, episodic_reward: 155.36906830503676, value_loss: 1.070870280265808, policy_loss: -0.24897520244121552, entropy_loss: 0.06882990896701813
timestep  : 2287920, episodic_reward: 145.62831780595448, value_loss: 0.5758696794509888, policy_loss: 0.02183401584625244, entropy_loss: 0.06490574777126312
timestep  : 2295920, episodic_reward: 184.58031783535907, value_loss: 8.667265892028809, policy_loss: 0.5502911806106567, entropy_loss: 0.07819308340549469
timestep  : 2303920, episodic_reward: 196.54585363739835, value_loss: 0.9148141741752625, policy_loss: -0.24581082165241241, entropy_loss: 0.05054480582475662
timestep  : 2311920, episodic_reward: 209.25402728990

timestep  : 2687920, episodic_reward: 139.00937774899774, value_loss: 1.5535619258880615, policy_loss: -0.45856133103370667, entropy_loss: 0.09612768143415451
timestep  : 2695920, episodic_reward: 111.4570003338275, value_loss: 1.1178534030914307, policy_loss: 0.3182411193847656, entropy_loss: 0.0854605957865715
timestep  : 2703920, episodic_reward: 65.48647327093107, value_loss: 5.663890838623047, policy_loss: 1.145372986793518, entropy_loss: 0.06868723034858704
timestep  : 2711920, episodic_reward: 53.487660186008306, value_loss: 1.5602896213531494, policy_loss: -0.7829585075378418, entropy_loss: 0.08939126878976822
timestep  : 2719920, episodic_reward: 55.629907245347255, value_loss: 2.389347553253174, policy_loss: -0.3297775685787201, entropy_loss: 0.10122491419315338
timestep  : 2727920, episodic_reward: 33.54442711015321, value_loss: 1.4381191730499268, policy_loss: 0.17193672060966492, entropy_loss: 0.10267817229032516
timestep  : 2735920, episodic_reward: 2.8472216294157864, va

In [18]:
# Plot Experiment 1
import os
import pandas as pd
from pg.visualize import Plotter

log_dir = os.path.join("log", "a2c_lunar")
df_dict = {
    "a2c lunar": [pd.read_csv(os.path.join(log_dir, folder, "progress.csv"))
                for folder in os.listdir(log_dir)]
}
plotter = Plotter(df_dict)
plotter()

VBox(children=(Dropdown(description='X axis', options=('timestep',), value=None), Dropdown(description='Y axis…

In [48]:
!python pg/a2c/pong.py --log-dir ./log/a2c_pong

A.L.E: Arcade Learning Environment (version 0.7.5+db37282)
[Powered by Stella]
cuda
Logging at: ./log/a2c_pong/05-26-2024-16-30-08
  agent.learn()
timestep  :  15840, episodic_reward:  -21.0, value_loss: 0.025330621749162674, policy_loss: 0.05314486473798752, entropy_loss: 0.006906167604029179
timestep  :  31840, episodic_reward: -20.68, value_loss: 0.01362847350537777, policy_loss: -0.0019184043630957603, entropy_loss: 0.006870466284453869
timestep  :  47840, episodic_reward: -20.32, value_loss: 0.1817772090435028, policy_loss: 0.007553027011454105, entropy_loss: 0.005268241744488478
timestep  :  63840, episodic_reward: -20.88, value_loss: 0.034315403550863266, policy_loss: -0.04041040688753128, entropy_loss: 0.006691263988614082
timestep  :  79840, episodic_reward:  -20.8, value_loss: 0.022689130157232285, policy_loss: 0.005036736838519573, entropy_loss: 0.006779203191399574
timestep  :  95840, episodic_reward:  -20.6, value_loss: 0.015189980156719685, policy_loss: -0.057635866105556

KeyboardInterrupt: 

In [None]:
# Plot Experiment 2

In [None]:
# Evaluate Pong 5 times
import torch

from pg.a2c.pong import make_env, GruNet
from pg.a2c.model import A2C


model_data = torch.load("models/pong.b")
env = make_env()
in_size = env.observation_space.shape[0]
out_size = env.action_space.n

network = GruNet(in_size, out_size, model_data["args"]["hidden_size"])
model = A2C(network, None, None, None)
model.load_state_dict(model_data["state_dict"])
model.evaluate(make_env, n_episodes=5)


### PPO Implementation

PPO is very similar to A2C in terms of implementation steps. Most of the structure is the same and you can follow the same order as you did in the A2C part. In PPO experiments, we will not use recurrent architecture as that makes the implementation a bit challenging. There are two experiments with PPO, LunarLander, and BipedalWalker. Note that, BipedalWalker is a continuous action space environment.

#### Differences

- ```parameter_update``` method updates the parameters multiple times, one per mini-batch.
- Unlike in A2C, we have ```forward_given_actions``` method that is used within ```parameter_update``` method to calculate the log_probabilities, values, and entropies over multiple passes. Since, after every mini-batch, the parameters are updated we can not use the same log_probabilities and need to recalculate them.
- ```rollout_data_loader``` method needs to yield mini-batches of rollout data as opposed to full rollout data.
- You need to fill ```linear_annealing``` method that is used for scheduling ```clip_range``` parameter.

#### Box2d

Similar to A2C.

#### BipedalWalker

This is a continuous action space environment. Use normal distribution to represent action distributions. 

#### Experiments

We will run two experiments each containing 3 runs as mentioned previously.

- Run PPO in LunarLander-v2 (Reach at least +200 mean reward for the last 20 episodes)
- Run PPO in BipedalWalker-v2 (Reach at least +100 mean reward for the last 20 episodes)

> Notice: Recent gym versions may require BipedalWalker-v3 or BipedalWalker-v4 instead of BipedalWalker-v2. Please change it accordingly if necessary. 

Plot these results (2 Plots). Also **keep** the ```.csv``` files and the models (you can leave a Google Drive link for the model files).

In [None]:
!python pg/ppo/box2d.py --log-dir ./log/ppo_lunar

In [None]:
# Plot Experiment 1

In [None]:
!python pg/ppo/walker.py --log-dir ./log/ppo_walker

In [None]:
# Plot Experiment 2

In [None]:
# Evaluate Walker agent

### Comparison

Now that you completed the implementations you can compare the training performance of A2C and PPO.


In [None]:
import os
import pandas as pd
from pg.visualize import Plotter

a2c_log_dir = os.path.join("log", "a2c_lunar")
ppo_log_dir = os.path.join("log", "ppo_lunar")
df_dict = {
    "a2c Lunarlander": [pd.read_csv(os.path.join(a2c_log_dir, folder, "progress.csv"))
                for folder in os.listdir(a2c_log_dir)],
    "ppo Lunarlander": [pd.read_csv(os.path.join(ppo_log_dir, folder, "progress.csv"))
                for folder in os.listdir(ppo_log_dir)],
}

plotter = Plotter(df_dict)
plotter()

#### Your comments (Bonus + 5)

> Explain the score comparison you observe on Lunar Lander environment.

> Explain the advantages and disadvantages of the methods you implemented within this homework