# HW2 Policy Gradient


## Experiment 1: CartPole with PG
Example sim result:

| iter 0 | iter 90 |
| :---: | :---: |
| <img src="cartpole_iter0.gif" width="200"/> | <img src="cartpole.gif" width="270"/> |

**Learning Curves (Avg Return):**

| Small Batch (1000) | Large Batch (5000) |
| :---: | :---: |
| <img src="sb_cartpole.png" width="600"/> | <img src="lb_cartpole.png" width="600"/> |

* The reward-to-go estimater outperformed the trajectory-centric estimator in both cases. This makes sense as it takes advantage of causality (later actions cannot affect previous rewards).
* Advantage standardization also helps (mostly for small batch) as it acts as a baseline, reducing the variance of monte-carlo returns.
* Larger batch size improved performance as it makes the monte-carlo estimates of reward-to-go more accurate. As batch size goes to infinity, monte-carlo estimate converges to true expected reward to go, as it is unbiased.

## Experiment 2: Inverted Pendulum

Able to achieve 1000 frame episode length with `batch = 100` an `learning rate = 0.01`

<img src="invertedpendulum_returns.png" width="800"/>

command:

```python cs285/scripts/run_hw2.py -ngpu --video_log_freq 10 --env_name InvertedPendulum-v2 --ep_len 1000 --discount 0.9 -n 100 -l 2 -s 64 -b 200 -lr 0.01 -rtg --exp_name q2_b200_r0.01```

## Experiment 3: Lunar Lander - NN Baseline
Example sim eval result. Note that for each iterations:
* 40,000 training samples are collected with current policy.
* This corresponds to a min of 40 rollouts (1000 episode length max, or more if e.g. lander crashes early)
* Reward-to-go Q-values (via discounted cumsum) are assigned to trajectories
* A single step (learning rate 0.005) is taken in direction of approximated RL objective (fn of policy)


| iter 0 | iter 90 |
| :---: | :---: |
| <img src="lunarlander_iter0.gif" width="400"/> | <img src="lunarlander.gif" width="400"/> |

**Lunar Lander Avg Return:**
Blue shows return with neural net state-dependent, baseline (value function), red is without baseline. Baseline shows improved performance as it reduces variance.
<img src="lunarlandar_return.png" width="800"/>


## Experiment 4: Half Cheetah
I was lazy and didn't search over batch sizes and learning rates (just chose `lr=0.02` `batch=50000`). Results below show:
* reward-to-go (blue) has the biggest affect
* Using value fn baseline shows slight improvement with reward-to-go but no distinguishable difference otherwise.
<img src="halfCheetah_returns.png" width="800"/>

Interestingly, the rtg that recieved higher return found an undesireable solution where the cheetah flips onto it's back and then flails around to make progress:

| traj based | reward-to-go |
| :---: | :---: |
| <img src="halfCheetah.gif" width="300"/> | <img src="halfCheetah_rtg.gif" width="300"/> |

## Experiment 5: Hopper - GAE
Avg returns shown below. Results suggest:
* $\lambda=0.0$ (one step Value fn backup) performed worse by far, possibly due to high bias that is added.
* $\lambda=0.95$ and $\lambda=0.99$ performed best (balance bias with variance)
* $\lambda=1.0$ performed nearly as good, and actually had the best returns by the end
* The videos below do show the best looking policy to be $\lambda=0.95$, where the hopper performs multiple hops, vs $\lambda=1.0$, where it only performs a single long-jump hop.

<img src="hopper_returns.png" width="800"/>


| $\lambda=0.95$ | $\lambda=1.0$ |
| :---: | :---: |
| <img src="hopper_lambdap95.gif" width="400"/> | <img src="hopper_lambda1.gif" width="400"/> |

### PG Pseudo-code

Example from Q5

---
**for** n_iter (e.g. 300)
<blockquote>
    collect_training_trajectories with current policy
    <blockquote>
        - with batch_size (2k) env samples with ep_len (1k) samples per rollout (i.e. 2 rollouts) <br>
        - sampling actions from $\pi(a|o)$ distribution
    </blockquote>    
    for num_agent_train_steps_per_iter (e.g. 1):   
    <blockquote>
        - sample train_batch_size (2k) most recent samples (need to be collected from curr policy due to on-policy)<br>
        - compute Monte-Carlo return estimator Q-vals<br>
        if reward-to go:
        <blockquote>
            rtg decreasing over rollout: $r_{i}=\sum_{t'=t}^{T-1} {\gamma^{t'-t}r(s_{it'},a_{it'}})$
        </blockquote>
        else:
        <blockquote>
            return est is const over rollout: $r_{i}=\sum_{t'=0}^{T-1} {\gamma^{t'}r(s_{it'},a_{it'}})$
        </blockquote>
        - Estimate advantage from return_est minus baseline $(Q_t - b_t)$. E.g. NN value fn estimator $b_t=V_{NN}(obs_t)$.<br>
        - Perform one grad step in policy with $\sum_{t=0}^{T-1} [\nabla [log \pi(a_t|o_t) * (Q_t - b_t)]]$<br>
        So `loss=-Sum(log_prob_of_action * advantages)` where gradient is not propagated through baseline NN used for advantage estimation<br>
        - Perform one grad step in NN baseline (value fn estimator, normalized)
    </blockquote>
</blockquote>

**end for**

---
where `MLPPolicyPG` has 2 hidden layers with `tanh` activation and 32 nodes each. With outputs:
- discrete: categorical distro, e.g. `[0.2, 0.7, 0.1]` would repr prob for actions: `[a0, a1, a2]`