# Discrete CartPole-v0
```
python train_pg.py CartPole-v0 -n 100 -b 1000 -e 5 -dna --exp_name
sb_no_rtg_dna 
python train_pg.py CartPole-v0 -n 100 -b 1000 -e 5 -rtg -dna --exp_name
sb_rtg_dna
python train_pg.py CartPole-v0 -n 100 -b 1000 -e 5 -rtg --exp_name
sb_rtg_na
python train_pg.py CartPole-v0 -n 100 -b 5000 -e 5 -dna --exp_name
lb_no_rtg_dna
python train_pg.py CartPole-v0 -n 100 -b 5000 -e 5 -rtg -dna --exp_name
lb_rtg_dna
python train_pg.py CartPole-v0 -n 100 -b 5000 -e 5 -rtg --exp_name
lb_rtg_na
```

### Small batch experiments
<img src="Figure_Cartpole-v0.png">

### Large batch experiments
<img src="Figure_lb_Cartpole-v0.png">


## Observations
- Clearly, *reward_to_go=True* ("rtg") gives better results than the trajectory-centric one ("no_rtg"), 
- adding  advantage-centering i.e. *normalize_advantages=True* ("rtg_na") seems to only help marginally. 
- Using a large batch size helps the algorithm to converge faster, especially in the case of trajectory-centric PG estimate.

# InvertedPendulum-v1 continuous control
Ran with the Roboschool environment.

### Small batch
```
python train_pg.py RoboschoolInvertedPendulum-v1 -n 100 -b 1000 -e 5 -rtg --exp_name sb_rtg_na
python train_pg.py RoboschoolInvertedPendulum-v1 -n 100 -b 1000 -e 5 -rtg -dna --exp_name sb_rtg_dna
python train_pg.py RoboschoolInvertedPendulum-v1 -n 100 -b 1000 -e 5 --exp_name sb_no_rtg_na
python train_pg.py RoboschoolInvertedPendulum-v1 -n 100 -b 1000 -e 5 -dna --exp_name sb_no_rtg_dna
```
<img src="Figure_RoboInvertedPendulum-v1.png">

### Large batch
```
python train_pg.py RoboschoolInvertedPendulum-v1 -n 100 -b 5000 -e 5 -rtg --exp_name lb_rtg_na
python train_pg.py RoboschoolInvertedPendulum-v1 -n 100 -b 5000 -e 5 -rtg -dna --exp_name lb_rtg_dna
```
<img src="Figure_lb_RoboInvertedPendulum-v1.png">


## Observations
- *reward_to_go=True* ("rtg") gives better results than the trajectory-centric one ("no_rtg"), but only if   advantage-centering is not used i.e. *normalize_advantages=False* ("rtg_dna"). 
- The rewards go down after certain iterations. Seems like we are overfitting past 60 iterations.
- The large batch helps the convergence rate and stability.

This was using default parameters. Might be worth experimenting with learning rate and batch sizes.


# Neural Network Baseline
Policy gradient with state estimated baseline. 

Fitted a TD(0) based baseline, i.e. `target_baseline[i] = reward[i] + gamma * prev_baseline[i+1]`.
See for more context the [last equation on the slide 21 of the fall'17 lecture](http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_5_actor_critic_pdf.pdf).

Here's a plot comparing the outcomes for Inverted Pendulum with and without baseline. 
<img src="Figure_RoboInvertedPendulum-v1-baseline1.png">

## Observations
- The baseline does not seem to help, at least in the current parameter setting. 
- That said, the rewards go down after certain iterations. Seems like past 80 iterations, the baseline performs better, although in absolute terms, both are bad.

This was using default parameters. Might be worth experimenting with learning rate and batch sizes.


# Half Cheetah
Ran with the Roboschool environment.
```
python train_pg.py RoboschoolHalfCheetah-v1 -ep 150 --discount 0.9 -n 100 -b 1000 -e 5 -rtg --exp_name sb_rtg_na
python train_pg.py RoboschoolHalfCheetah-v1 -ep 150 --discount 0.9 -n 100 -b 1000 -e 5 -rtg -bl --exp_name sb_rtg_na_bl
python train_pg.py RoboschoolHalfCheetah-v1 -ep 150 --discount 0.9 -n 100 -b 1000 -e 5 -rtg --n_layers 2 -bl --exp_name sb_rtg_na_bl_2L
python train_pg.py RoboschoolHalfCheetah-v1 -ep 150 --discount 0.9 -n 100 -b 5000 -e 5 -rtg --n_layers 2 -bl --exp_name lb_rtg_na_bl_2L
python train_pg.py RoboschoolHalfCheetah-v1 -ep 150 --discount 0.9 -n 100 -b 1000 -e 5 -rtg -dna --exp_name sb_rtg_dna
```
<img src="Figure_RoboHalfCheetah-v1.png">

## Observations
- The rtg, baseline and baseline with 2 layers all give the same performance (Avg return ~ 20). 
- Increasing the batch size to 5000 helps marginally (Avg return ~30).