Proximal Policy Optimization

My implementation of PPO based on OpenAI's baseline implementation.

Layout:

Gym wrappers
- GymWrap: Ordinary wrapper for continuous environments like RoboSchool
- StackObs: Used for partially observable environments in order to estimate velocity acceleration etc..
- ResetEarly: Can be used combined with GRU or LSTM models in partially observable environments, however this training process is very slow.
Models
- MLP: Standard 2 hidden layers FC neural network, where the actor and critic network are independent.
- CNN: Shared CNN-block followed by separate output layers for the actor and critic.
- LSTM and GRU: Recurrent neural network models that can model partially observable environments without the help of stacked frames. (Note that these should be used when the episode steps < the minibatch size)

The models can be edited in ppo.py

main_atari.py will run Brakeout by default.
main_roboschool.py will run InvertedPendulum by default.

The environments and other hyperparameters can be changed in the tensorflow flags.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
gif		gif
saved_model		saved_model
README.md		README.md
gym_wrapper.py		gym_wrapper.py
layers.py		layers.py
main_atari.py		main_atari.py
main_roboschool.py		main_roboschool.py
models.py		models.py
ppo.py		ppo.py
running_mean_std.py		running_mean_std.py