3d cartpole gym env using bullet physics trained from pixels with LRPG, DDPG & NAF
Python Shell R Protocol Buffer
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
exps more comparison of low/dim variants Oct 24, 2016
models first pass at model Aug 11, 2016
u
.gitignore bite the bullet (no pun intended) and move bullet specific opts to be… Sep 6, 2016
LICENSE
README.md Update README.md Dec 7, 2016
base_network.py hack test of dropping blue channel Oct 30, 2016
bullet_cartpole.py Revert "switch dft to 3 repeats, 4 steps per repeat (from 2/6)" Oct 29, 2016
cartpole.gif make hdr img a gif Oct 20, 2016
cartpole.png add an image for readme Aug 28, 2016
ddpg_cartpole.py include dropout in hidden layer stacks Oct 29, 2016
deciles.py
dqn_cartpole.py rename dqn version Sep 7, 2016
eg_render.png eg of high dim rendering Sep 7, 2016
event.proto introduce --num-cameras. switch state to be more explicit (None, H, W… Oct 21, 2016
event_log.py reinstate preloading replay memory with --event-log-in Oct 25, 2016
event_log_sample.push_in_one_dir.gif Merge branch 'master' into pixels Sep 8, 2016
lrpg_cartpole.py fix init var bug on eval only runs Oct 26, 2016
make_plots.sh second pass at ddpg; is learning now Aug 27, 2016
naf_cartpole.py configurable fraction of gpu mem Oct 30, 2016
plots.R general plot update Oct 18, 2016
random_action_agent.py emit total_reward, not steps, in random_action_agent Oct 17, 2016
random_plots.R
replay_memory.py
replay_memory_test.py update tests re: only call update once per add_episode Sep 18, 2016
run_diff.py run-diff update Oct 29, 2016
stitch_activations.py move cameras out a bit, drop to a simpler net for quicker convergence… Oct 28, 2016
util.py option to debug render states one image per RGB channel Oct 29, 2016

README.md

cartpole ++

cartpole++ is a non trivial 3d version of cartpole simulated using bullet physics where the pole isn't connected to the cart.

cartpole

this repo contains a gym env for this cartpole as well as implementations for training with ...

we also train a deep q network from keras-rl as an externally implemented baseline.

for more info see the blog post. for experiments more related to potential transfer between sim and real see drivebot. for next gen experiments in continuous control from pixels in minecraft see malmomo

general environment

episodes are initialised with the pole standing upright and receiving a small push in a random direction

episodes are terminated when either the pole further than a set angle from vertical or 200 steps have passed

there are two state representations available; a low dimensional one based on the cart & pole pose and a high dimensional one based on raw pixels (see below)

there are two options for controlling the cart; a discrete and continuous method (see below)

reward is simply 1.0 for each step in the episode

states

in both low and high dimensional representations we use the idea of action repeats; per env.step we apply the chosen action N times, take a state snapshot and repeat this R times. the deltas between these snapshots provides enough information to infer velocity (or acceleration (or jerk)) if the learning algorithm finds that useful to do.

observation state in the low dimensional case is constructed from the poses of the cart & pole

  • it's shaped (R, 2, 7)
  • axis 0 represents the R repeats
  • axis 1 represents the object; 0=cart, 1=pole
  • axis 2 is the 7d pose; 3d postition + 4d quaternion orientation
  • this representation is usually just flattened to (R*14,) when used

the high dimensional state is a rendering of the scene

  • it's shaped (height, width, 3, R, C)
  • axis 0 & 1 are the rendering height/width in pixels
  • axis 2 represents the 3 colour channels; red, green and blue
  • axis 3 represents the R repeats
  • axis 4 represents which camera the image is from; we have the option of rendering with one camera or two (located at right angles to each other)
  • this representation is flattened to have shape (H, W, 3*R*C). we do this for ease of use of conv2d operations. (TODO: try conv3d instead)

eg_render

actions

in the discrete case the actions are push cart left, right, up, down or do nothing.

in the continuous case the action is a 2d value representing the push force in x and y direction (-1 to 1)

rewards

in all cases we give a reward of 1 for each step and terminate the episode when either 200 steps have passed or the pole has fallen too far from the z-axis

agents

random agent

we use a random action agent (click through for video) to sanity check the setup. add --gui to any of these to get a rendering

link

# no initial push and taking no action (action=0) results in episode timeout of 200 steps.
# this is a check of the stability of the pole under no forces
$ ./random_action_agent.py --initial-force=0 --actions="0" --num-eval=100 | ./deciles.py 
[ 200.  200.  200.  200.  200.  200.  200.  200.  200.  200.  200.]

# no initial push and random actions knocks pole over
$ ./random_action_agent.py --initial-force=0 --actions="0,1,2,3,4" --num-eval=100 | ./deciles.py
[ 16.   22.9  26.   28.   31.6  35.   37.4  42.3  48.4  56.1  79. ]

# initial push and no action knocks pole over
$ ./random_action_agent.py --initial-force=55 --actions="0" --num-eval=100 | ./deciles.py
[  6.    7.    7.    8.    8.6   9.   11.   12.3  15.   21.   39. ]

# initial push and random action knocks pole over
$ ./random_action_agent.py --initial-force=55 --actions="0,1,2,3,4" --num-eval=100 | ./deciles.py 
[  3.    5.9   7.    7.7   8.    9.   10.   11.   13.   15.   32. ]

discrete control with a deep q network

$ ./dqn_cartpole.py \
 --num-train=2000000 --num-eval=0 \
 --save-file=ckpt.h5

result by numbers...

$ ./dqn_cartpole.py \
 --load-file=ckpt.h5 \
 --num-train=0 --num-eval=100 \
 | grep ^Episode | sed -es/.*steps:// | ./deciles.py 
[   5.    35.5   49.8   63.4   79.   104.5  122.   162.6  184.   200.   200. ]

result visually (click through for video)

link

$ ./dqn_cartpole.py \
 --gui --delay=0.005 \
 --load-file=run11_50.weights.2.h5 \
 --num-train=0 --num-eval=100

discrete control with likelihood ratio policy gradient

policy gradient nails it

$ ./lrpg_cartpole.py --rollouts-per-batch=20 --num-train-batches=100 \
 --ckpt-dir=ckpts/foo

result by numbers...

# deciles
[  13.    70.6  195.8  200.   200.   200.   200.   200.   200.   200.   200. ]

result visually (click through for video)

link

continuous control with deep deterministic policy gradient

./ddpg_cartpole.py \
 --actor-hidden-layers="100,100,50" --critic-hidden-layers="100,100,50" \
 --action-force=100 --action-noise-sigma=0.1 --batch-size=256 \
 --max-num-actions=1000000 --ckpt-dir=ckpts/run43

result by numbers

# episode len deciles
[  30.    48.    56.8   65.    73.    86.   116.4  153.3  200.   200.   200. ]
# reward deciles
[  35.51154724  153.20243076  178.7908135   243.38630372  272.64655323
  426.95298195  519.25360223  856.9702368   890.72279221  913.21068417
  955.50168709]

result visually (click through for video)

link

low dimensional continuous control with normalised advantage functions

./naf_cartpole.py --action-force=100 \
--action-repeats=3 --steps-per-repeat=4 \
--optimiser=Momentum --optimiser-args='{"learning_rate": 0.0001, "momentum": 0.9}' \

similiar convergence to ddpg

high dimensional continuous control with normalised advantage functions

does OK, but not perfect yet. as a human it's hard to do even... (see the blog post)

general utils

run a random agent, logging events to disk (outputs total rewards per episode)

note: for replay logging will need to compile protobuffer protoc event.proto --python_out=.

$ ./random_action_agent.py --event-log=test.log --num-eval=10 --action-type=continuous
12
14
...

review event.log (either from ddpg training or from random agent)

$ ./event_log.py --log-file=test.log --echo
event {
  state {
    cart_pose: 0.116232253611
    cart_pose: 0.0877446383238
    cart_pose: 0.0748709067702
    cart_pose: 1.14359036161e-05
    cart_pose: 5.10180834681e-05
    cart_pose: 0.0653914809227
    cart_pose: 0.997859716415
    pole_pose: 0.000139251351357
    pole_pose: -0.0611916743219
    pole_pose: 0.344804286957
    pole_pose: -0.123383037746
    pole_pose: 0.00611496530473
    pole_pose: 0.0471726879478
    pole_pose: 0.991218447685
    render {
      height: 120
      width: 160
      rgba: "\211PNG\r\n\032\n\000\..."
    }
  }
  is_terminal: false
  action: -0.157108291984
  action: 0.330988258123
  reward: 4.0238070488
}
...

generate images from event.log

$ ./event_log.py --log-file=test.log --img-output-dir=eg_renders
$ find eg_renders -type f | sort
eg_renders/e_00000/s_00000.png
eg_renders/e_00000/s_00001.png
...
eg_renders/e_00009/s_00018.png
eg_renders/e_00009/s_00019.png

1000 events in an event_log is roughly 750K for the high dim case and 100K for low dim