<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#CartPole,-aka-Inverted-Pendulum" data-toc-modified-id="CartPole,-aka-Inverted-Pendulum-1">CartPole, aka Inverted Pendulum</a></span></li><li><span><a href="#OpenAI's-CartPole-Environment" data-toc-modified-id="OpenAI's-CartPole-Environment-2">OpenAI's CartPole Environment</a></span></li><li><span><a href="#Define-Environment" data-toc-modified-id="Define-Environment-3">Define Environment</a></span></li><li><span><a href="#Define-Neural-Network-" data-toc-modified-id="Define-Neural-Network--4">Define Neural Network </a></span></li><li><span><a href="#Define-Agent" data-toc-modified-id="Define-Agent-5">Define Agent</a></span></li><li><span><a href="#Train-the-model" data-toc-modified-id="Train-the-model-6">Train the model</a></span></li><li><span><a href="#Test-the-model" data-toc-modified-id="Test-the-model-7">Test the model</a></span></li><li><span><a href="#Grading-Submission-Notes" data-toc-modified-id="Grading-Submission-Notes-8">Grading Submission Notes</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-9">Bonus Material</a></span></li></ul></div>

<center><h2>CartPole, aka Inverted Pendulum</h2></center>
<br>
<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Cart-pendulum.svg/300px-Cart-pendulum.svg.png" width="35%"/></center>
<br><br><br>
<center><a href="https://fluxml.ai/experiments/cartPole/">Demo!</a></center>



<center><h2>OpenAI's CartPole Environment</h2></center>

A pole is attached to a cart by an un-actuated joint, and the cart moves along a frictionless track. 

The system is controlled by applying a force of +1 or -1 to the cart. 

The pendulum starts upright, and the goal is to prevent it from falling over. 

A reward of +1 is provided for every time-step that the pole remains upright. 

The episode ends when:

- The pole is more than 15 degrees from the vertical.
- The cart moves more than 2.4 units from the center.
- 200 time-steps.


In [82]:
reset -fs

In [83]:
import warnings
warnings.filterwarnings('ignore')

Define Environment
----

In [84]:
# Import OpenAI's gym (easy way or hard way)
try:
    import gym 
except ImportError:
    import pip
    import sys
    import subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'gym'])
    
    import gym

In [85]:
env = gym.make('CartPole-v0')

In [86]:
# Let's see how the RL problem is formulated
print('State size:  ', env.observation_space.shape[0] )
print('Action size: ', env.action_space.n)

State size:   4
Action size:  2


Let's read the docs:
https://github.com/openai/gym/wiki/CartPole-v0

Let's read the code: https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

Define Neural Network 
-----

In [87]:
import numpy as np

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.optimizers import Adam

In [88]:
# Define a sample model
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(24, activation='relu'))
model.add(Dense(env.action_space.n, activation='linear'))
model.compile(loss='mse', optimizer=Adam(lr=0.01))

In [89]:
# Define your model

# YOUR CODE HERE
# raise NotImplementedError()

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_6 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_11 (Dense)             (None, 24)                120       
_________________________________________________________________
dense_12 (Dense)             (None, 2)                 50        
Total params: 170
Trainable params: 170
Non-trainable params: 0
_________________________________________________________________


Define Agent
----

In [90]:
# Import keras-rl's gym (easy way or hard way)
try:
    import rl
except ImportError:
    import pip
    import sys
    import subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'keras-rl'])
    import rl

In [91]:
# Define memory
from rl.memory import SequentialMemory

memory = SequentialMemory(limit=50_000, 
                          window_length=1)

In [92]:
# Define sample policy

from rl.policy import GreedyQPolicy

policy = GreedyQPolicy()

In [93]:
# Define your policy

# YOUR CODE HERE
# raise NotImplementedError()


In [94]:
# Define a sample agent
from rl.agents.dqn import DQNAgent

dqn = DQNAgent(model=model, 
               nb_actions=env.action_space.n, 
               memory=memory,
               nb_steps_warmup=15,
               target_model_update=1e-2,
               policy=policy)

dqn.compile(Adam(lr=1e-3), metrics=['mae'])

In [95]:
# Define your agent

# YOUR CODE HERE
# raise NotImplementedError()


Train the model
----

In [96]:
# Train model
dqn.fit(env, 
        nb_steps=1_500, 
        visualize=False, 
        verbose=2 # Episode logging
       )

Training for 1500 steps ...
   10/1500: episode: 1, duration: 0.501s, episode steps: 10, steps per second: 20, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.155 [-3.091, 1.935], loss: --, mean_absolute_error: --, mean_q: --
   20/1500: episode: 2, duration: 2.538s, episode steps: 10, steps per second: 4, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.142 [-3.029, 1.951], loss: 0.526494, mean_absolute_error: 0.530830, mean_q: 0.384968
   33/1500: episode: 3, duration: 0.072s, episode steps: 13, steps per second: 180, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.846 [0.000, 1.000], mean observation: -0.139 [-2.931, 1.722], loss: 0.418333, mean_absolute_error: 0.468051, mean_q: 0.473482
   42/1500: episode: 4, duration: 0.060s, episode steps: 9, steps per second: 150, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], 

  289/1500: episode: 31, duration: 0.105s, episode steps: 11, steps per second: 105, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.137 [-3.291, 2.106], loss: 0.328887, mean_absolute_error: 1.357420, mean_q: 2.473842
  299/1500: episode: 32, duration: 0.086s, episode steps: 10, steps per second: 116, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.900 [0.000, 1.000], mean observation: -0.123 [-2.732, 1.779], loss: 0.383178, mean_absolute_error: 1.416042, mean_q: 2.524676
  309/1500: episode: 33, duration: 0.097s, episode steps: 10, steps per second: 103, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.900 [0.000, 1.000], mean observation: -0.132 [-2.684, 1.730], loss: 0.300767, mean_absolute_error: 1.398077, mean_q: 2.582131
  319/1500: episode: 34, duration: 0.098s, episode steps: 10, steps per second: 102, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], m

  572/1500: episode: 61, duration: 0.073s, episode steps: 10, steps per second: 138, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.130 [-3.072, 1.965], loss: 0.157984, mean_absolute_error: 1.229580, mean_q: 4.004125
  581/1500: episode: 62, duration: 0.064s, episode steps: 9, steps per second: 140, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.160 [-2.844, 1.752], loss: 0.174511, mean_absolute_error: 1.249547, mean_q: 3.956228
  589/1500: episode: 63, duration: 0.076s, episode steps: 8, steps per second: 105, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.142 [-2.510, 1.598], loss: 0.171492, mean_absolute_error: 1.259190, mean_q: 3.887810
  599/1500: episode: 64, duration: 0.082s, episode steps: 10, steps per second: 122, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean 

  851/1500: episode: 91, duration: 0.098s, episode steps: 9, steps per second: 92, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.148 [-2.838, 1.786], loss: 0.068909, mean_absolute_error: 1.338026, mean_q: 4.586572
  861/1500: episode: 92, duration: 0.103s, episode steps: 10, steps per second: 97, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.900 [0.000, 1.000], mean observation: -0.136 [-2.771, 1.767], loss: 0.036889, mean_absolute_error: 1.362355, mean_q: 4.867910
  870/1500: episode: 93, duration: 0.086s, episode steps: 9, steps per second: 105, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.135 [-2.795, 1.793], loss: 0.041245, mean_absolute_error: 1.351378, mean_q: 4.826426
  879/1500: episode: 94, duration: 0.096s, episode steps: 9, steps per second: 94, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean actio

 1142/1500: episode: 122, duration: 0.054s, episode steps: 8, steps per second: 149, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.140 [-2.503, 1.558], loss: 0.019777, mean_absolute_error: 1.588554, mean_q: 5.243237
 1152/1500: episode: 123, duration: 0.061s, episode steps: 10, steps per second: 164, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.800 [0.000, 1.000], mean observation: -0.127 [-2.435, 1.597], loss: 0.085578, mean_absolute_error: 1.651400, mean_q: 5.336720
 1162/1500: episode: 124, duration: 0.056s, episode steps: 10, steps per second: 177, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.900 [0.000, 1.000], mean observation: -0.132 [-2.696, 1.714], loss: 0.051693, mean_absolute_error: 1.611200, mean_q: 5.196894
 1171/1500: episode: 125, duration: 0.053s, episode steps: 9, steps per second: 169, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], m

 1445/1500: episode: 154, duration: 0.067s, episode steps: 9, steps per second: 134, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.889 [0.000, 1.000], mean observation: -0.150 [-2.497, 1.527], loss: 0.056756, mean_absolute_error: 1.776425, mean_q: 4.981745
 1455/1500: episode: 155, duration: 0.053s, episode steps: 10, steps per second: 189, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.129 [-2.984, 1.969], loss: 0.016073, mean_absolute_error: 1.841001, mean_q: 5.221635
 1463/1500: episode: 156, duration: 0.046s, episode steps: 8, steps per second: 176, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.161 [-2.594, 1.538], loss: 0.022991, mean_absolute_error: 1.827514, mean_q: 5.114954
 1473/1500: episode: 157, duration: 0.055s, episode steps: 10, steps per second: 182, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], m

<keras.callbacks.History at 0xb3aa55240>

Test the model
----

In [97]:
# Test model
test_results = dqn.test(env, nb_episodes=11, visualize=False)

Testing for 11 episodes ...
Episode 1: reward: 9.000, steps: 9
Episode 2: reward: 10.000, steps: 10
Episode 3: reward: 10.000, steps: 10
Episode 4: reward: 9.000, steps: 9
Episode 5: reward: 10.000, steps: 10
Episode 6: reward: 10.000, steps: 10
Episode 7: reward: 8.000, steps: 8
Episode 8: reward: 9.000, steps: 9
Episode 9: reward: 9.000, steps: 9
Episode 10: reward: 9.000, steps: 9
Episode 11: reward: 10.000, steps: 10


In [98]:
# The max is 200 steps per eposide.
# The goal of the assignment to train an agent that performs about 180 (on average).

from statistics import mean

# Remove worst run
test_results.history['episode_reward'].remove(min(test_results.history['episode_reward']))

# Take the average the remaining runs
test_performance = mean(test_results.history['episode_reward'])

print(f"There current model gets {test_performance:.2f} steps.")

There current model gets 9.50 steps.


In [99]:
# 5 points for over 100 steps per eposide.

assert test_performance > 100.00

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/Users/brian/anaconda3/envs/rl-course/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2878, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-99-e353aa7f68bd>", line 3, in <module>
    assert test_performance > 100.00
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/brian/anaconda3/envs/rl-course/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'AssertionError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/brian/anaconda3/envs/rl-course/lib/python3.7/site-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)


AssertionError: 

In [None]:
# 5 points for over 125 steps per eposide.

assert test_performance > 125.00

In [None]:
# 10 points for over 170 steps per eposide.

assert test_performance > 170.00

Grading Submission Notes
-------

If there is output, we'll grade your submitted lab without running the notebook. If there is __not__ output, we'll run the notebook to get output to grade.

It would behove you to submit a notebook with output.

Bonus Material
-----

Learn more about CartPole from the physics and control-model perspective  [here](https://danielpiedrahita.wordpress.com/portfolio/cart-pole-control/)