CartPole是一个杆子连在一个小车上，小车可以无摩擦的左右运动，杆子（倒立摆）一开始是竖直线向上的。 小车通过左右运动使得杆子不倒。 注：施加的力大小是固定的，但减小或增大的速度不是固定的，它取决于当时杆子与竖直方向的角度。 角度不同，产生的速度和位移也不同。

![](images/cartpole_g.gif)

CartPole is one of the simplest environments in the OpenAI gym (a game simulator). As you can see in the above animation, the goal of CartPole is to balance a pole that’s connected with one joint on top of a moving cart.

Instead of pixel information, there are four kinds of information given by the state (such as the angle of the pole and position of the cart). An agent can move the cart by performing a series of actions of 0 or 1, pushing the cart left or right.

We will use the keras-rl2 library here which lets us implement deep Q-learning out of the box.

Step 1: Install keras-rl2 library

https://github.com/keras-rl/keras-rl/issues/371

In [1]:
!pip freeze

absl-py==0.12.0
argon2-cffi==20.1.0
astunparse==1.6.3
async-generator==1.10
attrs==20.3.0
backcall==0.2.0
bleach==3.3.0
cachetools==4.2.1
certifi==2020.12.5
cffi==1.14.5
chardet==4.0.0
cloudpickle==1.6.0
colorama==0.4.4
decorator==5.0.5
defusedxml==0.7.1
entrypoints==0.3
flatbuffers==1.12
future==0.18.2
gast==0.3.3
google-auth==1.28.0
google-auth-oauthlib==0.4.4
google-pasta==0.2.0
grpcio==1.32.0
gym==0.18.0
h5py==2.10.0
idna==2.10
ipykernel==5.5.3
ipython==7.22.0
ipython-genutils==0.2.0
ipywidgets==7.6.3
jedi==0.18.0
Jinja2==2.11.3
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==6.1.12
jupyter-console==6.4.0
jupyter-core==4.7.1
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.0
Keras==2.4.3
Keras-Preprocessing==1.1.2
keras-rl2==1.0.4
Markdown==3.3.4
MarkupSafe==1.1.1
mistune==0.8.4
nbclient==0.5.3
nbconvert==6.0.7
nbformat==5.1.3
nest-asyncio==1.5.1
notebook==6.3.0
numpy==1.19.5
oauthlib==3.1.0
opt-einsum==3.3.0
packaging==20.9
pandocfilters==1.4.3
parso==0.8.2
pickleshare==0.7.5
P

In [None]:
pip install keras-rl2

Step 2: Install dependencies for the CartPole environment

In [None]:
pip install gym
pip install keras

In [None]:
#!pip freeze

Step 3: Let’s get started!

First, we have to import the necessary modules:

In [2]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

Then, set the relevant variables:

In [3]:
ENV_NAME = 'CartPole-v0'

# Get the environment and extract the number of actions available in the Cartpole problem
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

Next, we will build a very simple single hidden layer neural network model:

In [4]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 4)                 0         
_________________________________________________________________
dense (Dense)                (None, 16)                80        
_________________________________________________________________
activation (Activation)      (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 34        
_________________________________________________________________
activation_1 (Activation)    (None, 2)                 0         
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
None


Now, configure and compile our agent. We will set our policy as Epsilon Greedy and our memory as Sequential Memory because we want to store the result of actions we performed and the rewards we get for each action.

In [5]:
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10, 
               target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this slows down training quite a lot. 
dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)

Training for 5000 steps ...




    9/5000: episode: 1, duration: 1.931s, episode steps:   9, steps per second:   5, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 1.000 [1.000, 1.000],  loss: --, mae: --, mean_q: --




   28/5000: episode: 2, duration: 0.973s, episode steps:  19, steps per second:  20, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.526 [0.000, 1.000],  loss: 0.464901, mae: 0.541042, mean_q: 0.304381
   37/5000: episode: 3, duration: 0.166s, episode steps:   9, steps per second:  54, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 1.000 [1.000, 1.000],  loss: 0.418581, mae: 0.508398, mean_q: 0.330029




   47/5000: episode: 4, duration: 0.183s, episode steps:  10, steps per second:  55, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.900 [0.000, 1.000],  loss: 0.357435, mae: 0.483128, mean_q: 0.478620
   55/5000: episode: 5, duration: 0.148s, episode steps:   8, steps per second:  54, episode reward:  8.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 1.000 [1.000, 1.000],  loss: 0.343553, mae: 0.486915, mean_q: 0.557740
   64/5000: episode: 6, duration: 0.174s, episode steps:   9, steps per second:  52, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 1.000 [1.000, 1.000],  loss: 0.309573, mae: 0.486496, mean_q: 0.718290
   73/5000: episode: 7, duration: 0.157s, episode steps:   9, steps per second:  57, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 1.000 [1.000, 1.000],  loss: 0.269646, mae: 0.456214, mean_q: 0.817970
   82/5000: episode: 8, duration: 0.167s, episode steps:   9, steps per seco

  392/5000: episode: 40, duration: 0.216s, episode steps:  12, steps per second:  56, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.917 [0.000, 1.000],  loss: 0.489871, mae: 1.468363, mean_q: 3.095916
  403/5000: episode: 41, duration: 0.199s, episode steps:  11, steps per second:  55, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 1.000 [1.000, 1.000],  loss: 0.567061, mae: 1.459492, mean_q: 3.217516
  414/5000: episode: 42, duration: 0.196s, episode steps:  11, steps per second:  56, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.909 [0.000, 1.000],  loss: 0.540188, mae: 1.464930, mean_q: 3.191311
  424/5000: episode: 43, duration: 0.166s, episode steps:  10, steps per second:  60, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.900 [0.000, 1.000],  loss: 0.506549, mae: 1.441760, mean_q: 3.208490
  433/5000: episode: 44, duration: 0.167s, episode steps:   9, steps per

  752/5000: episode: 76, duration: 0.249s, episode steps:  14, steps per second:  56, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.714 [0.000, 1.000],  loss: 0.381163, mae: 2.178760, mean_q: 4.942778
  760/5000: episode: 77, duration: 0.148s, episode steps:   8, steps per second:  54, episode reward:  8.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.875 [0.000, 1.000],  loss: 0.311668, mae: 2.189519, mean_q: 5.067178
  771/5000: episode: 78, duration: 0.197s, episode steps:  11, steps per second:  56, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.727 [0.000, 1.000],  loss: 0.351925, mae: 2.218431, mean_q: 5.078398
  782/5000: episode: 79, duration: 0.218s, episode steps:  11, steps per second:  50, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.818 [0.000, 1.000],  loss: 0.380443, mae: 2.199909, mean_q: 5.134449
  792/5000: episode: 80, duration: 0.224s, episode steps:  10, steps per

 1120/5000: episode: 113, duration: 0.173s, episode steps:  10, steps per second:  58, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.700 [0.000, 1.000],  loss: 0.201341, mae: 2.729678, mean_q: 5.672774
 1131/5000: episode: 114, duration: 0.183s, episode steps:  11, steps per second:  60, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.727 [0.000, 1.000],  loss: 0.190577, mae: 2.881839, mean_q: 6.022272
 1142/5000: episode: 115, duration: 0.184s, episode steps:  11, steps per second:  60, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.727 [0.000, 1.000],  loss: 0.181369, mae: 2.863822, mean_q: 5.990850
 1153/5000: episode: 116, duration: 0.201s, episode steps:  11, steps per second:  55, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.727 [0.000, 1.000],  loss: 0.199122, mae: 2.889940, mean_q: 6.004386
 1162/5000: episode: 117, duration: 0.163s, episode steps:   9, step

 1491/5000: episode: 150, duration: 0.199s, episode steps:  11, steps per second:  55, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.091 [0.000, 1.000],  loss: 1.311106, mae: 3.752714, mean_q: 7.076500
 1500/5000: episode: 151, duration: 0.167s, episode steps:   9, steps per second:  54, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 1.324453, mae: 3.807926, mean_q: 7.153075
 1509/5000: episode: 152, duration: 0.167s, episode steps:   9, steps per second:  54, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 0.911833, mae: 3.876981, mean_q: 7.329408
 1518/5000: episode: 153, duration: 0.165s, episode steps:   9, steps per second:  55, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 1.850518, mae: 3.755797, mean_q: 6.982300
 1527/5000: episode: 154, duration: 0.166s, episode steps:   9, step

 1868/5000: episode: 186, duration: 0.436s, episode steps:  26, steps per second:  60, episode reward: 26.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  loss: 2.004313, mae: 4.585264, mean_q: 8.310884
 1882/5000: episode: 187, duration: 0.248s, episode steps:  14, steps per second:  57, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.571 [0.000, 1.000],  loss: 2.592759, mae: 4.611460, mean_q: 8.316271
 1903/5000: episode: 188, duration: 0.363s, episode steps:  21, steps per second:  58, episode reward: 21.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.524 [0.000, 1.000],  loss: 1.650994, mae: 4.587945, mean_q: 8.415720
 1913/5000: episode: 189, duration: 0.167s, episode steps:  10, steps per second:  60, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.100 [0.000, 1.000],  loss: 2.019940, mae: 4.749363, mean_q: 8.679503
 1926/5000: episode: 190, duration: 0.234s, episode steps:  13, step

 2601/5000: episode: 222, duration: 0.816s, episode steps:  36, steps per second:  44, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.500 [0.000, 1.000],  loss: 1.454843, mae: 5.228097, mean_q: 9.757804
 2621/5000: episode: 223, duration: 0.580s, episode steps:  20, steps per second:  34, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.450 [0.000, 1.000],  loss: 1.158839, mae: 5.330241, mean_q: 10.026169
 2662/5000: episode: 224, duration: 0.823s, episode steps:  41, steps per second:  50, episode reward: 41.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.512 [0.000, 1.000],  loss: 1.385427, mae: 5.411831, mean_q: 10.167161
 2694/5000: episode: 225, duration: 0.563s, episode steps:  32, steps per second:  57, episode reward: 32.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.531 [0.000, 1.000],  loss: 1.867020, mae: 5.482892, mean_q: 10.132372
 2716/5000: episode: 226, duration: 0.370s, episode steps:  22, s

 4479/5000: episode: 258, duration: 0.985s, episode steps:  58, steps per second:  59, episode reward: 58.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.534 [0.000, 1.000],  loss: 2.091609, mae: 8.147963, mean_q: 15.608634
 4568/5000: episode: 259, duration: 1.581s, episode steps:  89, steps per second:  56, episode reward: 89.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.517 [0.000, 1.000],  loss: 2.375388, mae: 8.253555, mean_q: 15.777792
 4732/5000: episode: 260, duration: 3.861s, episode steps: 164, steps per second:  42, episode reward: 164.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.506 [0.000, 1.000],  loss: 2.679690, mae: 8.570634, mean_q: 16.377707
 4777/5000: episode: 261, duration: 1.688s, episode steps:  45, steps per second:  27, episode reward: 45.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.533 [0.000, 1.000],  loss: 3.364439, mae: 8.684531, mean_q: 16.536690
 4846/5000: episode: 262, duration: 1.482s, episode steps:  69,

<tensorflow.python.keras.callbacks.History at 0x29e790a6160>

Test our reinforcement learning model:

In [6]:
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...
Episode 1: reward: 200.000, steps: 200
Episode 2: reward: 200.000, steps: 200
Episode 3: reward: 200.000, steps: 200
Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 200.000, steps: 200


<tensorflow.python.keras.callbacks.History at 0x2240d280d30>