# Solving CartPole OpenAI environment using DQNAgent

## Getting started

Documentation: https://gym.openai.com/docs/

### Setting up the gym
git clone https://github.com/openai/gym
cd gym
pip install -e . # minimal install

In [1]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

Using TensorFlow backend.


#### Open environment
extract the number of actions. 
There are two discrete actions - move left and move Right

In [5]:
env = gym.make('CartPole-v0')
env.seed(123)
nb_actions = env.action_space.n

[2017-10-30 16:48:52,445] Making new env: CartPole-v0


In [4]:
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action

[2017-10-30 16:48:36,076] You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.


In [6]:
env.observation_space.shape

(4,)

#### Build a simple NN

In [7]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_2 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_3 (Activation)    (None, 16)                0         
__________

#### Configure and compile our agent. 
You can use every built-in Keras optimizer and even the metrics!


In [10]:
memory = SequentialMemory(limit=50000, window_length=1)
policy = BoltzmannQPolicy()
dqn = DQNAgent(model=model, 
               nb_actions=nb_actions, 
               memory=memory, 
               nb_steps_warmup=10,
               target_model_update=1e-2, 
               policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

#### Actual learning 
Visualization is OFF untill we figure out how to export display correctly.

In [11]:
dqn.fit(env, nb_steps=50000, visualize=True, verbose=2)

# After training is done, we save the final weights.
dqn.save_weights('dqn_{}_weights.h5f'.format(ENV_NAME), overwrite=True)

Training for 50000 steps ...




    12/50000: episode: 1, duration: 1.224s, episode steps: 12, steps per second: 10, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.333 [0.000, 1.000], mean observation: 0.098 [-0.995, 1.587], loss: 0.730131, mean_absolute_error: 0.697327, mean_q: 0.141359
    36/50000: episode: 2, duration: 0.404s, episode steps: 24, steps per second: 59, episode reward: 24.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.417 [0.000, 1.000], mean observation: 0.123 [-0.755, 1.734], loss: 0.581815, mean_absolute_error: 0.627062, mean_q: 0.237459
    46/50000: episode: 3, duration: 0.186s, episode steps: 10, steps per second: 54, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.167 [-1.905, 3.089], loss: 0.380364, mean_absolute_error: 0.556484, mean_q: 0.419568
    90/50000: episode: 4, duration: 0.734s, episode steps: 44, steps per second: 60, episode reward: 44.000, mean reward: 1.000 [1.000, 1.000], mean

NameError: name 'ENV_NAME' is not defined

#### Evaluate our algorithm for 5 episodes.

In [None]:
dqn.test(env, nb_episodes=5, visualize=True)