# Deep Q-Learning for FrozenLake-v0

This is a very similar project to what I did in qlearning.ipynb, but I use deep instead of tabular Q-Learning. To do this, I use the keras-rl package. To do this, I first design a vanilla, supervised network in Keras, and then I train it using the same updates as in tabular Q-Learning. This way, the neural net (hopefully) memorizes the data that would have been contained in the Q-Table.

Important: keras-rl only works with tensorflow v <= 1.14. You will probably have to downgrade or use a virtual environment to use the notebook.

In [25]:
import gym

from keras.models import Sequential
from keras.layers import Dense, Input
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.memory import SequentialMemory

In [26]:
env = gym.make("FrozenLake-v0")
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [52]:
model = Sequential()

model.add(Dense(16, input_shape=[16], activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='softmax'))

In [53]:
agent = DQNAgent(model=model, nb_actions=env.action_space.n, memory=SequentialMemory(10000, window_length=16))
agent.compile(Adam(lr=.00025), metrics=['mae'])

In [54]:
agent.fit(env, nb_steps=10000, verbose=1, log_interval=2500)

Training for 10000 steps ...
Interval 1 (0 steps performed)
187 episodes - episode_reward: 0.005 [0.000, 1.000] - loss: 0.050 - mae: 0.146 - mean_q: 0.670 - prob: 0.333

Interval 2 (2500 steps performed)
199 episodes - episode_reward: 0.005 [0.000, 1.000] - loss: 0.047 - mae: 0.162 - mean_q: 0.623 - prob: 0.333

Interval 3 (5000 steps performed)
212 episodes - episode_reward: 0.000 [0.000, 0.000] - loss: 0.044 - mae: 0.191 - mean_q: 0.532 - prob: 0.333

Interval 4 (7500 steps performed)
done, took 64.106 seconds


<keras.callbacks.callbacks.History at 0x13e5df6d8>

In [64]:
# see how well it plays

rewards = []

for _ in range(1000):
    env.reset()
    done = False
    observation, reward, done, info = env.step(0)

    for _ in range(500):
        action = agent.forward(observation)
        observation, reward, done, info = env.step(action)
        
        if done:
            rewards.append(reward)
            break

print("Average Score:", sum(rewards)/len(rewards))

Average Score: 0.048
