<a id='index'></a>
# Q Learning

* [Taxi v2 - a deterministic game](#taxi-v2)
    * [Training](#taxi-v2-training)
    * [Test](#taxi-v2-test)
* [FrozenLake - a stochastic game](#frozen-lake)
    * [Training](#frozen-lake-training)
    * [Test](#frozen-lake-test)

In [27]:
from typing import Sequence

import gym
import numpy as np
from tqdm import tqdm_notebook

<a id='taxi-v2'></a>
## Taxi v2 - deterministic game

There are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

See the full [description](https://gym.openai.com/envs/Taxi-v2/).

In [29]:
# Environment, states, actions, and Q table initialisation

env = gym.make('Taxi-v2')
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))

print('No. states: {}\nNo. actions: {}'.format(Q.shape[0], Q.shape[1]))

No. states: 500
No. actions: 6


[back to index](#index)

<a id='taxi-v2-training'></a>
### Training

In [36]:
# Training: building the Q-table
def train_q_table(
    env: gym.wrappers.time_limit.TimeLimit,
    Q: Sequence[Sequence[float]],
    n_episodes: int,
    n_t_periods: int,
    alpha: float=0.7, # learning rate
    gamma: float=0.6, # discount rate
    epslion_0: float=1.0, # starting exploration rate
    epsilon_min: float=0.01, # minimum exploration rate
    decay: float=0.01,  # exponential decay rate of epsilon
    random_state: int=None
):
    epsilon = epsilon_0
    rng = np.random.RandomState(random_state) 
    
    with tqdm_notebook(total=n_episodes) as pbar:
        for ep in range(n_episodes):
            epsilon = epsilon_0 * np.exp(-decay * ep) if epsilon > epsilon_min else epsilon_min
            s = env.reset() # state
            game_over = False

            for t in range(n_t_periods):
                # Epsilon-Greedy
                if rng.rand() > epsilon:
                    a = np.argmax(Q[s, :]) # action
                else:
                    a = env.action_space.sample()

                s_new, reward, game_over, info = env.step(a)
                Q[s, a] += alpha * (reward + gamma * np.max(Q[s_new, :]) - Q[s, a])
                s = s_new

                if game_over:
                    break

            pbar.update(1)
            

train_q_table(
    env,
    Q, 
    n_episodes=50000, 
    n_t_periods=100, 
    random_state=42
)        

HBox(children=(IntProgress(value=0, max=50000), HTML(value='')))




[back to index](#index)

<a id='taxi-v2-test'></a>
### Test

In [37]:
def test_q_table(
    env: gym.wrappers.time_limit.TimeLimit,
    Q: Sequence[Sequence[float]],
    n_episodes,
    n_t_periods
):
    rewards = np.zeros(n_test_episodes)

    with tqdm_notebook(total=n_episodes) as pbar:
        for ep in range(n_episodes):
            s = env.reset()
            game_over = False
            total = 0

            for t in range(n_t_periods):
                a = np.argmax(Q[s, :])
                s_new, reward, game_over, info = env.step(a)
                total += reward
                s = s_new

                if game_over:
                    break

            rewards[ep] = total
            pbar.update(1)
        return rewards
    
    
rewards = test_q_table(env, Q, n_episodes=100, n_t_periods=100)    

print('Min reward:', rewards.min())
print('Avg. reward:', rewards.mean())
print('Max reward:', rewards.max())

HBox(children=(IntProgress(value=0), HTML(value='')))


Min reward: 3.0
Avg. reward: 8.44
Max reward: 15.0


[back to index](#index)

<a id='frozen-lake'></a>
## FrozenLake - a stochastic game

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

See the full [description](https://gym.openai.com/envs/FrozenLake-v0/).

In [38]:
# Environment, states, actions, and Q table initialisation

env = gym.make('FrozenLake-v0')
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))

print('No. states: {}\nNo. actions: {}'.format(Q.shape[0], Q.shape[1]))

No. states: 16
No. actions: 4


In [39]:
train_q_table(
    env,
    Q, 
    n_episodes=10000, 
    n_t_periods=100, 
    random_state=42
)    

HBox(children=(IntProgress(value=0, max=10000), HTML(value='')))




In [40]:
rewards = test_q_table(env, Q, n_episodes=100, n_t_periods=100)    

print('Min reward:', rewards.min())
print('Avg. reward:', rewards.mean())
print('Max reward:', rewards.max())

HBox(children=(IntProgress(value=0), HTML(value='')))


Min reward: 0.0
Avg. reward: 0.37
Max reward: 1.0
