# An introduction to q-learning

In this notebook, we will try to help a newly returned-to-the-office SINTEF-employee find their way to the printer in the renovated offices using reinforcement learning. 

The goal of the employee (agent) is to find the most efficient route from their office to the printer.

The employee does not have a map of the office, and must explore to find a solution.

For every step the employee takes that does not end their search, they receive a reward of -1.

If the employee finds the printer, they receive a reward of 10. The episode ends.

Renovation of the offices is not yet complete, and if the employee enters a construction area, they must immediately return to their office and make a report in Lydia. This yields a reward of -10. The episode ends.




In [None]:
import numpy as np

import office_gym

In the file `office_gym.py` we find the `OfficeGym` class. An instance of this class is an environment suited for reinforcement learning. It has the following methods that are worth noting:

`OfficeEnv.step(action)`: Applies an action that transitiones the environment to the next state.

`OfficeEnv.reset(seed)`: Resets the environment, returning to the initial state and resetting the random state.

`OfficeEnv.render(mode)`: Display the current state of the environment.

`OfficeEnv.showmap(mode)`: Display a map of the environment.


### The environment has an 'easy' and a 'hard' mode. We will start with the easy one.

In [None]:
env_easy = office_gym.OfficeEnv(mode='easy')
env_easy.showmap();

### A very simple way to solve this problem is to try random actions for a number of episodes and keep the best sequence of actions we encounter:

In [None]:
class RandomAgent():
    def __init__(self, actions, seed=None):
        self.actions = actions
        self.best_action_sequence = []
        self.highest_reward_episode = -np.inf
        self.rng = np.random.default_rng(seed)

    def get_action(self):
        return self.rng.choice(self.actions)


In [None]:
agent = RandomAgent(actions=np.arange(4), seed=3)

# Number of episodes to search before we time out
nepisodes = 100

for i in range(nepisodes):
    state = env_easy.reset()

    reward_episode = 0
    done = False
    actions = []

    while not done:
        action = agent.get_action()

        state, reward, done, info = env_easy.step(action)

        reward_episode += reward
        actions += [action]

    if reward_episode > agent.highest_reward_episode:
        agent.best_action_sequence = actions
        agent.highest_reward_episode = reward_episode
        print(f'Episode {i}: Found new best path of length {len(actions)} with reward {reward_episode}.')


In [None]:
reward_episode, steps, animated = office_gym.perform_action_sequence(
    agent.best_action_sequence, env_easy, render=True)
animated

### Now let us create an agent that estimates the Q-value of each state-action pair it observes

For convenience, we repeat the Q-value update function: $Q(s_t, a_t) = Q(s_t, a_t) + \alpha(R_t + \gamma \max_{a}Q(s_{t+1}, a) - Q(s_t, a_t))$

In [None]:

class QLearningAgent():
    def __init__(self, nstates, actions, learning_rate=0.8, discount_factor=0.99, seed=None):
        self.actions = actions
        self.qtable = np.zeros((nstates, len(actions)))
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.rng = np.random.default_rng(seed)

    def get_action(self, state, epsilon=0):
        if self.rng.uniform(0, 1) < epsilon:
            action = self.rng.choice(self.actions) # Explore action space
        else:
            action = np.argmax(self.qtable[state]) # Exploit learned value
        return action
    
    def update_qtable(self, current_state, next_state, action, reward):
        old_qvalue = self.qtable[current_state, action]
        new_qvalue = (old_qvalue + self.learning_rate*(reward
                                                       + self.discount_factor * np.max(self.qtable[next_state])
                                                       - old_qvalue))
        self.qtable[current_state, action] = new_qvalue


In [None]:
agent = QLearningAgent(nstates=env_easy.observation_space.n,
                       actions=np.arange(env_easy.action_space.n),
                       learning_rate=0.5,
                       discount_factor=0.99,
                       seed=5)

epsilon = 0.99

for i in range(100):
    state = env_easy.reset()

    steps, reward = 0, 0
    done = False
    
    while not done:
        action = agent.get_action(state, epsilon)
        next_state, reward, done, info = env_easy.step(action)
        agent.update_qtable(state, next_state, action, reward)
        state = next_state
    epsilon = max(0.2, epsilon**2)


In [None]:
office_gym.visualize_qvalues(agent, env_easy);

In [None]:
reward_episode, steps, animated = office_gym.play_episode(agent, env_easy, render=True, seed=None)
animated

## Spicing up the environment

As we all know, printers are not perfect machines, and there is always a risk of a printer not working.

We will now study the case where there are two printers in the office. The probability of a printer working varies between printers. In this case the closest printer works 30% of the time, while the printer further away works 90% of the time.

If the employee tries to use a printer and it works, they receive a reward of 10. However, if the printer does not work, the employee must return to their desk and make a report in Origo, which yields a reward of -10 (and ends the episode).

In [None]:
env_hard = office_gym.OfficeEnv(mode='hard')
env_hard.showmap();

### Oh no! Someone spilled coffee on the floor!

Spaces with coffee spill are slippery, and when moving away from them there is a 1/3 probability of the agent moving in the intended direction, and a 1/3 probability for moving in either of the two perpendicular directions, respectively.

In [None]:
agent = QLearningAgent(nstates=env_hard.observation_space.n,
                       actions=np.arange(env_hard.action_space.n),
                       learning_rate=0.01,
                       discount_factor=0.9)

epsilon = 0.99
episodes = 20000
ntestepisodes = 100

showvalues = [100, 1000, 10000, 20000]

for i in range(1, episodes+1):
    state = env_hard.reset()

    steps, reward = 0, 0
    done = False
    
    while not done:
        action = agent.get_action(state, epsilon)
        next_state, reward, done, info = env_hard.step(action)
        agent.update_qtable(state, next_state, action, reward)
        state = next_state
    if i in showvalues:
        fig = office_gym.visualize_qvalues(agent, env_hard)
        fig.suptitle(f'Max Q-values after {i} episodes.')
        fig.tight_layout()
        
    if i % 1000 == 0:
        epsilon = max(0.25, epsilon**2)
        avg_reward = 0
        if ntestepisodes > 0:
            for j in range(ntestepisodes):
                reward_episode, steps, _ = office_gym.play_episode(agent, env_hard, render=False, seed=None)
                avg_reward += reward_episode
            avg_reward /= ntestepisodes

            print(f'Epoch {i}: Expected episode reward {avg_reward}')



In [None]:
reward_episode, steps, animated = office_gym.play_episode(agent, env_hard, render=True, seed=None)
animated