# Q-Learning Example

This notebook demonstrates how a reinforcement learning simulation can be build using rlutils. The following code outlines how agents are policies are combined, how an agent is trained on a task, and how different logging classes are integrated into the simulation.

We will use the [Q-learning algorithm](https://link.springer.com/article/10.1007/BF00992698) and the puddle world task (a variation of the tasks presented by [Boyan and Moore, 1995](https://www.ri.cmu.edu/pub_files/pub1/boyan_justin_1995_1/boyan_justin_1995_1.pdf))as a simple leading example. Note that the purpose of rlutils is to provide a framework for designing RL experiments, but not the algorithms themselves. The same patterns could be used to design more complex simulations that could also integrate deep neural networks, but the user would have to implement these RL agents. 

OpenAI gym environments are compatible with rlutils, and all classes that implement a class are a sub-class of `gym.Env`.

First we import the needed libraries. The `set_seeds` function sets the seed for all libraries used in rlutils, including numpy.

In [1]:
import numpy as np
import plotly.graph_objects as go
import rlutils as rl
rl.set_seeds(12345)

The puddle world task is a 10x10 grid world, where the agent has to navigate from the start state to the goal state but avoid a puddle:

<img src="assets/puddle_world_map.png" width=300px>

Once the green goal state is entered, the agent receives a reward of +1 and the interaction between the agent and task is terminated. Every time a puddle cell is entered, the agent receives a reward penalty of -1.

In [9]:
env = rl.environment.PuddleWorld()

We will use $\epsilon$-greedy exploration and linearly anneal the exploration parameter from 1 to 0. In rlutils, a hyper-parameter that changes with an integer index is implemented using a schedule class. In this example, we will use a linearly interpolated schedule, implemented in `rl.schedule.LinearInterpolatedVariableSchedule`. The schedule starts with a value of 1 (uniform random action selection) and is then annealed to zero (greedy action selection).

In [10]:
exploration_schedule = rl.schedule.LinearInterpolatedVariableSchedule([0, 120], [1., 0.])

go.Figure(
    data=go.Scatter(
        x=np.arange(140) + 1, 
        y=[exploration_schedule(i) for i in range(140)]
    ),
    layout=dict(
        xaxis_title='Episode',
        yaxis_title='Exploration Value'
    )
)

Next, we will construct the Q-learning agent, by constructing an `rl.agent.QLearning` object. In rlutils, all agents are sub-classes of the class `rl.agent.Agent`. This abstract class defines the interface between a RL agent and the remaining functions of rlutils.

In [4]:
agent = rl.agent.QLearning(env.num_states(), env.num_actions(), learning_rate=0.8, gamma=0.9)

An $\varepsilon$-greedy exploration policy is constructed using the class `rl.policy.EGreedyPolicy`. All policies are sub-classes of the abstract class `rl.policy.Policy` which provides an interface for all policies in rlutils. The $\varepsilon$-greedy exploration policy only needs the Q-values at a particular state to compute an action that should be selected. For finite action spaces, a policy that only needs Q-values to select an action should always sub-class `rl.policy.ValuePolicy`. Consequently, the policy `rl.policy.EGreedyPolicy` is a sub-class of `rl.policy.ValuePolicy`.

In [5]:
policy = rl.policy.EGreedyPolicy(agent, 1.0)

The interaction between an MDP and an agent can be simulated using the function `rl.data.simulate`. The first parameter passed into this function is the MDP `env`, the second parameter is the `policy` used for action selection. This function will reset the MDP and simulate one episode until a terminal state is reached, or until the maximum number of steps is reached (5000 steps in the example below). If this time out threshold is hit, an `rl.data.SimulationTimeout` exception is thrown.

The interaction between MDP and agent is a sequence of `(state, action, reward, next_state, terminal_fag, info_dict)` sextuples. To access this data during simulation to either train an agent online or log the trajectory itself, a transition listener can be passed into the function `rl.data.simulate`. All transition listeners are a sub-class of `rl.data.TransitionListener`. The Q-learning agent is also a transition listener, because this agent learns online and updates its value based on each observed transition.

To train the Q-learning agent and also log the length of each episode, the following code constructs an aggregate listener using the `rl.data.transition_listener` function. Internally, the `update` object will update the Q-learning agent and the episode length logger with each observed transition. This aggregate listener is then passed into the `rl.data.simulate` function.

After each trial, the $\varepsilon$ parameter of the exploration policy is updated using the exploration schedule.

In [7]:
logger_ep_len = rl.logging.LoggerEpisodeLength()
update = rl.data.transition_listener(agent, logger_ep_len)

for i in range(140):
    try:
        rl.data.simulate(env, policy, update, max_steps=5000)
    except rl.data.SimulationTimout:
        logger_ep_len.finish_trajectory()
    policy.set_epsilon(exploration_schedule(i))

We now plot the episode length as a function of episode index. This information is stored in the episode length logger. The plot below demonstrates that 

In [8]:
episode_length = logger_ep_len.get_episode_length()
go.Figure(
    data=go.Scatter(
        x=np.arange(140) + 1, 
        y=episode_length
    ),
    layout=dict(
        xaxis_title='Episode',
        yaxis_title='Episode Length'
    )
)