## RLDMUU 2025
#### Q-learning and SARSA
jakub.tluczek@unine.ch

Today we are going to implement two fundamental temporal difference algorithms - Q-learning and SARSA. Both of these algorithms choose action based on $Q(s,a)$ function, but update it in a diffetent manner. Q-learning update goes as follows:

$$ Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \arg\max_a Q(s',a) - Q(s,a)\right] $$

While SARSA updates its $Q(s,a)$ in a following way:

$$ Q(s,a) \leftarrow \alpha \left[ r + \gamma Q(s', a') - Q(s,a) \right] $$

where $\alpha$ is a learning rate and $\gamma$ is a discount factor.

Difference between these two arises when computing the discounted value of $Q$ for next state. As $Q : S \times A \rightarrow \mathbb{R}$, we need to pick the next action. We can do it either by maximizing over all actions in an off-line fashion (Q-Learning) or assume that the next action will be picked using the same poicy $\pi$ we are currently following. 

Your task is to program both Q-Learning and SARSA from scratch:

In [None]:
class QLearning:
    def __init__(self, n_states, n_actions, alpha, gamma, epsilon):
        # TODO: Initialize the class
        pass 

    def act(self):
        # TODO: pick the action
        pass

    def update(self, action, reward, next_state):
        # TODO: Update the Q-table
        pass

    def reset(self):
        # TODO: Reset the Q-tables 
        pass

In [None]:
class SARSA:
    def __init__(self, n_states, n_actions, alpha, gamma, epsilon):
        # TODO: Initialize the class
        pass 

    def act(self):
        # TODO: pick the action
        pass

    def update(self, action, reward, next_state):
        # TODO: Update the Q-table
        pass

    def reset(self):
        # TODO: Reset the Q-tables 
        pass

### Gymnasium

Now let's introduce a python framework that you are going to work with over the course of this semester, namely `gymnasium`, which is the successor of OpenAI `gym`. Let's go through the basic functionality of `gymnasium` based environments. First, let's import a Frozen Lake environment:

In [10]:
import gymnasium as gym

env = gym.make('FrozenLake-v1')

Before the first use, and after each episode we have to reset an environment. `reset()` function returns the state represenation and an additional dictionary `info`, if we ever wanted to collect some additional data about the environment. For now we won't take it into consideration.

In [11]:
state, info = env.reset()
# useful for checking if the environment terminated
done = False

*Hint*: In order to create Q tables we have to know the size of the state and action space. We can check it with:

In [13]:
print(f"Observation space size: {env.observation_space}")
print(f"Action space size: {env.action_space}")

Observation space size: Discrete(16)
Action space size: Discrete(4)


Now let's act on the environment and observe the results (we sample the action for now). For this reason we provide an action to the `step` method and observe the following:

- `next_state` to which we transition
- `reward` received
- `done` signal, indicating if the environment terminated
- `truncated` signal, indicating whether a timeout or other external constraint had been reached
- `info` dict with supplementary information

In [14]:
action = env.action_space.sample()

next_state, reward, done, truncated, info = env.step(action)

Your task is to perform both Q-Learning and SARSA to learn the optimal policy for an agent acting in an environment. After you're done, plot the rewards.