# 03 Algorithms

## 3.7 Q-Learning

### Q-Learning
Q-Learning:
$$
\begin{cases}
q_{t+1}(s_t, a_t) = q_t(s_t,a_t) - \alpha_t(s_t, a_t)[q_t(s_t,a_t) - (r_{t+1} + \gamma \max_{a \in \cal A(s_{t+1})}q_t(s_{t+1}, a))] & (s, t)=(s_t, a_t) \\
q_{t+1}(s, a) = q_t(s, a) & (s, a) \neq (s_t, a_t)
\end{cases}
$$

其中，$t=0,1,2,...$，$\alpha_t(s_t, a_t)$是一个很小的正数，代表学习率。

Q-Learning算法与Sarsa算法非常类似，区别在于更新Q值时，Sarsa是根据下一个状态和动作来计算的，而Q-Learning则是根据下一个状态的所有可能的动作的最大Q值来计算，即TD Target有差异：
- Sarsa TD Target: $r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})$
- Q-Learning TD Target: $r_{t+1} + \gamma \max_{a' \in \cal A(s_{t+1})} q_t(s_{t+1}, a')$

Sarsa需要知道下一时刻$t+1$的$(r_{t+1}, s_{t+1}, a_{t+1})$；而Q-Learning仅需知道下一时刻$t+1$的$(r_{t+1}, s_{t+1})$。因此，Sarsa是一种On-Policy算法($a_{t+1}$仍然来自于当前策略)；而Q-Learning是一种Off-Policy算法($a'$的产生来自于greedy策略)，既可以用于on-policy也可以用于off-policy。

### Q-Learning Algorithm(On-Policy Version)
Goal：学习一个能够指导智能体从初始状态$s_0$到达目标状态的最优路径
- 初始化Q值表$q_0(s, a)$、初始化$\epsilon$-greedy策略$\pi_0(a|s)$、学习率$\alpha$、折扣因子$\gamma$、探索概率$\epsilon$；
- 对于每一个episode:
    - 如果$s_t(t=0,1,2,...)$不是终止状态，则：
      - 收集状态$s_t$上的经验样本$(a_t, r_{t+1}, s_{t+1})$：遵循策略$\pi(a|s)$，采取动作$a_t$，与环境交互得到奖励$r_{t+1}$，进入状态$s_{t+1}$；
      - 更新Q值表$(s_t, a_t)$：
        - $q_{t+1}(s_t, a_t) \leftarrow q_t(s_t, a_t) - \alpha [q_t(s_t, a_t) - (r_{t+1} + \gamma max_{a'} q_t(s_{t+1}, a'))]$
      - 更新策略$\pi(a|s)$：
        - $a = \arg\max_a q_{t+1}(s_t, a)$， $\pi_{t+1}(a|s_t)=1-\frac{\epsilon}{|\cal A(s_t)|}(|\cal A(s_t)| - 1)$，否则$\pi_{t+1}(a|s_t) = \frac{\epsilon}{|\cal A(s_t)|}$

### Example

In [1]:
import time

import numpy as np
import gymnasium as gym
from tqdm import tqdm

In [2]:
class QLearningOnPolicy:
    """ Q-Learning On-Policy Algorithm """
    def __init__(self, env, alpha=0.1, gamma=0.9, epsilon=0.1, epsilon_decay=0.99):
        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay

        self.returns = []
        self.q_tables = np.zeros((env.observation_space.n, env.action_space.n))
        self.policy = np.ones((env.observation_space.n, env.action_space.n)) / env.action_space.n

    @staticmethod
    def custom_reward(done, reward):
        if done and reward == 1:
            return 10
        elif done and reward == 0:
            return -5
        else:
            return -0.1

    def take_action(self, state):
        """ Take action according to the current policy. """

        return np.random.choice(range(self.env.action_space.n), p=self.policy[state])

    def best_action(self, state):
        """ Return the best action based on the Q-table """

        return np.argmax(self.q_tables[state])

    def update_policy_and_values(self, state, action, reward, next_state, done):
        """ Update the policy based on the Q-table. """

        # Update the Q-value for the state and action pair
        td_target = reward + self.gamma * self.q_tables[next_state][self.best_action(next_state)]
        td_error = self.q_tables[state][action] - td_target
        self.q_tables[state][action] -= self.alpha * td_error

        # Update the policy based on the Q-table
        best_action = self.best_action(state)
        policy = np.ones(self.env.action_space.n) * self.epsilon / self.env.action_space.n
        policy[best_action] = 1 - self.epsilon / self.env.action_space.n * (self.env.action_space.n - 1)
        self.policy[state] = policy

    def train(self, episodes=1000):
        """ Train the agent using Q-learning. """

        for i in range(10):
            with tqdm(total=episodes // 10, desc=f'Episode {i + 1}') as pbar:
                for episode in range(episodes // 10):
                    state, info = self.env.reset()
                    action = self.take_action(state)
                    done = False

                    gamma_power = 1
                    episode_return = 0
                    while not done:
                        next_state, reward, terminated, truncated, info = self.env.step(action)

                        done = terminated or truncated
                        reward = self.custom_reward(done, reward)

                        self.update_policy_and_values(state, action, reward, next_state, done)
                        state, action = next_state, self.best_action(next_state)

                        episode_return += reward * gamma_power
                        gamma_power *= self.gamma

                    self.returns.append(episode_return)
                    if (episode + 1) % 10 == 0:
                        pbar.set_postfix(
                            {
                                'epoch': episodes / 10 * i + episode + 1,
                                'return': np.mean(self.returns),
                                'epsilon': self.epsilon
                            }
                        )
                    pbar.update(1)

                    self.epsilon *= self.epsilon_decay
                    self.epsilon = max(self.epsilon, 0.01)


    def visualize_policy(self, delay=0.5):
        """ Visualize the policy learned by the agent """
        state, info = self.env.reset()
        done = False

        while not done:
            self.env.render()
            action = np.argmax(self.policy[state])
            state, reward, terminated, truncated, info = self.env.step(action)
            done = terminated or truncated
            time.sleep(delay)

        self.env.render()
        self.env.close()

In [3]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='rgb_array')
environment.reset()

(0, {'prob': 1})

In [4]:
agent = QLearningOnPolicy(environment, gamma=0.9, epsilon=0.99, alpha=0.1, epsilon_decay=0.99)

In [5]:
agent.train(1000)
print(f"Optimal policy: {agent.policy}")
print(f"Optimal Q-tables: {agent.q_tables}")

Episode 1: 100%|██████████| 100/100 [00:00<00:00, 489.88it/s, epoch=100, return=-1.02, epsilon=0.366]
Episode 2: 100%|██████████| 100/100 [00:00<00:00, 468.57it/s, epoch=200, return=-0.904, epsilon=0.134]
Episode 3: 100%|██████████| 100/100 [00:00<00:00, 535.33it/s, epoch=300, return=-0.767, epsilon=0.049]
Episode 4: 100%|██████████| 100/100 [00:00<00:00, 486.34it/s, epoch=400, return=-0.667, epsilon=0.018]
Episode 5: 100%|██████████| 100/100 [00:00<00:00, 452.41it/s, epoch=500, return=-0.657, epsilon=0.01] 
Episode 6: 100%|██████████| 100/100 [00:00<00:00, 427.75it/s, epoch=600, return=-0.636, epsilon=0.01]
Episode 7: 100%|██████████| 100/100 [00:00<00:00, 442.73it/s, epoch=700, return=-0.6, epsilon=0.01] 
Episode 8: 100%|██████████| 100/100 [00:00<00:00, 488.31it/s, epoch=800, return=-0.598, epsilon=0.01]
Episode 9: 100%|██████████| 100/100 [00:00<00:00, 501.82it/s, epoch=900, return=-0.552, epsilon=0.01]
Episode 10: 100%|██████████| 100/100 [00:00<00:00, 482.37it/s, epoch=1e+3, retu

Optimal policy: [[0.9925 0.0025 0.0025 0.0025]
 [0.0025 0.0025 0.0025 0.9925]
 [0.9925 0.0025 0.0025 0.0025]
 [0.0025 0.0025 0.0025 0.9925]
 [0.9925 0.0025 0.0025 0.0025]
 [0.25   0.25   0.25   0.25  ]
 [0.9925 0.0025 0.0025 0.0025]
 [0.25   0.25   0.25   0.25  ]
 [0.0025 0.0025 0.0025 0.9925]
 [0.0025 0.9925 0.0025 0.0025]
 [0.9925 0.0025 0.0025 0.0025]
 [0.25   0.25   0.25   0.25  ]
 [0.25   0.25   0.25   0.25  ]
 [0.0025 0.0025 0.9925 0.0025]
 [0.0025 0.9925 0.0025 0.0025]
 [0.25   0.25   0.25   0.25  ]]
Optimal Q-tables: [[-0.52765605 -1.12882894 -1.24917297 -1.12042535]
 [-1.61576402 -1.76066102 -1.72640039 -0.66240109]
 [-1.16214959 -1.53644207 -1.53728894 -1.59172428]
 [-1.76609981 -1.70658526 -1.73265083 -1.12689073]
 [-0.38094698 -0.96539    -0.94199617 -1.00023535]
 [ 0.          0.          0.          0.        ]
 [-2.21006926 -3.25573529 -3.24559806 -3.22330874]
 [ 0.          0.          0.          0.        ]
 [-0.56172665 -0.51791    -0.53729618  0.04783567]
 [-0.5    




In [6]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='human')
environment.reset()

agent.env = environment
agent.visualize_policy(delay=0.005)

### Q-Learning Algorgorithm(Off-Policy Version)
Goal：从遵循行为策略$\pi_{behavior}(a|s)$生成的经验样本中学习最优策略$\pi_{target}(a|s)$。

- 初始化Q值表$q_0(s, a)$、初始化行为策略$\pi_{behavior}(a|s)$，$\alpha$为学习率，$\gamma$为折扣因子
- 对于每一个遵循策略$\pi_{behavior}(a|s)$生成的episode $\{ s_0,a_0,r_1,s_1,a_1,r_2,\cdots \}$:
  - 对于这个episode中的每一步$t$:
    - 更新Q值：
      - $q_{t+1}(s_t, a_t) \leftarrow (1-\alpha) q_t(s_t, a_t) + \alpha [r_{t} + \gamma q_t(s_{t+1}, \arg\max_a q_t(s_{t+1}, a))]$
    - 更新策略$\pi(a|s)$：
      - $\pi_{target,t+1}(a|s_t)=1, if \quad a = \arg\max_a q_{t+1}(s_t, a) \quad else \quad 0$

### Example

In [18]:
class QLearningOffPolicy:
    """ Q-Learning Off-Policy Algorithm """

    def __init__(self, env, alpha=0.1, gamma=0.9, epsilon=0.1, epsilon_decay=0.99):

        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay

        self.returns = []
        self.q_tables = np.zeros((env.observation_space.n, env.action_space.n))
        self.policy = np.ones((env.observation_space.n, env.action_space.n)) / env.action_space.n

    @staticmethod
    def custom_reward(done, reward):
        if done and reward == 1:
            return 10
        elif done and reward == 0:
            return -5
        else:
            return -0.1

    def take_action(self, state):
        """ Take action according to the current policy. """

        # return np.random.choice(range(self.env.action_space.n), p=self.policy[state])  # on-policy vs off-policy

        if np.random.uniform(0, 1) < self.epsilon:
            # Explore (take a random action with probability epsilon)
            action = self.env.action_space.sample()
        else:
            # Exploit (take the best known action according to Q-values)
            action = np.argmax(self.q_tables[state])
        return action


    def best_action(self, state):
        """ Return the best action based on the Q-table """

        return np.argmax(self.q_tables[state])

    def generate_episode(self, state):
        episode = []
        done = False
        while not done:
            action = self.take_action(state)
            next_state, reward, terminated, truncated, info = self.env.step(action)

            done = terminated or truncated
            reward = self.custom_reward(done, reward)
            episode.append((state, action, reward, next_state))
            state = next_state
        return episode

    def get_optimal_policy(self):
        """ Get Optimal Policy from value function """

        for state in range(self.env.observation_space.n):
            policy = np.zeros(self.env.action_space.n)
            policy[self.best_action(state)] = 1.0
            self.policy[state] = policy

        return self.policy

    def update_policy_and_values(self, episode):
        """ Update the policy and values using the generated episode. """

        gamma_power = 1
        episode_return = 0
        for state, action, reward, next_state in reversed(episode):
            td_target = reward + self.gamma * np.max(self.q_tables[next_state])
            td_error = self.q_tables[state][action] - td_target
            self.q_tables[state][action] -= self.alpha * td_error

              # on-policy vs off-policy
            # policy = np.zeros(self.env.action_space.n)
            # policy[self.best_action(state)] = 1.
            # self.policy[state] = policy

            episode_return += reward * gamma_power
            gamma_power *= self.gamma

        return episode_return

    def train(self, episodes=1000):
        """ Train the agent for a specified number of episodes. """

        for i in range(10):
            with tqdm(total=episodes // 10, desc=f'Episode {i + 1}') as pbar:
                for idx in range(episodes // 10):
                    state, info = self.env.reset()

                    episode = self.generate_episode(state)
                    episode_return = self.update_policy_and_values(episode)
                    self.returns.append(episode_return)

                    if (idx + 1) % 10 == 0:
                        pbar.set_postfix(
                            {
                                'epoch': episodes / 10 * i + idx + 1,
                                'return': np.mean(self.returns),
                                'epsilon': self.epsilon
                            }
                        )
                    pbar.update(1)

                    self.epsilon *= self.epsilon_decay
                    self.epsilon = max(self.epsilon, 0.01)

    def visualize_policy(self, delay=0.5):
        """ Visualize the policy learned by the agent """
        state, info = self.env.reset()
        done = False

        while not done:
            self.env.render()
            action = np.argmax(self.policy[state])
            state, reward, terminated, truncated, info = self.env.step(action)
            done = terminated or truncated
            time.sleep(delay)

        self.env.render()
        self.env.close()

In [19]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='rgb_array')
environment.reset()

(0, {'prob': 1})

In [20]:
agent = QLearningOffPolicy(environment, gamma=0.9, epsilon=0.99, alpha=0.1, epsilon_decay=0.99)

In [21]:
agent.train(1000)
agent.get_optimal_policy()
print(f"Optimal policy: {agent.policy}")
print(f"Optimal Q-tables: {agent.q_tables}")

Episode 1: 100%|██████████| 100/100 [00:00<00:00, 1945.05it/s, epoch=100, return=-4.72, epsilon=0.366]
Episode 2: 100%|██████████| 100/100 [00:00<00:00, 1075.91it/s, epoch=200, return=-4.44, epsilon=0.134]
Episode 3: 100%|██████████| 100/100 [00:00<00:00, 708.21it/s, epoch=300, return=-2.99, epsilon=0.049]
Episode 4: 100%|██████████| 100/100 [00:00<00:00, 726.11it/s, epoch=400, return=-2.12, epsilon=0.018]
Episode 5: 100%|██████████| 100/100 [00:00<00:00, 509.10it/s, epoch=500, return=-0.968, epsilon=0.01]
Episode 6: 100%|██████████| 100/100 [00:00<00:00, 517.54it/s, epoch=600, return=-0.127, epsilon=0.01]
Episode 7: 100%|██████████| 100/100 [00:00<00:00, 540.46it/s, epoch=700, return=0.171, epsilon=0.01] 
Episode 8: 100%|██████████| 100/100 [00:00<00:00, 651.12it/s, epoch=800, return=0.604, epsilon=0.01]
Episode 9: 100%|██████████| 100/100 [00:00<00:00, 688.59it/s, epoch=900, return=0.944, epsilon=0.01]
Episode 10: 100%|██████████| 100/100 [00:00<00:00, 746.57it/s, epoch=1e+3, return=

Optimal policy: [[1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]]
Optimal Q-tables: [[-0.26585427 -0.70959213 -0.61567585 -0.6655357 ]
 [-2.14415945 -2.29096442 -2.31466029 -0.82013835]
 [-1.47319414 -1.45481711 -1.45123551 -1.17734321]
 [-2.14334324 -1.59458302 -1.79249331 -1.17591872]
 [-0.10646646 -1.8313499  -1.40985262 -2.69788572]
 [ 0.          0.          0.          0.        ]
 [-0.88377165 -3.44290973 -3.10070478 -3.41517606]
 [ 0.          0.          0.          0.        ]
 [-2.5504639  -0.96482239 -1.7323844   0.33740344]
 [-2.11035683  1.15634603 -0.86769536 -1.86716592]
 [ 2.36330065 -1.50077207 -1.32236305 -1.37548157]
 [ 0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.        ]
 [-0.69089204  0.32482169  2.61287462  0.69128996]
 [ 1.259085




In [22]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='human')
environment.reset()

agent.env = environment
agent.visualize_policy(delay=0.005)