# 03 Algorithms

## 3.7 Q-Learning

### Q-Learning
Q-Learning:
$$
\begin{cases}
q_{t+1}(s_t, a_t) = q_t(s_t,a_t) - \alpha_t(s_t, a_t)[q_t(s_t,a_t) - (r_{t+1} + \gamma \max_{a \in \cal A(s_{t+1})}q_t(s_{t+1}, a))] & (s, t)=(s_t, a_t) \\
q_{t+1}(s, a) = q_t(s, a) & (s, a) \neq (s_t, a_t)
\end{cases}
$$

其中，$t=0,1,2,...$，$\alpha_t(s_t, a_t)$是一个很小的正数，代表学习率。

Q-Learning算法与Sarsa算法非常类似，区别在于更新Q值时，Sarsa是根据下一个状态和动作来计算的，而Q-Learning则是根据下一个状态的所有可能的动作的最大Q值来计算，即TD Target有差异：
- Sarsa TD Target: $r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})$
- Q-Learning TD Target: $r_{t+1} + \gamma \max_{a' \in \cal A(s_{t+1})} q_t(s_{t+1}, a')$

Sarsa需要知道下一时刻$t+1$的$(r_{t+1}, s_{t+1}, a_{t+1})$；而Q-Learning仅需知道下一时刻$t+1$的$(r_{t+1}, s_{t+1})$。因此，Sarsa是一种On-Policy算法($a_{t+1}$仍然来自于当前策略)；而Q-Learning是一种Off-Policy算法($a'$的产生来自于greedy策略)，既可以用于on-policy也可以用于off-policy。

### Q-Learning Algorithm(On-Policy Version)
Goal：学习一个能够指导智能体从初始状态$s_0$到达目标状态的最优路径
- 初始化Q值表$q_0(s, a)$、初始化$\epsilon$-greedy策略$\pi_0(a|s)$、学习率$\alpha$、折扣因子$\gamma$、探索概率$\epsilon$；
- 对于每一个episode:
    - 如果$s_t(t=0,1,2,...)$不是终止状态，则：
      - 收集状态$s_t$上的经验样本$(a_t, r_{t+1}, s_{t+1})$：遵循策略$\pi(a|s)$，采取动作$a_t$，与环境交互得到奖励$r_{t+1}$，进入状态$s_{t+1}$；
      - 更新Q值表$(s_t, a_t)$：
        - $q_{t+1}(s_t, a_t) \leftarrow q_t(s_t, a_t) - \alpha [q_t(s_t, a_t) - (r_{t+1} + \gamma max_{a'} q_t(s_{t+1}, a'))]$
      - 更新策略$\pi(a|s)$：
        - $a = \arg\max_a q_{t+1}(s_t, a)$， $\pi_{t+1}(a|s_t)=1-\frac{\epsilon}{|\cal A(s_t)|}(|\cal A(s_t)| - 1)$，否则$\pi_{t+1}(a|s_t) = \frac{\epsilon}{|\cal A(s_t)|}$

### Example

In [1]:
import time

import numpy as np
import gymnasium as gym
from tqdm import tqdm

In [2]:
class QLearningOnPolicy:
    """ Q-Learning On-Policy Algorithm """
    def __init__(self, env, alpha=0.1, gamma=0.9, epsilon=0.1, epsilon_decay=0.99):
        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay

        self.returns = []
        self.q_tables = np.zeros((env.observation_space.n, env.action_space.n))
        self.policy = np.ones((env.observation_space.n, env.action_space.n)) / env.action_space.n

    @staticmethod
    def custom_reward(done, reward):
        if done and reward == 1:
            return 10
        elif done and reward == 0:
            return -5
        else:
            return -0.1

    def take_action(self, state):
        """ Take action according to the current policy. """

        return np.random.choice(range(self.env.action_space.n), p=self.policy[state])

    def best_action(self, state):
        """ Return the best action based on the Q-table """

        return np.argmax(self.q_tables[state])

    def update_policy_and_values(self, state, action, reward, next_state, done):
        """ Update the policy based on the Q-table. """

        # Update the Q-value for the state and action pair
        td_target = reward + self.gamma * self.q_tables[next_state][self.best_action(next_state)]
        td_error = self.q_tables[state][action] - td_target
        self.q_tables[state][action] -= self.alpha * td_error

        # Update the policy based on the Q-table
        best_action = self.best_action(state)
        policy = np.ones(self.env.action_space.n) * self.epsilon / self.env.action_space.n
        policy[best_action] = 1 - self.epsilon / self.env.action_space.n * (self.env.action_space.n - 1)
        self.policy[state] = policy

    def train(self, episodes=1000):
        """ Train the agent using Q-learning. """

        for i in range(10):
            with tqdm(total=episodes // 10, desc=f'Episode {i + 1}') as pbar:
                for episode in range(episodes // 10):
                    state, info = self.env.reset()
                    action = self.take_action(state)
                    done = False

                    gamma_power = 1
                    episode_return = 0
                    while not done:
                        next_state, reward, terminated, truncated, info = self.env.step(action)

                        done = terminated or truncated
                        reward = self.custom_reward(done, reward)

                        self.update_policy_and_values(state, action, reward, next_state, done)
                        state, action = next_state, self.best_action(next_state)

                        episode_return += reward * gamma_power
                        gamma_power *= self.gamma

                    self.returns.append(episode_return)
                    if (episode + 1) % 10 == 0:
                        pbar.set_postfix(
                            {
                                'epoch': episodes / 10 * i + episode + 1,
                                'return': np.mean(self.returns),
                                'epsilon': self.epsilon
                            }
                        )
                    pbar.update(1)

                    self.epsilon *= self.epsilon_decay
                    self.epsilon = max(self.epsilon, 0.01)


    def visualize_policy(self, delay=0.5):
        """ Visualize the policy learned by the agent """
        state, info = self.env.reset()
        done = False

        while not done:
            self.env.render()
            action = np.argmax(self.policy[state])
            state, reward, terminated, truncated, info = self.env.step(action)
            done = terminated or truncated
            time.sleep(delay)

        self.env.render()
        self.env.close()

In [3]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='rgb_array')
environment.reset()

(0, {'prob': 1})

In [4]:
agent = QLearningOnPolicy(environment, gamma=0.9, epsilon=0.99, alpha=0.1, epsilon_decay=0.99)

In [5]:
agent.train(1000)
print(f"Optimal policy: {agent.policy}")
print(f"Optimal Q-tables: {agent.q_tables}")

Episode 1: 100%|██████████| 100/100 [00:00<00:00, 658.24it/s, epoch=100, return=-1.35, epsilon=0.366]
Episode 2: 100%|██████████| 100/100 [00:00<00:00, 443.63it/s, epoch=200, return=-0.99, epsilon=0.134]
Episode 3: 100%|██████████| 100/100 [00:00<00:00, 512.82it/s, epoch=300, return=-0.912, epsilon=0.049]
Episode 4: 100%|██████████| 100/100 [00:00<00:00, 363.82it/s, epoch=400, return=-0.872, epsilon=0.018]
Episode 5: 100%|██████████| 100/100 [00:00<00:00, 412.42it/s, epoch=500, return=-0.81, epsilon=0.01]  
Episode 6: 100%|██████████| 100/100 [00:00<00:00, 430.77it/s, epoch=600, return=-0.714, epsilon=0.01]
Episode 7: 100%|██████████| 100/100 [00:00<00:00, 421.47it/s, epoch=700, return=-0.645, epsilon=0.01]
Episode 8: 100%|██████████| 100/100 [00:00<00:00, 438.10it/s, epoch=800, return=-0.614, epsilon=0.01]
Episode 9: 100%|██████████| 100/100 [00:00<00:00, 509.39it/s, epoch=900, return=-0.567, epsilon=0.01]
Episode 10: 100%|██████████| 100/100 [00:00<00:00, 450.23it/s, epoch=1e+3, retu

Optimal policy: [[0.9925 0.0025 0.0025 0.0025]
 [0.0025 0.0025 0.0025 0.9925]
 [0.0025 0.0025 0.0025 0.9925]
 [0.0025 0.0025 0.0025 0.9925]
 [0.9925 0.0025 0.0025 0.0025]
 [0.25   0.25   0.25   0.25  ]
 [0.9925 0.0025 0.0025 0.0025]
 [0.25   0.25   0.25   0.25  ]
 [0.0025 0.0025 0.0025 0.9925]
 [0.0025 0.9925 0.0025 0.0025]
 [0.9925 0.0025 0.0025 0.0025]
 [0.25   0.25   0.25   0.25  ]
 [0.25   0.25   0.25   0.25  ]
 [0.0025 0.0025 0.9925 0.0025]
 [0.0025 0.9925 0.0025 0.0025]
 [0.25   0.25   0.25   0.25  ]]
Optimal Q-tables: [[-0.11236121 -1.44519913 -1.24586402 -1.15346968]
 [-1.70627948 -1.74060461 -1.59994484 -0.65336243]
 [-1.6849129  -1.65618128 -1.68614112 -0.89104787]
 [-2.0534549  -1.98766993 -2.07575064 -0.94098777]
 [ 0.2124232  -1.67738157 -1.65334506 -1.69444359]
 [ 0.          0.          0.          0.        ]
 [-1.47329121 -3.36391367 -3.44422237 -3.40541268]
 [ 0.          0.          0.          0.        ]
 [-0.95       -0.79728749 -0.71769354  0.95289711]
 [-0.52515




In [6]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='human')
environment.reset()

agent.env = environment
agent.visualize_policy(delay=0.005)

### Q-Learning Algorgorithm(Off-Policy Version)
Goal：从遵循行为策略$\pi_{behavior}(a|s)$生成的经验样本中学习最优策略$\pi_{target}(a|s)$。

- 初始化Q值表$q_0(s, a)$、初始化行为策略$\pi_{behavior}(a|s)$，$\alpha$为学习率，$\gamma$为折扣因子
- 对于每一个遵循策略$\pi_{behavior}(a|s)$生成的episode $\{ s_0,a_0,r_1,s_1,a_1,r_2,\cdots \}$:
  - 对于这个episode中的每一步$t$:
    - 更新Q值：
      - $q_{t+1}(s_t, a_t) \leftarrow (1-\alpha) q_t(s_t, a_t) + \alpha [r_{t} + \gamma q_t(s_{t+1}, \arg\max_a q_t(s_{t+1}, a))]$
    - 更新策略$\pi(a|s)$：
      - $\pi_{target,t+1}(a|s_t)=1, if \quad a = \arg\max_a q_{t+1}(s_t, a) \quad else \quad 0$







### Example

In [7]:
class QLearningOffPolicy:
    """ Q-Learning Off-Policy Algorithm """

    def __init__(self, env, alpha=0.1, gamma=0.9, epsilon=0.1, epsilon_decay=0.99):

        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay

        self.returns = []
        self.q_tables = np.zeros((env.observation_space.n, env.action_space.n))
        self.policy = np.ones((env.observation_space.n, env.action_space.n)) / env.action_space.n

    @staticmethod
    def custom_reward(done, reward):
        if done and reward == 1:
            return 10
        elif done and reward == 0:
            return -5
        else:
            return -0.1

    def take_action(self, state):
        """ Take action according to the current policy. """

        return np.random.choice(range(self.env.action_space.n), p=self.policy[state])

    def best_action(self, state):
        """ Return the best action based on the Q-table """

        return np.argmax(self.q_tables[state])

    def generate_episode(self, state):
        episode = []
        done = False
        while not done:
            action = self.take_action(state)
            next_state, reward, terminated, truncated, info = self.env.step(action)

            done = terminated or truncated
            reward = self.custom_reward(done, reward)
            episode.append((state, action, reward, next_state))
            state = next_state
        return episode

    def update_policy_and_values(self, episode):
        """ Update the policy and values using the generated episode. """

        gamma_power = 1
        episode_return = 0
        for state, action, reward, next_state in reversed(episode):
            td_target = reward + self.gamma * np.max(self.q_tables[next_state])
            td_error = self.q_tables[state][action] - td_target
            self.q_tables[state][action] -= self.alpha * td_error

            policy = np.zeros(self.env.action_space.n)
            policy[self.best_action(state)] = 1.
            self.policy[state] = policy

            episode_return += reward * gamma_power
            gamma_power *= self.gamma

        return episode_return

    def train(self, episodes=1000):
        """ Train the agent for a specified number of episodes. """

        for i in range(10):
            with tqdm(total=episodes // 10, desc=f'Episode {i + 1}') as pbar:
                for idx in range(episodes // 10):
                    state, info = self.env.reset()

                    episode = self.generate_episode(state)
                    episode_return = self.update_policy_and_values(episode)
                    self.returns.append(episode_return)

                    if (idx + 1) % 10 == 0:
                        pbar.set_postfix(
                            {
                                'epoch': episodes / 10 * i + idx + 1,
                                'return': np.mean(self.returns),
                                'epsilon': self.epsilon
                            }
                        )
                    pbar.update(1)

                    self.epsilon *= self.epsilon_decay
                    self.epsilon = max(self.epsilon, 0.01)

    def visualize_policy(self, delay=0.5):
        """ Visualize the policy learned by the agent """
        state, info = self.env.reset()
        done = False

        while not done:
            self.env.render()
            action = np.argmax(self.policy[state])
            state, reward, terminated, truncated, info = self.env.step(action)
            done = terminated or truncated
            time.sleep(delay)

        self.env.render()
        self.env.close()

In [8]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='rgb_array')
environment.reset()

(0, {'prob': 1})

In [9]:
agent = QLearningOffPolicy(environment, gamma=0.9, epsilon=0.99, alpha=0.1, epsilon_decay=0.99)

In [10]:
agent.train(1000)
print(f"Optimal policy: {agent.policy}")
print(f"Optimal Q-tables: {agent.q_tables}")

Episode 1: 100%|██████████| 100/100 [00:00<00:00, 349.59it/s, epoch=100, return=-3.91, epsilon=0.366]
Episode 2: 100%|██████████| 100/100 [00:00<00:00, 286.70it/s, epoch=200, return=-0.519, epsilon=0.134]
Episode 3: 100%|██████████| 100/100 [00:00<00:00, 278.13it/s, epoch=300, return=0.51, epsilon=0.049]  
Episode 4: 100%|██████████| 100/100 [00:00<00:00, 297.71it/s, epoch=400, return=1.25, epsilon=0.018] 
Episode 5: 100%|██████████| 100/100 [00:00<00:00, 314.59it/s, epoch=500, return=1.55, epsilon=0.01] 
Episode 6: 100%|██████████| 100/100 [00:00<00:00, 287.76it/s, epoch=600, return=1.77, epsilon=0.01]
Episode 7: 100%|██████████| 100/100 [00:00<00:00, 270.07it/s, epoch=700, return=2.16, epsilon=0.01]
Episode 8: 100%|██████████| 100/100 [00:00<00:00, 281.46it/s, epoch=800, return=2.53, epsilon=0.01]
Episode 9: 100%|██████████| 100/100 [00:00<00:00, 257.15it/s, epoch=900, return=2.54, epsilon=0.01]
Episode 10: 100%|██████████| 100/100 [00:00<00:00, 273.29it/s, epoch=1e+3, return=2.6, ep

Optimal policy: [[1.   0.   0.   0.  ]
 [0.   0.   0.   1.  ]
 [1.   0.   0.   0.  ]
 [0.   0.   0.   1.  ]
 [1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   1.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   0.   1.  ]
 [0.   1.   0.   0.  ]
 [1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.25 0.25 0.25 0.25]
 [0.   0.   1.   0.  ]
 [0.   0.   0.   1.  ]
 [0.25 0.25 0.25 0.25]]
Optimal Q-tables: [[-0.46265042 -0.89551894 -0.89417804 -0.90693324]
 [-1.59152753 -1.62426729 -1.52414657 -1.40101201]
 [-1.48367304 -1.49488964 -1.51450835 -1.50448104]
 [-1.87720533 -1.74267177 -1.71980584 -1.23097771]
 [-0.26233403 -0.91601507 -0.89975679 -0.91031527]
 [ 0.          0.          0.          0.        ]
 [-3.16278232 -3.14220191 -1.90918928 -3.25324696]
 [ 0.          0.          0.          0.        ]
 [-1.20159305 -1.2688851  -0.9581      0.3306748 ]
 [-0.882329    1.08220984 -0.9581     -0.95      ]
 [ 2.37053739 -0.84036884 -1.03759593 -0.88341463]
 [ 0.          0.          0.       




In [11]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='human')
environment.reset()

agent.env = environment
agent.visualize_policy(delay=0.005)