# 03 Algorithms

## 3.2 Policy Iteration

### The Matric-Vector form of Policy Iteration
**Policy Iteration** 是强化学习中的一个重要算法，与 **Value Iteration** 一样，用于求解马尔可夫决策过程（MDP）的最优策略。与Value Iteration进行价值迭代不同，Policy Iteration 通过迭代地改进策略和价值函数来找到最优策略。每一次迭代包括两个步骤：策略评估（Policy Evaluation）和策略改进（Policy Improvement）。
#### Policy Evaluation:
在策略评估步骤中，我们使用当前的策略 $\pi$ 来计算状态值函数 $v_{\pi_k}$。具体来说，我们通过解贝尔曼方程来更新价值函数：
$$
v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}
$$
其中，$v_{\pi_k}$ 是要被计算的状态价值，$\pi_k$ 是最近一次计算得到的策略，$\gamma$ 是折扣因子。

在之前求解Bellman Equation的时，我们提供了两种解法：
- 闭式解（仅用于理论分析）：$$v_{\pi_k} = (I - \gamma P_{\pi_k})^{-1} r_{\pi_k}$$
- 迭代法（用于实际计算）：$$v_{\pi_{k+1}}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)} \ \ \ \ j=0,1,2,...$$，其中，$v_{\pi_k}^{(j)}$ 是在第 $j$ 次迭代中计算得到的状态价值。



#### Policy Improvement:
在策略改进步骤中，我们使用当前的价值函数 $v^{\pi_k}$ 来更新策略 $\pi$。具体来说，求解如下最优化问题：

$$ \pi_{k+1}(s) = \arg\max_{\pi} (r_{\pi} + \gamma P_{\pi} v^{\pi_k}) $$
其中，$r_{\pi}$ 是奖励函数，$P_{\pi}$ 是状态转移矩阵。这个最优化问题表示在每个状态下选择能使当前状态的长期期望回报最大的动作。
#### Stopping Criterion:
策略迭代算法通常会设置一个停止准则来决定何时停止迭代。一种常见的停止准则是价值函数的差异小于某个阈值，即：
$$ \max_{s} |v_{\pi_{k+1}}(s) - v_{\pi_k}(s)| < \epsilon $$
其中，$\epsilon$ 是一个很小的正数，表示允许的最大误差。当价值函数的差异小于这个阈值时，我们就可以认为策略已经收敛到最优策略了。


### The Elementwise form of Policy Iteration
**Policy Iteration** 是强化学习中的一个重要算法，与 **Value Iteration** 一样，用于求解马尔可夫决策过程（MDP）的最优策略。与Value Iteration进行价值迭代不同，Policy Iteration 通过迭代地改进策略和价值函数来找到最优策略。每一次迭代包括两个步骤：策略评估（Policy Evaluation）和策略改进（Policy Improvement）。
#### Policy Evaluation:
在策略评估步骤中，我们使用当前的策略 $\pi_k$ 来计算状态值函数 $v^{\pi_k}$。具体来说，我们通过解贝尔曼期望方程来更新价值函数：
$$ v_{\pi_{k}}^{j+1}(s) = \sum_{a \in \cal A(s)} \pi_k(a|s) \left( \sum_r p(r|s, a) r + \gamma \sum_{s' \in \cal S} p(s'|s, a) v_{\pi_k}^{j}(s') \right)  \ \ \ j=1,2,... $$
其中，$v_{\pi_k}(s)$ 是第 $k$ 次迭代中状态 $s$ 的状态价值函数，$r$ 是在状态 $s$ 下采取动作 $a$ 的即时奖励，$p(s'|s, a)$ 是从状态 $s$ 采取动作 $a$ 转移到状态 $s'$ 的概率，$\gamma$ 是折扣因子。
#### Policy Improvement:
在策略改进步骤中，改进的目标是求解$ \pi_{k+1}(s) = \arg\max_{\pi} (r_{\pi} + \gamma P_{\pi} v^{\pi_k}) $，其Element-wise 表示为：
$$
\pi_{k+1}(s) = \arg\max_{\pi} \sum_{a \in \cal A(s)} \pi (a|s) \left (\sum_{r \in \cal R(s'|s, a)} p(r|s, a) r + \gamma \sum_{s' \in \cal S} p(s'|s, a) v^{\pi_k}(s') \right )
$$



### Policy Iteration Algorithm
* 对于所有状态动作对$(s, a)$环境模型的$p(r|s, a)$和$p(s'|s, a)$都已知，随机初始化$\pi_0$
* $while \ v_{\pi_k} - v_{\pi_{k-1}} > \epsilon$ do:
* $\qquad$ $Policy \ Evaluation:$
* $\qquad$ 随机初始化$v_{\pi_0}(s)$
* $\qquad$ $while \ v_{\pi_k}^{j+1} - v_{\pi_k}^{j} > \epsilon \ do$:
* $\qquad\qquad$ $for \ s \in \cal S \ do:$
* $\qquad\qquad\qquad$ $v_{\pi_k}^{j+1}(s) \leftarrow \sum_{a \in \cal A(s)} \pi_k (a|s) \left (\sum_{r \in \cal R(s'|s, a)} p(r|s, a) r + \gamma \sum_{s' \in \cal S} p(s'|s, a) v_{\pi_k}^{j}(s') \right )$
* $\qquad\qquad$ $end \ for$
* $\qquad$ $end \ while$
* $\qquad$ $Policy \ Improvement:$
* $\qquad$ $for \ s \in \cal S \ do:$
* $\qquad\qquad$ $for \ a \in \cal A(s) \ do:$
* $\qquad\qquad\qquad$ $q_{\pi_k}(s, a) \leftarrow \sum_{r \in \cal R(s'|s, a)} p(r|s, a) r + \gamma \sum_{s' \in \cal S} p(s'|s, a) v_{\pi_k}(s')$
* $\qquad\qquad$ $end \ for$
* $\qquad\qquad$ $a_k^*(s) \leftarrow \arg\max_a q_k(s, a)$
* $\qquad\qquad$ $\pi_{k+1}(a|s) = 1 \ if \ a_k^*(s)=a, \ else \ 0$
* $\qquad$ $end \ for$
* $end \ while$

### Example

In [2]:
import time

import numpy as np
import gymnasium as gym

In [4]:
class PolicyIteration:
    """ Policy Iteration Algorithm for FrozenLake """

    def __init__(self, env, gamma=0.9, delta=0.001):
        """  """

        self.env = env
        self.gamma = gamma
        self.delta = delta

        self.values = np.zeros(env.observation_space.n)
        self.policy = np.zeros((env.observation_space.n, env.action_space.n))

    def policy_evaluation(self):
        """ Policy Evaluation """
        steps = 0
        while True:
            delta = 0
            for state in range(self.env.observation_space.n):
                value = 0
                for action, action_prob in enumerate(self.policy[state]):
                    # Compute the q-values for the given state and action
                    for prob, next_state, reward, terminated in self.env.unwrapped.P[state][action]:
                        # Compute the q-value for this transition
                        value += action_prob * prob * (reward + self.gamma * self.values[next_state] * (1 - terminated))  # p(r|s, a) & p(s'|s, a) -> prob

                delta = max(abs(value - self.values[state]), delta)
                self.values[state] = value
            steps += 1
            if delta < self.delta:
                break
        print("Policy Evaluation Steps:", steps)

    def policy_improvement(self):
        """ Policy Improvement """
        new_policy = np.zeros((self.env.observation_space.n, self.env.action_space.n))
        for state in range(self.env.observation_space.n):
            q_values = np.zeros(self.env.action_space.n)
            for action in range(self.env.action_space.n):
                # Compute the q-value for this transition
                for prob, next_state, reward, terminated in self.env.unwrapped.P[state][action]:
                    q_values[action] += prob * (reward + self.gamma * self.values[next_state] * (1 - terminated))  # p(r|s, a) & p(s'|s, a) -> prob
            best_action = np.argmax(q_values)
            new_policy[state][best_action] = 1.
        self.policy = new_policy

    def policy_iteration(self):
        """ Policy Iteration """

        while True:
            self.policy_evaluation()
            old_policy = self.policy
            self.policy_improvement()
            new_policy = self.policy
            if max(abs(old_policy - new_policy).flatten()) < self.delta: break

    def visualize_policy(self, delay=0.5):
        """ Visualize the policy in the environment """
        state, info = self.env.reset()
        done = False

        while not done:
            self.env.render()
            action = np.argmax(self.policy[state])
            state, reward, terminated, truncated, info = self.env.step(action)
            done = terminated or truncated
            time.sleep(delay)

        self.env.render()
        self.env.close()

In [5]:
env = gym.make('FrozenLake-v1', desc=None, map_name='8x8', is_slippery=True, render_mode='human')
env.reset()

agent = PolicyIteration(env, gamma=0.95, delta=1e-6)
agent.policy_iteration()
print(f"Optimal Policy: {agent.policy}")
print(f"Optimal Value Function: {agent.values}")

2025-02-10 22:42:39.479 python[50474:38398341] +[IMKClient subclass]: chose IMKClient_Modern
2025-02-10 22:42:39.479 python[50474:38398341] +[IMKInputSession subclass]: chose IMKInputSession_Modern


Policy Evaluation Steps: 1
Policy Evaluation Steps: 17
Policy Evaluation Steps: 112
Policy Evaluation Steps: 25
Policy Evaluation Steps: 69
Policy Evaluation Steps: 44
Policy Evaluation Steps: 27
Policy Evaluation Steps: 42
Policy Evaluation Steps: 46
Policy Evaluation Steps: 48
Optimal Policy: [[0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]

In [None]:
agent.visualize_policy(delay=0.005)