# Dynamic Programming
Now that we have the Bellman Optimality Equation, it is very natural that we come up with the idea to solve MDP with Dynamic Programming (DP). 

There are 2 main types of DP for RL: policy iteration and value iteration. Policy iteration consists of 2 parts: policy evaluation and policy improvement. The policy evalutaion in policy iteration utilizes the Bellman Expactation Equation to get a policy's state value function, which is a DP process; whereas value iteration directly uses the Bellman Optimality Equation to do DP and reaches the final optimality.

Different from Monte Carlo methods, DP methods requires we know $P$ and $r$. Howevver, such white-box environments are rare, which limits the usage of DP. Besides, DP may fits only on descrete MDP, where the states and actions are descrete.

# Cliff Walking Enverionment

This is a classic RL environment, where an agent walks next to a cliff. The agent should avoid the cliff parts and walk from a given starting point to the given end point. Each step has a reward of -1, when the agent entering the cliff area, the reward is -100. Reaching either the end point or the cliff ends the episode. The action space contains 4 actions: up, down, left and right. A action that tries to cross the boundary of the envrionment lead to no move. Below is the code for the cliff walking environment.

In [1]:
import copy


class CliffWalkingEnv:
    """ 悬崖漫步环境"""
    def __init__(self, ncol=12, nrow=4):
        self.ncol = ncol  # 定义网格世界的列
        self.nrow = nrow  # 定义网格世界的行
        # 转移矩阵P[state][action] = [(p, next_state, reward, done)]包含下一个状态和奖励
        self.P = self.createP()

    def createP(self):
        # 初始化
        P = [[[] for j in range(4)] for i in range(self.nrow * self.ncol)]
        # 4种动作, change[0]:上,change[1]:下, change[2]:左, change[3]:右。坐标系原点(0,0)
        # 定义在左上角
        change = [[0, -1], [0, 1], [-1, 0], [1, 0]]
        for i in range(self.nrow):
            for j in range(self.ncol):
                for a in range(4):
                    # 位置在悬崖或者目标状态,因为无法继续交互,任何动作奖励都为0
                    if i == self.nrow - 1 and j > 0:
                        P[i * self.ncol + j][a] = [(1, i * self.ncol + j, 0, True)]
                        continue
                    # 其他位置
                    next_x = min(self.ncol - 1, max(0, j + change[a][0]))
                    next_y = min(self.nrow - 1, max(0, i + change[a][1]))
                    next_state = next_y * self.ncol + next_x
                    reward = -1
                    done = False
                    # 下一个位置在悬崖或者终点
                    if next_y == self.nrow - 1 and next_x > 0:
                        done = True
                        if next_x != self.ncol - 1:  # 下一个位置在悬崖
                            reward = -100
                    P[i * self.ncol + j][a] = [(1, next_state, reward, done)]
        return P

In [2]:
env = CliffWalkingEnv()
print(env.P[35][0])

[(1, 23, -1, False)]


# Policy Iteration
## Policy Evalutaion 
Policy evaluation evaluates the value of a policy. Recall the Bellman Expectation euqation: $$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) (r(s, a) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, a)V^{\pi}(s'))$$

Using the idea of DP, we can update the value of this iteration by that of the last iteration: $$V^{k+1}(s) = \sum_{a \in \mathcal{A}} \pi(a|s) (r(s, a) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, a)V^{k}(s'))$$

We may choose any starting point $V^0$. According the Bellman Expectation Equation, $V^k=V^\pi$ is a fixed point of the updating equation above. Hence, when we have $\max_{s \in \mathcal{S}}|V^{k+1}-V^{k}| \leq \epsilon$ where $\epsilon$ is very small, the evalutaion may end.

## Policy Improvement
Now that we have $V^\pi$ in Policy Evaluation, we can improve it. Assume we have $Q^\pi(s, a) > V^\pi(s)$, we can replace the previous action with action $a$ to improve $V^\pi(s)$. Similarly, we can assume there exists a policy $\pi'$, such that $Q^\pi(s, \pi'(s)) \geq V^\pi(s)$, then we have $V^{\pi'}(s) \geq V^\pi(s)$. This is the policy improvement theorem. Hence, we can improve the policy greedily: $$\pi'(s) = \arg \max_{a \in \mathcal{A}} Q^\pi(s, a) = \arg \max_{a \in \mathcal{A}} \{r(s, a) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, a)V^{k}(s')\}$$

## Policy Iteration Algotithm
Combine policy evaluation and improvement together, we have the algorithm below: 

- Randomly init $V(s)$, $\pi(s)$

- while $\Delta > \theta$, do: # Pollicy Evaluation
    - $\Delta \leftarrow 0$
    - for every $s \in \mathcal{S}$:
        - $v \leftarrow V(s)$
        - $V(s) \leftarrow r(s, \pi(s)) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, \pi(s))V(s')$
        - $\Delta \leftarrow \max(\Delta, |v - V(s)|)$
    - end for
- end while
- $\pi_{old} = \pi$ # Policy Improvement
- for every $s \in \mathcal{S}$:
    - $\pi(s) \leftarrow \arg \max_{a \in \mathcal{A}} \{r(s, a) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, a)V(s')\}$
- if $\pi_{old} == \pi$:
    - return $\pi$
- else:
    - Redo since the while loop

Below is the implementation of Policy Iteration Algorithm in the cliff walking environment.

In [3]:
import numpy as np
import math

In [8]:
class PolicyIteration:
    """ 策略迭代算法 """
    def __init__(self, env, theta, gamma):
        self.env = env
        self.v = [0] * self.env.ncol * self.env.nrow  # 初始化价值为0
        self.pi = [[0.25, 0.25, 0.25, 0.25]
                   for i in range(self.env.ncol * self.env.nrow)]  # 初始化为均匀随机策略
        self.theta = theta  # 策略评估收敛阈值
        self.gamma = gamma  # 折扣因子

    def policy_evaluation(self):  # 策略评估
        cnt = 0  # 计数器
        # TODO: fill the blank
        while True:
            v_new = [0] * self.env.ncol * self.env.nrow
            for row in range(self.env.nrow):
                for col in range(self.env.ncol):
                    a_list = self.pi[row * self.env.ncol + col]
                    for a, a_prob in enumerate(a_list):
                        for p, next_s, r, done in self.env.P[row * self.env.ncol + col][a]:
                            if done:
                                v_new[row * self.env.ncol + col] += p * r * a_prob
                            else:
                                v_new[row * self.env.ncol + col] += p * (r + self.gamma * self.v[next_s]) * a_prob
                    
            cnt += 1
            delta = np.max(np.abs(np.array(v_new) - np.array(self.v)))
            # print(f"delta: {delta}")
            self.v = v_new

            if delta < self.theta:
                break

        print("策略评估进行%d轮后完成" % cnt)

    def policy_improvement(self):  # 策略提升
        # TODO: fill the blank
        for row in range(self.env.nrow):
            for col in range(self.env.ncol):
                action_vals = []
                for a in range(4):
                    qsa = 0

                    for p, next_s, r, done in self.env.P[row * self.env.ncol + col][a]:
                        if done:
                            qsa += r * p
                        else:
                            qsa += (r + self.gamma * self.v[next_s]) * p
                    action_vals.append(qsa)

                # important: maximize the actions that gives maximal Q(s,a)
                maxq = max(action_vals)
                cntq = action_vals.count(maxq)

                self.pi[row * self.env.ncol + col] = [1 / cntq if q == maxq else 0 for q in action_vals]

        print("策略提升完成")
        return self.pi

    def policy_iteration(self):  # 策略迭代
        while 1:
            self.policy_evaluation()
            old_pi = copy.deepcopy(self.pi)  # 将列表进行深拷贝,方便接下来进行比较
            new_pi = self.policy_improvement()
            if old_pi == new_pi: break

In [9]:
def print_agent(agent, action_meaning, disaster=[], end=[]):
    print("状态价值：")
    for i in range(agent.env.nrow):
        for j in range(agent.env.ncol):
            # 为了输出美观,保持输出6个字符
            print('%6.6s' % ('%.3f' % agent.v[i * agent.env.ncol + j]), end=' ')
        print()

    print("策略：")
    for i in range(agent.env.nrow):
        for j in range(agent.env.ncol):
            # 一些特殊的状态,例如悬崖漫步中的悬崖
            if (i * agent.env.ncol + j) in disaster:
                print('****', end=' ')
            elif (i * agent.env.ncol + j) in end:  # 目标状态
                print('EEEE', end=' ')
            else:
                a = agent.pi[i * agent.env.ncol + j]
                pi_str = ''
                for k in range(len(action_meaning)):
                    pi_str += action_meaning[k] if a[k] > 0 else 'o'
                print(pi_str, end=' ')
        print()


env = CliffWalkingEnv()
action_meaning = ['^', 'v', '<', '>']
theta = 0.001
gamma = 0.9
agent = PolicyIteration(env, theta, gamma)
agent.policy_iteration()
print_agent(agent, action_meaning, list(range(37, 47)), [47])

策略评估进行60轮后完成
策略提升完成
策略评估进行72轮后完成
策略提升完成
策略评估进行44轮后完成
策略提升完成
策略评估进行12轮后完成
策略提升完成
策略评估进行1轮后完成
策略提升完成
状态价值：
-7.712 -7.458 -7.176 -6.862 -6.513 -6.126 -5.695 -5.217 -4.686 -4.095 -3.439 -2.710 
-7.458 -7.176 -6.862 -6.513 -6.126 -5.695 -5.217 -4.686 -4.095 -3.439 -2.710 -1.900 
-7.176 -6.862 -6.513 -6.126 -5.695 -5.217 -4.686 -4.095 -3.439 -2.710 -1.900 -1.000 
-7.458  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
策略：
ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovoo 
ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovoo 
ooo> ooo> ooo> ooo> ooo> ooo> ooo> ooo> ooo> ooo> ooo> ovoo 
^ooo **** **** **** **** **** **** **** **** **** **** EEEE 


# Value Iteration Algorithm
The results above shows that the policy evaluation requires many rounds to converge to find the state value function of a policy, which is computation-intensive. Do we have to finish policy evaluation to improve the policy? Actually it is possiblt that even the value function has not converged yet, the policy won't change no matter how the calue function will be updated. 

The value iteration algorithm conducts 1 round of value updates and directly improve the policy accordingly. Notice that there does not ecist an explicit policy, we only maintain the value state function. 

To make it more clear, value iteration can be seen as a DP using the Bellman Optimality Equation: $$V^*(s) = \max_{a \in \mathcal{A}}\{r(s, a) + \gamma * \sum_{s'} P(s' |s, a)V^*(s')\}$$

Writing it into iteration-updating format: $$V^{k+1}(s) = \max_{a \in \mathcal{A}}\{r(s, a) + \gamma * \sum_{s'} P(s' |s, a)V^{k}(s')\}$$

Once the iteration reaches the fixed point ($v^{k+1}=v^k$), we can recover the optimal policy by $$\pi^*(s) = \arg \max_a\{r(s, a) + \gamma \sum_{s'} P(s'|s, a)V^*(s')\}$$

The value iteration algorithm is given as:
- init $V(s)$ randomly
- while $\Delta > \theta$ do:
    - $\Delta \leftarrow 0$
    - for each $s \in \mathcal{S}$:
        - $v \leftarrow V(s)$
        - $V(s) \leftarrow \max_{a} r(s, a) + \gamma \sum_{s'} P(s'|s, a)V(s')$
        - $\Delta \leftarrow \max(\Delta, |v-V(s)|)$
- end while
- return a deterministic policy $\pi(s) = \arg \max_a \{r(s, a) + \gamma \sum_{s'} P(s'|s, a)V(s')\}$

In [12]:
class ValueIteration:
    """ 价值迭代算法 """
    def __init__(self, env, theta, gamma):
        self.env = env
        self.v = [0] * self.env.ncol * self.env.nrow  # 初始化价值为0
        self.theta = theta  # 价值收敛阈值
        self.gamma = gamma
        # 价值迭代结束后得到的策略
        self.pi = [None for i in range(self.env.ncol * self.env.nrow)]

    def value_iteration(self):
        cnt = 0
        # TODO: fill in the blank
        while 1:
            v_new = [0] * self.env.ncol * self.env.nrow
            for s in range(len(self.v)):
                action_vals = []
                for a in range(4):
                    q = 0
                    for p, s_next, r, done in self.env.P[s][a]:
                        q += p * (r + self.gamma * self.v[s_next] * (1 - done))
                    action_vals.append(q)
                v_new[s] = max(action_vals)

            delta = np.max(np.abs(np.array(v_new) - np.array(self.v)))
            cnt += 1
            self.v = v_new
            if delta < self.theta:
                break

        print("价值迭代一共进行%d轮" % cnt)
        self.get_policy()

    def get_policy(self):  # 根据价值函数导出一个贪婪策略
        # TODO: fill in the blank
        for s in range(len(self.v)):
            action_vals = []
            for a in range(4):
                q = 0
                for p, s_next, r, done in self.env.P[s][a]:
                    q += p * (r + self.gamma * self.v[s_next] * (1 - done))
                action_vals.append(q)

            max_q = max(action_vals)
            cnt_q = action_vals.count(max_q)
            self.pi[s] = [1 / cnt_q if q == max_q else 0 for q in action_vals]

env = CliffWalkingEnv()
action_meaning = ['^', 'v', '<', '>']
theta = 0.001
gamma = 0.9
agent = ValueIteration(env, theta, gamma)
agent.value_iteration()
print_agent(agent, action_meaning, list(range(37, 47)), [47])

价值迭代一共进行15轮
状态价值：
-7.712 -7.458 -7.176 -6.862 -6.513 -6.126 -5.695 -5.217 -4.686 -4.095 -3.439 -2.710 
-7.458 -7.176 -6.862 -6.513 -6.126 -5.695 -5.217 -4.686 -4.095 -3.439 -2.710 -1.900 
-7.176 -6.862 -6.513 -6.126 -5.695 -5.217 -4.686 -4.095 -3.439 -2.710 -1.900 -1.000 
-7.458  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
策略：
ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovoo 
ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovo> ovoo 
ooo> ooo> ooo> ooo> ooo> ooo> ooo> ooo> ooo> ooo> ooo> ovoo 
^ooo **** **** **** **** **** **** **** **** **** **** EEEE 


# Frozen Lake Environment
The frozen lake enverionment is similar to cliff walk, despite that the action does not deterministically leads to the next state. In the frozen lake environment, the result of an action in a given state is sampled from a distribution of all accessible states given the current state as well as the action. 

OpenAI Gym provides the frozen lake environment. Every step has reward 0 except the goal with reward 1.

In [18]:
import gym
env = gym.make("FrozenLake-v1")  # 创建环境
env = env.unwrapped  # 解封装才能访问状态转移矩阵P
env.render()  # 环境渲染,通常是弹窗显示或打印出可视化的环境

holes = set()
ends = set()
for s in env.P:
    for a in env.P[s]:
        for s_ in env.P[s][a]:
            if s_[2] == 1.0:  # 获得奖励为1,代表是目标
                ends.add(s_[1])
            if s_[3] == True:
                holes.add(s_[1])
holes = holes - ends
print("冰洞的索引:", holes)
print("目标的索引:", ends)

for a in env.P[14]:  # 查看目标左边一格的状态转移信息
    print(env.P[14][a])

冰洞的索引: {11, 12, 5, 7}
目标的索引: {15}
[(0.3333333333333333, 10, 0.0, False), (0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 14, 0.0, False)]
[(0.3333333333333333, 13, 0.0, False), (0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True)]
[(0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True), (0.3333333333333333, 10, 0.0, False)]
[(0.3333333333333333, 15, 1.0, True), (0.3333333333333333, 10, 0.0, False), (0.3333333333333333, 13, 0.0, False)]


In [19]:
# 这个动作意义是Gym库针对冰湖环境事先规定好的
action_meaning = ['<', 'v', '>', '^']
theta = 1e-5
gamma = 0.9
agent = PolicyIteration(env, theta, gamma)
agent.policy_iteration()
print_agent(agent, action_meaning, [5, 7, 11, 12], [15])

策略评估进行25轮后完成
策略提升完成
策略评估进行58轮后完成
策略提升完成
状态价值：
 0.069  0.061  0.074  0.056 
 0.092  0.000  0.112  0.000 
 0.145  0.247  0.300  0.000 
 0.000  0.380  0.639  0.000 
策略：
<ooo ooo^ <ooo ooo^ 
<ooo **** <o>o **** 
ooo^ ovoo <ooo **** 
**** oo>o ovoo EEEE 


In [20]:
action_meaning = ['<', 'v', '>', '^']
theta = 1e-5
gamma = 0.9
agent = ValueIteration(env, theta, gamma)
agent.value_iteration()
print_agent(agent, action_meaning, [5, 7, 11, 12], [15])

价值迭代一共进行61轮
状态价值：
 0.069  0.061  0.074  0.056 
 0.092  0.000  0.112  0.000 
 0.145  0.247  0.300  0.000 
 0.000  0.380  0.639  0.000 
策略：
<ooo ooo^ <ooo ooo^ 
<ooo **** <o>o **** 
ooo^ ovoo <ooo **** 
**** oo>o ovoo EEEE 
