# Dynamic Programming
Now that we have the Bellman Optimality Equation, it is very natural that we come up with the idea to solve MDP with Dynamic Programming (DP). 

There are 2 main types of DP for RL: policy iteration and value iteration. Policy iteration consists of 2 parts: policy evaluation and policy improvement. The policy evalutaion in policy iteration utilizes the Bellman Expactation Equation to get a policy's state value function, which is a DP process; whereas value iteration directly uses the Bellman Optimality Equation to do DP and reaches the final optimality.

Different from Monte Carlo methods, DP methods requires we know $P$ and $r$. Howevver, such white-box environments are rare, which limits the usage of DP. Besides, DP may fits only on descrete MDP, where the states and actions are descrete.

# Cliff Walking Enverionment

This is a classic RL environment, where an agent walks next to a cliff. The agent should avoid the cliff parts and walk from a given starting point to the given end point. Each step has a reward of -1, when the agent entering the cliff area, the reward is -100. Reaching either the end point or the cliff ends the episode. The action space contains 4 actions: up, down, left and right. A action that tries to cross the boundary of the envrionment lead to no move. Below is the code for the cliff walking environment.

In [55]:
import copy


class CliffWalkingEnv:
    """ 悬崖漫步环境"""
    def __init__(self, ncol=12, nrow=4):
        self.ncol = ncol  # 定义网格世界的列
        self.nrow = nrow  # 定义网格世界的行
        # 转移矩阵P[state][action] = [(p, next_state, reward, done)]包含下一个状态和奖励
        self.P = self.createP()

    def createP(self):
        # 初始化
        P = [[[] for j in range(4)] for i in range(self.nrow * self.ncol)]
        # 4种动作, change[0]:上,change[1]:下, change[2]:左, change[3]:右。坐标系原点(0,0)
        # 定义在左上角
        change = [[0, -1], [0, 1], [-1, 0], [1, 0]]
        for i in range(self.nrow):
            for j in range(self.ncol):
                for a in range(4):
                    # 位置在悬崖或者目标状态,因为无法继续交互,任何动作奖励都为0
                    if i == self.nrow - 1 and j > 0:
                        P[i * self.ncol + j][a] = [(1, i * self.ncol + j, 0, True)]
                        continue
                    # 其他位置
                    next_x = min(self.ncol - 1, max(0, j + change[a][0]))
                    next_y = min(self.nrow - 1, max(0, i + change[a][1]))
                    next_state = next_y * self.ncol + next_x
                    reward = -1
                    done = False
                    # 下一个位置在悬崖或者终点
                    if next_y == self.nrow - 1 and next_x > 0:
                        done = True
                        if next_x != self.ncol - 1:  # 下一个位置在悬崖
                            reward = -100
                    P[i * self.ncol + j][a] = [(1, next_state, reward, done)]
        return P

In [56]:
env = CliffWalkingEnv()
print(env.P[35][0])

[(1, 23, -1, False)]


# Policy Iteration
## Policy Evalutaion 
Policy evaluation evaluates the value of a policy. Recall the Bellman Expectation euqation: $$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) (r(s, a) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, a)V^{\pi}(s'))$$

Using the idea of DP, we can update the value of this iteration by that of the last iteration: $$V^{k+1}(s) = \sum_{a \in \mathcal{A}} \pi(a|s) (r(s, a) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, a)V^{k}(s'))$$

We may choose any starting point $V^0$. According the Bellman Expectation Equation, $V^k=V^\pi$ is a fixed point of the updating equation above. Hence, when we have $\max_{s \in \mathcal{S}}|V^{k+1}-V^{k}| \leq \epsilon$ where $\epsilon$ is very small, the evalutaion may end.

## Policy Improvement
Now that we have $V^\pi$ in Policy Evaluation, we can improve it. Assume we have $Q^\pi(s, a) > V^\pi(s)$, we can replace the previous action with action $a$ to improve $V^\pi(s)$. Similarly, we can assume there exists a policy $\pi'$, such that $Q^\pi(s, \pi'(s)) \geq V^\pi(s)$, then we have $V^{\pi'}(s) \geq V^\pi(s)$. This is the policy improvement theorem. Hence, we can improve the policy greedily: $$\pi'(s) = \arg \max_{a \in \mathcal{A}} Q^\pi(s, a) = \arg \max_{a \in \mathcal{A}} \{r(s, a) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, a)V^{k}(s')\}$$

## Policy Iteration Algotithm
Combine policy evaluation and improvement together, we have the algorithm below: 

- Randomly init $V(s)$, $\pi(s)$

- while $\Delta > \theta$, do: # Pollicy Evaluation
    - $\Delta \leftarrow 0$
    - for every $s \in \mathcal{S}$:
        - $v \leftarrow V(s)$
        - $V(s) \leftarrow r(s, \pi(s)) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, \pi(s))V(s')$
        - $\Delta \leftarrow \max(\Delta, |v - V(s)|)$
    - end for
- end while
- $\pi_{old} = \pi$ # Policy Improvement
- for every $s \in \mathcal{S}$:
    - $\pi(s) \leftarrow \arg \max_{a \in \mathcal{A}} \{r(s, a) + \gamma \sum_{s' \in \mathcal{S}} p(s'|s, a)V(s')\}$
- if $\pi_{old} == \pi$:
    - return $\pi$
- else:
    - Redo since the while loop

Below is the implementation of Policy Iteration Algorithm in the cliff walking environment.

In [4]:
import numpy as np
import math

In [61]:
class PolicyIteration:
    """ 策略迭代算法 """
    def __init__(self, env, theta, gamma):
        self.env = env
        self.v = [0] * self.env.ncol * self.env.nrow  # 初始化价值为0
        self.pi = [[0.25, 0.25, 0.25, 0.25]
                   for i in range(self.env.ncol * self.env.nrow)]  # 初始化为均匀随机策略
        self.theta = theta  # 策略评估收敛阈值
        self.gamma = gamma  # 折扣因子

    def policy_evaluation(self):  # 策略评估
        cnt = 0  # 计数器
        # TODO: fill the blank
        while True:
            v_new = [0] * self.env.ncol * self.env.nrow
            for row in range(self.env.nrow):
                for col in range(self.env.ncol):
                    a_list = self.pi[row * self.env.ncol + col]
                    for a, a_prob in enumerate(a_list):
                        for p, next_s, r, done in self.env.P[row * self.env.ncol + col][a]:
                            if done:
                                v_new[row * self.env.ncol + col] += p * r * a_prob
                            else:
                                v_new[row * self.env.ncol + col] += p * (r + self.gamma * self.v[next_s]) * a_prob
                    
            cnt += 1
            delta = np.max(np.abs(np.array(v_new) - np.array(self.v)))
            print(f"delta: {delta}")
            self.v = v_new

            if delta < self.theta:
                break

        print("策略评估进行%d轮后完成" % cnt)

    def policy_improvement(self):  # 策略提升
        # TODO: fill the blank
        for row in range(self.env.nrow):
            for col in range(self.env.ncol):
                action_vals = []
                for a in range(4):
                    qsa = 0

                    for p, next_s, r, done in self.env.P[row * self.env.ncol + col][a]:
                        if done:
                            qsa += r * p
                        else:
                            qsa += (r + self.gamma * self.v[next_s]) * p
                    action_vals.append(qsa)
                self.pi[row * self.env.ncol + col] = action_vals

        print("策略提升完成")
        return self.pi

    def policy_iteration(self):  # 策略迭代
        while 1:
            self.policy_evaluation()
            old_pi = copy.deepcopy(self.pi)  # 将列表进行深拷贝,方便接下来进行比较
            new_pi = self.policy_improvement()
            if old_pi == new_pi: break

In [62]:
def print_agent(agent, action_meaning, disaster=[], end=[]):
    print("状态价值：")
    for i in range(agent.env.nrow):
        for j in range(agent.env.ncol):
            # 为了输出美观,保持输出6个字符
            print('%6.6s' % ('%.3f' % agent.v[i * agent.env.ncol + j]), end=' ')
        print()

    print("策略：")
    for i in range(agent.env.nrow):
        for j in range(agent.env.ncol):
            # 一些特殊的状态,例如悬崖漫步中的悬崖
            if (i * agent.env.ncol + j) in disaster:
                print('****', end=' ')
            elif (i * agent.env.ncol + j) in end:  # 目标状态
                print('EEEE', end=' ')
            else:
                a = agent.pi[i * agent.env.ncol + j]
                pi_str = ''
                for k in range(len(action_meaning)):
                    pi_str += action_meaning[k] if a[k] > 0 else 'o'
                print(pi_str, end=' ')
        print()


env = CliffWalkingEnv()
action_meaning = ['^', 'v', '<', '>']
theta = 0.001
gamma = 0.9
agent = PolicyIteration(env, theta, gamma)
agent.policy_iteration()
print_agent(agent, action_meaning, list(range(37, 47)), [47])

delta: 25.75
delta: 12.0375
delta: 8.0240625
delta: 5.937363281250001
delta: 4.089732714843752
delta: 3.27681584472656
delta: 2.5250907914428673
delta: 2.2306387758316024
delta: 1.9579752046523673
delta: 1.6901071782382928
delta: 1.4595255911432226
delta: 1.269339887163735
delta: 1.1028484997896264
delta: 0.9590274898704862
delta: 0.8288213673899314
delta: 0.7168213885465349
delta: 0.6178573650509378
delta: 0.5328496576682191
delta: 0.4586530697272728
delta: 0.39493595566952777
delta: 0.33968529971880557
delta: 0.2922321109942061
delta: 0.2512339482312349
delta: 0.21601524266326777
delta: 0.1856515498418645
delta: 0.15956478939328989
delta: 0.13710303972808546
delta: 0.117804224101004
delta: 0.10120084586155897
delta: 0.0869357609571999
delta: 0.07466993073367334
delta: 0.06413228860047937
delta: 0.05507516148160718
delta: 0.047294948142834414
delta: 0.040609898302744085
delta: 0.034868037283086295
delta: 0.029935629640885253
delta: 0.025699677389749098
delta: 0.022061624656650736
delt

KeyboardInterrupt: 