#  Dynamic Programming

Written by [Junkun Yuan](https://junkunyuan.github.io/) (yuanjk0921@outlook.com).

See paper reading list and notes [here](https://junkunyuan.github.io/paper_reading_list/paper_reading_list.html).

Last updated on Sep 06, 2025; &nbsp; First committed on Sep 06, 2025.

**References**

- [**Hands-on RL**](https://github.com/boyu-ai/Hands-on-RL/blob/main/%E7%AC%AC2%E7%AB%A0-%E5%A4%9A%E8%87%82%E8%80%81%E8%99%8E%E6%9C%BA%E9%97%AE%E9%A2%98.ipynb)

**Contents**
- Introduction

## Introduction

[**Dynamic programming**](https://en.wikipedia.org/wiki/Dynamic_programming) aims to simplify a complicated problem by breaking it down into simpler sub-problems in a recursive manner.

[**Value iteration**](https://en.wikipedia.org/wiki/Markov_decision_process#Value_iteration) employs optimal Bellman equation to obtian optimal state value through Bellman equaiton.

[**Policy iteration**](https://en.wikipedia.org/wiki/Markov_decision_process#Policy_iteration) consists of **policy evaluation** and **policy improvement**, where policy evaluation obtains state value function with Bellman equation as a dynamic programming process.

Value iteration and policy iteration require (1) access to the **transition function** and **reward function**, and (2) a discrete and finite state space and action space.

## Cliff Walking

The **Cliff Walking** problem is a reinforcement learning gridworld task where an agent must navigate from a *start state* to a *goal state* while avoiding a "cliff" of *terminal states* that yield large negative rewards.

In [None]:
import copy

class CliffWalkingEnv:
    def __init__(self, ncol=12, nrow=4):
        self.ncol = ncol  # column of the grid
        self.nrow = nrow  # row of the grid
        ## transition matrix P[state][action] = [(p, next_state, reward, done)]
        self.P = self.createP()

    def createP(self):
        ## Initialize the transition matrix
        P = [[[] for j in range(4)] for i in range(self.nrow * self.ncol)]
        ## 4 actions: up, down, left, right
        change = [[0, -1], [0, 1], [-1, 0], [1, 0]]
        for i in range(self.nrow):
            for j in range(self.ncol):
                for a in range(4):
                    ## At cliff or goal state, any action has reward 0
                    if i == self.nrow - 1 and j > 0:
                        P[i * self.ncol + j][a] = [(1, i * self.ncol + j, 0, True)]
                        continue
                    ## At other states
                    next_x = min(self.ncol - 1, max(0, j + change[a][0]))
                    next_y = min(self.nrow - 1, max(0, i + change[a][1]))
                    next_state = next_y * self.ncol + next_x
                    reward = -1
                    done = False
                    ## Next step at cliff or goal state
                    if next_y == self.nrow - 1 and next_x > 0:
                        done = True
                        if next_x != self.ncol - 1:  # at cliff
                            reward = -100
                    P[i * self.ncol + j][a] = [(1, next_state, reward, done)]
        return P

## Policy Iteration

### Policy Evaluation

Polisy evaluation calculates state value function of a policy.

Given the Bellman equation:

$$
V^{\pi}(s)=\mathbb{E}_{\pi}[R_t+\gamma V^{\pi}(S_{t+1})|S_{t}=s]=\sum_{a\in\mathcal{A}}\pi(a|s)(r(s,a)+\gamma\sum_{s'\in S}p(s'|s,a)V^{\pi}(s')),
$$