# 03 Algorithms

## 3.5 Temporal-Difference Methods

### Temporal-Difference Learning of State Values
给定策略$\pi$，我们的目标是：对于所有的$s \in \mathcal{S}$，估计状态价值$v_{\pi}(s)$。假设我们有遵循策略$\pi$生成的样本集$ (s_0, r_1, s_1, s_1, r_2, s_2, ... ,s_{T-1}, r_T, \mathcal s_T) $，也可以表示为$\{(s_i, r_{i+1}, s_{i+1})\}^{T}_{i=0}$，从状态价值函数的定义
$$
v_{\pi}(s_t) = E_{\pi}[r_{t+1} + \gamma v_{\pi}(s_{t+1}) | s_t = s]
$$
可令：
$$
g(v_{\pi}(s_t)) = v_{\pi}(s_t) - E_{\pi}[r_{t+1} + \gamma v_{\pi}(s_{t+1}) | s_t = s]
$$
利用Robbins-Monro算法进行求解，我们有：
$$
\tilde{g}(v_{\pi}(s_t)) = v_{\pi}(s_t) - [(r_{t+1} + \gamma v_{\pi}(s_{t+1}))]
$$
根据 Robbins-Monro算法，$g(v_{\pi}(s_t)) = 0$的解可以使用一下迭代过程来求解：
$$
\begin{aligned}
v_{t+1}(s_t) &= v_t(s_t) - \alpha_t(s_t) \tilde{g}(v_t(s_t)) \\
&= v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1}))]
\end{aligned}
$$
即为Temporal-Difference方法。


Temporal-Difference方法利用这些样本来估计状态价值:
$$
\begin{cases}
v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1}))] & s=s_t \\
v_{t+1}(s) = v_t(s) & s \neq s_t
\end{cases}
$$

其中，$t=0,1,2,...$，$\alpha_t$是一个很小的正数，代表学习率。

Temporal-Difference中的重要名词解释：
$$
\underbrace{v_{t+1}(s_t)}_{prediction} = \underbrace{v_t(s_t)}_{current \ estimate} - \alpha_t(s_t)\overbrace{[v_t(s_t) - (\underbrace {r_{t+1} + \gamma v_t(s_{t+1})}_{TD \ target \ \bar v_t})]}^{TD \ error \ \delta_t}
$$

### Temporal-Difference Learning VS Monte Carlo Learning
| Temporal-Difference Learning                                       | Monte Carlo Learning                                             |
|--------------------------------------------------------------------|------------------------------------------------------------------|
| **Incremental:** TD算法是增量的，它能够在样本集上直接增量更新状态/动作价值；                   | **Non-incremental:** MC方法则是非增量的，必须等到一整个episode结束后才能进行更新；因为MC算法必须计算整个episode的return，这个过程通常比较耗时； |
| **Continuing task:** 因为TD算法是增量的，它既能够处理回合制和连续性任务；                   | **Episodic Tasks:** 由于MC是非增量的，它只能处理有限步的回合任务;                     |
| **Bootstrapping:** 采用bootstrapping的方法预测当前状态的价值函数（通常利用下一个时刻的状态-动作价值来近似当前的状态-动作价值）,要求对状态/动作价值的猜测值 | **Sampling:** 使用实际回报值进行更新，而不用估计的未来奖励值；                           |
| **Low estimation variance:** TD方法通常有较小的方差（因为它是单步更新）；               | **High estimation variance:** 由于MC方法是使用完整的episode，所以它的方差相对较高;    |
| **Fastly**: 只利用单步转移后的奖励来更新对状态价值的估计，不需要等待整个回合结束;                    |  **Slowly**：需要完整的轨迹才能计算期望的回报。它们通过对多次经历中的每个状态的实际返回值取平均来估计其期望价值;|

### Example

In [1]:
import time

import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from tqdm import tqdm

In [2]:
class TemporalDifference:
    """ Temporal Difference learning for estimating the state values"""

    def __init__(self, env, gamma=0.95, alpha=0.1, epsilon=0.05, episodes=500, epsilon_decay=0.99):
        """ Initialize the parameters of temporal difference method """

        self.env = env
        self.gamma = gamma
        self.alpha = alpha
        self.epsilon = epsilon
        self.episodes = episodes
        self.epsilon_decay = epsilon_decay

        self.values = np.zeros(env.observation_space.n)
        self.policy = np.ones((env.observation_space.n, env.action_space.n)) / env.action_space.n

    def select_best_action(self, state):
        """ Select the best action based on the current value function """
        q_values = np.zeros(self.env.action_space.n)
        for action in range(self.env.action_space.n):
            for prob, next_state, reward, done in self.env.unwrapped.P[state][action]:
                reward = self.custom_reward(done, reward)
                q_values[action] += prob * (reward + self.gamma * self.values[state] * (1 - done))
        return np.argmax(q_values)

    @staticmethod
    def custom_reward(done, reward):
        if done and reward == 1:
            return 10
        elif done and reward == 0:
            return -5
        else:
            return -0.1

    def take_action(self, state):
        """ Take an action based on the current policy """

        if np.random.uniform(0, 1) < self.epsilon:
            return self.env.action_space.sample()
        else:
            return self.select_best_action(state)

    def update(self):
        """ Update the value function using temporal difference method """

        state, _ = self.env.reset()
        done = False
        while not done:
            action = self.take_action(state)
            next_state, reward, terminated, truncated, info = self.env.step(action)

            done = terminated or truncated
            reward = self.custom_reward(done, reward)

            td_target = reward + self.gamma * self.values[next_state]
            td_error = self.values[state] - td_target
            self.values[state] += -self.alpha * td_error

            state = next_state

    def visualize_policy(self, delay=0.5):
        state, info = self.env.reset()
        done = False

        while not done:
            self.env.render()
            action = np.argmax(self.policy[state])
            state, reward, terminated, truncated, info = self.env.step(action)
            done = terminated or truncated
            time.sleep(delay)

        self.env.render()
        self.env.close()

    def get_optimal_policy(self):
        """ Get Optimal Policy from value function """

        for state in range(self.env.observation_space.n):
            policy = np.zeros(self.env.action_space.n)
            policy[self.select_best_action(state)] = 1.0
            self.policy[state] = policy

        return self.policy

    def train(self):
        """ Train the agent for a specified number of episodes """

        for episode in range(self.episodes):
            self.update()
            self.epsilon *= self.epsilon_decay
            self.epsilon = max(self.epsilon, 0.01)
            print(f'{episode} Episode Complete Values:', self.values, '\n')
            print(f'Epsilon:', self.epsilon, '\n')
        print('Training complete')

In [3]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='human')
environment.reset()

2025-02-13 13:46:50.704 python[49137:45113186] +[IMKClient subclass]: chose IMKClient_Modern
2025-02-13 13:46:50.704 python[49137:45113186] +[IMKInputSession subclass]: chose IMKInputSession_Modern


(0, {'prob': 1})

In [4]:
agent = TemporalDifference(environment, gamma=0.9, episodes=100, epsilon=0.9, alpha=0.1)

In [5]:
agent.epsilon = 0.99
agent.train()
print(f"Optimal Values: {agent.values}")

0 Episode Complete Values: [-0.0367309  0.         0.         0.        -0.01       0.
  0.         0.        -0.509      0.         0.         0.
  0.         0.         0.         0.       ] 

Epsilon: 0.9801 

1 Episode Complete Values: [-0.04395781  0.          0.          0.         -0.509       0.
  0.          0.         -0.509       0.          0.          0.
  0.          0.          0.          0.        ] 

Epsilon: 0.9702989999999999 

2 Episode Complete Values: [-0.13535498 -0.5         0.          0.         -0.48187688  0.
  0.          0.         -0.509       0.          0.          0.
  0.          0.          0.          0.        ] 

Epsilon: 0.96059601 

3 Episode Complete Values: [-0.19067419 -0.5         0.          0.         -0.9383523   0.
  0.          0.         -0.509       0.          0.          0.
  0.          0.          0.          0.        ] 

Epsilon: 0.9509900498999999 

4 Episode Complete Values: [-0.22660678 -0.46       -0.01       -0.5        -0

In [6]:
agent.get_optimal_policy()

array([[1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.]])

In [7]:
agent.visualize_policy(delay=0.005)