# Introduction
DP requires a known MDP. Under this situation, the agent does not need to communicate with the env to get information about it. Instead, it can simply use DP to find out the optimal policy and value function. It is similar to a supervised learning task with given data distribution, where we can simply minimize the expected error in lieu of sampling data points.

However, this could be inrealistic to most RL scenarios, in which the MDP cannot be written expilicitly and the distribution is unknown. Under this setting, the agent has to interact with the environment and learn from sampled data. Such methods, as there is not enverionment model, are called model-free RL.

In this section we will learn 2 classic temporal difference methods that belongs to model-free RL, SARSA and Q-learning. Meanwhile, we also introduce a set of concepts: online learning vs. offline learning. Generally speaking, online learning force the agent to learn from data sampled from its current policy. Once the policy is updated, previous samples are no longer valid. On the other hand, offline learning collects previous sampled data into a replay pool for further utilization. Therefore, offline learning is usually better in utilizing historical data, and has a smaller sample complexity (number of samples required to reach convergence). 

# Temporal Difference
Temporal Difference (TD) is a method to evaluate the value function that combines the ideas of Monte Calro (MC) methods and DP. TD is similar to MC in a way that they learn from data samples without the need to know the environment. TD resembles DP by estimating current value functions using later states' values according to the Bellman equation.

Review the update of value functions in MC: $$V(s_t) \leftarrow V(s_t) + \alpha[G_t - V(s_t)]$$ where we replace $\frac{1}{N(s)}$ with $\alpha$, a constant represents the step of the update. MC must wait until the episode ends to calculate $G_t$, while TD can calculate it once a step ends. 

Instead of taking the expectation of the following values, TD estimates the return of current state by current reward plus next state value (discounted): $$V(s_t) \leftarrow V(s_t) + \alpha[r_t + \gamma V(s_{t+1}) - V(s_t)], $$

among which $r_t + \gamma V(s_{t+1}) - V(s_t)$ is called TD error. 

Here is why you can replace $G_t$ with $r_t + \gamma V(s_{t+1})$:

$$V_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s] \\ \quad = \mathbb{E}_\pi[\sum_{k=0}^\infty \gamma^k R_{t+k}| S_t = s] \\ \quad = \mathbb{E}_\pi[R_t + \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s] \\ \quad = \mathbb{E}_\pi[R_t + \gamma V(s_{t+1}) | S_t = s]$$


# Sarsa Algorithm
With our TD estimation, it is nature for us to think if there is a way we can do RL resembling policy iteration. For policy evaluation, we can directly use TD estimation. But what should we do for policy improvement without reward function and state transformation functions? The answer is to use TD estimation for the action value function $Q$: $$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[r(s_t, a_t) + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]$$

Then we can choose the best action greedily: $\arg \max_a(Q(s, a))$. It seems to be a complete algorithm now: use greedy algorithm to interact with the environment, and then use the sampled data from interation to update the state value function by TD estimation. 

However, there exists something to think before we start our implementation. First, an accurate estimation requires great amount of data samples. But, recall what we do in our value iteration algorithm. Actually we do not need it to be very accurate before we update our policy, the idea of which is call the generalized policy iteration. Second, if we use the greedy selection, some actions may naver appear in our sampled data, in which case we cannot estimate those action values and thus cannot guarantee the updated policy is better than the old one. If I may remind you of what we do in the bandit problem, usually we can replace greedy with $\epsilon$-greedy. In our case, we have the formula below: $$\pi(a|s) = \begin{cases} \frac{\epsilon}{|\mathcal{A}|} + 1 - \epsilon \text{, if } a = \arg \max_{a' \in \mathcal{A}} Q(s, a') \\ \frac{\epsilon}{|\mathcal{A}|} \text{, otherwise}\end{cases}$$

Now, we have the SARSA algorithm. This alogirthm got its name by what its update requires: current state $s$, current action $a$, reward $r$, next state $s'$, and next action $a'$. Below is the full descrition of Sarsa:

- Init $Q(s, a)$
- for $e \leftarrow 1$ to $E$ do:
    - get start state $s$
    - choose the action $a$ by $\epsilon$-greedy
    - for $t \leftarrow 1$ to $T$ do:
        - get $r, s'$ from env
        - choose $a'$ by $\epsilon$-greedy
        - $Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]$
        - $s \leftarrow s'$, $a \leftarrow a'$
    - end for
- end for

Now, let's try Sarsa in cliff walking! Notice we have some different implementation this time to better interact. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm  # tqdm是显示循环进度条的库


class CliffWalkingEnv:
    def __init__(self, ncol, nrow):
        self.nrow = nrow
        self.ncol = ncol
        self.x = 0  # 记录当前智能体位置的横坐标
        self.y = self.nrow - 1  # 记录当前智能体位置的纵坐标

    def step(self, action):  # 外部调用这个函数来改变当前位置
        # 4种动作, change[0]:上, change[1]:下, change[2]:左, change[3]:右。坐标系原点(0,0)
        # 定义在左上角
        change = [[0, -1], [0, 1], [-1, 0], [1, 0]]
        self.x = min(self.ncol - 1, max(0, self.x + change[action][0]))
        self.y = min(self.nrow - 1, max(0, self.y + change[action][1]))
        next_state = self.y * self.ncol + self.x
        reward = -1
        done = False
        if self.y == self.nrow - 1 and self.x > 0:  # 下一个位置在悬崖或者目标
            done = True
            if self.x != self.ncol - 1:
                reward = -100
        return next_state, reward, done

    def reset(self):  # 回归初始状态,坐标轴原点在左上角
        self.x = 0
        self.y = self.nrow - 1
        return self.y * self.ncol + self.x