toc | layout | description | categories | title | classes | image | |
---|---|---|---|---|---|---|---|
true |
post |
A summary of "Understanding Deep Reinforcement Learning" |
|
Temporal Difference Learning TD(0) |
wide |
We want to estimate
In incremental every-visit MC, update the estimate using one sample of return for the current
What if we use the following recursive formula:
(replaced $G_{i, t} \rarr r_t + \gamma V^{\pi}(s_{t+1})$)
$$ V^{\pi}(s) = V^{\pi}(s) + \alpha \big(r_t + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s)\big)$$
- Generate episodes
$s_1, a_1, r_1, s_2, a_2, r_2, \cdots$ under policy$\pi$ where the actions are sampled from$\pi$ to make tuples ($s, a, r, s'$ ) - TD(0) learning: update value by estimated value
In this formula, we can define the error of TD(0) as
As you can see, value function can be calculated from current value and next value. So TD(0) can immediately update value estimate using (
Detailed process as follows:
- Initialize
$V^{\pi}(s), \forall s \in S$ - Repeat until convergence
- Sample tuple (
$s_t, a_t, r_t, s_{t+1}$ ) - Update value with a learning rate
$\alpha$
$V^{\pi}(s_t) = V^{\pi}(s_t) + \alpha (\underbrace{[r_t + \gamma V^{\pi}(s_{t+1})]}_{\text{TD target}} - V^{\pi}(s_t))$
- Sample tuple (
In MC,
-
TD can learn online after every step
- But MC must wait until end of episode before return is known
-
TD can learn with incomplete sequences
- MC can only learn from complete sequences
-
TD works in continuing (non-episodic) environments
- MC only works for episodic (terminating) environments
-
MC has high variance and unbiased
- Simple to use
- Good convergence properties (even with function approximation)
- Not very sensitive to initial value
- Usually more efficient in non-Markov environments
-
TD has low variance and biased
- Usually more efficient than MC
- TD(0) converges to
$V_{\pi}(s)$ (but not always with function approximation) - More sensitive to initial value
- Usually more effective in Markov environments