# Model-Free Prediction

## 1. Introduction
* Previous lecture: Planning by dynamic programming -> solve a known MDP
* This lecture: Model-free prediction -> estimate the value function of unknown MDP
* Next lecture: Model-free control -> optimise the value function of an unknown MDP

## 2. Monte-Carlo Learning
* MC methods learn directly from episodes of experience and the episode shoulb be complete here. Thus, the MDP should ve episodic, which means all episodes must terminate.
* MC is model-free: no knowledge of MDP transitions / rewards.
* MC uses the simplest idea: value = mean return.

### 2.1. Monte-Carlo Policy Evaluation
* Goal: learn $v_\pi$ from episodes of experience under policy $\pi$.
$$S_1, A_1, R_2, ..., S_k ~ \pi$$
* Monte-Carlo policy evaluation uses empirical mean return instead of expected return.

#### 2.1.1. First-Visit Monte-Carlo Policy Evaluation
* To evaluate state $s$
* Consider the first time-step $t$ that state $s$ is visited in an episode
* Increment counter $N(s) \leftarrow N(s)+1$
* Increment toal return $S(s) \leftarrow S(s)+G_t$
* Value is estimated by mean return $V(s)=S(s)/N(s)$
* By law of large numbers, $V(s) \rightarrow v_\pi(s)$ as $N(s) \rightarrow \infty$

#### 2.1.2. Every-Visit Monte-Carlo Policy Evaluation
* To evaluate state $s$
* Every time-step $t$ that state $s$ is visited in an episode
* Increment counter $N(s) \leftarrow N(s)+1$
* Increment toal return $S(s) \leftarrow S(s)+G_t$
* Value is estimated by mean return $V(s)=S(s)/N(s)$
* By law of large numbers, $V(s) \rightarrow v_\pi(s)$ as $N(s) \rightarrow \infty$

### 2.2. Incremental Monte-Carlo Updates
* Update $V(s)$ incrementally after episode $S_1, A_1, R_2, ..., S_T$
* For each state $S_t$ wirh return $G_t$
$$N(S_t) \leftarrow N(S_t)+1$$
$$V(S_t) \leftarrow V(S_t)+\frac{1}{N(S_t)}(G_t-V(S_t))$$
* In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.
$$V(S_t) \leftarrow V(S_t)+ \alpha (G_t-V(S_t))$$

## 3. Temporal-Difference Learning
* TD methods learn directly from episodes of experience, but here the episodes are incomplete.
* TD is model-free: no knowledge of MDP transitions / rewards.
* TD updates a guess towards a guess.

### 2.1. MC and TD
* Goal: learn $v_\pi$ online from experience under policy $\pi$
* Incremental every-visit Monte-Carlo
* Update value $V(S_t)$ toward actual return $G_t$
$$V(S_t) \leftarrow V(S_t) + \alpha (G_t-V(S_{t+1}))$$
* Simplest temporal-difference learning algorithm: TD(0)
* Update value $V(S_t)$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$
$$V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1}+\gamma V(S_{t+1})-V(S_t))$$
* $R_{t+1}+\gamma V(S_{t+1})$ is called the TD target
* $\delta_t = R_{t+1}+\gamma V(S_{t+1})-V(S_t)$ is called the TD error

### 2.2 Advantages and Disadvantages of MC vs. TD
* TD can learn before knowing the final outcome
* TD can learn online after every step
* MC must wait until end of episode before return is known
* TD can learn without the final outcome TD can learn from incomplete sequences
* MC can only learn from complete sequences
* TD works in continuing (non-terminating) environments
* MC only works for episodic (terminating) environments

### 2.3 Bias / Variance Trade-Off
* Return $G_t=R_{t+1}+\gamma R_{t+2}+...+\gamma^{T-1}R_T$ is unbiased estimate of $v_\pi (S_t)$
* True TD target $R_{t+1}+\gamma v_\pi (S_{t+1})$ is unbiased estimate of $v_\pi (S_t)$
* TD target $R_{t+1}+\gamma V(S_{t+1})$ is biased estimate of $v_\pi (S_t)$
* TD target is much lower variance than the return, because return depends on many random actions, transitions, rewards, but TD target depends on one.


#### Prove that fixed learning rate (step size alpha) for MC is equivalent to an exponentially decaying average of episode returns

$$\begin{split}
V_{k+1}(S_{t}) & = & V_k(S_t) + \alpha (G_{k,t}-V_k(S_{t+1})) \\
& = & (1-\alpha)V_k(S_t) + \alpha G_{k,t} \\
& = & (1-\alpha)((1-\alpha)V_{k-1}(S_t) + \alpha G_{k-1,t}) + \alpha G_{k,t} \\
& = & \alpha (1-\alpha)^{k-1}G_{0,t} + ... + \alpha (1-\alpha)G_{k-1,t}+\alpha G_{k,t}
\end{split}$$

#### Prove that Offline Forward-View TD(Lambda) and Offline Backward View TD(Lambda) are equivalent. We covered the proof of Lambda = 1 in class. Do the proof for arbitrary Lambda (similar telescoping argument as done in class) for the case where a state appears only once in an episode.

