# Model-Free Methods

#### Model-free prediction
Estimate the value function of an unknown MDP

#### Model-free control
Optimise the value function of an unknown MDP

## Monte-Carlo Learning

* learns directly from episodes of experience
* no knowledge of MDP transitions/rewards (model-free)
* learns from complete episodes: **no bootstrapping**
* value = mean return
* requires episodic MDP (must terminate)

### First-Visit Monte-Carlo Policy Evaluation

The **first** time-step $t$ that state $s$ is visited in an episode
* Increment counter $N(s) \leftarrow N(s)+1$
* Increment total return $S(s) \leftarrow S(s)+G_t$

Value is estimated by mean return $V(s)=S(s)/N(s)$

By law of large numbers, $V(s) \rightarrow v_\pi(s)$ as $N(s)\rightarrow \infty$

### Every-Visit Monte-Carlo Policy Evaluation

**Every** time-step $t$ that state $s$ is visited in an episode
* Increment counter $N(s) \leftarrow N(s)+1$
* Increment total return $S(s) \leftarrow S(s)+G_t$

Value is estimated by mean return $V(s)=S(s)/N(s)$

By law of large numbers, $V(s) \rightarrow v_\pi(s)$ as $N(s)\rightarrow \infty$

### Incremental Monte-Carlo Updates

For each state $S_t$ with return $G_t$
* $N(S_t) \leftarrow N(S_t)+1$
* $V(S_t) \leftarrow V(S_t)+ \frac{1}{N(S_t)}(G_t-V(S_t))$

In non-stationary probles, it can be useful to track a runnning mean, i.e. forget old episodes

$$V(S_t) \leftarrow V(S_t)+\alpha(G_t-V(S_t))$$

## Temporal-Difference Learning

* learns directly from episodes of experience
* no knowledge of MDP transitions/rewards
* learns from incomplete episodes, by **bootstrapping**
* updates a guess towards a guess

### TD(0)

Update value $V(S_t)$ toward estimated return $R_{t+1}+\gamma V(S_{t+1})$
$$ V(S_t)\leftarrow V(S_t)+\alpha(R_{t+1}+\gamma V(S_{t+1})-V(S_t))$$

* $R_{t+1}+\gamma V(S_{t+1})$ is called the **TD target**
* $\delta_t=R_{t+1}+\gamma V(S_{t+1})-V(S_t)$ is called the **TD error**

### TD($\lambda$)

### n-Step

n-step return:
$$G_t^{(n)} = R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1}R_{t+n}+\gamma^n V(S_{t+n})$$

n-step temporal difference learning:
$$ V(S_t)\leftarrow V(S_t)+\alpha(G_t^{(n)}-V(S_t))$$

#### Forward View of TD($\lambda$)

The $\lambda$-return $G_t^\lambda$ combines all n-step returns $G_t^{(n)}$ using weight $(1-\lambda)\lambda^{n-1}$

$$ G_t^\lambda=(1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_t^{(n)}$$

Forward-view TD($\lambda$)
$$ V(S_t)\leftarrow V(S_t)+\alpha(G_t^\lambda-V(S_t))$$

#### Backward View of TD($\lambda$)

##### Eligibility Traces
* Frequency heuristic: assign credit to most frequent states
* Recency heurestic: assign credit to most recent states

Eligibility traces combine both heuristics
* $ E_0(s)=0 $
* $ E_t(s)=\gamma\lambda E_{t-1}(s)+\mathbb{1}(S_t=s)$

* Keep an eligibility trace for every state $s$
* Update the value $V(s)$ for every state $s$ in proportion to TD-error $\delta_t$ and eligibility trace $E_t(s)$

$$ \delta_t = R_{t+1}+\gamma V(S_{t+1})-V(S_t) $$
$$ V(s)\leftarrow V(s)+\alpha\delta_t E_t(s)$$ 

