# Functional approximation

## Prelude: curses

### Curse of dimensionality

The state space tends to grow to a large size easily. Let's take the Atari game. The input is an $84\times 84 \times 4$ image with intensity values between 0 and 256. How many states are possible?

$$|S| = 255^{84\cdot 84 \cdot 4} \approx 10^{50000}$$

Which is huge, and it is not possible to store it in a table. This is the curse of dimensionality.

### Curse of modeling

therefore the transition matrix $T(s, a, s')$ is way bigger than the state space. For instance in case of the Atari game, the size of the transition matrix:

$$|T| = |S| \times |A| \times |S| \approx 10^{100000}$$

The transition matrix is huge and impossible to store it in the memory. It is also difficult to learn or identify such a huge model by sampling the environment. It is harder than exploring the state itself. This is the curse of modeling. Model-free methods can eliminate this problem.

### Curse of credit assignment

When a longer episode is played or executed, it is not evident which actions were the most relevant to achieve the final result. This is the credit assignment problem.

## Motivation:

1. We've seen RL finds optimal policies for arbitrary environments, if the value functions $V(s)$ and policies $Q(s, a)$ can be exactly represented in tables
2. But the real world is too large and complex for tables
3. Will RL work for function approximators?


## Old and new setting in pseudo code
$$
\begin{array}{l}
\left.\mathrm{Q}=\mathrm{np} . \text { zeros([n_states, } \mathrm{n}_{-} \text {actions }\right] \text { ) } \\
\mathrm{a}_{-} \mathrm{p}=\text { Q[s,:] } \\
\# \text { action-value table is approximated: } \\
\mathrm{a}_{-} \mathrm{p}=\text { DeepNeuralNetwork(s) }
\end{array}
$$

## Definition: Function approximation
There are too many states/actions to fit in memory, which are too slow to process. Therefore we estimate the value function:
$$
\hat{v}(S, \mathbf{w}) \approx v_{\pi}(S)
$$
or for control we do:
$$
\hat{q}(S, A, \mathbf{w}) \approx q_{\pi}(S, A)
$$

Where **$w$** are the learned weights to estimate the parameter function

TD or Monte Carlo methods will give us the targets to fit our functional approximators to = target for loss function

## Types of functional approximations

(1) State-Value approximation

(2) Action Value approximation: action in approximation

(3) Action Value approximation: action out approximation
<img src="http://drive.google.com/uc?export=view&id=1AVRvfT1Cxhs_my-AM9h51Y70vBGhSRlc" width=35%>



## Challenges: functional approximation

1. Data is non-stationary

    * As you explore and discover new rewards, the value function can change drastically, e.g. suddenly being crushed by boulder in Zelda
    * As the policy changes there is a constant drift


2. Data is not i.i.d.

    * When playing a game, all subsequent experiences are highly correlated



## Definition: incremental SGD for prediction
Incremental ways to do this, using stochastic gradient decent, to achieve incremental value function approximation.
$$
\begin{aligned}
\mathbf{w}_{t+1} &=\mathbf{w}_{t}-\frac{1}{2} \alpha \nabla_{w}\left(v_{\pi}\left(S_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)\right)^{2} \\
&=\mathbf{w}_{t}+\alpha\left(v_{\pi}\left(S_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)\right) \nabla_{w} \hat{v}\left(S_{t}, \mathbf{w}_{t}\right)
\end{aligned}
$$
How do we compute $v_{\pi}\left(S_{t}\right) ?$ We substitute it with a target.

Note that the second line is just an alternative formulation of the first one common in reinforcement learning (take the derivative in the first equation by applying the chain rule)

## Definition: $\text { substituting } v_{\pi}\left(S_{t}\right)
$


For **MC** learning, the target is the return $G_{t}$ :
$$
\mathbf{w}_{t+1}=\mathbf{w}_{t}-\frac{1}{2} \alpha \nabla_{\mathbf{w}} \left(\color{red}{G_{t}}-\color{black}{v}\left(S_{t}, \mathbf{w}_{t}\right)\right)^{2}
$$
For $\mathbf{T D}(0)$, the target is $R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}\right)$ :
$$
\mathbf{w}_{t+1}=\mathbf{w}_{t}-\frac{1}{2} \alpha \nabla_{\mathbf{w}}\left(\color{red}{R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}\right)}\color{black}-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)\right)^{2}
$$


For $\operatorname{TD}(\lambda)$, the target is the $\lambda$ return $G_{t}^{\lambda}$.
$$
\mathbf{w}_{t+1}=\mathbf{w}_{t}-\frac{1}{2} \alpha \nabla_{\mathbf{w}}\left(\color{red}{G_{t}^{\lambda}}-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)\right)^{2}
$$


## Definition: action-value function approximation for control

For control, we wish to approximate the action-value function $\hat{q}(S, A, \mathrm{w}) \approx q_{\pi}(S, A)$
$$
\begin{aligned}
\mathbf{w}_{t+1} &=\mathbf{w}_{t}-\frac{1}{2} \alpha \nabla_{\mathbf{w}}\left(q_{\pi}\left(S_{t}, A_{t}\right)-\hat{q}\left(S_{t}, A_{t}, \mathbf{w}_{t}\right)\right)^{2} \\
&=\mathbf{w}_{t}+\alpha\left(q_{\pi}\left(S_{t}, A_{t}\right)-\hat{q}\left(S_{t}, A_{t}, \mathbf{w}_{t}\right)\right) \nabla_{\mathbf{w}} \hat{v}\left(S_{t}, \mathbf{w}_{t}\right)
\end{aligned}
$$
Similarly we substitute $q_{\pi}\left(S_{t}, A_{t}\right)$ with a target.



## Definition: $
\text { substituting } q_{\pi}\left(S_{t}, A_{t}\right)
$

For $\mathrm{MC}$ learning, the target is the return $G_{t}$ :
$$
\mathbf{w}_{t+1}=\mathbf{w}_{t}-\frac{1}{2} \alpha \nabla_{\mathbf{w}}\left(\color{red}{G_{t}}-\hat{q}\left(S_{t}, A_{t}, \mathbf{w}_{t}\right)\right)^{2}
$$
For $\mathrm{TD}(0)$, the target is $R_{t+1}+\gamma \hat{q}\left(S_{t+1}, A_{t+1}, \mathrm{w}\right)$ :
$$
\mathbf{w}_{t+1}=\mathbf{w}_{t}-\frac{1}{2} \alpha \nabla_{\mathbf{w}}\left(\color{red}{R_{t+1}+\gamma \hat{q}\left(S_{t+1}, A_{t+1}, \mathbf{w}\right)}-\hat{q}\left(S_{t}, A_{t}, \mathbf{w}_{t}\right)\right)^{2}
$$
For $\operatorname{TD}(\lambda)$, the target is the $\lambda$ return:
$$
\mathbf{w}_{t+1}=\mathbf{w}_{t}-\frac{1}{2} \alpha \nabla_{\mathbf{w}}\left(\color{red}{q_{t}^{\lambda}}-\hat{q}\left(S_{t}, A_{t}, \mathbf{w}_{t}\right)\right)^{2}
$$

The previous section borrows from [Wilcocks, 2021](https://cwkx.github.io/data/teaching/dl-and-rl/rl-lecture7.pdf)

### Feature extraction for linear methods

The question is how can we create a compressed representation of a large state space. We should choose specific states in the state space then every state that is close to it, can be described in the same way. This is similar to discretization. Can we do a better approach?

**Polynomials**

Suppose each state $s$ corresponds to $k$ numbers, $s_1, s_2, ..., s_k$, with each $s_i \in R$. For this $k$-dimensional state space, each order-$n$ polynomial-basis feature $x_i$ can be written as

$$x_i(s) = \prod_{j=1}^k{s_j^{c_{i, j}}}$$

where each $c_{i, j}$ is an integer in the set $\{ 0, 1, ..., n \}$ for an integer $n \ge 0$.

**Radial Basis Functions**

It is useful for continuous-valued features. The RBF feature, $x_i$, depends on the distance between the state $s$ and the corresponding center state, $c_i$, and the feature's width, $\sigma_i$:

$$x_i(s) = e^{\left( -\frac{||s-c_i||^2}{2\sigma_i^2} \right)}$$

Note that for a two dimensional space the RBF function looks as follows
$$x_i(s) = e^{\left( -\frac{||s_1-c_1i||^2 +||s_2-c_2i||^2}{2\sigma_i^2} \right)}$$

<img src="http://drive.google.com/uc?export=view&id=16eO17_rp7JpaGE6jdlOxz7GH-8dpsgY8" width=75%>