# Function Approximation

For large state spaces, it is not feasible to store value functions in tabular form
* The number of states can be infinitely many (Ex: Camera image as state).
* Memory is limited, and we cannot store values for a large number of states.
* Time and data are also limited
* It is possible that many states are never visited.

This can be addressed by using *parameterized functions* for approximating the value functions.


## Linear value-function approximation

Let $\hat{v}$ represent an approximate state-value function.

$$ \hat{v}(s,\textbf{w}) \doteq \sum_i w_i x_i(s) $$

$$  \hat{v}(s,\textbf{w}) \doteq \textbf{w}^\text{T}\textbf{x}(s) $$

where
\
$\textbf{w}$ $-$ weight vector
\
$\textbf{x}(s)$ $-$ feature vector
\
$x_i(s)$ $-$ $i^{th}$ feature
\
$w_i$ $-$ $i^{th}$ weight

$\textbf{w}$ and $\textbf{x}(s)$ are of the same dimentions.

#### Tabular value functions as Linear functions

If we use one-hot vectors as feature vectors, the weights corresponding to the $1$ in the vector becomes the value of the state it represents.

**INSERT TABLE FOR EXAMPLE**

$$\textbf{x}(s_1) = \begin{bmatrix}
1\\ 
0\\ 
0
\end{bmatrix}\ \ \ 
\textbf{x}(s_2) = \begin{bmatrix}
0\\ 
1\\ 
0
\end{bmatrix}\ \ \ 
\textbf{x}(s_3) = \begin{bmatrix}
0\\ 
0\\ 
1
\end{bmatrix}$$

$$\hat{v}(s_1, \textbf{w}) = w_1\ \ \ \hat{v}(s_2, \textbf{w}) = w_2\ \ \ \hat{v}(s_3, \textbf{w}) = w_3$$

#### Generalization and Discrimination

We need to do better than one-hot vectors for features. We want two properties in the features we choose:

* High __Generalization__ - representing multiple states using their common features
* High __Discrimination__ - being able to identify different states based on their differences


## The Prediction Objective

In the tabular case, the learned value could become equal to the true value, and the values of every state were decoupled

With function approximation, we'll have more states than weights, the states are generalized, and changing the value of one state affects others state-values.

We must specify a distribution that describes how much we care about the error in each state.

$$ \mu(s) \geq 0, \sum_s \mu(s) = 1 $$

Using this, we describe the *Mean Squared Value Error* as:

$$ \overline{\text{VE}} = \sum_s \mu(s) \left [ v_{\pi}(s) - \hat{v}(s,\textbf{w}) \right ]^2 $$

We want to minimize this error. We could use gradient descent...

$$ \bigtriangledown \overline{\text{VE}} = \bigtriangledown  \sum_s \mu(s) \left [ v_{\pi}(s) - \hat{v}(s,\textbf{w}) \right ]^2 $$

$$ \bigtriangledown \overline{\text{VE}} = \sum_s \mu(s) \bigtriangledown \left [ v_{\pi}(s) - \hat{v}(s,\textbf{w}) \right ]^2 $$

$$ \bigtriangledown \overline{\text{VE}} = \sum_s \mu(s). 2.\left [ v_{\pi}(s) - \hat{v}(s,\textbf{w}) \right ] \bigtriangledown \hat{v}(s, \textbf{w}) $$

## Monte Carlo with Function Approximation

The equation above is fine, but we don't know $\mu(s)$ because we only get the states online. Enter *Monte Carlo*! We __sample states__ and use the samples to approximate the weights.

Let's assume that we have access to $v_{\pi}$ values and we sample states: $ \left ( S_1, v_{\pi}(S_1) \right ), \left ( S_2, v_{\pi}(S_2) \right ), \ldots $

The *Stochastic Gradient Descent update* for one weight can be written as:

$$ \textbf{w} \leftarrow \textbf{w} + \alpha \left [ v_{\pi}(S_1) - \hat{v}(S_1, \textbf{w}_1) \right ] \bigtriangledown \hat{v}(S, \textbf{w}) $$

But we don't actually know $v_{\pi}$. So we will have to use the expected return $G_t$ here instead.

$$ \textbf{w} \leftarrow \textbf{w} + \alpha \left [ G_t - \hat{v}(S_1, \textbf{w}_1) \right ] \bigtriangledown \hat{v}(S, \textbf{w}) $$

Let's rewrite the MC SGD update equation with a generic notation for a target $U_t$:

$$ \textbf{w} \leftarrow \textbf{w} + \alpha \left [ U_t - \hat{v}(S_1, \textbf{w}_1) \right ] \bigtriangledown \hat{v}(S, \textbf{w}) $$

If $U_t$ is an unbiased estimate of the true value, then $\textbf{w}$ will converge to a local optimum.

## TD with Function Approximation

Like in the tabular case, Monte Carlo methods usng Function Approximation will require episode completion before the weights can be updated (and through them the state-values). We can do better with Temporal Difference learning.

We can use the one-step TD target:

$$ U_t \doteq R_{t+1} + \gamma . \hat{v}(S_{t+1}, \textbf{w}) $$

By bootstrapping targets, we use the current value of $\textbf{w}$, like in $\hat{v}(S_{t+1}, \textbf{w})$, and this makes $U_t$ biased.

Bootstrapping methods are not instances of true gradient descent. These are called __semi-gradient__ methods because they take into account the effects of changes in the weights for the estimate $\hat{v}(S_1, \textbf{w}_1)$ but not the target $U_t$.

$$ \bigtriangledown \left [ U_t - \hat{v}(S_t, \textbf{w}) \right ]^2 = \left ( U_t - \hat{v}(S_t, \textbf{w}) \right ) \left ( \bigtriangledown U_t - \bigtriangledown \hat{v}(S_t, \textbf{w}) \right) $$

This update term becomes the gradient descent update term when

$$ \bigtriangledown U_t = 0 $$

But, for one-step TD, 

$$ \bigtriangledown U_t = \gamma . \bigtriangledown \hat{v}(S_{t+1}, \textbf{w}) \neq 0 $$ 

i.e. it depends on $\textbf{w}$

Semi-gradient methods are not as robust as the gradient methods. So __why do we prefer Semi-gradient methods__?
* They perform well with linear functions
* They enable faster learning
* Allow learning to be continuous and online (not limited to episodic tasks)
* Computational advantages

## Semi-gradient methods with Linear function approximation

Recall the update equation for weights

$$ \textbf{w} \leftarrow \textbf{w} + \alpha \left [ U_t - \hat{v}(S_1, \textbf{w}_1) \right ] \bigtriangledown \hat{v}(S, \textbf{w}) $$

For $ \bigtriangledown U_t = \gamma \bigtriangledown \hat{v}(S_{t+1}, \textbf{w}) $, we can rewrite the update equation as:

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \delta_t \bigtriangledown \hat{v}(S, \textbf{w}) $$

where

$$ \delta_t \doteq R_{t+1} + \gamma . \hat{v}(S_{t+1},\textbf{w}) - \hat{v}(S_{t},\textbf{w}) $$

If we use linear approximation functions:

$$ \hat{v}(S_t,\textbf{w}) \doteq \textbf{w}^\text{T}\textbf{x}(S_t) $$

$$ \bigtriangledown \hat{v}(S_t,\textbf{w}) = \textbf{x}(S_t) $$

The weight update equation now becomes:

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \delta_t \textbf{x}(S_t) $$

Let $\textbf{x}_t = \textbf{x}(S_t)$,

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \delta_t \textbf{x}_t $$

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \left [ R_{t+1} + \gamma . \hat{v}(S_{t+1},\textbf{w}) - \hat{v}(S_{t},\textbf{w}) \right ] \textbf{x}_t $$

Expand $ \hat{v} $,

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \left [ R_{t+1} + \gamma . \textbf{w}^\text{T}\textbf{x}_{t+1} - \textbf{w}^\text{T}\textbf{x}_t \right ] \textbf{x}_t $$

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \left [ R_{t+1} \textbf{x}_t - \textbf{x}_t (\textbf{x}_t - \gamma\ \textbf{x}_{t+1} )^\text{T}\ \textbf{w} \right ] $$

#### Update Expectation

Let vector $ \textbf{b} \doteq R_{t+1} \textbf{x}_t$

and matrix $ \textbf{A} \doteq \textbf{x}_t (\textbf{x}_t - \gamma\ \textbf{x}_{t+1} )^\text{T} $

The expected update can be written as

$$ \mathop{{}\mathbb{E}}  \left [ \Delta \textbf{w}_t  \right ] = \alpha \left ( \textbf{b} - \textbf{Aw}^\text{T} \right ) $$

The weights are said to converge when $ \mathop{{}\mathbb{E}}  \left [ \Delta \textbf{w}_t  \right ] = 0 $.

The weights at this *TD Fixed Point* are given by:

$$ \textbf{w}_{\text{TD}} = \textbf{A}^{-1}\textbf{b} $$

At this TD fixed point, $\overline{\text{VE}}$ is within a bounded expansion of the lowest possible error.

$$ \overline{\text{VE}}\ (\textbf{w}_{\text{TD}}) \leq \frac{1}{1-\gamma} \underset{\text{w}}{min}\ \overline{\text{VE}}\ (\text{w}) $$

Since we use $\gamma$ values near one, this bound is usually large. 
\
But, while MC methods may diverge, TD methods converge faster since they have less variance.



## Feature construction

* Features are an important way of adding domain knowledge into Reinforcement Learning systems.
* Since in linear methods there is not interaction between features, we must construct features that are a combinations of the state dimensions.
* Features should allow high generalization and discrimination.

#### Some general methods of Feature Construction:

* __Polynomials__: Combinations of the state dimensions. The polynomial coefficients will be the weights.
* __Fourier Bases__: Any periodic function can be expressed as a weighted sum of sines and cosines of different frequencies.
* __Coarse coding__: Use N-dimensional shapes as features. Overlapping features like this provide generalization if the shapes are large and discrimination at the intersections.
* __Tile coding__: It is hard to implement coarse coding (arbitrary shapes in N-dimensions), so this strategy uses repeating grids offset by an amount much smaller than the grid size.
* __Neural Networks__: Let the network learn to identify useful features during training.


## Control with Function Approximation

For control, we need action-value fucntions. We have been estimating state-value functions thus far. We can estimate the action-values in the same way.

We know,
$$  \hat{v}(s,\textbf{w}) \doteq \textbf{w}^\text{T}\textbf{x}(s) $$
Similarly,
$$  \hat{q}(s,a,\textbf{w}) \doteq \textbf{w}^\text{T}\textbf{x}(s,a) $$

### Sarsa

The weight update for TD learning was,

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \left [ R_{t+1} + \gamma . \hat{v}(S_{t+1},\textbf{w}) - \hat{v}(S_{t},\textbf{w}) \right ] \bigtriangledown \hat{v}(S_t, \textbf{w}) $$

For TD with GPI (Sarsa), it becomes,

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \left [ R_{t+1} + \gamma . \hat{q}(S_{t+1},A_{t+1},\textbf{w}) - \hat{q}(S_{t},A_{t},\textbf{w}) \right ] \bigtriangledown \hat{q}(S_t, A_t, \textbf{w}) $$

Note the similarity to the Tabular Sarsa update rule:

$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left [ R_{t+1} + \gamma . Q(S_{t+1},A_{t+1}) - Q(S_t,A_t) \right ] $$

### Expected Sarsa

Compute the expectation of the next action's value using the current policy (instead of sampling actions from the policy):

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \left [ R_{t+1} + \gamma . \sum_{a'} \pi (a'|S_{t+1})\ \hat{q}(S_{t+1},a',\textbf{w}) - \hat{q}(S_{t},A_{t},\textbf{w}) \right ] \bigtriangledown \hat{q}(S_t, A_t, \textbf{w}) $$

$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left [ R_{t+1} + \gamma . \sum_{a'} \pi (a'|S_{t+1}) . Q(S_{t+1},a') - Q(S_t,A_t) \right ] $$

### Q-learning

Off-policy learning that assumes greedy action selection in the behavior policy.

$$ \textbf{w} \leftarrow \textbf{w} + \alpha\ \left [ R_{t+1} + \gamma . \underset{a'}{max}\ \hat{q}(S_{t+1},a',\textbf{w}) - \hat{q}(S_{t},A_{t},\textbf{w}) \right ] \bigtriangledown \hat{q}(S_t, A_t, \textbf{w}) $$

$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left [ R_{t+1} + \gamma . \underset{a'}{max}\ Q(S_{t+1},a') - Q(S_t,A_t) \right ] $$


## Exploration and exploitation in Function Approximation

### Optimistic initialization

* Set optimistic weights $\textbf{w}$.
* We cannot know the optimistic weights in non-linear systems (Neural Nets)
* Due to generalization, optimism can be lost quickly before some states are even visited.

### $\epsilon-$greedy action selection

* Works for function approximation as well as it did for tabular methods.
* Not systematic

## The Deadly Triad

The usage of the following three methods together could lead to instability in the system.

* __Function Approximation__ (Ex: Linear function approximation, Artificial Neural Nets).
  * Can't let go of this one.
  

* __Bootstrapping__: Using existing estimates in targets (Dynamic Programming, Temporal Difference).
  * Costs memory to let go of bootstrapping, but MC methods could be used.
  * Learning takes more time without bootstrapping.


* __Off-policy learning__: Training on a distribution of transitions other than that produced by the target policy.
  * In large-scale learning, with many agents learning in parallel, Off-policy learning becomes essential.
  * On-policy methods are adequate in small solutions
  