# Chapter 09 - Exercises

### Exercise 9.1

**Q**

Show that tabular methods such as presented in Part I of this book are a special case of linear function approximation. What would the feature vectors be?

**A**

Consider the equation 9.15:

$$
\textbf{w}_{t+n} \doteq \textbf{w}_{t+n-1} + \alpha [G_{t:t+n} - \widehat{v}(S_t, \textbf{w}_{t+n-1})] \nabla \widehat{v}(S_t, \textbf{w}_{t+n-1})
$$

The corresponding equation of the tabular method update is:

$$
V_{t+n}(S_t) \doteq V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)]
$$

The transformation function of the first equation also returns the (approximate) value of the given state, just like the direct retrieval of the state-value of the tabular method above. Also, the update of the tabular method is the specific state-value of $S_t$, while the equation using approximate function updates the weights vector for the features of the states.

This means that the tabular method is a special case of $\textbf{w}$ having exactly $|S|$ elements, and each element is equal to the corresponding state-value, that is, $V(s) = \textbf{w}_s$.

Consequently, the weights vector is $[V(s_0), V(s_1), ..., V(s_{|S| - 1})] = V$, that is, $\textbf{w} = V$.

The objective is to:

1. Consider the vector of weights (feature vector) $\textbf{w}$ as the vector of state-values;
2. Choose a transformation function that receives a state and the feature vector as inputs and gives the state-value of that state as output, so $\widehat{v}(S_t, \textbf{w}) = V(S_t)$;
3. Ensure that the only state-value updated in time step $t$ is $V(S_t)$; this can be guaranteed if $\nabla \widehat{v}(S_t, \textbf{w}_{t+n-1})$ is a vector with value 1 at the element corresponding to the state $S_t$, and 0 for all other elements.

To generalize the tabular update to the function approximation update, we can simply consider that the transformation function, $\widehat{v}$, returns the state-value of the given state, so we just make the transformation for the *k*th state define the feature vector $x_k$ ($|S|$ elements) with the element at that state *k* as 1 and 0 otherwise, which is the same as multiply the *k*th row of the identity matrix ($|S| \times |S|$), let's call it $I_k$, against $\textbf{w}$:

$$
\widehat{v}(s_k, \textbf{w}) = x_k \times \textbf{w} = I_k \times \textbf{w} = w_k
$$

or more generally:

$$
\widehat{v} = I \times \textbf{w} = \textbf{w}
$$

It's important to keep in mind that the weight $w_k$ depends only on $s_k$ (that is, it corresponds to a different and single state, because each weight is the state-value of the corresponding state), which means that $\frac{\partial \textbf{w}_a}{\partial \textbf{w}_b} = 1$ if $a = b$ and $0$ otherwise.

We have:

$$
\nabla \widehat{v}(s_k, \textbf{w}) = \left[ \frac{\partial \widehat{v}(s_k, \textbf{w})}{\partial \textbf{w}_0}, ..., \frac{\partial \widehat{v}(s_k, \textbf{w})}{\partial \textbf{w}_k}, ..., \frac{\partial \widehat{v}(s_k, \textbf{w})}{\partial \textbf{w}_{|S| - 1}} \right] = \left[ \frac{\partial \textbf{w}_k}{\partial \textbf{w}_0}, ..., \frac{\partial \textbf{w}_k}{\partial \textbf{w}_k}, ..., \frac{\partial \textbf{w}_k}{\partial \textbf{w}_{|S| - 1}} \right] = \left[ 0, ..., 1, ..., 0 \right] = I_k
$$

Let $V_t$ be the vector of the state-values at the time-step $t$, the state $S_t$ be the *k*th state $s_k$ ($0 \leq k \leq |S| - 1$) and $\textbf{w}_{k|t}$ be the corresponding weight (state-value in this case) of $s_k$ at the time-step $t$.

Applying the above considerations to the first equation, we get its more specific form as the second equation:

\begin{align*}
\textbf{w}_{t+n} &\doteq \textbf{w}_{t+n-1} + \alpha [G_{t:t+n} - \widehat{v}(S_t, \textbf{w}_{t+n-1})] \nabla \widehat{v}(S_t, \textbf{w}_{t+n-1}) \\
&= \textbf{w}_{t+n-1} + \alpha [G_{t:t+n} - \widehat{v}(s_k, \textbf{w}_{t+n-1})] \times I_k \\
&= V_{t+n-1} + \alpha [G_{t:t+n} - V_{t+n-1}(s_k)] \times I_k
\end{align*}

The above update changes only the value of the *k*th state, because $I_k$ have value 1 at the *k*th element, and all other elements are 0, so it can ve represented as:

\begin{align*}
\textbf{w}_{t+n} &= V_{t+n-1} + \alpha [G_{t:t+n} - V_{t+n-1}(s_k)] \times I_k \\
&= [V_{t+n-1}(s_0) + 0, ..., V_{t+n-1}(s_k) + \alpha [G_{t:t+n} - V_{t+n-1}(s_k)], ..., V_{t+n-1}(s_{|S| - 1}) + 0] \\
&= [V_{t+n-1}(s_0), ..., V_{t+n-1}(s_k) + \alpha [G_{t:t+n} - V_{t+n-1}(s_k)], ..., V_{t+n-1}(s_{|S| - 1})]
\end{align*}

with the only change being in the *k*th state, which can be generalised to any state (*k* is just to identify the index of the state $S_t$).

So:

$$
V_{t+n}(s) = V_{t+n-1}(s), \quad \text{if } s \neq S_t
$$

and:

$$
V_{t+n}(s) = V_{t+n-1}(s) + \alpha [G_{t:t+n} - V_{t+n-1}(s)], \quad \text{if } s = S_t
$$

which corresponds to the tabular update (the second equation).

The feature vector of the *k*th state is the vector with the *k*th element 1 and all other elements 0, that is, $x_i(s_j) = 1$ if $i = j$ and 0 otherwise ($x_i = I_i$).