# Chapter 09 - Exercises

### Exercise 9.1

**Q**

Show that tabular methods such as presented in Part I of this book are a special case of linear function approximation. What would the feature vectors be?

**A**

Consider the equation 9.15:

$$
\textbf{w}_{t+n} \doteq \textbf{w}_{t+n-1} + \alpha [G_{t:t+n} - \widehat{v}(S_t, \textbf{w}_{t+n-1})] \nabla \widehat{v}(S_t, \textbf{w}_{t+n-1})
$$

The corresponding equation of the tabular method update is:

$$
V_{t+n}(S_t) \doteq V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)]
$$

The transformation function of the first equation also returns the (approximate) value of the given state, just like the direct retrieval of the state-value of the tabular method above. Also, the update of the tabular method is the specific state-value of $S_t$, while the equation using approximate function updates the weights vector for the features of the states.

This means that the tabular method is a special case of $\textbf{w}$ having exactly $|S|$ elements, and each element is equal to the corresponding state-value, that is, $V(s) = \textbf{w}_s$.

Consequently, the weights vector is $[V(s_0), V(s_1), ..., V(s_{|S| - 1})] = V$, that is, $\textbf{w} = V$.

The objective is to:

1. Consider the vector of weights (feature vector) $\textbf{w}$ as the vector of state-values;
2. Choose a transformation function that receives a state and the feature vector as inputs and gives the state-value of that state as output, so $\widehat{v}(S_t, \textbf{w}) = V(S_t)$;
3. Ensure that the only state-value updated in time step $t$ is $V(S_t)$; this can be guaranteed if $\nabla \widehat{v}(S_t, \textbf{w}_{t+n-1})$ is a vector with value 1 at the element corresponding to the state $S_t$, and 0 for all other elements.

To generalize the tabular update to the function approximation update, we can simply consider that the transformation function, $\widehat{v}$, returns the state-value of the given state, so we just make the transformation for the *k*th state define the feature vector $x_k$ ($|S|$ elements) with the element at that state *k* as 1 and 0 otherwise, which is the same as multiply the *k*th row of the identity matrix ($|S| \times |S|$), let's call it $I_k$, against $\textbf{w}$:

$$
\widehat{v}(s_k, \textbf{w}) = x_k \times \textbf{w} = I_k \times \textbf{w} = w_k
$$

or more generally:

$$
\widehat{v} = I \times \textbf{w} = \textbf{w}
$$

It's important to keep in mind that the weight $w_k$ depends only on $s_k$ (that is, it corresponds to a different and single state, because each weight is the state-value of the corresponding state), which means that $\frac{\partial \textbf{w}_a}{\partial \textbf{w}_b} = 1$ if $a = b$ and $0$ otherwise.

We have:

$$
\nabla \widehat{v}(s_k, \textbf{w}) = \left[ \frac{\partial \widehat{v}(s_k, \textbf{w})}{\partial \textbf{w}_0}, ..., \frac{\partial \widehat{v}(s_k, \textbf{w})}{\partial \textbf{w}_k}, ..., \frac{\partial \widehat{v}(s_k, \textbf{w})}{\partial \textbf{w}_{|S| - 1}} \right] = \left[ \frac{\partial \textbf{w}_k}{\partial \textbf{w}_0}, ..., \frac{\partial \textbf{w}_k}{\partial \textbf{w}_k}, ..., \frac{\partial \textbf{w}_k}{\partial \textbf{w}_{|S| - 1}} \right] = \left[ 0, ..., 1, ..., 0 \right] = I_k
$$

Let $V_t$ be the vector of the state-values at the time-step $t$, the state $S_t$ be the *k*th state $s_k$ ($0 \leq k \leq |S| - 1$) and $\textbf{w}_{k|t}$ be the corresponding weight (state-value in this case) of $s_k$ at the time-step $t$.

Applying the above considerations to the first equation, we get its more specific form as the second equation:

\begin{align*}
\textbf{w}_{t+n} &\doteq \textbf{w}_{t+n-1} + \alpha [G_{t:t+n} - \widehat{v}(S_t, \textbf{w}_{t+n-1})] \nabla \widehat{v}(S_t, \textbf{w}_{t+n-1}) \\
&= \textbf{w}_{t+n-1} + \alpha [G_{t:t+n} - \widehat{v}(s_k, \textbf{w}_{t+n-1})] \times I_k \\
&= V_{t+n-1} + \alpha [G_{t:t+n} - V_{t+n-1}(s_k)] \times I_k
\end{align*}

The above update changes only the value of the *k*th state, because $I_k$ have value 1 at the *k*th element, and all other elements are 0, so it can ve represented as:

\begin{align*}
\textbf{w}_{t+n} &= V_{t+n-1} + \alpha [G_{t:t+n} - V_{t+n-1}(s_k)] \times I_k \\
&= [V_{t+n-1}(s_0) + 0, ..., V_{t+n-1}(s_k) + \alpha [G_{t:t+n} - V_{t+n-1}(s_k)], ..., V_{t+n-1}(s_{|S| - 1}) + 0] \\
&= [V_{t+n-1}(s_0), ..., V_{t+n-1}(s_k) + \alpha [G_{t:t+n} - V_{t+n-1}(s_k)], ..., V_{t+n-1}(s_{|S| - 1})]
\end{align*}

with the only change being in the *k*th state, which can be generalised to any state (*k* is just to identify the index of the state $S_t$).

So:

$$
V_{t+n}(s) = V_{t+n-1}(s), \quad \text{if } s \neq S_t
$$

and:

$$
V_{t+n}(s) = V_{t+n-1}(s) + \alpha [G_{t:t+n} - V_{t+n-1}(s)], \quad \text{if } s = S_t
$$

which corresponds to the tabular update (the second equation).

The feature vector of the *k*th state is the vector with the *k*th element 1 and all other elements 0, that is, $x_i(s_j) = 1$ if $i = j$ and 0 otherwise ($x_i = I_i$ or, more generally, $x = I$).

### Exercise 9.2

**Q**

Why does (9.17) define $(n + 1)_k$ distinct features for dimension k?

**A**

Each feature can be defined in the form:

$$
s_1^{i_0} \cdot s_2^{i_1} \cdot ... \cdot s_{k-1}^{i_{k-1}} \cdot s_k^{i_k}
$$

where $i_j$ is an integer in the set $\{0, 1, ..., n\}$.

The number of possible permutations, that defines the number of different features (considering each state unique), is given by the number of different powers for each state, $n+1$ (the number of different values in the given set from 0 to $n$), to the power of $k$, the number of different states, because there can be $n+1$ different powers for each of the $k$ states to define every single feature.

Considering an ordered set of features, it can be represented as:

\begin{align*}
&s_1^0 s_2^0 ... s_{k-1}^0 s_k^0 = 1 \\
&s_1^0 s_2^0 ... s_{k-1}^0 s_k^1 = s_k \\
&... \\
&s_1^0 s_2^0 ... s_{k-1}^0 s_k^n = s_k^n \\
&s_1^0 s_2^0 ... s_{k-1}^1 s_k^0 = s_{k-1} \\
&s_1^0 s_2^0 ... s_{k-1}^1 s_k^1 = s_{k-1} s_k \\
&... \\
&s_1^0 s_2^1 ... s_{k-1}^0 s_k^0 = s_2 \\
&s_1^0 s_2^1 ... s_{k-1}^0 s_k^1 = s_2  s_k \\
&... \\
&s_1^1 s_2^0 ... s_{k-1}^0 s_k^0 = s_1 \\
&s_1^1 s_2^0 ... s_{k-1}^0 s_k^1 = s_1  s_k \\
&... \\
&s_1^1 s_2^0 ... s_{k-1}^0 s_k^n = s_1 s_k^n \\
&s_1^2 s_2^0 ... s_{k-1}^0 s_k^0 = s_1^2 \\
&... \\
&s_1^n s_2^{n-1} ... s_{k-1}^n s_k^n = s_1^n s_2^{n-1} \prod_{i=2}^k s_i^n \\
&s_1^n s_2^n ... s_{k-1}^n s_k^n = \prod_{i=1}^k s_i^n \\
\end{align*}

The number of diferent features is:

$$
\prod_{i=1}^k \sum_{i=0}^n 1 = \prod_{i=1}^k (n+1) = (n+1)^k
$$

### Exercise 9.3 

**Q**

What $n$ and $c_{i,j}$ produce the feature vectors $x(s) = (1, s_1, s_2, s_1 s_2, s_1^2, s_2^2, s_1 s_2^2, s_1^2 s_2, s_1^2 s_2^2)^{\textsf{T}}$?

**A**

The value of $n$ is $2$ (the highest power).

Considering $x_1(s)$ as the first element (in this case, $s_1^0 s_2^0 = 1$), and so on, we have:

\begin{align*}
x_1(s) &= s_1^0 s_2^0 = 1 \\
x_2(s) &= s_1^1 s_2^0 = s_1 \\
x_3(s) &= s_1^0 s_2^1 = s_2 \\
x_4(s) &= s_1^1 s_2^1 = s_1 s_2 \\
x_5(s) &= s_1^2 s_2^0 = s_1^2 \\
x_6(s) &= s_1^0 s_2^2 = s_2^2 \\
x_7(s) &= s_1^1 s_2^2 = s_1 s_2^2 \\
x_8(s) &= s_1^2 s_2^1 = s_1^2 s_2 \\
x_9(s) &= s_1^2 s_2^2 = s_1^2 s_2^2 \\
\end{align*}

For $j$ in $\{ 1, 2 \}$, the values of $c_{i,j}$ used in $x_i(s) = \prod_{j=1}^2 s_j^{c_{i,j}} = s_1^{c_{i,1}} \cdot s_2^{c_{i,2}}$ are:

\begin{align*}
c_{1, 1} = 0 \quad &| \quad c_{1, 2} = 0 \\
c_{2, 1} = 1 \quad &| \quad c_{2, 2} = 0 \\
c_{3, 1} = 0 \quad &| \quad c_{3, 2} = 1 \\
c_{4, 1} = 1 \quad &| \quad c_{4, 2} = 1 \\
c_{5, 1} = 2 \quad &| \quad c_{5, 2} = 0 \\
c_{6, 1} = 0 \quad &| \quad c_{6, 2} = 2 \\
c_{7, 1} = 1 \quad &| \quad c_{7, 2} = 2 \\
c_{8, 1} = 2 \quad &| \quad c_{8, 2} = 1 \\
c_{9, 1} = 2 \quad &| \quad c_{9, 2} = 2 \\
\end{align*}

or, represented as the matrix $C$:

$$
C = \begin{bmatrix}
c_{1, 1} & c_{1, 2} \\
c_{2, 1} & c_{2, 2} \\
c_{3, 1} & c_{3, 2} \\
c_{4, 1} & c_{4, 2} \\
c_{5, 1} & c_{5, 2} \\
c_{6, 1} & c_{6, 2} \\
c_{7, 1} & c_{7, 2} \\
c_{8, 1} & c_{8, 2} \\
c_{9, 1} & c_{9, 2}
\end{bmatrix} = \begin{bmatrix}
0 & 0 \\
1 & 0 \\
0 & 1 \\
1 & 1 \\
2 & 0 \\
0 & 2 \\
1 & 2 \\
2 & 1 \\
2 & 2
\end{bmatrix}
$$


### Exercise 9.4

**Q**

Suppose we believe that one of two state dimensions is more likely to have an effect on the value function than is the other, that generalization should be primarily across this dimension rather than along it. What kind of tilings could be used to take advantage of this prior knowledge?

**A**

Tilings that are ortogonal to the given dimension, which will cause several different tiles across the dimension of interest. For example, if the 2 states dimensions are horizontal and vertical, and the horizontal dimension is more likely to have an effect on the value function, then vertical stripes or vertical log stripes (like in Figure 9.12 (middle)) are good choices. Rectangular tiles with the smaller rectangle side along this state dimension is also a good choice (there will be more tiles across this state dimension).

### Exercise 9.5

**Q**

Suppose you are using tile coding to transform a seven-dimensional continuous state space into binary feature vectors to estimate a state value function $\widehat{v}(s, \textbf{w}) \approx v_{\pi}(s)$. You believe that the dimensions do not interact strongly, so you decide to use eight tilings of each dimension separately (stripe tilings), for $7 \times 8 = 56$ tilings. In addition, in case there are some pairwise interactions between the dimensions, you also take all $\binom{7}{2} = 21$ pairs of dimensions and tile each pair conjunctively with rectangular tiles. You make two tilings for each pair of dimensions, making a grand total of $21 \times 2 + 56 = 98$ tilings. Given these feature vectors, you suspect that you still have to average out some noise, so you decide that you want learning to be gradual, taking about 10 presentations with the same feature vector before learning nears its asymptote. What step-size parameter $\alpha$ should you use? Why?

**A**

The feature vectors are binary, with a single tile active (value 1) per tiling (it will be active in the tiling in which the state is located). The expression $\textbf{x}^T \textbf{x}$ is a product of 2 vectors, a row vector and the equivalent column vector, and for any number of tiles, $x_i$ is the value of the feature representing that tile, and there will be exactly 98 features with value 1 (a single value 1 per tiling), with the rest being 0, which will result in the sum of each term squared:

$$
\textbf{x}^T \textbf{x} = \sum_{i=1}^{tiles} x_i^2 = \sum_{i=1}^{98} 1^2 + \sum_{i=1}^{tiles - 98} 0^2 = 98
$$

Taking into account tha,t for the intended gradual learning, about 10 presentations with the same feature vector before learning nears its asymptote should be done, that is $\tau = 10$, then using the rule of thumb to calculate $\alpha$:

$$
\alpha \doteq (\tau \mathbb{E}[\textbf{x}^T \textbf{x}])^{-1} = (10 \cdot \mathbb{E}[98])^{-1} = 980^{-1} = \frac{1}{980}
$$