# Chapter 3

> _Exercise 3.1_ Devise three example tasks of your own that fit into the MDP framework, identifying for each its states, actions, and rewards. Make the three examples as different from each other as possible. The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.

1. Consider a possibly unknown mass $m$ hanging on a solid beam by a spring of stiffness $k$, as well as an actuator that can accelerate the beam. In this case, the state space can be represented as a tuple $(x_m, \dot x_m , x_b, \dot x_b)$. The action space can be represented by any continuous value $a(t) = \ddot x_b$. We can construct an arbitrary reward by the deviation between the oscillation of the mass and some desired reference signal $r(t) = | x(t) - x_r(t) |$.

2. Consider a robot playing a game of Jenga. The state space of the game is represented by physical configuration of the current stack of Jenga blocks, and the action space is the choice of Jenga block to move, as well as the precise motor voltages required in order to move the particular Jenga block. The reward is the total number of Jenga blocks moved without having the tower collapsing.

3. Consider an automated penetration tester. We can consider the state space to be the current machines that the tester has control of, as well as the current level of access to each machine. The action space can be represented as the possible network packets that the tester can send to other machines or the same machines, as well as the commands that can be run on the machines.

> _Exercise 3.2_ Is the MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?

The MDP framework fails to consider that non-Markov problems, where the state is no longer a sufficient statistic to compute the distribution of future states. In other words, the MDP framework does not let the state be incomplete. As a concrete example, we can consider the driving example from _Exercise 3.3_ . While driving, you may have access to vision sensors only in a particular direction, or your vision sensors may have blindspots. In this case, the agent only receives data that is a function of the true state, but does not represent the true state at all (e.g there could be vehicles in the sensors' blind spot).

> _Exercise 3.3_ Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther out—say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in—say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices of where to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?

There is no necessarily correct line to draw between the environment and the agent. For any given task, there is likely to be a spectrum of ways of defining the agent-environment boundary. In practice, it is likely to be fixed by what sensors and actuators are actually available to achieve any given task, what the accuracy of the sensors actuators are, and the difficulty of the each of the resulting reinforcement learning problems.

> _Exercise 3.4_ Give a table analogous to that in Example 3.3, but for $p(s', r|s, a)$. It should have columns for $s, a, s', r$, and $p(s', r|s, a)$, and a row for every 4-tuple for which $p(s', r|s, a) > 0$.

![Screen%20Shot%202022-01-04%20at%201.24.05%20PM.png](attachment:Screen%20Shot%202022-01-04%20at%201.24.05%20PM.png)

> _Exercise 3.5_ The equations in Section 3.1 are for the continuing case and need to be modified (very slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified version of (3.3).

$$
\sum_{s'\in \mathcal S^+} \sum_{r \in \mathcal R} { p(s', a | s, a) } = 1 \mathrm{\; for\; all\; } s \in \mathcal S, a \in \mathcal A 
$$

> _Exercise 3.6_ Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for 1 upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task?

The return would be:

$$
G_t = \sum_{t'=t+1}^T{R_{t'}} = \begin{cases} 0 & \text{for } K_t \geq T \\ -\gamma^{K_t-1} & \text{otherwise} \end{cases}
$$

where $K_t$ is the time of the next failure.

> Exercise 3.7 Imagine that you are designing a robot to run a maze. You decide to give it a reward of $+1$ for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes — the successive runs through the maze — so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

The problem is that the robot is not incentivized to escape from the maze quickly. The sum of rewards acquired for any successful exit from the maze is will always be $1$, since there is no discounting in (3.7). Therefore, even a random policy has the maximum expected reward of $1$.

> Exercise 3.8 Suppose $\gamma = 0.5$ and the following sequence of rewards is received $R_1 = -1, R_2 = 2, R_3 = 6, R_4 = 3$, and $R_5 = 2$, with $T = 5$. What are $G_0, G_1, \ldots, G_5$? Hint: Work backwards.

$t$ | $R_t$ | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $G_t$
--- |--- |------- |
5 | 2 | 0
4 | 3 | 2
3 | 6 | $3 + \gamma G_4 = 4$ 
2 | 2 | $6 + \gamma G_3 = 8$
1 | -1 | $2 + \gamma G_2 = 6$
0 | - | $-1 + \gamma G_1 = 2$

> _Exercise 3.9_ Suppose  $\gamma = 0.9$ and the reward sequence is $R_1 = 2$ followed by an infinite sequence of 7s. What are $G_1$ and $G_0$?

$$
G_1 = \sum_{i=0}^\infty{7\gamma^i} = 7\frac{1}{1-\gamma} = 70 \\
G_0 = R_1 + \gamma G_1 = 2 + 0.9\cdot70 = 65
$$

> _Exercise 3.10_ Prove the second equality in (3.10).

$$
S_N = \sum_{k=0}^N{\gamma^k} \\
\Rightarrow \gamma S_N = \sum_{k=1}^{N+1}{\gamma^k} \\
\Rightarrow S_N - \gamma S_N = 1 - \gamma^{N+1} \\
\Rightarrow S_N = \frac{1 - \gamma^{N+1}}{1 - \gamma} \\
\Rightarrow \lim_{N\to\infty}S_N = \frac{1}{1-\gamma}
$$

> _Exercise 3.11_ If the current state is $S_t$, and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?

$$
\mathbb{E}[R_{t+1} | s, \pi] = \sum_{a\in\mathcal A}\sum_{r\in\mathcal R}r\sum_{s'\in\mathcal S}p(s', r | s, a) = \sum_{a\in \mathcal A}\sum_{r\in\mathcal R}r\sum_{s'\in\mathcal S}p(s', r | s, a)\pi(a|s)
$$

> _Exercise 3.12_ Give an equation for $v_\pi$ in terms of $q_\pi$ and $\pi$.

$$
v_\pi(s) = \mathbb{E}_{a\sim\pi(a|s)}[q_\pi(s,a)] = \sum_{a\in\mathcal A}{q(s, a)\pi(a|s)}
$$

$$
\sum_{a\in\mathcal A}{\pi(a|s)} = 1 \mathrm{for all}\; s \in \mathcal S 
$$

> _Exercise 3.13_ Give an equation for $q_\pi$ in terms of $v_\pi$ and the four-argument $p$.

$$
q_\pi(s, a) = \mathbb{E}_{S',R\sim p(s', r|s, a)}[R + \gamma v_\pi(S)] = \sum_{s', r}(r + \gamma v_\pi(s'))p(s', r | s, a)
$$

> _Exercise 3.14_ The Bellman equation (3.14) must hold for each state for the value function $v_\pi$ shown in Figure 3.2 (right) of Example 3.5. Show numerically that this equation holds for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, 0.4, and +0.7. (These numbers are accurate only to one decimal place.)

$$
v_\pi(s) = \sum_a{\pi(a|s) \sum_{s',r}{p(s',r|s,a)\big(r + \gamma v_\pi(s')\big)}} \\
= \sum_a{\frac{1}{4}\sum_{s',r}{p(s',r|s,a)\big(r + \gamma v_\pi(s')\big)}} \\
v_\pi(s) = \frac{1}{4}\gamma\big(v_\pi(s_{\verb|up|}) + v_\pi(s_{\verb|right|}) + v_\pi(s_{\verb|down|} + v_\pi(s_{\verb|left|})\big) \\
= \frac{1}{4}\gamma\big(2.3 + 0.4 + (-0.4) + 0.7\big) \\
= \frac{1}{4} \cdot 0.9 \cdot 3.0 \approx 0.7
$$

> _Exercise 3.15_ In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant $c$ to all the rewards adds a constant, $v_c$, to the values of all states, and thus does not affect the relative values of any states under any policies. What is $v_c$ in terms of $c$ and $\gamma$?

$$
\tilde{G}_t \dot{=} \sum_{k=0}^\infty{\gamma^k \tilde{R}_{t+k+1}} \\
= \sum_{k=0}^\infty{\gamma^k (c + R_{t+k+1}}) \\
= \sum_{k=0}^\infty{c\gamma^k} + \sum_{k=0}^\infty{\gamma^k R_{t+k+1}} \\
= \frac{c}{1-\gamma} + G_t \Rightarrow v_c = \frac{c}{1-\gamma}
$$

> _Exercise 3.16_ Now consider adding a constant c to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.

This would have some effect, since longer trajectories would be able to acquire more reward, and therefore longer trajectories would have higher value.

$$
\sum_{k=0}^T{c\gamma^k} + \sum_{k=0}^T{\gamma^k R_{t+k+1}} \\
\frac{c - \gamma^T}{1-\gamma} + G_t
$$

> _Exercise 3.17_ What is the Bellman equation for action values, that is, for $q_\pi$? It must give the action value $q_\pi(s, a)$ in terms of the action values, $q_\pi(s', a')$, of possible successors to the state–action pair $(s, a)$. Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.

$$
q_\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a] \\
= \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\
= \sum_{s', r}{p(s', r | s, a)\big(r + \gamma v_\pi(s')\big)} \\
= \sum_{s', r}{p(s', r | s, a)\big(r + \gamma \sum_{a'\in\mathcal A}{q(s', a')\pi(a'|s')} \big)}
$$

## Bellman Equations
\begin{align}
v_\pi(s) &= \sum_{s',r}{p(s', r | s, \mu(s))\big(r + \gamma v_\pi(s')\big)}\\\\
q_\pi(s, a) &= \sum_{s', r}{p(s', r | s, a)\big(r + \gamma \sum_{a'\in\mathcal A}{q_\pi(s', a')\pi(a'|s')} \big)}
\end{align}

\begin{align}
v_\pi(s) &= \sum_{s',r,a}{p(s', r | s, a)\pi(a|s)\big(r + \gamma v_\pi(s')\big)}\\\\
q_\pi(s, a) &= \sum_{s', r}{p(s', r | s, a)\big(r + \gamma \sum_{a'\in\mathcal A}{q(s', a')\pi(a'|s')} \big)}
\end{align}

## Bellman _Optimality_ Equations
$$
v_*(s) = \max_a{\sum_{s',r}{p(s', r | s, a)\big(r + \gamma v_*(s')\big)}}\\\\
q_*(s, a) = \sum_{s',r}{p(s', r | s, a)\big(r + \gamma \max_{a'}{q_*(s', a')}\big)}
$$

> _Exercise 3.18_ The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:

> Give the equation corresponding to this intuition and diagram for the value at the root node, $v_\pi(s)$, in terms of the value at the expected leaf node, $q_\pi(s, a)$, given $S_t = s$. This equation should include an expectation conditioned on following the policy, $\pi$. Then give a second equation in which the expected value is written out explicitly in terms of $\pi(a|s)$ such that no expected value notation appears in the equation.

$$
v_\pi(s) = \mathbb{E}_{A\sim\pi(a|s)}[q(A, s)]\\
= \sum_{a\in\mathcal A}{q(a, s)\pi(a|s)}
$$

> Exercise 3.19 The value of an action, $q_\pi(s, a)$, depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:

> Give the equation corresponding to this intuition and diagram for the action value, $q_\pi(s, a)$, in terms of the expected next reward, $R_{t+1}$, and the expected next state value, $v_\pi(S_{t+1})$, given that $S_t = s$ and $A_t = a$. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of $p(s', r|s, a)$ defined by (3.2), such that no expected value notation appears in the equation.

$$
q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1})|S_t = s, A_t = a]\\
= \sum_{s', r}p(s',r|s,a)\big({r + \gamma_\pi(s')}\big)
$$

> _Exercise 3.20_ Draw or describe the optimal state-value function for the golf example.

The optimal state-value function for the golf example is a map of every location such that:

1. Everywhere on the green is $-1$ since we can take a single putt.
2. Everywhere that is one driver distance from the green is $-2$, that is two driver distances from the green is $3$, etc. Generalizing, any location that is $N$ driver distances from the green is $-N-1$ in optimal value.

> _Exercise 3.21_ Draw or describe the contours of the optimal action-value function for putting, $q_*(s, \verb|putter|)$, for the golf example.

The optimal action value function for putting, $q_*(s, \verb|putter|)$ is:

1. -1 For everywhere on the green.
2. -2 For everywhere that is one putt away from the green.
3. $-N-1$ for anywhere that is one putt away from a region of value $-N$.

> _Exercise 3.22_ Consider the continuing MDP shown to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, $\pi_{\verb|left|}$ and $\pi_{\verb|right|}$. What policy is optimal if $\gamma = 0$? If $\gamma = 0.9$? If $\gamma = 0.5$?

If we choose the $\pi_{\verb|left|}$ policy, then we will have the following dynamics:

![Screen%20Shot%202022-02-08%20at%204.44.21%20PM.png](attachment:Screen%20Shot%202022-02-08%20at%204.44.21%20PM.png)

| $t\quad$ | $S_t$ | $A_t$ | $R_t$
|-----|-------|-------|------
|  1  | $s_1$ | $\verb|left|$ | 1
|  2  | $s_\verb|left|$ | x | 0
|  3  | $s_1$ | $\verb|left|$ | 1
|  4  | $s_\verb|left|$ | x | 0
|  5  | $s_1$ | x | 1
| $\dots$
| $2k$ | $s_\verb|left|$ | x | 2
| $2k+1$ | $s_1$ | $\verb|left|$ | 0

So we notice that there is a positive reward on every even time step. So we can see that the discounted future reward will be

$$
E_{\pi_{\verb|left|}}[G_t] = 1 + \gamma \cdot 0 + \gamma^2 \cdot 1 + + \gamma^3 \cdot 0 + \gamma^4 \cdot 1 + \ldots \\
= \sum_{i=0}^\infty{(\gamma^2)^i} = \frac{1}{1 - \gamma^2}
$$

Similarly, for the policy $\pi_\verb|right|$

| $t$ | $S_t$ | $A_t$ | $R_t$
|-----|-------|-------|------
|  1  | $s_1$ | $\verb|right|$ | 1
|  2  | $s_\verb|right|$ | x | 0
|  3  | $s_1$ | $\verb|right|$ | 1
|  4  | $s_\verb|right|$ | x | 0
|  5  | $s_1$ | $\verb|right|$ | 0
| $\dots$
| $2k$ | $s_\verb|right|$ | x | 0
| $2k+1$ | $s_1$ | $\verb|right|$ | 1

$$
E_{\pi_{\verb|right|}}[G_t] = 0 + \gamma \cdot 2 + \gamma^2 \cdot 0 + + \gamma^3 \cdot 2 + \gamma^4 \cdot 0 + \ldots \\
= \sum_{i=0}^\infty{2\gamma^{2i+1}} = \frac{2\gamma}{1 - \gamma^2}
$$

|$\gamma$ | $E_{\pi_{\verb|left|}}[G_t]$ | $E_{\pi_{\verb|right|}}[G_t]$| $\pi_*$
|--- |--- |--- |---
|$\gamma=0$ | 1 | 0 | $\pi_{\verb|left|}$
|$\gamma=0.9$ | 5.263 | 9.474 | $\pi_{\verb|right|}$
|$\gamma=0.5$ | 1.333 | 1.333 | Tie

> _Exercise 3.23_ Give the Bellman equation for $q_*$ for the recycling robot.

$$
q_*(s, a) = \sum_{s',r}{p(s',r|s,a)\bigg[r + \gamma\max_a'{q_*(s', a')}\bigg]}
$$

\begin{align}
q_*(\verb|high|, \verb|search|) &=& \alpha\bigg(r_{\verb|search|} + \gamma\max{\{q_*(\verb|high|, \verb|search|), q_*(\verb|high|, \verb|wait|)\}}\bigg) + (1-\alpha)\bigg(r_\verb|search| + \gamma\max{\{q_*(\verb|low|, \verb|search|), q_*(\verb|low|, \verb|wait|), q_*(\verb|low|, \verb|recharge|)\}}\bigg)\\
q_*(\verb|high|, \verb|wait|) &=& r_{\verb|wait|} + \gamma\max{\{q_*(\verb|high|, \verb|search|), q_*(\verb|high|, \verb|wait|)\}}\\
q_*(\verb|low|, \verb|search|) &=& \beta\bigg(r_{\verb|search|} + \gamma\max{\{q_*(\verb|low|, \verb|search|), q_*(\verb|low|, \verb|wait|), q_*(\verb|low|, \verb|recharge|)\}}\bigg) + (1-\beta)\bigg(-3 + \gamma\max{\{q_*(\verb|high|, \verb|search|), q_*(\verb|high|, \verb|wait|)\}}\bigg)\\
q_*(\verb|low|, \verb|recharge|) &=& \gamma\max{\{q_*(\verb|high|, \verb|search|), q_*(\verb|high|, \verb|wait|)\}}\\
q_*(\verb|low|, \verb|wait|) &=& r_{\verb|wait|} + \gamma\max{\{q_*(\verb|low|, \verb|search|), q_*(\verb|low|, \verb|wait|), q_*(\verb|low|, \verb|recharge|)\}}
\end{align}

> Exercise 3.24 Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places.
Let $s$ be the optimal state (a.k.a $A$)
$$
v_*(s) = 10 + \gamma \cdot 0 + \gamma^2 \cdot 0 + \gamma^3 \cdot 0  + \gamma^4 \cdot 0 + \gamma^5 v_*(s) = 10 + \gamma^5 v_*(s)\\
\Rightarrow v_*(s) = \frac{10}{1-\gamma^5} \approx 24.419
$$

> _Exercise 3.25_ Give an equation for $v_*$ in terms of $q_*$.

$$
v_*(s) = \max_a{q_*(s, a)}
$$

> _Exercise 3.26_ Give an equation for $q_*$ in terms of $v_*$ and the four-argument $p$.

$$
q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a]\\
= \sum_{s', r}p(s', r| s, a)(r + \gamma v_*(s'))
$$

> _Exercise 3.27_ Give an equation for $\pi_*$ in terms of $q_*$.

$$
\pi_*(a|s) = \begin{cases} 1 & \mathrm{if}\; a = \arg\max_a{q_*(a, s)} \\
0 & \mathrm{otherwise} \end{cases}
$$

> _Exercise 3.28_ Give an equation for $\pi_*$ in terms of $v_*$ and the four-argument $p$.

$$
\pi_*(a|s) = \begin{cases} 1 & \mathrm{if}\; a = \arg\max_a{\sum_{s', r}p(s', r| s, a)(r + \gamma v_*(s'))}\\ 0 & \mathrm{otherwise} \end{cases}
$$

> _Exercise 3.29_ Rewrite the four Bellman equations for the four value functions ($v_\pi$, $v_*$, $q_\pi$, and $q_*$) in terms of the three argument function $p$ (3.4) and the two-argument function $r$ (3.5).

$$
v_\pi(s) = \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s] = \sum_{s',a}{p(s'|s, a)\pi(s|a)\big(r(s,a) + \gamma v_\pi(s')\big)}\\
v_*(s) = \max_a{\sum_{s'}{p(s'|s, a)\big(r(s, a) + \gamma v_*(s')\big)}}\\
q_\pi(s, a) = \sum_{s'}{p(s'|s,a)\bigg(r(s, a) + \gamma\sum_{a'\in\mathcal A}{q_\pi(s', a')\pi(a'|s')}\bigg)}\\
q_*(s, a) = \sum_{s'}{p(s'|s,a)\bigg(r(s, a) + \gamma\max_{a'\in\mathcal A}{q_\pi(s', a')}\bigg)}\\
$$