# Chapter 5 - Monte Carlo Methods
This chapter introduces MC methods, which differ from DP methods (Chapter 4) in a few key aspects:

- It learns from experience (samples of sequences of states, actions and rewards from interacting with an environment or a simulation), not from complete knowledge of the environment, i.e. the values $\mathcal{P}_{ss'}^a$ and $\mathcal{R}_{ss'}^a$ are inaccessible.

- It averages over complete returns from a given state to a terminal state, therefore the estimates for each state are independent, because it doesn't build upon the estimate of any other state, as is the case in DP. So MC doesn't bootstrap.

- If a model of the environment is not available, it is more useful to estimate action values $Q^{\pi}(s, a)$ instead of $V^{\pi}(s)$. In the former case, the optimal next action in a given state $s$ is simply calculated as $\underset{a} {\arg \max} Q(s, a)$, while knowing just $V(s)$ is not useful without being able to calculate the value of each action from that state.

### Exercise 5.1
Question on the approximate state-value functions for blackjack (Figure 5.2):

Q: Why does the value function jump up for the last two rows in the rear? <br>
A: Because when the player's sum (state) is 20 and 21, the policy prescribes _sticking_, with a high chance of a win being so close to 21. So the value of being in those states is high (if following this policy).

Q: Why does it drop off for the last row on the left? <br>
A: That corresponds to the dealer showing an _ace_, which puts it in a better position overall (as it can be used both as 1 or 11) regardless of the cards of the player. So the value of being in any of those states is correspondingly reduced for the player.

Q: Why are the frontmost values higher in the upper diagram than in the lower? <br>
A: Because all being equal, having a low sum (12, 13, ...) _with an usable ace_ is a better position for the player, since the policy prescribes _sticking_ in those states and the chances of overshooting 21 are lower due to the _ace_ potentially counted as 1.

### Exercise 5.2
What is the backup diagram for MC estimation of $Q^{\pi}$? (analogous to Figure 5.3 for $V^{\pi}$)

The backup diagram for $Q^{\pi}$ is similarly linear, since we only look at experience, i.e. pairs of $(state,\: action)$ that were actually played / simulated. The only difference is that the nodes (except the terminal state) are of a single type, which is tuples of $(state,\: action)$.

### Exercise 5.3
What is the MC estimate analogous to equation 5.3 for _action_-values of a policy $\pi$, given returns generated using $\pi'$?

First we need to redefine the variables similarly to the ones in Section 5.6, but for the complete sequence of $(state, action)$ pairs following (s, a) for the $i^{th}$ visit of $(s, a)$. Thus we define $p_i(s, a)$ the probability of the sequence of all states and actions after action $a$ in state $s$ by following policy $\pi$ and similarly $p'_i(s, a)$ the probability of the same sequence happening under policy $\pi'$. Also, let $R_i(s, a)$ be the observed return after the $i^{th}$ visit of $(s, a)$. We now need to average all $R_i(s, a)$ weighted by their relative probability of occuring under $\pi$ and $\pi'$, i.e. $\dfrac {p_i(s, a)} {p'_i(s, a)}$:

$$
Q(s, a) = \frac  {\sum_{i=1}^{n_s} \frac {p_i(s, a)}{p'_i(s, a)} R_i(s, a)} {\sum_{i=1}^{n_s} \frac {p_i(s, a)}{p'_i(s, a)}}
$$

As in the case of $V(s)$ in equation (5.3), we can show that we only need the ratio of $\dfrac {p_i(s, a)}{p'_i(s, a)}$ and not the environment's dynamics. By defining $T_i(s, a)$ as the time of termination of the $i^{th}$ episode following $(s, a)$, we have:

$$
\begin{aligned}
    p_i(s_t, a_t) &= \mathcal{P}_{s_t s_{t+1}}^{a_t} \cdot \pi(s_{t+1}, a_{t+1}) \cdot \mathcal{P}_{s_{t+1} s_{t+2}}^{a_{t+1}} \cdot \pi(s_{t+2}, a_{t+2}) \cdot \mathellipsis \cdot \mathcal{P}_{s_{T_i(s, a)-2} s_{T_i(s, a)-1}}^{a_{T_i(s, a) - 2}} \cdot \pi(s_{T_i(s, a)-1}, a_{T_i(s, a) - 1}) \cdot \mathcal{P}_{s_{T_i(s, a)-1} s_{T_i(s, a)}}^{a_{T_i(s, a) - 1}} \\ \\
    
    &= \mathcal{P}_{s_t s_{t+1}}^{a_t} \cdot \left[ \prod_{k=t+1}^{T_i(s, a) - 1} \pi(s_{k}, a_{k}) \cdot \mathcal{P}_{s_k s_{k+1}}^{a_k} \right] \\
\end{aligned}
$$

Therefore:
$$
\frac {p_i(s_t, a_t)}{p'_i(s_t, a_t)} = \frac {\mathcal{P}_{s_t s_{t+1}}^{a_t} \cdot \left[ \prod_{k=t+1}^{T_i(s, a) - 1} \pi(s_{k}, a_{k}) \cdot \mathcal{P}_{s_k s_{k+1}}^{a_k} \right]} {\mathcal{P}_{s_t s_{t+1}}^{a_t} \cdot \left[ \prod_{k=t+1}^{T_i(s, a) - 1} \pi'(s_{k}, a_{k}) \cdot \mathcal{P}_{s_k s_{k+1}}^{a_k} \right]} = \frac {\prod_{k=t+1}^{T_i(s, a) - 1} \pi(s_{k}, a_{k})} {\prod_{k=t+1}^{T_i(s, a) - 1} \pi'(s_{k}, a_{k})}
$$

which is very similar to the weight derived for $V(s)$ in the book (only difference being that index $k$ starts from $t+1$ instead of $t$, since $a_t$ is now fixed).