# Chapter 5 - Monte Carlo Methods
This chapter introduces MC methods, which differ from DP methods (Chapter 4) in a few key aspects:

- It learns from experience (samples of sequences of states, actions and rewards from interacting with an environment or a simulation), not from complete knowledge of the environment, i.e. the values $\mathcal{P}_{ss'}^a$ and $\mathcal{R}_{ss'}^a$ are inaccessible.

- It averages over complete returns from a given state to a terminal state, therefore the estimates for each state are independent, because it doesn't build upon the estimate of any other state, as is the case in DP. So MC doesn't bootstrap.

- If a model of the environment is not available, it is more useful to estimate action values $Q^{\pi}(s, a)$ instead of $V^{\pi}(s)$. In the former case, the optimal next action in a given state $s$ is simply calculated as $\underset{a} {\arg \max} Q(s, a)$, while knowing just $V(s)$ is not useful without being able to calculate the value of each action from that state.

### Exercise 5.1
Question on the approximate state-value functions for blackjack (Figure 5.2):

Q: Why does the value function jump up for the last two rows in the rear? <br>
A: Because when the player's sum (state) is 20 and 21, the policy prescribes _sticking_, with a high chance of a win being so close to 21. So the value of being in those states is high (if following this policy).

Q: Why does it drop off for the last row on the left? <br>
A: That corresponds to the dealer showing an _ace_, which puts it in a better position overall (as it can be used both as 1 or 11) regardless of the cards of the player. So the value of being in any of those states is correspondingly reduced for the player.

Q: Why are the frontmost values higher in the upper diagram than in the lower? <br>
A: Because all being equal, having a low sum (12, 13, ...) _with an usable ace_ is a better position for the player, since the policy prescribes _sticking_ in those states and the chances of overshooting 21 are lower due to the _ace_ potentially counted as 1.

### Exercise 5.2
What is the backup diagram for MC estimation of $Q^{\pi}$? (analogous to Figure 5.3 for $V^{\pi}$)

The backup diagram for $Q^{\pi}$ is similarly linear, since we only look at experience, i.e. pairs of $(state,\: action)$ that were actually played / simulated. The only difference is that the nodes (except the terminal state) are of a single type, which is tuples of $(state,\: action)$.