## Chapter 5 Monte Carlo Methods

In contrast to Chapter 4, we don't assume to have complete knowledge of the environment here. Monte Carlo (MC) methods only need "experience (sample sequences of states, actions, and rewards from actual (or simulated) interaction with an environment." Learning from actual experience is a big deal because then no knowledge of the mechanics and dynamics of the environment is needed to learn optimal behaviors. When we learn from simulated experience, this is also very useful becauase although we need a model of the environment, we only need to use the model to generate transition probabilities for samples, not for every possible transition like dynamic programming requires. 

"Monte Carlo methods enable us to solve the RL problem by averaging sample returns." We'll only look at Monte Carlo methods for episodic tasks so that we can be certain that we're only dealing with well-defined returns. Policies and value estimates are only changed at the end of an episode. 

MC samples and averages future returns over state-action pairs. This is similar to the k-armed bandit methods from Chapter 2; the bandit methods sampled and averaged reward for each action. The big difference now is that we have more than one state, and that we allow the states to interact with each other. Recall, this is the full RL problem alluded to at the end of Chapter 2 and covered throughout Chapter 3. "The return after taking an action in one state depends on the actions taken in later states in the same episode." This problem becomes nonstationary because we are continuously learning to make different action choices. 

We adapt the idea of Generalized Policy Iteration (GPI) to handle the nonstationarity of the problem; we use samples from the MDP to learn the value function. This is in contrast to when dynamic programming was used to directly compute the value function using GPI. Each piece of GPI is extended from dynamic programming to MC, where here we use sample experience to learn the policy $\pi$ and $v_\pi$ and $q_\pi$.

### 5.1: Monte Carlo Prediction

We will use the Monte Carlo method to learn the state-action value function for a policy. The policy will be given. To estimate the state-action values from experience, we'll average the observed returns after each time we are in a state. This average should converge to the expected value for each state. Imagine we want to estimate $v_\pi(s)$ given a set of episodes that we've gathered by following $\pi$ and passing through states $s$. Each time we see state $s$ in an episode is called a visit to $s$. $s$ may be visited multiple times in an episode. The first time we see $s$ in an episode will be called the first visit to $s$. First visit Monte Carlo method works to estimate $v_\pi(s)$ as every return after the first visit to $s$. Every visit Monte Carlo averages the returns following all visits to $s$. 

First visit and every visit Monte Carlo both converge to $v_\pi(s)$ as the number of visits goes to infinity. In the case of first visit Monte Carlo this is easy to see because each return is an i.i.d estimation of $v_\pi(s)$. By the law of large numbers the sequence of averages converges to the expected value. Every visit Monte Carlo isn't as simple but its estimates of $v_\pi(s)$ also converge quadratically. 


### 5.2: Monte Carlo Estimation of Action Values

When we don't have a model of the environment, then its useful to estimate action values rather than state values. When we have a model, state values by themselves are enough for us to form a policy. We just look one state ahead and choose the one with highest value. However, when we don't have a model, state values by themselves aren't enough. Need to directly estimate the value of each action for the values to be useful and direct a policy. So one of main goals with Monte Carlo methods is to estimate $q_*$. To do this, we'll look at the policy evaluation problem for action values.

When doing policy evaluation for action-values we want to estimate $q_\pi(s, a)$. The expected return when starting from a state $s$, taking an action $a$ and afterwards following policy $\pi$. "The Monte Carlo methods for this are essentially the same as the state values, except now we talk about visits to a state-action pair rather than to a state." Both Every-visit and First-visit MC still converge quadratically to the true expected reward values as the number of visits to each state-action pair goes to infinity. 

Only problem is that lots of state-action pairs will never be visited. If we're following a deterministic policy then MC will only ever observe returns for one action from that state. Since there will be no returns for the other actions, MC will never learn to estimate the returns of those actions. It is necessary to estimate returns for all the actions available in a state, not only the actions the policy prefers.

This is the problem of maintaining exploration. For the policy evaluation to still work for action values, we have to ensure that exploration continues. We can do this by saying that each episode starts in a state-action pair and give each pair a probability greater than zero of being selected. This is called exploring starts and it guarantees that each pair will be visited an infinite number of times as the number of episodes goes to infinity. Assuming exploring starts isn't always reliable. We cannot depend on it when learning from real interaction. The most common alternative is to only use policies that are stochastic and have nonzero probability for each action in a state. 

### 5.3: Monte Carlo Control

Monte Carlo estimation can be used to approximate an optimal policy. 