# Monte Carlo Methods

* __Monte Carlo__ $-$ any estimation that relies on repeated random sampling

The problem with DP was that it assumed a perfect model, but we don't usually know the environment dynamics. Monte Carlo methods, on the other hand, __require only *experience*__, i.e., samples of states, actions and rewards.

The agent can learn from *actual* and *simulated* experiences. While we still need a model to generate simulated experiences, we don't need the complete probability distribution of the environment dynamics. $\bigl($We don't need $ p(s',r|s,a)$.$\bigr)$

## Monte Carlo Prediction

Recall that the value of a state is the *expected return* (*expected cumulative future discounted reward*) starting from that state.

$$ v_{\pi}(s) \doteq \mathop{{}\mathbb{E}_{\pi}} \left[ G_t | S_t = s \right ] $$

In order to estimate the value function $v_{\pi}(s)$, we average the returns observed after *visits* to state $s$ until termination while following the policy $\pi$.

* If we average the returns from the first visit to $s$ in every episode, we call this the *First Visit MC method*.
* If we average the returns from every visit to $s$ in every episode, we call this the *Every Visit MC method*.

With a model, we could determine a policy from state values alone since the model could be used to look ahead one step and choose the action that leads to the best reward and next state.

Wihout a model, we must explicitly estimate the value of each action. We do this similar to state value estimation. Instead of considering a state to be visited, we consider a *state-action pair* to be visited and sampled from.

## Exploration

While estimating action values using the Monte Carlo method, it is possible that many state-action pairs are never visited. If the policy used is deterministic then there is only one action that is observed from each state. This makes it necessary to think of strategies to maintain exploration.

* __Exploring starts__ $-$ By forcing each episode to start from every state-action pair, we can ensure visits to each of them.

* __Use $\epsilon-$soft policies__ $-$ by forcing a minimum probability of choosing an action to be $\epsilon$ we can obtain a fairly good exploration. ($\epslon$-greedy, but slowly becomes more deterministic over time)

## Monte Carlo Control

We use Generallized Policy Iteration for control, and maintain both an approximate policy and an approximate value function.

__Policy Evaluation__ is done by experiencing many episodes while improving the action value function.
\
__Policy improvement__ is done by making the policy gredy with respect to the current policy.


### On-policy and Off-policy learning

* __On-policy__ $-$ improve the same policy that is used for making decisions while learning
* __Off-policy__ $-$ use different policies for making decisions and learning.

Off-policy learning allows us to do Monte Carlo control without exploring starts.

The policy that makes decisions is called the __behavior policy__ $b$, and the policy that is learned is called the __taret policy__ $\pi$.

* We can make $b$ favor exploration
* $b$ must cover $\pi$, i.e., $\pi(a|s) \ge 0 \Rightarrow b(a|s) \ge 0 $
* In On-policy learning, $b=\pi$

### Importance Sampling

We need to use Importance Sampling to estimate the expected values under one distribution ($\pi$) given sampls from another ($b$).

Let's sample $x \in X$ from distribution $b$.
\
Expectation of $x$ under $\pi$ is:
$$ \mathop{{}\mathbb{E}_{\pi}} [X] \doteq \sum_{x \in X} x.\pi(x) $$

$$ \mathop{{}\mathbb{E}_{\pi}} [X] = \sum_{x \in X} x.\pi(x) . \frac{b(x)}{b(x)} $$

Let $\rho(x) = \frac{\pi(x)}{b(x)}$

$$ \mathop{{}\mathbb{E}_{\pi}} [X] = \sum_{x \in X} x.\rho(x).b(x) $$

$$ \mathop{{}\mathbb{E}_{\pi}} [X] = \mathop{{}\mathbb{E}_{b}} [X \rho(X)] $$

$$ \mathop{{}\mathbb{E}_{\pi}} [X] = \frac{1}{n} \sum_{i=1}^n x_i \rho(x_i) $$

So we can find the expectation under $\pi$ if we know what $\rho$ is, in:

$$ v_{\pi}(s) \doteq  \mathop{{}\mathbb{E}_{\pi}} \left [ G_t | S_t = s \right ] $$

$$ v_{\pi}(s) \doteq  \mathop{{}\mathbb{E}_{b}} \left [ \rho G_t | S_t = s \right ] $$

$\rho$ is given by:

$$ \rho = \frac{P(\text{trajectory under }\pi)}{P(\text{trajectory under b})} $$

$$ \rho = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k).p(S_{k+1}|S_k,A_k)}{b(A_k|S_k).p(S_{k+1}|S_k,A_k)} $$

Since the environment dynamics will remain the same for both policies, we can eliminate it

$$ \rho = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)} $$

