# Review Theory
**Markov Decision Processes**
* State *S*
* Action *A*
* Transitions $P(s'|s,a)(or \ T(s,a,s'))$
* Rewards $R(s,a,s')(and \ discount \ \gamma)$
* Start State $s_o$

Reinforcement Learning Type:
* Known
  * Current State
  * Available Actions
  * Experienced Rewards
 
* Unknown
  * Transition Model
  * Reward Structure
  
* Assumed
  * Markov Transitions
  * Fixed Reward For (s,a,s')
  
 Problem: Find Values for Fixed policy $\pi$ (policy evaluation) <br>
 Model-based Learning: Learn the model, solve for values <br>
 Model-free Learning: Solve for values directly (by sampling) <br>
 
** Monte Carlo Methods **
 Monte Carlo methods are a large family of computational algorithms that rely on *random sampling*. These methods are mainly used for:
* Numerical integration
* Stochastic optimization
* Characterizing distributions

Reason for using Monte Carlo vs Dynamic Programming
* No need for a complete Markov Decision Process
* Computationlly more efficient
* Can be used with stochastic simulations

** Monte Carlo Process: **
* To evaluate state s
* The first time-step t that state s is visited in an episode
* Increment counter $N(s)  \leftarrow N(s) + 1$
* Increment total return $S(s) \leftarrow S(s) + G_t$
* Value is estimated by mean return $V(s)=\frac{S(s)}{N(s)}$
* By law of large numbers, $V(s) \rightarrow v_\pi(s) \ as N(s) \rightarrow \infty$ 

** First-visit Monte Carlo policy evaluation ** <br>
Initialize:
> $pi \leftarrow$ policy to be evaluated <br>
> $V \leftarrow$ an arbitrary state-value function
> $Returns(s) \leftarrow$ an empty list, for all $s \in S$

Repeat forever:
>(a) Generate an episode using $\pi$ <br>
>(b) For each state s appearing in the episode:
>>$R \leftarrow$ following the first occurrence of s <br>
>>Append R to *Returns(s)*
>>$V(s) \leftarrow$ average(Returns(s))

In model-free reinforcement learning, as opposed to model based, we dont know the reward function and the transition function beforehand we have to learn them through experience.<br>
In first visit monte carlo, the state value function is defined as the average of the returns following the agents first visit to s in a set of episodes.

### Exploration vs Exploitation
In Reinforcement learning an agent simultaneously attempts to acquire new knowledge (called "exploration") and optimizes its decision based on existing knowledge (called "exploitation"). The “exploration vs. exploitation tradeoff” applies systems that want to acquire new knowledge and maximize their reward at the same time. <br/>
**Multi Arm Bandits (MAB):** 
Bandit problems embody in essential form a conflict evident in all human action: choosing actions which yield immediate reward vs. choosing actions (e.g. acquiring information or preparing the ground) whose benefit will come only later. <br>
MAB is best understood through this analogy: A gambler at a row of slot machines has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a reward from a distribution specific to that machine. The objective is to maximize the sum of rewards earned through a sequence of lever pulls.

  ![MAB](Img/MAB.png) <br>
  Lets get formal and introduce some notation: <br>
• Lets Index the arms by a, and the probability distribution over possible rewards r for each arm a can be written as $pa(r)$. <br>
•  We have to find the arm with the largest mean reward  $μa=Ea[r]$. <br>
• In practice pa(r) are non-stationary <br>

So to come up with an optimal strategy to explore and exploit so as to reap maximum rewards, the model can follow 3 strategies: 
1. **Epsilon-Decreasing with Softmax** <br>
With this strategy weexplore with probability epsilon, and exploit with probability 1 — epsilon. Epsilon decreases over time, in the case of exploring a new option, we don’t just pick an option at random, but instead we estimate the outcome of each option, and then pick based on that (this is the softmax part). In other words, we try to figure out what we want to do at a young age, and then stick with it throughout our lives. Throughout high school and college we explore a variety of subjects and are open to new experiences. The older we get the more likely we are to settle on a path, and major life or career changes become less likely.  In a sense, epsilon here models our risk aversion. As we become older we become more risk-averse and less likely to explore new options, like a major career change, even if they could yield high returns.
2. **Upper-confidence bound strategy** <br>
This stratey loosely corresponds to living a very optimistic life. In addition to estimating the outcome of each option, we also calculate the confidence of our estimation. This gives us an interval of possible outcomes. Now, here’s our strategy: We always pick the option with the highest possible outcome, even if it that outcome very unlikely. The intuition behind this strategy is that options with high uncertainty usually lead to a lot of new knowledge. We don’t know enough about the option to accurately estimate the return and by pursuing that option we are bound to learn more and improve our future estimations. In simulated settings this algorithm does well when we have many options with very different variances. 
3. **Contextual-Epsilon-greedy strategy** <br>
This strategy is similar to epsilon-greedy, but we choose the value of epsilon based on how critical our situation is.When we are in a critical situation (large debt, need to provide for a sick family) we will always exploit instead of explore — We do what we know works well. If we are in a situation that is not critical we are more likely to explore new things. This strategy makes intuitive sense, but I believe that it is not commonly followed. Even in non-critcal situation we often choose to keep doing what we have always done due to our risk-averse nature.

### Monte Carlo Reinforcement Learning Tutorial
**Q Learning** <br>
* $Q = Quality$ 
* $Q =$ Long-term discounted reward we expect from taking action a in state s 
* $Q(s,a)=R(s,a)+\gamma V(s')$
* $V(s)=max_a(R(s,a)+\gamma V(s'))$
* $\pi(s)=max_a(Q(s,a))$

**Policy**
* Policy is a simple lookup table: state $\rightarrow$ best action
* Start with a random policy
* Play the game, use experience to improve value estimates
* Better value estimates improve policy

**Returns (G)**
* Return: the reward from our immediate action, plus all discounted future rewards from applying the current policy
* Denoted by capital G
* $G_t=r_{t+1}+\gamma G_{t+1}$
* Work in reverse from the final state applying this formula
* Once (s, G) several pairs are collected, average them to estimate the value of each state

**Algorithm to Calculate Returns** <br>
1. Initialize G to 0
2. states_and_returns =[]
3. loop backwards through the list of states_and_rewards (s, r):
 4. appends(s, G) to states_and_returns
 5. $G = r+\gamma*G$
6. Reverse states_and_returns to the original order  

**Explore / Exploit Dilemma**
* We must strike a balance between explore / exploit
* We are going to use a strategy called "epsilon greedy"
* Epsilon is the probability that our agent will choose a random action instead of following policy

**Epsilon Greedy Algorithm**
1. generate a random number p, between 0 and 1
2. if $p < (1-\varepsilon)$ take the action dictated by policy
3. otherwise take a random action

**First Visit Optimization**
* What happens if we visit the same state more than once?
* It's been proven subsequent visits won't change the answer
* All we need is the first visit
* We throw the rest of the data away

**Monte Carlo Q Learning Algorithm** <br>
![Monte Carlo Algorithm](Img/Monte_Carlo_Algorithm.png)
    

