# Markov Decision Process

MDPs are a classical formalization of sequential decision making, where actions influence not just *immediate rewards*, but also *subsequent situations*, or *states*, and through those influence future rewards.

A state is said to have the __Markov property__ if it contains all the information required from previous states, so that we can make decisions relying only on the current state.


## Agent and Environment

In MDPs the learner/decision maker is called the __Agent__.

Everything outside the agent in the world is the __Environment__.

Generally anything that the *agent* cannot change is considered part of the *environment*. For example, a robot's body or even the actuators in it are not part of the agent, but they belong to the *environment*. The signal to the actuators are controlled by the agent but the actuators can have noisy behavior, which is why they are part of the *environment* and not the *agent*.

The *agent* and the *environment* interact continually. The *agent* selects actions, and the *environment* returns subsequent states and the reward for selecting an action. This process continues until termination.

The trajectory looks like: $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots$

## Finite MDPs

* When States, Actions and Rewards are all finite.
* We can define the *dynamics* of a finite MDP as:
  $$ p(s',r|s,a) \doteq Pr\left \{ S_t=s', R_t=r | S_{t-1}=s, A_{t-1}=a \right \} $$
  
  It gives us the probability of a certain reward and the next state given a state and action at the current timestep.
  

* MDP dynamics has this property:
  $$\sum_{s'\in S} \sum_{r\in R} p(s',r|s,a) = 1,\ \ \forall s\in S, a\in A(s) $$
  
## Goals *vs* Rewards

__Reward__ is a number given by the environment that the agent tries to maximize.

The agent's __goal__ ,or purpose, is to maximize the *total reward* it receives.

It is important that the rewards are setup to encourage the desired behavior specifically, and not to impart prior knowledge.
\
If rewards are not set up properly the agent might maximize rewards but not learn what we want it to.

## Episodes and Returns

Tasks that terminate are called *episodic tasks*. Each iteration of this task is called an *Episode*.
\
In contrast, there are *continuing tasks* that do not terminate.

In an episodic task that terminates at timestep $T$, the *expected return* $G$ at timestep $t$ during the task can be written as:

$$ G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \ldots + R_{T} $$

In continuing tasks, there is no termination, and adding rewards as shown above results in infinite return. So we introduce a __discount factor $\gamma$__ and write the return at timestep $t$ as:

$$ G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots $$

$$ G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$

Discount factor (or discount rate) $\gamma \in [0,1]$ represents the current value of future rewards.
\
$\gamma = 0$ means we don't care about the future.
\
$\gamma = 1$ means future rewards have the most value in the present state. (0.99 is better since 1 results in $G = \infty$)

## Policies

A __policy__ is a mapping from *states* to probabilities of selecting each possible *action*.

If an agent is following a policy $\pi$, then $\pi(a|s)$ is the probability that action $a$ is selected while in state $s$.

There are two kinds of policies:
* __Deterministic policy__: maps states to specific action selection.
* __Stochastic policy__: maps states to a probability of action selection.


## Value functions

There are two kinds of *value functions*: 
* __State value functions $v_{\pi}(s)$ $-$__ estimate "how good" it is to be in a particular state. 
* __Action value functions $q_{\pi}(s,a)$ $-$__ estimate "how good" it is to choose an action while in a state.

In other words, they estimate expected return.

$$ v_{\pi}(s) \doteq \mathop{{}\mathbb{E}_{\pi}} \left[ G_t | S_t = s \right ] $$

$$ q_{\pi}(s,a) \doteq \mathop{{}\mathbb{E}_{\pi}} \left[ G_t | S_t = s, A_t = a \right ] $$

\* the $\pi$ subscript in $v_{\pi}(s)$, $q_{\pi}(s,a)$ and $\mathop{{}\mathbb{E}_{\pi}}$ means "*while following the policy $\pi$*" or "*under policy $\pi$*".

\* In the *bandit problem* we estimated the value $q(a)$ of each action $a$.
\
$\ \ $ In MDPs, we estimate the value $q(s, a)$ of each action $a$ in each state $s$, or we estimate the value $v(s)$ of each state given optimal action selections.

## Bellman equations

The value functions can be written recursively. We start with the expected return $G_t$ that can be written recursively as:

$$ G_t = R_{t+1} + \gamma G_{t+1} $$

So, $v_{\pi}(s)$ can be written as:
$$ v_{\pi}(s) \doteq \mathop{{}\mathbb{E}_{\pi}} \bigl[ R_{t+1} + \gamma G_{t+1} | S_t = s \bigr ] $$

Expanding the expectation for all *actions*, and the resulting *rewards* and *subsequent states*, we get:

$$ v_{\pi}(s) = \sum_a \pi (a|s) \sum_{s'} \sum_r p(s',r | s,a) \bigl [ r + \gamma \mathop{{}\mathbb{E}_{\pi}} \left[ G_{t+1} | S_{t+1} = s \right ] \bigr ] $$

$$ v_{\pi}(s) = \sum_a \pi (a|s) \sum_{s',r} p(s',r | s,a) \left [ r + \gamma . v_{\pi}(s') \right ] $$

Similarly for $q_{\pi}(s,a)$ we have:
$$ q_{\pi}(s,a) \doteq \mathop{{}\mathbb{E}_{\pi}} \left[ G_t | S_t = s, A_t = a \right ] $$

$$ q_{\pi}(s,a) = \sum_{s'} \sum_r p(s',r | s,a) \left [ r + \gamma \mathop{{}\mathbb{E}_{\pi}} \left [ G_{t+1} | S_{t+1} = s, A_t = a \right ] \right ] $$

$$ q_{\pi}(s,a) = \sum_{s',r} p(s',r | s,a) \left [ r + \gamma . \sum_{a'} \pi (a'|s') . q_{\pi}(s',a') \right ] $$

## Optimal Policies

__Optimal Policies__ are those that have the highest possible value function in all states.

* A policy is better than or equal to another policy if and only if its expected return is better than or equal to the expected return of the other policy in all states.
  $ \pi \geq \pi ' \text{ iff } v_{\pi}(s) \geq v_{\pi '}(s) \forall s \in S $
* There exists at least one optimal policy.
* If two policies are such that one gives highest return in half the states and the other gives highest return in the other half, then a new policy can be constructed using these, and it will be the optimal policy.
  
## Optimal value functions

__Optimal value functions__ are value functions that estimate the expected return of optimal policies.

* All optimal policies share the same value function

  $$ v_*(s) \doteq \underset{\pi}{max}\ v_{\pi}(s) $$

  $$ q_*(s,a) \doteq \underset{\pi}{max}\ q_{\pi}(s,a) $$
  
* We can write $q_*$ in terms of $v_*$:

  $$ q_*(s,a) = \mathop{{}\mathbb{E}} \left[ R_{t+1} + \gamma v_*(s_{t+1}) | S_t = s, A_t = a \right ] $$


* We can also write $v_*$ in terms of $q_*$:

  $$ v_*(s) = \underset{a \in A}{max}\ q_{\pi_*}(s,a) $$
  

* We can obatin Optimal Policies from Optimal Value functions as follows:
  - From *state value* functions
  $$ \pi_*(s) = \underset{a}{argmax}\ \sum_{s',r}p(s',r|s,a) \left [ r + \gamma . v_*(s') \right ] $$

  - From *action value* functions
  $$ \pi_*(s) = \underset{a}{argmax}\ q_*(s,a) $$
  
## Bellman Optimality Equations

We know $v_{\pi}(s)$:
$$ v_{\pi}(s) = \sum_a \pi (a|s) \sum_{s',r} p(s',r | s,a) \left [ r + \gamma . v_{\pi}(s') \right ] $$

For $\pi_*$,
$$ v_{\pi_*}(s) = \sum_a \pi_* (a|s) \sum_{s',r} p(s',r | s,a) \left [ r + \gamma . v_{\pi_*}(s') \right ] $$

In an optimal policy the best action gets a probability of 1 and others 0.

$$ v_{\pi_*}(s) = \underset{a}{max} \sum_{s',r} p(s',r | s,a) \left [ r + \gamma . v_{\pi_*}(s') \right ] $$

Similarly, for $q_{\pi_*}(s)$ we have:
$$ q_{\pi_*}(s,a) = \sum_{s',r} p(s',r | s,a) \left [ r + \gamma . \underset{a'}{max} . q_{\pi_*}(s',a') \right ] $$