# Markov Property
> "The future is independent of the past given the present"

### Definition

A state ${S_t}$ is **Markov** if and only if 

$${{\rm P}\ [\ S_{t+1}\ |\ S_{t}\ ]\ =\ {\rm P}\ [\ S_{t+1}\ |\  S_{1},\ S_{2},..,\ S_{t}\ ] }$$

### State Transition Matrix

For a Markov state ${S}$ and successor state ${S'}$, the **state transition probability** is defined by ${P_{ss'} = {\rm P}\ [\ S_{t+1}\ =\ s'\ |\ S_{t}\ =\ s\ ] }$ 

State transition matrix $P$ defines transition probabilities from all states $s$ to all successor states $s'$

$${P = \begin{bmatrix}p_{11} & ... & p_{1n}\\ ... & ... & ... \\ p_{n1} & ... & p_{nn}\end{bmatrix} }$$

--------



# Markov Process

A Markov Process (or Markov Chain) is a tuple  ${\big(S,\ P\big) }$

1. $S$ is a (finite) set of states
2. $P$ is a state transition probability matrix

#### Example

<img src="img/mp_example.png", style="width: 500px;">

------

## Markov Reward Process

A Markov reward process is a **Markov chain with values**, it is a tuple  ${\big(S,\ P,\ R,\ \gamma\big) }$

1. $S$ is a (finite) set of states
2. $P$ is a state transition probability matrix
3. $R$ is a reward function, ${R_s\ = {\rm E}\ [\ R_{t+1}\ |\ S_t\ =\ s \ ]}$
4. $\gamma$ is a discount factor, ${ \gamma \in [\ 0,\ 1\ ]}$

> - A reward signal defines the goal in a reinforcement learning problem. 

> - On each time step, the ***environment*** sends to the reinforcement learning agent a single number. 

> - A Reward is a ***scalar feedback signal***

> - Reward signals may be stochastic functions of the state of the environment and the actions taken


#### Example
<img src="img/mrp_example.png", style="width: 500px;">

### Return

The return $G_t$ is the **total discounted reward from time-step $t$**

$${G_t\ = \ R_{t+1}\ + \ \gamma R_{t+2}\ +\ ......\ =\  \sum_{k=0}^\infty \gamma^k R_{t+k+1}}$$

> The value of receiving reward $R$ after $k + 1$ time-steps is $\gamma^k R$

### Value Function

The state value function $v(s)$ of an MRP is the expected return starting from state $s$

$${v(s)\ = {\rm E}\ [\ G_{t}\ |\ S_t\ =\ s \ ]}$$

#### Example

<img src="img/mrp_value_function_example.png", style="width: 500px;">

$${
\begin{equation}
\begin{aligned}
v\ (C3) &= 0.6 * v\ (Pass) + 0.4 * v\ (Pub) \\
&=0.6 * (-2 + 0.9 * 10) + 0.4 * (-2 + 0.9 * 0.8) \\
&=4.1
\end{aligned}
\end{equation} 
}$$

----

## Bellman Equation for MRPs

The value function can be decomposed into two parts:
1. immediate reward $R_{t+1}$
2. discounted value of successor state $\gamma V(S_{t+1})$

$${
\begin{equation}
\begin{aligned}
v(s)&= {\rm E}\ [\ G_t\ |\ S_t=s\ ]\\
&= {\rm E}\ [\ R_{t+1}\ + \ \gamma R_{t+2}\ + \ \gamma^2 R_{t+3}\ + ... \ |\ S_t=s\ ] \\
&= {\rm E}\ [\ R_{t+1}\ + \ \gamma (R_{t+2}\ + \ \gamma R_{t+3}\ + ... )\ |\ S_t=s\ ] \\
&= {\rm E}\ [\ R_{t+1}\ + \ \gamma G_{t+1}\ |\ S_t=s\ ] \\
&= {\rm E}\ [\ R_{t+1}\ + \ \gamma v(S_{t+1})\ |\ S_t=s\ ] \\
\end{aligned}
\end{equation} 
}$$



### Backup Diagram

$${ v(s)\ = {\rm E}\ [\ R_{t+1}\ + \ \gamma v(S_{t+1})\ |\ S_t=s\ ]}$$

<img src="img/mrp_backup.png",  style="width: 400px;">

$${v(s)\ = \ R_s\ + \gamma \sum_{s'  \in S} P_{ss'} v(s')}$$

#### Example

<img src="img/mrp_bellman_example.png",  style="width: 500px;">


for $v\ (C3)$, here, $\gamma \ =\ 1.0$


$${
\begin{equation}
\begin{aligned}
v\ (C3) &= 0.6 * v\ (Pass) + 0.4 * v\ (Pub) \\
&=0.6 * (-2 + 1.0 * 10) + 0.4 * (-2 + 1.0 * 0.8) \\
&=4.3
\end{aligned}
\end{equation} 
}$$

----

## Solving the Bellman Equation

The Bellman equation is a linear equation

$${
\begin{equation}
\begin{aligned}
v&=\ R\ +\ \gamma P v \\
(I\ -\ \gamma P)\ v&=\ R \\
v& =\  (I\ -\ \gamma P)^{-1} R
\end{aligned}
\end{equation}
}$$

1. Computational complexity is very enormous when encountering large states number n
2. **Direct solution only possible for small MRPs**
3. There are many **iterative methods** for large MRPs, eg: 
> * Dynamic programming
> * Monte-Carlo evaluation
> * Temporal-Difference learning
> * ......

-----

## Example of solving Bellman Equation directly

In [1]:
import numpy as np

In [74]:
# Status and its corresponding Rewards
states = ["Class_1", "Class_2", "Class_3", "Facebook", "Pub", "Pass", "Sleep"]
rewards = [-2.0, -2.0, -2.0, -1.0, 1.0 ,10.0, 0.0]

states_index = dict(zip(states, range(len(states))))
immdiate_reward = dict(zip(states, rewards))

In [68]:
states_index

{'Class_1': 0,
 'Class_2': 1,
 'Class_3': 2,
 'Facebook': 3,
 'Pass': 5,
 'Pub': 4,
 'Sleep': 6}

In [69]:
immdiate_reward

{'Class_1': -2.0,
 'Class_2': -2.0,
 'Class_3': -2.0,
 'Facebook': -1.0,
 'Pass': 10.0,
 'Pub': 1.0,
 'Sleep': 0.0}

In [70]:
# Transition Matrix
P = np.mat(np.zeros((7, 7)))
P[states_index["Class_1"], states_index["Class_2"]] = 0.5
P[states_index["Class_1"], states_index["Facebook"]] = 0.5
P[states_index["Class_2"], states_index["Class_3"]] = 0.8
P[states_index["Class_2"], states_index["Sleep"]] = 0.2
P[states_index["Class_3"], states_index["Pub"]] = 0.4
P[states_index["Class_3"], states_index["Pass"]] = 0.6
P[states_index["Facebook"], states_index["Class_1"]] = 0.1
P[states_index["Facebook"], states_index["Facebook"]] = 0.9
P[states_index["Pub"], states_index["Class_1"]] = 0.2
P[states_index["Pub"], states_index["Class_2"]] = 0.4
P[states_index["Pub"], states_index["Class_3"]] = 0.4
P[states_index["Pass"], states_index["Sleep"]] = 1.0
P[states_index["Sleep"], states_index["Sleep"]] = 1.0
P

matrix([[ 0. ,  0.5,  0. ,  0.5,  0. ,  0. ,  0. ],
        [ 0. ,  0. ,  0.8,  0. ,  0. ,  0. ,  0.2],
        [ 0. ,  0. ,  0. ,  0. ,  0.4,  0.6,  0. ],
        [ 0.1,  0. ,  0. ,  0.9,  0. ,  0. ,  0. ],
        [ 0.2,  0.4,  0.4,  0. ,  0. ,  0. ,  0. ],
        [ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  1. ],
        [ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  1. ]])

In [71]:
# Immdiate Reward Matrix
R = np.mat([
    immdiate_reward["Class_1"],
    immdiate_reward["Class_2"],
    immdiate_reward["Class_3"],
    immdiate_reward["Facebook"],
    immdiate_reward["Pub"],
    immdiate_reward["Pass"],
    immdiate_reward["Sleep"],
])
R

matrix([[ -2.,  -2.,  -2.,  -1.,   1.,  10.,   0.]])

In [72]:
# Solve the Value Matrix(Vector)
# here, set gamma = 0.9
gamma = 0.9
V = (np.mat(np.eye(7, 7)) - gamma * P).I * R.T

> here , **gamma can not be 1.0**, because for aborbing state **Sleep** it satisfied as  $v_{sleep} = 0 + \gamma * v_{sleep}$. if $\gamma = 1$, then $v_{sleep} = v_{sleep}$ can not be solved uniquely by matrix solution

In [73]:
V

matrix([[ -5.01272891],
        [  0.9426553 ],
        [  4.08702125],
        [ -7.63760843],
        [  1.90839235],
        [ 10.        ],
        [  0.        ]])

In [79]:
states_value = dict(zip(states, V.tolist()))
states_value

{'Class_1': [-5.012728910014522],
 'Class_2': [0.9426552976939075],
 'Class_3': [4.087021246797094],
 'Facebook': [-7.637608431059513],
 'Pass': [10.0],
 'Pub': [1.9083923522141462],
 'Sleep': [0.0]}

<img src="img/mrp_value_function_example.png", style="width: 500px;">

# Markov Decision Processes

A Markov decision process (MDP) is a **Markov reward process with decisions(actions)**. It is an environment in which all states are Markov.

it is a tuple  ${\big(S,\ A,\ P,\ R,\ \gamma\big) }$

1. $S$ is a finite set of states
2. $A$ is a finite set of actions
3. $P$ is a state transition probability matrix, and ${P_{ss'}^{a}\ = {\rm P} \big[\ S_{t+1}\ = \ s'\ |\ S_t\ =\ s,\ A_t\ =\ a \big]}$
4. $R$ is a reward function, ${R_{s}^{a}\ = {\rm E} \big[\ R_{t+1}\ |\ S_t\ =\ s,\ A_t\ =\ a \big]}$
5. $\gamma$ is a discount factor, ${ \gamma \in [\ 0,\ 1\ ]}$


<img src="img/mdp_concept.png", style="width: 600px;">

### Policy

A policy $\pi$ is a distribution over actions given states,

$${\pi(a\ |\ s)\ = {\rm P} \ [\ A_t\ = a\ |\ S_t\ = s\ ]}$$

1. MDP policies depend on the **current state** (not the history)
2. **A policy fully defines the behaviour** of an agent

> Given a **MDP**, M =  ${\big(S,\ A,\ P,\ R,\ \gamma\big) }$, and a **Policy**, $\pi$

> 1. The **state sequence** $S_1,\ S_2,\ ......$ is a **Markov process $<S,\ P^\pi>$**

> 2. The **state and reward sequence**  $S_1,\ R_1,\ S_2,\ R_2,\ ......$ is a **Markov reward process $<S,\ P^\pi,\ R^\pi,\ \gamma >$** where $${P_{s,s'}^{\pi}\ =\  \sum_{a \in A} \pi (a|s) P_{s,s'}^{a} }$$  and $${R_{s}^{\pi}\ =  \sum_{a \in A} \pi (a|s) R_s^a}$$

### Value Function

#### State-Value function

The state-value function $v(s)$ of an MDP is the **expected return starting from state $s$, and then following policy $\pi$**

$${v_{\pi}(s)\ = {\rm E}_{\pi} [\ G_t\ |\ S_t\ =\ s]}$$

#### Action-Value function

The action-value function $q(s;\ a)$ is the **expected return starting from state $s$,** ***taking action $a$***, **and then following policy $\pi$**

$${q_{\pi}(s,\ a)\ = {\rm E}_{\pi} [\ G_t\ |\ S_t\ =\ s,\ A_t\ =\ a]}$$

#### Example

<img src="img/mdp_state_value_example.png",  style="width: 500px;">

----

## Bellman Expectation Equation

### state-value function

${v_{\pi}(s)\ = {\rm E_{\pi}} \ [\  R_{t+1}\ +\ \gamma v_{\pi}(S_{t+1})\ |\ S_{t}\ =\ s \ ]}$

### action-value function

${q_{\pi}(s, a)\ = {\rm E_{\pi}} \ [\  R_{t+1}\ +\ \gamma q_{\pi}(S_{t+1},\ A_{t+1})\ |\ S_{t}\ =\ s,\ A_{t}\ =\ a \ ]}$

### Backup diagram for $v(s)$

<img src="img/bellman_backup_v.png",  style="width: 400px;">

so

$${v_{\pi}(s)\ =  \sum_{a \in A} \pi (a\ |\ s)\ q_{\pi}(s,\ a) }$$

### Backup diagram for $q(s,\ a)$

<img src="img/bellman_backup_q.png",  style="width: 400px;">

so

$${q_{\pi}(s,\ a)\ = R_{s}^{a}\ +\ \gamma \sum_{s' \in S} P_{ss'}^{a}\ v_{\pi}(s')}$$

### Backup diagram for $v(s)$ again

<img src="img/bellman_backup_vq.png",  style="width: 400px;">

so

$${v_{\pi}(s)\ =  \sum_{a \in A} \pi (a\ |\ s)\ \big(R_{s}^{a}\ +\ \gamma \sum_{s' \in S} P_{ss'}^{a}\ v_{\pi}(s')\big) }$$

### Backup diagram for $q(s,\ a)$ again

<img src="img/bellman_backup_qv.png",  style="width: 400px;">

so

$${q_{\pi}(s,\ a)\ = R_{s}^{a}\ +\ \gamma \sum_{s' \in S} P_{ss'}^{a}\ \sum_{a' \in A} \pi (a'\ |\ s')\ q_{\pi}(s',\ a')}$$

-----

## Optimal Value Function

The optimal state-value function $v_{*}(s)$ is the maximum value function over all policies

$${v_{*}(s)\ =\ \max \limits_{for\ all\ \pi} v_{\pi}(s) }$$


The optimal action-value function $q_{\pi}(s,\ a)$ is the maximum action-value function over all policies

$${q_{*}(s,\ a)\ =\ \max \limits_{for\ all\ \pi} q_{\pi}(s,\ a) }$$

> The optimal value function specifies the best possible performance in the MDP

----

## Optimal Policy

### Theorem

For any Markov Decision Process:
1. There exists an optimal policy $\pi_{*}$ that is **better than or equal to all other policies**,  $\pi_{*}\  \geq \ \pi,\ \forall \pi$
2. All optimal policies achieve the **optimal value function**, $v_{\pi_{*}}(s)\ =\ v_{*}(s)$
3. All optimal policies achieve the **optimal action-value function**, $q_{\pi_{*}}(s,\ a)\ =\ q_{*}(s,\ a)$


### Finding an Optimal Policy

An optimal policy can be found by maximising over $q_{*}(s,\ a)$,

$${ \pi_{*}(a\ |\ s)\ =\begin{cases}1 & if\ a\ = \mathop{\arg\max}_{\forall a}q_{*}(s,\ a)\\0 & otherwise\end{cases} }$$

> There is always **a deterministic optimal policy for any MDP**

> If we know $q_{*}(s,\ a)$, we immediately have the **optimal policy**

## Bellman Optimality Equation

### Bellman Optimality Equation for $v_{*}(s)$

<img src="img/bellman_optimality_v_origin.png",  style="width: 400px;">

so

$${v_{*}(s)\ =\ \max \limits_{a} q_{*}(s,\ a)  }$$


### Bellman Optimality Equation for $q_{*}(s,\ a)$

<img src="img/bellman_optimality_q_origin.png",  style="width: 400px;">

so

$${q_{*}(s,\ a)\ =\ R_{s}^a\ + \gamma \sum_{s' \in S} P_{ss'}^a v_{*}(s')    }$$


### Bellman Optimality Equation for $v_{*}(s)$ again

<img src="img/bellman_optimality_v.png",  style="width: 400px;">

so

$${v_{*}(s)\ =\ \max \limits_{a} q_{*}(s,\ a)\ =  \max \limits_{a} \ \big(\ R_{s}^a\ + \gamma \sum_{s' \in S} P_{ss'}^a v_{*}(s')\ \big)}$$

### Bellman Optimality Equation for $q_{*}(s,\ a)$ again

<img src="img/bellman_optimality_q.png",  style="width: 400px;">

so

$${q_{*}(s,\ a)\ =\ R_{s}^a\ + \gamma \sum_{s' \in S} P_{ss'}^a \max \limits_{a'} q_{*}(s',\ a')    }$$

## Example

<img src="img/bellman_optimality_example.png",  style="width: 500px;">

----

## Solving the Bellman Optimality Equation

> Bellman Optimality Equation is **non-linear**

Many iterative solution methods:
1. Value Iteration
2. Policy Iteration
3. Q-learning
4. Sarsa

----