### 2.4 Value Function for MRP
The $state \ value \ function \ v(s)$ of an MRP is the expected return starting from state s. It gives the long-term value of state s.
$$v(s)=E[G_t \mid S_t=s]$$

### 2.5 Bellman Equation for MRP
The value function can be decomposed into 2 parts:
* Immediate reward $R_{t+1}$.
* Discounted value of successor state $\gamma v(S_{t+1})$.
$$\begin{split}
v(s) & = & E[G_t \mid S_t=s] \\
& = & E[R_{t+1}+ \gamma R_{t+2} + \gamma^2 R_{t+3}+... \mid S_t=s] \\
& = & E[R_{t+1}+ \gamma G_{t+1} \mid S_t=s] \\
& = & E[R_{t+1}+ \gamma v(S_{t+1}) \mid S_t=s] \\
& = & \mathcal R_{s} + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}v(s^{'})
\end{split}$$
Rewrite this to matrix form, we have:
$$v=\mathcal R + \gamma \mathcal P v$$
Then we can solve the Bellman equation by:
$$v=(I- \gamma \mathcal P)^{-1} \mathcal R$$
* The computational complexity is $(n^3)$ for n states.
* Direct solution only possible for small MRPs.
* Some iterative methods: dynamic programming, Monte-Carlo evaluation.

## 3. Markov Decision Process
A $Markov \ decision \ Process$ is a Markov reward process with decisions. Then, a MDP has 5 components:
* A finite set of states which satisfy Markov property.
* A finite set of actions.
* A corresponding state transition probability matrix: (now it's a 3-dimension matrix)
$$\mathcal P_{ss^{'}}^a=P(S_{t+1}=s^{'} \mid S_t=s, A_t=a)$$
* A reward function:
$$\mathcal R_{s}^a=E[R_{t+1} \mid S_t=s, A_t=a]$$
* A discount factor.

### 3.1 Policies
A $policy \ \pi$ is a distribution over actions given states:
$$\pi(a \mid s)=P[A_t=a \mid S_t=s]$$
* A policy fully defines the behaviour of an agent.
* MDP policies depend on the current state (not the history).
* Given an MDP $\mathcal {M = <S, A ,P ,R}, \gamma>$ and a policy $\pi$.  
* The state sequence, $S_1, S_2, ...$ is a MP $\mathcal {<S, P^\pi}>$.
* The state and reward sequance, $S_1, R_2, S_2, ...$ is a MRP $\mathcal {<S, P^\pi ,R^\pi}, \gamma>$.
* where,
$$\mathcal P_{s,s^{'}}^\pi=\sum_{a \in \mathcal A} \pi(a \mid s) \mathcal P_{ss^{'}}^a$$
$$\mathcal R_s^\pi=\sum_{a \in \mathcal A} \pi(a \mid s) \mathcal R_s^a$$

### 3.2 Value Function for MDP
There are 2 definitions for the value function of MDP based on state or action.  
The $state-value \ function \ v_\pi(s)$ of an MDP is the expected return starting from state $s$ and then following policy $\pi$:
$$v_\pi(s)=E_\pi[G_t \mid S_t=s]$$
The $action-value \ function \ q_\pi(s,a)$ of an MDP is the expected return starting from state $s$, taking action $a$ and then following policy $\pi$:
$$q_\pi(s,a)=E_\pi[G_t \mid S_t=s, A_t=a]$$

### 3.3 Bellman Equation for MDP
Both state-value and action-value function can again be decomposed intp immediate reward plus discounted value of successor state:
$$v_\pi(s)=E_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s]$$
$$q_\pi(s,a)=E_\pi[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t=s, A_t=a]$$
* Consider one-step expectation for $v_\pi(s)$, we have:
$$v_\pi(s)=\sum_{a \in \mathcal A} \pi(a \mid s) q_\pi(s,a)$$
* Consider one-step expectation for $q_\pi(s,a)$, we have:
$$q_\pi(s,a)=\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a v_\pi(s^{'})$$
* Combine these 2 equations, we have iterative expressions for $v_\pi(s)$ and $q_\pi(s,a)$:
$$v_\pi(s)=\sum_{a \in \mathcal A} \pi(a \mid s) \Big (\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a v_\pi(s^{'}) \Big)$$
$$q_\pi(s,a)=\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a \sum_{a^{'} \in \mathcal A} \pi(a^{'} \mid s^{'}) q_\pi(s^{'},a^{'})$$

### 3.4 Optimal Value Function
The $optimal \ state-value \ function \ v_*(s)$ is the maximum state-value function over all policies:
$$v_*(s) = \max _\pi v_\pi(s)$$
The $optimal \ action-value \ function \ q_*(s,a)$ is the maximum action-value function over all policies:
$$q_*(s,a) = \max _\pi q_\pi(s,a)$$
* The optimal value function specifies the best possible performance in the MDP.
* An MDP is "solved" when we know the optimal value function.

### 3.5 Optimal policy
Define the inequality relation over policies as: $\pi >= \pi^{'} $ if $ v_\pi(s) >= v_{\pi^{'}}(s), \forall s$.  
Theorem: For any MDP,
* There exists an optimal policy $\pi_*$ that is better than or equal to all other policies, $\pi_* >= \pi, \forall \pi$.
* All optimal policies achieve the optimal state-value function, $v_{\pi_*(s)}=v_*(s)$.
* All optimal policies achieve the optimal action-value function, $q_{\pi_*(s,a)}=q_*(s,a)$.  
  
An optimal policy can be found by maximising over $q_*(s,a)$,
$$\pi_*(a \mid s)=
\begin{cases}
1& \text{if $a=\mathop{\arg\max}_{a \in \mathcal A}q_*(s,a)$}\\
0& \text{otherwise}
\end{cases}
$$
* There is always a deterministic optimal policy for any MDP.
* If we know $q_*(s,a)$, we immediately have the optimal policy.

### 3.6 Bellman Optimality Equation
* Consider the optimal policy case in the one-step Bellman equation, we have:
$$v_*(s)=\max_a q_*(s,a)$$
$$q_*(s,a)=\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a v_*(s^{'})$$
* Again, combine these 2 equations, we have iterative expressions for  $v_*(s)$ and $q_*(s,a)$ :
$$v_*(s)=\max_a \Big (\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a v_*(s^{'}) \Big)$$
$$q_*(s,a)=\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a \max_{a^{'}} q_*(s^{'},a^{'})$$
* Bellman optimality equation is non-linear due to max in the equation.
* So there is no closed form solution in general.
* We can use iterative methods to solve it.

### 3.7 Class Design of MDP

In [1]:
from mp import MP
from mrp import MRP
from policy import Policy
from typing import Mapping, List
from utils.generic_typevars import S, A
from utils.utils import sum_dicts 

class MDP(MP):
    def __init__(self, transitions: Mapping[S, Mapping[A, Mapping[S, float]]], \
                 rewards: Mapping[S, Mapping[A, Mapping[S, float]]], gamma: float) -> None:
        self.transitions = transitions
        self.states = self.get_all_states()
        self.actions = self.get_all_actions()
        self.rewards = rewards
        self.gamma = gamma
        
    def get_all_actions(self) -> List:
        return list(set().union(*list(self.transitions.values())))
    
    def get_mrp(self, pol: Policy) -> MRP:
        transitions = {s: sum_dicts([{s1: p * v2 for s1, v2 in v[a].items()}
                        for a, p in pol.data[s].items()])
                        for s, v in self.transitions.items()}
        rewards = {s: sum(p * v[a] for a, p in pol.data[s].items())
                    for s, v in self.rewards.items()}
        return MRP(transitions, rewards, self.gamma)
        
    def get_state_value_func(self, pol: Policy) -> Mapping[S, float]:
        mrp = self.get_mrp(pol)
        value_func = mrp.get_value_func()
        return {mrp.states[i]: value_func[i] for i in range(len(mrp.states))}
    
    def get_action_value_func(self, pol: Policy) -> Mapping[S, Mapping[A, float]]:
        value_func = self.get_state_value_func(pol)
        return {s:  {a: r + self.gamma * sum(p * value_func[s1]
                for s1, p in self.transitions[s][a].items())
                for a, r in v.items()}
                for s, v in self.rewards.items()} 

In [2]:
transitions = {
    1: {
        'a': {1: 0.3, 2: 0.6, 3: 0.1},
        'b': {2: 0.3, 3: 0.7},
        'c': {1: 0.2, 2: 0.4, 3: 0.4}
    },
    2: {
        'a': {1: 0.3, 2: 0.6, 3: 0.1},
        'c': {1: 0.2, 2: 0.4, 3: 0.4}
    },
    3: {
        'b': {3: 1.0}
    }
}
rewards = {
    1: {'a': 5, 'b': 4, 'c': -6},
    2: {'a': 5, 'c': -6},
    3: {'b': 0}
}
policy_data = {
    1: {'a': 0.4, 'b': 0.6},
    2: {'a': 0.7, 'c': 0.3},
    3: {'b': 1.0}
}
mdp = MDP(transitions, rewards, 0.5)
pol = Policy(policy_data)
print('States:', mdp.states, '\n')
print('Actions', mdp.actions, '\n')
print('Transition Matrix of MDP:\n', mdp.transitions, '\n')
print('Reward Function of MDP:\n', mdp.rewards, '\n')
print('Policy:\n', pol, '\n')
mrp = mdp.get_mrp(pol)
print('Transition Matrix of MRP:\n', mrp.transition_matrix, '\n')
print('Reward Function of MRP:\n', mrp.reward_func, '\n')

States: [1, 2, 3] 

Actions ['a', 'b', 'c'] 

Transition Matrix of MDP:
 {1: {'a': {1: 0.3, 2: 0.6, 3: 0.1}, 'b': {2: 0.3, 3: 0.7}, 'c': {1: 0.2, 2: 0.4, 3: 0.4}}, 2: {'a': {1: 0.3, 2: 0.6, 3: 0.1}, 'c': {1: 0.2, 2: 0.4, 3: 0.4}}, 3: {'b': {3: 1.0}}} 

Reward Function of MDP:
 {1: {'a': 5, 'b': 4, 'c': -6}, 2: {'a': 5, 'c': -6}, 3: {'b': 0}} 

Policy:
 {1: {'a': 0.4, 'b': 0.6}, 2: {'a': 0.7, 'c': 0.3}, 3: {'b': 1.0}} 

Transition Matrix of MRP:
 [[0.12 0.42 0.46]
 [0.27 0.54 0.19]
 [0.   0.   1.  ]] 

Reward Function of MRP:
 [4.4, 1.7000000000000002, 0.0] 

