# Simple MDP Design
Grzegorz Malisz
Student Number: 4852370

In [None]:
import random

## Assignment 1

### MDP definition ver 1
A set of states $S$: continuous or discrete
A set of actions $A$: continuous or discrete
A set of rewards $R$: continuous or discrete
For each state $s \in S$ there are permitted actions $a \in A(s)$
Transition Probabilities: $p(s_{t+1}, r_{t+1}|s_t, a_t)$

With this version we can calculate the probability of moving to certain state $s_{t+1}$ and acquiring reward $r_{t+1}$, given that we are in the state $s_t$ and do action $a_t$.

### MDP definition ver 2
A set of states $S$: continuous or discrete
A set of actions $A$: continuous or discrete
A set of rewards $R$: continuous or discrete
For each state $s \in S$ there are permitted actions $a \in A(s)$
Transition Probabilities: $p(s_{t+1}|s_t, a_t)$
Deterministic function of $r_{t+1} = f(s_t, s_{t+1}, a_t)$

With this version we can calculate the probability of moving to certain state $s_{t+1}$, given that we are in the state $s_t$ and do action $a_t$. And we can also calculate the reward $r_{t+1}$ from performing $a_t$ in $s_{t+1}$ and finishing in $s_t$.

### Assignment 1.1: MDP ver1 vs ver2
The most notable difference between those two definitions is the fact that ver 1 calculates already possibility of ending up in $s_{t+1}, r_{t+1}$, while ver 2 calculates first probability of ending up in the $s_{t+1}$, and then we can get the deterministic function to calculate the expected reward $r_{t+1}$. Also the ver 2 does not work with a stochastic reward system, by that I mean that for ver 2 to work for every $s_{t+1}, s_t, a_t$ there is deterministic reward $r_{t+1}$

### Assignment 1.2: generic stochastic MDP
This Python Class for generic MDP implements formal definition of MDP ver 2. Variables below are added to illustrate the behaviour of MDP. Permitted actions are solved by using the input (`transition_probabilities`, `rewards`) that consist only of permitted actions and `lookup_transition_probability`, `lookup_reward` functions which in case of forbidden action return adequate value.

In [1]:
class MDP:
    def __init__(self, states, actions, transition_probabilities, rewards):
        self.states = states
        self.actions = actions
        self.transition_probabilities = transition_probabilities
        self.rewards = rewards

    def lookup_transition_probability(self, state: str, action: str, next_state: str):
        return self.transition_probabilities[state].get(action, {}).get(next_state, 0.0)

    def lookup_reward(self, state: str, action: str, next_state: str):
        return self.transition_probabilities[state].get(action, {}).get(next_state, 0)

states = ['s1', 's2', 's3']
actions = ['a', 'b', 'c']

transition_probabilities = {
    's1': {'a': {'s1': 0.5, 's2': 0.5}},
    's2': {'a': {'s1': 0.1, 's2': 0.4, 's3': 0.5},
           'b': {'s1': 0.6, 's2': 0.3, 's3': 0.1}},
    's3': {'b': {'s2': 1.0}}
}

rewards = {
    's1': {'a': {'s1': 5, 's2': 10}},
    's2': {'a': {'s1': -2, 's2': 1, 's3': 4},
           'b': {'s1': 3, 's2': 2, 's3': 8}},
    's3': {'b': {'s2': 100}}
}

mdp_generic = MDP(states, actions, transition_probabilities=transition_probabilities, rewards=rewards)

### Assignment 1.3: MDP form image

Code below uses states, actions, rewards, and probabilities form the image.

In [None]:
states = ['s0', 's1', 's2']
actions = ['a0', 'a1']

transition_probabilities = {
    's0': {'a0': {'s0': 0.5, 's2': 0.5},
           'a1': {'s2': 1}},
    's1': {'a0': {'s0': 0.7, 's1': 0.1, 's2': 0.2},
           'a1': {'s1': 0.95, 's2': 0.05}},
    's2': {'a0': {'s0': 0.4, 's2': 0.6},
           'a1': {'s0': 0.3, 's1': 0.3, 's2': 0.4}},
}

rewards = {
    's0': {},
    's1': {'a0': {'s0': 5}},
    's2': {
        'a1': {'s0': -1}},
}

mdp_from_image = MDP(states, actions, transition_probabilities=transition_probabilities, rewards=rewards)

## Assignment 2

## Assignment 2.1


## Assignment 3
### Assignment 3.1
$$
v(s_0) = \mathbb{E}
$$