<a href="https://colab.research.google.com/github/lfmartins/markov-decision-processes/blob/main/Markov_Decision_Processes_Numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!sudo apt-get update -y
!sudo apt-get install python3.11
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
!python --version

In [7]:
import numpy as numpy

# Introduction

We want to build a framework for MDPs. The basic ingredients are:

- A set of *states* $\mathscr{S}$. We assume that $\mathscr{S}$ is finite and let $N=|\mathscr{S}|$
- A set of *actions* $\mathscr{A}$. We assume that $\mathscr{A}$ is finite.
- For each $s\in\mathscr{S}$, a set of *admissible actions for state $s$*, $\mathscr{A}_s\subset\mathscr{A}$.
- For each $s,s'\in\mathscr{S}$ and $a\in\mathscr{A}_s$, a number $p(s'\,|\,s,a)\in[0,1]$.
- For each $s\in\mathscr{S}$ and $a\in\mathscr{A}$, a number $r(s,a)$. We let the set of possible reward be $\mathscr{R}$. We also assume this set to be finite.

We interpret $p(s'\,|\,s,a)$ as the probability that, if the agent is in state $s$ and action $a$ is chosen, the agent will next transition to state $s'$. We require that, for each $s$, $a$ we have:
$$
\sum_{s'\in\mathscr{S}}p(s'\,|\,s,a)=1
$$

The function $r(s,a)$ represents a *reward* received by the agent for visiting state $s$ and choosing action $a$. We can easily extend the definition to cover for randomized rewards.

We will only consider infinite-horizon problems, so we only have to deal with stationary policies. A *randomized policy* is a specification, for every $s\in\mathscr{S}$ and $a\in\mathscr{A}$ of a number $\pi(a\,|\,s)$, interpreted as the probability that action $a$ is chosen when in state $s$. We require:
$$
0\le\pi(a\,|\,s)\le 1,\quad \sum_{a\in\mathscr{A_s}}\pi(a\,|\,s)=1
$$
If, for all $s\in\mathscr{S}$ there is an $a=a(s)\in\mathscr{A}_s$ such that $\pi(a\,|\,s)=1$, we say that $\pi$ is *deterministic*.

The *transition probability matrix* associated to a policy $\pi$ is a $N\times N$ matrix $Q$ defined by:
$$
Q_{ss'}\sum_{a\in\mathscr{A}_s}\pi(a\,|\,s)p(s'\,|\,s,a)
$$
It is easy to see that this is a stochastic matrix. Thus, if $\pi_0$ is a probability distribution on $\mathscr{S}$, there is a Markov chain $\{(S_t,A_t,R_t)\}_{t\ge 0}\subset \mathscr{S}\times\mathscr{A}\times\mathscr{R}$ such that:

- $P_{\pi,\pi_0}\left[S_t=s'\,|\,S_{t-1}=s, A_{t-1}=a\right]=p(s'\,|\,s,a)$
- $P_{\pi,\pi_0}\left[A_t=a\,|\,S_t=s\right]=\pi(a\,|\,s)$ if $a\in\mathscr{A}_t$.
- $R_t=r(S_t,A_t)$

In general, the initial distribution will not be relevant in the formulas below, so we will write simply $P_\pi$ to the probability measure corresponding to the Markov chain. We denote the corresponding expected value by $E_\pi$.

Let $0< \gamma < 1$ we define the *value function* of policy $\pi$ as:
$$
V_\pi(s)=E_\pi\left[\sum_{t=0}^\infty\gamma^tR_t\,|\,S_0=s\right]
$$

This value function is a solution to *Bellman's Equation*:
$$
V_\pi=\sum_{a\in\mathscr{A}_s}\pi(a\,|\,s)\left[r(a,s)+\sum_{s'\in\mathscr{S}}\gamma^tp(s'\,|\,s,a)V_\pi(s)\right]
$$
The sum is finite since $\gamma$ is in the interval $(0,1)$.

The *optimal value function* is defined as:
$$
V(s)=\max\left\{V_\pi(s)\,|\,\text{$\pi$ is a randomized policy}\right\}
$$

Bellman's Equation for the optimal value function is:
$$
V(s)=\max_{a\in\mathscr{A}_s}\left\{\pi(a\,|\,s)\left[r(a,s)+\sum_{s'\in\mathscr{S}}\gamma^tp(s'\,|\,s,a)\right]V(s)\right\}
$$

# Data Structures

We need to store three things:

- The transition probabilities $p(s',s,a)$
- The rewards $r(s,a)$
- The policy

A Markov Decision Process is represented by an object in the class `MDP`. This class records all states and actions. States and actions have unique IDs that are mapped to an integer (according to the order by which they are created). The state and action IDs can be any hashable object.

State IDs are stored in a list `__states`, and action IDs are stored in a list `__actions`. States and actions are added to these lists in the order they are created.

For storing the transition probabilities, we use a dictionary `__tp`. The keys in this dictionary are state IDs. For a state with ID `sid`, the dictionary entry `__tp[sid]` is itself a dictionary. The keys for this dictionary are action IDs. For each action, this dictionary points to an array containing the transition probabilities.

Notice that there is some waste in this description, and in the future we may move to a more sparse representation.


In [None]:
import typing
class MDP(object):
  def __init__(self, states=[]):
    __states = []
    __actions = []
    __tp = {}
    __reward = {}
    for sid in states:
      self.add_state(sid)

  def add_state(self, state):
    if state_id in self.__tp:
      raise ValueError(f'state {state} already exists in this MDP')
      self.__states.append(state)
      self.__tp[state] = {}

  def add_action(self, state_id, action_) 

In [1]:
import sys
sys.version

'3.8.10 (default, Nov 14 2022, 12:59:47) \n[GCC 9.4.0]'

In [8]:
!python --version

Python 3.11.2
