# <center>Markov decision processes</center>
### <center>Alfred Galichon (NYU & Sciences Po)</center>
## <center>'math+econ+code' masterclass series</center>
#### <center>With python code examples</center>
© 2018–2023 by Alfred Galichon. Past and present support from NSF grant DMS-1716489, ERC grant CoG-866274 are acknowledged, as well as inputs from contributors listed [here](http://www.math-econ-code.org/team).

**If you reuse material from this masterclass, please cite as:**<br>
Alfred Galichon, 'math+econ+code' masterclass series. https://www.math-econ-code.org/

# Introduction

### Learning objectives

* Dynamic programming
* Stationarity
* LCP formulation


### References

* Bertsekas TBC.
* Weber (2016). *Optimization and Control*. Lecture notes.


# Setup

Consider a finite state space $\mathcal{X}$, and an action space $\mathcal{Y}
$. The short-term reward associated with taking action $y$ in space $x$ is $%
\Phi _{xy}$. The decision variable $\mu _{xy}\geq 0$ is the number of units of type $x$ for
which action $y$ is taken. 

If action $y$ is taken for a unit in state $x$, then the probability of
transitioning to a state $x^{\prime }$ is $N_{x^{\prime },x}^{y}$. We have 
$$
N=\sum_{y}N^{y}\left( e^{y}\right) ^{\top },
$$
so that $N_{x^{\prime },xy}=N_{x^{\prime },x}^{y}$.

Consider $M=\mathbf{I}_{\mathcal{X}}\otimes \mathbf{1}_{\mathcal{Y}}^{\top }$, so that $M_{x^{\prime },xy}=1\left\{ x=x^{\prime }\right\} $. 

### Dynamic programming

If $(q_x)_x$ is the present period distribution over states, and if $(p^\prime_x)_x$ is the next period payoff associated to each state, we can compute the intertemporal value as:
$W\left( q,p^{\prime }\right) =\max_{\mu \geq 0}\left\{ \mu ^{\top }\left(
\Phi +\beta N^{\top }p^{\prime }\right) :M\mu =q\right\} $.

This induces a policy $\pi_{y|x} = \mu_{xy}/q_x$.

Let $p_{x}$ be the present-period reward associated with state $x$, and let  $q_{x}^{\prime }$ be the next-period number of units in state $x$. We have<div> 
$p \in \partial_q W(q,p^\prime)$<div> 
where $\partial_q$ denotes the superdifferential in $q$, and<div>
$q^\prime \in \partial_p W(q,p^\prime)$<div> 
where $\partial_p$ denotes the subdiffernential in $p$.

In [1]:
#!pip install gurobipy
import numpy as np
import scipy.sparse as sp
import gurobipy as grb

In [2]:
class Mdp():
    def __init__(self,list_Phi_x,list_N_xp_x,beta, pp_x=None,q_x=None):
        self.nbx = len(list_Phi_x[0])
        self.nby = len(list_Phi_x)
        self.Phi_xy = np.block( [Phi_x.reshape((-1,1)) for Phi_x in list_Phi_x]).flatten()
        self.N_xp_xy = np.block( [N_xp_x.reshape((-1,1)) for N_xp_x in list_N_xp_x]  ).reshape(self.nbx,-1)
        self.beta = beta
        self.pp_x = pp_x
        self.q_x = q_x
        

Define the model and call the solver by:

## Sampling a MDP

In [3]:
def Mdp_sampler_init(self,state=0):
    self.state = state
    self.payoff = 0.0
    self.period = 0

Mdp.sampler_init = Mdp_sampler_init

def Mdp_sampler_forward(self,y,seed=None,verbose=False):
    if seed is not None:
        np.random.seed(seed)
    xy = self.nby * self.state + y
    oldstate = self.state
    self.state = np.random.choice(list(range(self.nbx)),1, p=self.N_xp_xy[:,xy])[0]
    self.period += 1
    self.payoff = self.payoff/self.beta + self.Phi_xy[xy]  
    if verbose:
        print('transition',str(oldstate),'->',str(self.state))
        print('new state',self.state)
        print('new period',self.period)
        print('new payoff',self.payoff)
    return(self.state)
    
Mdp.sampler_forward = Mdp_sampler_forward

In [4]:
def build_simple_mdp(nbperiods,payoff1,payoff2,beta,verbose = True):
    nbz=nbperiods
    P1_zp_z = np.diag(np.ones(nbz-1),-1)
    P1_zp_z[-1,-1]=1
    P2_zp_z = np.zeros((nbz,nbz))
    P2_zp_z[0,:]=1
    phi1_z = np.zeros(nbz)
    phi1_z[-2] = payoff1
    phi2_z = np.zeros(nbz)
    phi2_z[0] = payoff2
    if verbose:
        print('P1_zp_z=\n',P1_zp_z)
        print('P2_zp_z=\n',P2_zp_z)
        print('phi1_z=\n',phi1_z)
        print('phi2_z=\n',phi2_z)
    return(Mdp([phi1_z,phi2_z],[P1_zp_z,P2_zp_z],beta))

    

In [5]:
the_mdp = build_simple_mdp(6,5,-4,0.9)

P1_zp_z=
 [[0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 1.]]
P2_zp_z=
 [[1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]
phi1_z=
 [0. 0. 0. 0. 5. 0.]
phi2_z=
 [-4.  0.  0.  0.  0.  0.]


In [6]:
the_mdp.sampler_init()

In [7]:
the_mdp.sampler_forward(0,verbose=True)

transition 0 -> 1
new state 1
new period 1
new payoff 0.0


1

In [8]:
the_mdp.sampler_forward(1,verbose=True)

transition 1 -> 0
new state 0
new period 2
new payoff 0.0


0

In [9]:
the_mdp.N_xp_xy.shape

(6, 12)

In [10]:
def Mdp_simulate(self,nbperiod,pi_x_y,seed = None,initial_state=0,verbose = 0):
    if seed is not None:
        np.random.seed(seed)
    self.sampler_init(initial_state)
    for t in range(nbperiod):
        y =  np.random.choice(list(range(self.nby)),1, p=pi_x_y[self.state,:])[0]
        if verbose>0:
            print(y)
        self.sampler_forward(y)
    return(self.payoff * (self.beta**self.period) )

Mdp.simulate=Mdp_simulate

## Value of a policy


Assume I know initial value $q$ and policy $\mu $. What is the total value?


If $\pi _{y|x}$ is the conditional distribution, we have

$\mu _{xy}=\pi _{y|x}q_{x}$

$q_{x^{\prime }}^{\prime }=\sum_{x}\sum_{y}N_{x^{\prime }|xy}\pi _{y|x}q_{x}$

we can define $\tilde{N}_{x^{\prime }|x}=\sum_{y}N_{x^{\prime }|xy}\pi
_{y|x} $

and we have $q_{x^{\prime }}^{t}=\left( \tilde{N}\right) ^{t}q^{0}$

and the surplus is equal to

$\sum_{t}\sum_{x}q_{x}^{t}\beta ^{t}{\tilde \phi} _{x}=\sum_{t=0}^{\infty }\tilde{\phi%
}^{\top }\left( \beta ^{t+1}\tilde{N}^{t}\right) q^{0}$

where $\tilde{\phi}_{x}=\sum_{y}\pi _{y|x}\Phi _{xy}$

thus ${\tilde \phi} ^{\top }\beta \sum_{t=0}^{\infty }\left( \beta \tilde{N}\right)
^{t}q^{0}$

that is $\tilde{\phi}^{\top }\beta \left( 1-\beta \tilde{N}\right) ^{-1}q^{0}
$



In [11]:
def Mdp_policy_value(self,pi_x_y):
    Ntilde_xp_x = (pi_x_y[None,:,:] * self.N_xp_xy.reshape((self.nbx,self.nbx,-1))).sum(axis=2)
    phithilde_x = (pi_x_y * self.Phi_xy.reshape( (self.nbx,-1))).sum(axis=1)
    p_x = the_mdp.beta*np.linalg.solve((np.eye(the_mdp.nbx) - the_mdp.beta * Ntilde_xp_x.T),phithilde_x)
    return(p_x)
    
Mdp.policy_value = Mdp_policy_value

In [12]:
pi_x_y=np.zeros((the_mdp.nbx,the_mdp.nby))
pi_x_y[0:-1,0]=1
pi_x_y[-1,1]=1

the_mdp.policy_value(pi_x_y)


array([6.3011275 , 7.00125278, 7.77916975, 8.64352195, 9.60391327,
       5.67101475])

In [20]:
np.array([the_mdp.simulate(1000,pi_x_y,initial_state=x) for x in range(the_mdp.nbx)])

array([6.3011275 , 7.00125278, 7.77916975, 8.64352195, 9.60391327,
       5.67101475])

# Solving for the optimal policy

In [14]:
def Mdp_solve_lp(self):
    m=grb.Model()
    m.Params.NonConvex = 2
    mu_xy = m.addMVar(self.nbx*self.nby)
    rho_xy = m.addMVar(self.nbx*self.nby)
    p_x = m.addMVar(self.nbx, lb = - grb.GRB.INFINITY)
    qp_x = m.addMVar(self.nbx )
    m.setObjective(mu_xy @ rho_xy , grb.GRB.MINIMIZE)
    if self.pp_x is not None:
        m.addConstr(rho_xy+self.Phi_xy+self.beta*self.N_xp_xy.T @ self.pp_x - self.M_xp_xy.T @ p_x == 0 )
    else:
        m.addConstr(rho_xy+self.Phi_xy + (self.beta*self.N_xp_xy - self.M_xp_xy).T @ p_x == 0 )
    m.addConstr(self.N_xp_xy @ mu_xy == qp_x)
    if self.q_x is not None:
        m.addConstr(self.M_xp_xy @ mu_xy == self.q_x)
    else:
        m.addConstr(self.M_xp_xy @ mu_xy == qp_x)
        m.addConstr(qp_x.sum()==1)
        
    m.optimize()
    return(p_x.X,qp_x.X,mu_xy.X)
    
    



We have the following orthogonality conditions relating $p,p^\prime,q$ and $q^\prime$
$$\left\{ 
\begin{array}{l}
0\leq \Phi +\beta N^{\top }p^{\prime }-M^{\top }p\perp \mu \geq 0 \\ 
M\mu =q \\ 
N\mu =q^{\prime }%
\end{array}%
\right. 
$$
which we can rewrite as
$$
\left\{ 
\begin{array}{l}
0\leq \rho \perp \mu \geq 0 \\ 
\rho =M^{\top }p-\beta N^{\top }p^{\prime }-\Phi  \\ 
M\mu =q \\ 
N\mu =q^{\prime }%
\end{array}%
\right. 
$$

Note that $p=p^{\prime }$ and $q=q^{\prime }$ yields the stationary solution.
