# Chapter 5 Markov Decision Process Reading Note

- discrete time, infinite horizon, dynamic program

1. Definitions and Properties:
    - Model specification
    - Examples: Renewal, optimal inventory, cake eating, optimal stopping
    - Optimality theory
    - Algorithm: VFI,HPI,HPI VS NI, OPI
2. Detailed applications
    - Optimal Inventories
    - Optimal saving and labor income: MDP representation, Implementation, timing, output
    - Optimal Investment: description, MDP representation, Implementation
    
3. Modified Bellman equation:
    - Structure estimation: definition, illustration, expected value function, optimality vs EV
    - Gumbel Max Trick
    - Optimal savings with stochastic return on wealth
    - Q-Factors
    - Operator factorization: Refactoring the Bellman operator, refactorization and optimality, refactored OPI

## MDP model

Now actions have affects on rewards, next state, also transition probabilities. Compared to Optimal stopping (binary choice). Now the feasible actions depends on state and are not restricted to binary choices.

A controller who interacts with a state process $(X_t)_{t\ge 0}$ by choosing an action path $(A_t)_{t\ge 0}$ to maximize expected discounted rewards,

$$
\mathbb{E}\sum_{t\ge 0}\beta^t r(X_t,A_t)
$$

taking the initial state $X_0$ as given.

**Controller is clairvoyant: cannot choose actions that depends on future states**.

We take 

- $X$: finite state space
- $A$: finite action space
- $\Gamma: X\mapsto \mathscr{P}(A)$: a non-emptycorrespondence from $X$ to the subsets of $A$, i.e., $\Gamma(x)\neq \emptyset\,\,\forall x\in X$

**We define the Markov Decision Process as a 4-Tuple $\mathcal{M}=(\Gamma, \beta, r, P)$**

1. **Feasible correpondence** $\Gamma: X\mapsto \mathscr{P}(A)$, induces the **feasible state-action pairs**

$$
G:=\{(x,a)\in X\times A: x\in X, a\in \Gamma(x)\}
$$

2. **Discount factor**: $\beta\in(0,1)$
3. **Reward function**: $r\in\mathbb{R}^G$, $r:G\mapsto \mathbb{R}$
4. **Stochastic kernel**: $P:G\times X \mapsto \mathbb{R}_+$ satisfying:

$$
\sum_{x'\in X}P(x,a,x')=1,\,\,\forall (x,a)\in G
$$

**Bellman Equation** corresponding to $\mathcal{M}$:

$$
v(x) = \max_{a\in \Gamma(x)}\left\{r(x,a)+\beta\sum_{x'}v(x')P(x,a,x')\right\} \,\,\forall x\in X
$$



We can understand the Bellman equation as reducing an infinite-horizon problem to a two period problem involving the present and the future. 

Current actions influence:

1. Current reward
2. Expected discounted value from future states

In every ccase, **there is a tradeoff between maximizing current rewards and shifting probability mass towards states with high future rewards**.

# Optimality

## Policies and Lifetime values

Let $\mathcal{M} = (\Gamma, \beta,r,P)$ be a MDP.

**Feasible policy set**

$$
\Sigma:=\{\sigma\in A^X: \sigma(x)\in \Gamma(x)\,\,\forall x\in X\}
$$

**If we select a policy $\sigma\in\Sigma$, it is understood that we respond to the state $X_t$ with action $A_t =\sigma(X_t)$ for all $t$**.

Hence, the state evolving by drawing $X_{t+1}$ from the stochastic kernel under $\sigma$ becomes,

$$
P(X_t,\sigma(X_t),\cdot)
$$

We denote $(X_t)_{t\ge 0}$ as a $P_\sigma$-Markov when,

$$
P_\sigma(x,x') = P(x,\sigma(x),x')
$$

(Left hand side is $P_\sigma$-Markov, RHS is a stochastic kernel)

**Note $P_\sigma\in\mathscr{M}(\mathbb{R}^X)$. Hence, fixing a policy closes the loop in the state transition process and defines a Markov chain for the state.**

Under the policy $\sigma$, rewards at state $x$ are $r(x,\sigma(x))$.

We denote $r_\sigma(x) = r(x,\sigma(x))$ and $\mathbb{E}_x = \mathbb{E}[\cdot|X_0=x]$.

Then the **lifetime value of following $\sigma$ from initial state $x$ can be written as**

$$
v_\sigma(x) =\mathbb{E}_x \sum_{t\ge 0}\beta^t r_\sigma(X_t), \,\,\,\text{$(X_t)$ is a $P_\sigma$ Markov with $X_0=x$}
$$

If $\beta<1$ we have,

**$\sigma$-value function**
$$
v_\sigma =\sum_{t\ge 0}\beta^t P_\sigma^t r_\sigma = (I-\beta P_\sigma)^{-1} r_\sigma
$$


## Examples of MDP

### Renewal Problem: Rust(1987), engine replacement problem for bus

In each period, the superintendent decides whether or not to replace the engine of a given bus. Replacement is costly but delaying risks unexpected failure. 

Consider an abstract version of Rust's problem with **binary action** $A_t=\{0,1\}$, 

- when $A_t=1$, the state resets to some fixed **renewal state** $\bar x \in X$.

- when $A_t=0$, the state updates according to $Q\in \mathscr{M}(\mathbb{R}^X)$.

Given current state $x$ and action $a$, current reward $r(x,a)$ is received. The discount rate is $\beta$.

**Bellman equation**

$$
v(x)= \max\{r(x,1)+ \beta v(\bar x), r(x,0)+\beta\sum_{x'}v(x')Q(x,x')\}
$$

**MDP representation**

To set the problem up as an MDP we set 

- $A=\{0,1\}$
- $\Gamma(x) = A$ for all $x\in A$

We define the stochastic kernel as

$$
P(x,a,x') = a\mathbb{1}\{x'=\bar x\}+(1-a)Q(x,x')\tag{$(x,a)\in G, x'\in X$}
$$

Inserting this $P$, we get the MDP form of Bellman equation:

$$
v(x)=\max_{a\in\{0,1\}}\left\{r(x,a)+ a\beta v(\bar x) + (1-a)\beta\sum_{x'\in X}v(x')Q(x,x')\right\}
$$

### Optimal inventory management

A firm where a manager maximizes shareholder value. To simplify, we ignore exit option, hence the value of the firm is the EPV of future profits.

Assume the firm only sell one product with profit function $\pi_t$, and let $r>0$ be the interest rate.

**value of the firm** is

$$
V_0 = \mathbb{E}\sum_{t\ge 0} \beta^t \pi_t
$$

The firm faces **exogenous demand process**

$$
(D_t)\sim_{IID}\varphi\in\mathcal{D}(\mathbb{Z}_+)
$$

The **Inventory** $(X_t)_{t\ge 0}$ obeys the law of motion:

$$
X_{t+1} = f(X_t,D_t,A_t) = \max\{X_t-D_{t+1}, 0\}+A_t
$$

The **Action $A_t$ is the unit of stock ordered this period, which take one period to arrive.**


We assume that firms cannot sell more stock than they have at hand and can store at maximum $K$ items. We set the price equals to 1. 

**Profit function**

$$
\pi_{t} = X_t\wedge D_{t+1}-cA_t - \kappa\mathbb{1}\{A_t>0\}
$$

- $c$ is the per unit cost, $\kappa$ is the fixed cost related to order inventory
- $X_t\wedge D_{t+1}$, implies firm cannot sell more than they have at hand.

### MDP formulation

Let $X$ be the **state space** of inventory at storage, i.e.,

$$
X = \{0,1,2,\cdots,K\}
$$

Then the **feasible correspondence** is,

$$
\Gamma(x) = \{0,1,2,\cdots,K-x\}
$$

**Reward function is the current expected profit**

$$
r(x,a) = \left(\sum_{d\ge 0} (x\wedge d)\varphi(d)\right)-ca-\kappa\mathbb{1}\{a>0\}
$$

**Stochastic kernel**

$$
P(x,a,x') = \mathbb{P}\{f(x,D,a)=x'\}, \tag{$D\sim \varphi$}
$$

**Bellman Equation**

$$
v(x)= \max_{a\in\Gamma(x)}\left\{r(x,a)+\beta\sum_{x'\in X} v(x')P(x,a,x')\right\}
$$

or

$$
v(x) =\max_{a\in\Gamma(x)}\left\{r(x,a)+\beta\sum_{d\ge 0} v(f(x,a,d))\varphi(d)\right\}
$$

**Bellman operator**

$$
(Tv)(x) = \max_{a\in \Gamma(x)}\left\{r(x,a)+\beta\sum_{d\ge 0}v(f(x,a,d))\varphi(d)\right\}
$$

The operator maps $\mathbb{R}^X$ to itself is designed so that its set of fixed points coincide with the solution of the Bellman equation.

### Cake Eating

A household with finite wealth endowment but has no labor income. The wealth evolves as

$$
W_{t+1} = R(W_t-C_t)
$$

where $C_t$ is current consumption, $R$ is the gross interest rate. Household has utilty function $u(C_t)$ and aim to maximize lifetime utility

$$
\mathbb{E}\sum_{t\ge 0}\beta^t u(C_t), W_0=w
$$

by choosing the consumption path or similarly by choosing the next period wealth level $w'=R(w-c)\implies c = w-w'/R$

**Bellman equation**

$$
v(w) = \max_{0\le w'\le w} \{u(w-w'/R)+\beta v(w')\}
$$

Hence, household uses the Bellman equation to trade-off current utility with future value.

**MDP formulation**

A MDP is a 4-tuple $\mathcal{M}=(\Gamma,\beta, r, P)$.

With $W$ the wealth space as the state space. We can first formulate the feasible correspondence $\Gamma(w)$.

Since houshold can only consume less or equal to that what they have and also greater than 0, this implies, with $W_0=w$, we have

$$
\Gamma(w) = \{0,1,\cdots, w\},\tag{$w\in W$}
$$

Then, we have discount factor $\beta = 1/(1+r)$ where $r=R-1$. Hence, $\beta =1/R$.

The reward function is the utility function, we have,

$$
u(c)= u(w,w') = u(w-w'/R)
$$

Since there is no uncertainty in this case, there is no stochastic kernel. 

### Optimal Stopping 

We can frame optimal stopping problem as MDP by adding new state variable

Consider a job search model with Markov wage, i.e., $(W_t)_{t\ge 0}$ is a $Q$-Markov on finite $W$.

To express the job search problem as MDP, we let

$$
X = \{0,1\}\times W
$$

be a state space whose typical element is $(e,w)$ with $e$ representing the employment status.

The action space $A=\{0,1\}$ denoting rejecting or accepting the wage offer.