# Definitions and Properties

## Defining RDP

Consider a generic Bellman equation

$$
v(x) = \max_{a\in\Gamma{x}}B(x,a,v)
$$

where, 

- $x$ is state
- $a$ is action
- $\Gamma$ is a feasible correspondence
- $B$ is an aggregator function


**Recursive Decision Process**

We define a **Recursive Decision Process (RDP)** to be a triple $\mathcal{R}=(\Gamma, V, B)$ consisting of

- **Feasible correspondence** $\Gamma$ that is a nonempty correpondence from $X$ to $A$, which in turns defines

   - **Feasible state-action pair** $G=\{(x,a)\in X\times A: a\in \Gamma(x)\}$
   - **Feasible policy set** $\Sigma=\{\sigma \in A^X: \sigma(x)\in \Gamma(x)\}$

- **Value space** A subset $V\subset\mathbb{R}^X$
- **Value aggregator** $B: G\times V\to \mathbb{R}$, it is the total lifetime reward corresponding to current action $a$, current state $x$ and value function $v$, that satisfies
   - **Monotonicity**: $v,w\in V, v\le w\implies B(x,a,v)\le B(x,a,w)$
   - **Consistency**: $w(x) = B(x,\sigma(x),w)$ for some $\sigma\in\Sigma$ and $v\in V\implies w\in V$.
 
(The monotonicity condition states relative to $v$, if rewards are at least as high for $w$ in every future state, then the total rewards one can extract under $w$ should be at least as high.)

(The consistency condition ensures that as we consider values of different policies we remain within the value space $V$).

We can treat MDP as a special case of RDP.

## Lifetime Value

### Policy and Value

Let $\mathcal{R}=(\Gamma, V,B)$ be an RDP with state and action space $X$ and $A$, let $\Sigma$ be the set of all feasible policies. For each $\sigma\in\Sigma$, we introduce the **policy operator** $T_\sigma$ as a self-map on $V$ defined by

$$
(T_\sigma v)(x) = B(x,\sigma(x),x)
$$

and $T_\sigma$ is an order-preserving self-map on $V$.

If $T_\sigma$ has a unique fixed point in $V$, we denote this fixed point by $v_\sigma$ and call it the $\sigma$-value function.

We can interpret $v_\sigma$ as representing the lifetime value of following policy $\sigma$.


**IN RDP**

The policy operator can be expressed as $(T_\sigma v)(x) = A_\sigma(x,(R_\sigma v)(x))$ for some aggregator $A_\sigma$ and certainty equivalent operator $R_\sigma$. 

Hence $T_\sigma$ is a Koopmans operator and lifetime value associated with policy $\sigma$ is the fixed point of this operator.

### Uniqueness and Stability

Let $\mathcal{R} = (\Gamma, V, B)$ be a given RDP with policy operators $\{T_\sigma\}$

Given that our objective is to maximize lifetime value over the set of policies in $\Sigma$, we need to assume at the very least that lifetime value is well defined at each policy.

**Well-Posed**

We say that $\mathcal{R}$ is **well-posed** if for all $\sigma\in\Sigma$, $T_\sigma$ has a unique fixed point $v_\sigma\in V$.

**Globally stable**

Let $\mathcal{R}$ be an RDP with policy operators $\{T_\sigma\}_{\sigma\in\Sigma}$.

We say $\mathcal{R}$ is **globally stable** if for all $\sigma\in\Sigma$, $T_\sigma$ is globally stable on $V$.

**Every gloablly stable RDP is well-posed.**

**Global stability implies that for any choice of terminal condition, finite horizon valuations always converge to their infinite horizon counterparts.**

### Continuity

Let $\mathcal{R}=(\Gamma, V,B)$ be an RDP. 

We call $\mathcal{R}$ **continuous** if $B(x,a,v)$ is continuous in $v$ for all $(x,a)\in G$. 

In other words, $\mathcal{R}$ is continuous if for any $v\in V$, any $(x,a)\in G$ and any sequence $(v_k)_{k\ge 1}$ in $V$, we have,

$$
\lim_{k\to\infty} v_k = v \implies \lim_{k\to\infty}B(x,a,v_k) = B(x,a,v) 
$$

## Optimality

### Greedy policies

Given an RDP $\mathcal{R} = (\Gamma, V,B)$ and $v\in V$, a policy $\sigma\in\Sigma$ is called $v$-greedy if

$$
\sigma(x)\in\arg\max_{a\in\Gamma(x)}B(x,a,v)
$$

for all $x\in X$. 

**Since $\Gamma(x)$ is finite and nonempty at each $x\in X$, at least one such policy exists.**

We say that $v\in V$ satisfies the **Bellman equation** if

$$
v(x)=\max_{a\in\Gamma(x)}B(x,a,v)
$$

for all $x\in X$. The **Bellman operator** correponding to $\mathcal{R}$ is the map $T$ on $V$ defined by

$$
(Tv)(x) = \max_{a\in\Gamma(x)}B(x,a,x)
$$

comparing with **policy operator**

$$
(T_\sigma v)(x) = B(x,\sigma(x),v)
$$

We have 

1. $Tv = \bigvee_\sigma T_\sigma v$, $T$ is the upper envelope of $\{T_\sigma\}$
2. $Tv=T_\sigma v$ iff $\sigma$ is $v-greedy$
3. $T$ is order-preserving self-map
4. $(T^kv)(x) = \max_{a\in\Gamma(x)} B(x,a,T^{k-1}v)$
5. $(T_\sigma^kv)(x) = B(x,\sigma(x), T^{k-1}_\sigma v)$

## Algorithms

OPI is a more practical alternative. HPI is basically the same, just replacing $v_\sigma = (I-\beta P)^{-1}r$ to $v_\sigma$ as the fixed point of $T_\sigma$. This can be approximate using successive approximation, which lead us to OPI.

**OPI: we need to choose the initialized $v_0=v_\sigma$ for some $\sigma$.** (Implicitly, this implies $v_0\le v^*$).

OPI reduced to VFI when $m=1$.

### Howard operator

We call $H: V\mapsto \{v_\sigma\}_{\sigma\in\Sigma}$ be the **Howard operator** generated by $\mathcal{R}$, in which,

$$
Hv = v_\sigma
$$

where $\sigma$ is $v$-greedy. Iterating $H$ implements HPI. 

In particular, if we fix $\sigma\in\Sigma$ and set $v_k = H^k v_\sigma$, then $(v_k)_{k\ge 0}$ is the sequence of $\sigma$-value functions generated by HPI.




### OPI

The operator $W_m$ is an approximation of $H$, since $T_\sigma^m v \to v_\sigma = Hv$ as $m\to\infty$.

Iterating with $W_m$ generates the value sequence in OPI. 

More specifically, we take $v_0 \in \{v_\sigma\}$ and generate

$$
(v_k, \sigma_k)_{k\ge 0}
$$

where $v_k = W^k_m v_0$ and $\sigma_k$ is $v_k$-greedy policy.

# Optimality Theorem

Let $\mathcal{R}$ be a well-posed RDP with policy operator $\{T_\sigma\}$ and $\sigma$-value functions $\{v_\sigma\}$.

We set the **value function as**

$$
v^* = \bigvee_{\sigma\in\Sigma} v_\sigma
$$

and it satisfies

$$
v^*(x) = \max_{\sigma\in\Sigma}v_\sigma(x)
$$

A policy $\sigma$ is called **optimal** for $\mathcal{R}$ if $v_\sigma = v^*$ that is if

$$
v_\sigma(x)\ge v_\tau (x)
$$

for all $\tau\in\Sigma$ and all $x\in X$.

We say that $\mathcal{R}$ satisfies **Bellman's principle of optimality** if

$$
\sigma\in\Sigma \text{ is optimal for  }\mathcal{R} \iff \sigma \text{ is $v^*$-greedy}
$$

## Theorem 8.1.1. Optimality Results

If $\mathcal{R}$ is globally stable, then

1. $v$^* is the unique solution to the Bellman equation in $V$.

2. $\mathcal{R}$ satisfies Bellman's principle of optimality

3. $\mathcal{R}$ has at least one optimal policy

4. HPI returns an optimal policy in finitely many steps

5. OPI sequence: $v_k\to v^*$ as $k\to\infty$ and, there exists a $K\in\mathbb{N}$ such that $\sigma_k$ is optimal for all $k\ge K$.

### Nonstationary policies

Let $\mathcal{R} = (\Gamma, V, B)$ be a globally stable RDP. $(T^k_\sigma v)(x)$ gives finite horizon utility under policy $\sigma$ with inital state $x$ and terminal condition $v$.

Extending this idea, it is natural to understand $T_{\sigma_k} T_{\sigma_{k-1}}\cdots T_{\sigma_1} v$ as providing finite horizon utility values for the nonstationary policy sequence $(\sigma_k)_{k\in\mathbb{N}} \subset\Sigma$, given terminal condition $v\in V$.

For the same policy sequence we define its lifetime value via

$$
\bar v = \limsup_{k\to\infty} v_k 
$$

with $v_k = T_{\sigma_k} T_{\sigma_{k-1}}\cdots T_{\sigma_1} v$. 

whenever the limsup is finite and independent of the terminal condition $v$.

Under this setting, we let $v\in V_\Sigma$. By theorem 8.1.1., we have $T^kv\to v^*$ as $k\to\infty$. 

Hence, we have,

$$
\bar v = \limsup_{k\to\infty} v_k \le \limsup_{k\to\infty}T^k v = \lim_{k\to\infty} T^kv = v^*
$$


## Bounded RDP

We call an RDP $\mathcal{R} = (\Gamma, V, B)$ **bounded** if $V$ is convex, and moveover, there exists, functions $v_1, v_2$ such that $v_1\le v_2$,

$$
v_1(x)\le B(x,a,v_1), B(x,a,v_2)\le v_2(x)
$$

for all $(x,a)\in G$.

We show that boundedness can be used to obtain optimality results for well-posed RDPs even without global stability.

** MDP is a bounded RDP. **

### Theorem 8.1.2. Well-posed and bounded implies all above optimality result.



## Topologically conjugate RDPs

Let $\mathcal{R} = (\Gamma, V, B)$ and $\hat{\mathcal{R}} = (\Gamma, \hat V, \hat B)$ be two RDPs with identical state space $X$, action space $A$ and feasible correspondence $\Gamma$.

We consider settings, where

$$
V = \mathbb{M}^X, \hat V = \hat{\mathbb{M}}^X
$$

where $\mathbb{M}, \hat{\mathbb{M}}\subset\mathbb{R}$.  And, in addition, there exists a homeomorphism $\varphi:\mathbb{M}\mapsto \hat{\mathbb{M}}$ such that

$$
B(x,a,v) = \varphi^{-1}[\hat B (x,a,\varphi\circ v)]
$$

We call $\mathcal{R}$ and $\hat{\mathcal{R}}$ topologically conjugate under $\varphi$.

## Proposition 8.1.3. $\mathcal{R}$ is globally stable $\iff$ $\hat{\mathcal{R}}$ is globally stable. 

# Types of RDPs

## Contracting RDPs

Let $\mathcal{R} = (\Gamma, V, B)$ be an RDP. We call $\mathcal{R}$ **contracting** if there exists a $\beta<1$ such that,

$$
|B(x,a,v)-B(x,a,w)|\le \beta \|v-w\|_\infty
$$

for all $(x,a)\in G$ and $v,w\in V$.

### Proposition 8.2.1. Contracting RDP implies all policy operators and Bellman operators are contractions with the same modulus of contraction under supremum norm.

### Corollary 8.2.2. Contracting RDP with closed $V$ implies RDP is globally stable.

### Proposition 8.2.3. Contracting RDP has the following Error Bound

Fix $v\in V$ and let $v_k=T^kv$. If $\sigma$ is $v_k$-greedy, then

$$
\|v^*-v_\sigma\|_\infty\le \frac{2\beta}{1-\beta}\|v_k-v_{k-1}\|_\infty
$$

Since VFI terminates when $\|v_k-v_{k-1}\|_\infty$ falls below a given tolerance, this result directly provides a quantitative bound on the performance of the policy returned by VFI.

We say RDP satisfies **Blackwell's condition** if $v\in V$ implies $v+\lambda = v+\lambda\mathbb{1} \in V$ for every $\lambda\ge 0$ and there exists $\beta\in [0,1)$ such that

$$
B(x,a,v+\lambda) \le B(x,a,v) + \beta\lambda
$$

## Eventually Contracting RDPs

Let $\mathcal{R} = (\Gamma, V, B)$ be an RDP with policy set $\Sigma$. We call $\mathcal{R}$ **eventually contracting** if there is a map $L:G\times X\mapsto \mathbb{R}_+$ such that,

$$
|B(x,a,v)-B(x,a,w)|\le \sum_{x'} |v(x')-w(x')|L(x,a,x')
$$

for all $(x,a)\in G$ and all $v,w\in V$ and moreover,

$$
\sigma\in\Sigma \implies \rho(L_\sigma)<1
$$

### Proposition 8.2.4. RDP is eventually contracting on closed set implies gloablly stable RDP 

## Convex and Concave RDPs

Let $\mathcal{R} = (\Gamma, V, B)$ be an RDP with $V=[v_1,v_2]$ for some $v_1\le v_2$ in\mathbb{R}^X. 

We call $\mathcal{R}$ **convex** if

- for all $(x,a)\in G$, $\lambda\in[0,1]$ and $v,w\in V$, we have,

$$
B(x,a,\lambda v + (1-\lambda)w) \le \lambda B(x,a,v) + (1-\lambda)B(x,a,w)
$$

**and**

- there exists a $\delta>0$ such that

$$
B(x,a,v_2) \le v_2(x)-\delta[v_2(x)-v_1(x)]
$$

for all $(x,a)\in G$.

We call $\mathcal{R}$ concave if

- for all $(x,a)\in G$, $\lambda\in[0,1]$, $v,w\in V$, we have,

$$
B(x,a,\lambda v+(1-\lambda)w)\ge \lambda B(x,a,v)+(1-\lambda) B(x,a,w)
$$

**and**

- there exists a $\delta>0$ such that

$$
B(x,a,v_1) \ge v_1(x) + \delta [v_2(x)-v_1(x)]
$$

for all $(x,a)\in G$.


### Proposition 8.2.5. Convex or concave RDP implies globally stable.

# Further Applications

## Risk-sensitive RDP