# Definitions and Properties

## Defining RDP

Consider a generic Bellman equation

$$
v(x) = \max_{a\in\Gamma{x}}B(x,a,v)
$$

where, 

- $x$ is state
- $a$ is action
- $\Gamma$ is a feasible correspondence
- $B$ is an aggregator function


**Recursive Decision Process**

We define a **Recursive Decision Process (RDP)** to be a triple $\mathcal{R}=(\Gamma, V, B)$ consisting of

- **Feasible correspondence** $\Gamma$ that is a nonempty correpondence from $X$ to $A$, which in turns defines

   - **Feasible state-action pair** $G=\{(x,a)\in X\times A: a\in \Gamma(x)\}$
   - **Feasible policy set** $\Sigma=\{\sigma \in A^X: \sigma(x)\in \Gamma(x)\}$

- **Value space** A subset $V\subset\mathbb{R}^X$
- **Value aggregator** $B: G\times V\to \mathbb{R}$, it is the total lifetime reward corresponding to current action $a$, current state $x$ and value function $v$, that satisfies
   - **Monotonicity**: $v,w\in V, v\le w\implies B(x,a,v)\le B(x,a,w)$
   - **Consistency**: $w(x) = B(x,\sigma(x),w)$ for some $\sigma\in\Sigma$ and $v\in V\implies w\in V$.
 
(The monotonicity condition states relative to $v$, if rewards are at least as high for $w$ in every future state, then the total rewards one can extract under $w$ should be at least as high.)

(The consistency condition ensures that as we consider values of different policies we remain within the value space $V$).

We can treat MDP as a special case of RDP.

## Lifetime Value

### Policy and Value

Let $\mathcal{R}=(\Gamma, V,B)$ be an RDP with state and action space $X$ and $A$, let $\Sigma$ be the set of all feasible policies. For each $\sigma\in\Sigma$, we introduce the **policy operator** $T_\sigma$ as a self-map on $V$ defined by

$$
(T_\sigma v)(x) = B(x,\sigma(x),x)
$$

and $T_\sigma$ is an order-preserving self-map on $V$.

If $T_\sigma$ has a unique fixed point in $V$, we denote this fixed point by $v_\sigma$ and call it the $\sigma$-value function.

We can interpret $v_\sigma$ as representing the lifetime value of following policy $\sigma$.


**IN RDP**

The policy operator can be expressed as $(T_\sigma v)(x) = A_\sigma(x,(R_\sigma v)(x))$ for some aggregator $A_\sigma$ and certainty equivalent operator $R_\sigma$. 

Hence $T_\sigma$ is a Koopmans operator and lifetime value associated with policy $\sigma$ is the fixed point of this operator.

### Uniqueness and Stability

Let $\mathcal{R} = (\Gamma, V, B)$ be a given RDP with policy operators $\{T_\sigma\}$

Given that our objective is to maximize lifetime value over the set of policies in $\Sigma$, we need to assume at the very least that lifetime value is well defined at each policy.

**Well-Posed**

We say that $\mathcal{R}$ is **well-posed** if for all $\sigma\in\Sigma$, $T_\sigma$ has a unique fixed point $v_\sigma\in V$.

**Globally stable**

Let $\mathcal{R}$ be an RDP with policy operators $\{T_\sigma\}_{\sigma\in\Sigma}$.

We say $\mathcal{R}$ is **globally stable** if for all $\sigma\in\Sigma$, $T_\sigma$ is globally stable on $V$.

**Every gloablly stable RDP is well-posed.**

**Global stability implies that for any choice of terminal condition, finite horizon valuations always converge to their infinite horizon counterparts.**

### Continuity

Let $\mathcal{R}=(\Gamma, V,B)$ be an RDP. 

We call $\mathcal{R}$ **continuous** if $B(x,a,v)$ is continuous in $v$ for all $(x,a)\in G$. 

In other words, $\mathcal{R}$ is continuous if for any $v\in V$, any $(x,a)\in G$ and any sequence $(v_k)_{k\ge 1}$ in $V$, we have,

$$
\lim_{k\to\infty} v_k = v \implies \lim_{k\to\infty}B(x,a,v_k) = B(x,a,v) 
$$

## Optimality

### Greedy policies

Given an RDP $\mathcal{R} = (\Gamma, V,B)$ and $v\in V$, a policy $\sigma\in\Sigma$ is called $v$-greedy if

$$
\sigma(x)\in\arg\max_{a\in\Gamma(x)}B(x,a,v)
$$

for all $x\in X$. 

**Since $\Gamma(x)$ is finite and nonempty at each $x\in X$, at least one such policy exists.**

We say that $v\in V$ satisfies the **Bellman equation** if

$$
v(x)=\max_{a\in\Gamma(x)}B(x,a,v)
$$

for all $x\in X$. The **Bellman operator** correponding to $\mathcal{R}$ is the map $T$ on $V$ defined by

$$
(Tv)(x) = \max_{a\in\Gamma(x)}B(x,a,x)
$$