# Preliminaries

### Order stability

Let $V$ be a partially ordered set and let $T$ be a self-map on $V$ with **exactly one fixed point** $\bar v\in V$.

We call 

- $T$ is **upward stable** on $V$ if $v\in V$ and $v\precsim Tv \implies v\precsim \bar v$

- $T$ is **downward stable** on $V$ if $v\in V, Tv\precsim v \implies \bar v \precsim v$.

- $T$ is **order stable** on $V$ if $T$ is both upward and downward stable.


### Lemma 9.1.1.T is order-preserving+Globally stable $\implies$ T is order-stable

Let $X$ be finite, let $V$ be a subset of $\mathbb{R}^X$, and let $T$ be an order-preserving self-map on $V$.

If $T$ is globally stable on $V$, then $T$ is order stable on $V$.

### Order duals

Given partially ordered set $V$, let $V^\partial = (V, \precsim^\partial)$ be the **order dual**, so that, for $u,v\in V$, we have $u\precsim^\partial v$ if and only if $v\precsim u$.

(The order dual $V^\partial$ is just $V$ with the order reversed.)

### Lemma 9.1.2. $S$ is order-stable on $V$ $\iff$ $S$ is order-stable on $V^\partial$.

## Abstract Dynamic Program

An **abstract dynamic program (ADP)** is a pair $\mathcal{A} = (V, \{T_\sigma\}_{\sigma\in\Sigma})$ such that 

1. $V = (V,\precsim)$ is a partially ordered set,
2. $\{T_\sigma\}:= \{T_\sigma\}_{\sigma\in\Sigma}$ is a family of self-maps on $V$
3. for all $v\in V$, the set $\{T_\sigma v\}_{\sigma\in\Sigma}$ has both a least and greatest element.

- Elements of the index set $\Sigma$ are called **policies**
- Elements of $\{T_\sigma\}$ are called **policy operators**
- Given $v\in V$, a policy $\sigma\in\Sigma$ is called **v-greedy** if $T_\tau v\precsim T_\sigma v$ for all $\tau\in\Sigma$.

Existence of greastest element in III of the definition is equivalent to the statement that **each $v\in V$ has at least one v-greedy policy.**

### Example of ADP that cannot be represented as RDP

Recall the $Q$-factor MDP Bellman operator, which takes the form

$$
(Sq)(x,a) = r(x,a)+\beta\sum_{x'}\max_{a\in\Gamma(x')} q(x',a') P(x,a,x')
$$

with $q\in\mathbb{R}^G$ and $(x,a)\in G$.

The $Q$-factor policy operators $\{S_\sigma\}$ are given by

$$
(S_\sigma q)(x,a) = r(x,a) + \beta\sum_{x'}q(x', \sigma(x'))P(x,a,x')
$$

Each $S_\sigma$ is a self-map on $\mathbb{R}^G = (\mathbb{R}^G, \le)$.

If $q\in\mathbb{R}^G$ and $\sigma\in\Sigma$ is such that $\sigma(x)\in \arg\max_{a\in\Gamma(x)}q(x,a)$ for all $x\in X$, then $S_\sigma q\ge S_\tau q$ for all $\tau \in\Sigma$.

Hence $\sigma$ is the $q$-greedy policy and $\mathcal{A}=(\mathbb{R}^G,\{S_\sigma\})$ is an ADP.

# Optimality

## Max-optimality

We call an ADP $\mathcal{A} = (V,\{T_\sigma\})$ **well-posed** if every policy operator $T_\sigma$ has a unique fixed point in $V$. 

Well-posedness is a minimum requirement for constructing an optimality theory around ADP.

#### Operators

Let $\mathcal{A} = (V,\{T_\sigma\})$ be an ADP. We set

$$
Tv = \bigvee_\sigma T_\sigma v
$$

And call $T$ the **Bellman operator** generated by $\mathcal{A}$. 

$T$ is well-defined self-map on $V$ by part III of the definition of ADP (the existence of greedy policy).

A function $v\in V$ is said to satisfy the **Bellman equation** if it is a fixed point of $T$.





### Howard Operator

Define a map $H$ from $V$ to $\{v_\sigma\}$ via $Hv=v_\sigma$ where $\sigma$ is $v$-greedy.

Iterating with $H$ generates the value sequence associated with Howard Policy Iteration.

We call $H$ the **Howard operator generated by the ADP**.

### Property of ADP

Let $\mathcal{A}: = (V, \{T_\sigma\}_{\sigma\in\Sigma})$ be an ADP.

We call $\mathcal{A}$

- **finite** if $\Sigma$ is a finite set.
- **order stable** if every policy operator $T_\sigma$ is order stable on $V$
- **max-stable** if $A$ is order stable and $T$ has at least one fixed point in $V$.

We have **max-stable $\implies$ order-stable $\implies$ well-posed**.


### Proposition 9.2.1. Finite and order stable ADP is max-stable.

### Corollary 9.2.2. If RDP is globally stable, then the ADP generated by the RDP is max-stable.

### Proposition 9.2.3. For ADP generated by RDP, Well-posed iff Order stable on an order interval

## Max-Optimality Result

Let $\mathcal{A} = (V, \{T_\sigma\})$ be a well-posed ADP with $\sigma$-value functions $\{v_\sigma\}$.

We define the set of value functions. 

$$
V_\Sigma = \{v_\sigma\}
$$

and 

$$
V_u = \{v\in V: v\precsim Tv\}
$$

If $V_\Sigma$ has a greatest value, we denote it as $v^*$ and call it the **value function** generated by the ADP.

A policy $\sigma\in\Sigma$ is called **optimal** for the ADP if $v_\sigma = v^*$.

We say that the ADP obeys **Bellman's principle of optimality** if

$$
\sigma\in\Sigma \text{ is optimal for the ADP $\iff$ $\sigma$ is $v^*$-greedy }
$$

### Lemma B.4.1. Properties of Order stable ADP

1. $v\in V_u \implies v\precsim Hv$

**Proof:**
$v\in V_u \implies v \precsim Tv \implies v\precsim T_\tau v$, $\tau$ is $v$-greedy.

Since $T_\tau$ is order stable, we have 


2. $Tv_\sigma = v_\sigma\implies v_\sigma =v^* $

3. $Hv=v \implies v=v^*$

4. Finite and Order stable ADP $\implies \exists v^*, Hv^*=v^*$, $\forall v, H^k v \to v^*$, with finite $k$.

5. $H^{k+1}v=H^k v\implies H^k v = v^*$.

# Theorem 9.2.4. Max-Optimality

If the ADP is **finite and order stable**, then

1. the set of $\sigma$-value functions $V_\sigma$ has a greatest element $v^*$

2. $v^*$ is the unique solution to the Bellman equation in $V$.

3. ADP obeys the Bellman's principle of optimality

4. ADP has at least one optimal policy

5. HPI returns an exact optimal policy in finitely many steps.


If the ADP is **max-stable**, we have 1-4.

### Lemma 9.2.6. VFI starting from $\sigma$-value functions converges to the value function

### Lemma 9.2.7. $W_m$ is order-preserving self map on $V_u$. $v\in V_u\implies Tv\le W_m v\le T^m  v$

### Lemma 9.2.8. $T^kv_\sigma \le W^k_m v_\sigma \le T^{km}v_\sigma$

### Lemma 9.2.9. $W^{k+1}_m v_\sigma = W^{k}_m v_\sigma\implies v^*= W^k_m v_\sigma$.