Consider a measurable space $(\Omega, \mathcal{F})$. A map
$Q\colon\Omega \times \mathcal{F} \to [0, +\infty)$ is a transition
kernel if

* $Q(\cdot, A)$ is measurable map for any $A \in \mathcal{F}$

* $Q(\omega, \cdot)$ is a finite measure on $(\Omega, \mathcal{F})$ 
  for any $\omega \in \Omega$


For any measurable map $f$ we can define a measurable map

$$
    T f
    \colon x \mapsto \int f(\omega) \, Q(x, d\omega)
    \,. $$


For a measure $\lambda$ on $(\Omega, \mathcal{F})$ we can similarly
define a measure

$$
    T^* \lambda
    \colon A \mapsto \int Q(\omega, A) \lambda(d\omega)
    \,. $$


A dual pairing:

$$
\langle
    f, \lambda
\rangle = \int f(\omega) \lambda(d\omega)
    \,. $$

We can show that $
\langle
    T f, \lambda
\rangle = \langle
        f, T^* \lambda
    \rangle
$ via Fubini (?) theorem:

$$
\int T f(x) \lambda(dx)
    = \int \int f(\omega) \, Q(x, d\omega) \lambda(dx)
    % = \iint f(\omega) \lambda(dx) Q(x, d\omega)
    = \int f(\omega) \int \lambda(dx) Q(x, d\omega)
    = \int f(\omega) (T^* \lambda)(d\omega)
    \,. $$

In [1]:
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
import torch
import torch.nn.functional as F

In [3]:
import gym

A display device

In [4]:
from IPython.display import clear_output

from time import sleep

def display(env, fps=15):
    if fps > 0:
        clear_output(wait=True)
        env.render()
        sleep(1. / fps)

Environment simulation loop:

In [5]:
from dpd.tools.delayed import DelayedKeyboardInterrupt

def run(env, policy, fps=15):
    state, terminate = env.reset(), False

    display(env, fps)
    with DelayedKeyboardInterrupt("ignore") as stop:
        while not (terminate or stop):
    
            # take an action and get a response form the environment
            action = policy(env, state)
            state_prime, reward, terminate, info = env.step(action)

            # return the result
            yield state, action, reward, state_prime, terminate
            state = state_prime

            # render the enviroment
            display(env, fps)

        env.close()

<br>

## From Whiteson lecture on MLSS 2019

In [59]:
from gym.envs.toy_text import FrozenLakeEnv, CliffWalkingEnv

env = FrozenLakeEnv(map_name="8x8", is_slippery=True)

<br>

MDP -- classical formal model of a sequential decision problem:

* fully-observable, stationary, and possibly stochastic environment

* discrete states $S$ and actions $A_s$ for each $s \in S$

* transition kernel $s\to z \colon z\sim q(z\mid s, a)$ on $S$

* reward distributuion $q(r\mid s, a, s')$ when transitioning $s \to s'$ under $a$

* aplanning horizon or a discount factor $\gamma \in (0, 1)$

Markov property
$p(z_{t+1}, r_{t+1}\mid s_t, a_t) = p(z_{t+1}, r_{t+1}\mid s_t, a_t, s_{:t}, a_{:t})$

* Reactive policies $a\sim \pi(a\mid s)$
* deterministic policies


**State-value** of a policy $
v^\pi(s) = \mathbb{E}_\pi \bigl(
    \sum_{k\geq 1} \gamma r_{t+k+1}
    \big\vert s_t = s
\bigr)
$ and **action-value** $
Q^\pi(s, a) = \mathbb{E}_\pi \bigl(
    \sum_{k\geq 1} \gamma r_{t+k+1}
    \big\vert s_t = s, a_t = a
\bigr)
$

Bellman fixed-point equation for $v^\pi$:
$$
v^\pi(s)
    = \mathbb{E}_{a\sim \pi(s)} \mathbb{E}_{s' \sim q(s'\mid s, a)}
        r(s, a, s') + \gamma v^\pi(s')
    \,, $$

and for $q^\pi$

$$
q^\pi(s, a)
    = \mathbb{E}_{s' \sim q(s'\mid s, a)}
        r(s, a, s') + \gamma \mathbb{E}_{a'\sim \pi(s')} q^\pi(s', a')
    = \mathbb{E}_{s' \sim q(s'\mid s, a)}
        r(s, a, s') + \gamma v^\pi(s')
    \,. $$

Policies can be partially oredered by their value function. And
all optimal policies share the same optimal state valeu function
$v^*(\cdot) = \max_\pi v^\pi(\cdot)$.

The Bellamn optimality conditions for $v^*$ and are

$$
v^*(s)
    = \max_{a\in A_s} \mathbb{E}_{s' \sim q(s'\mid s, a)}
        r(s, a, s') + \gamma v^*(s')
    \,, $$

and

$$
q^*(s, a)
    = \mathbb{E}_{z \sim q(s'\mid s, a)}
        r(s, a, s') + \gamma \max_{a'\in A_{s'}}  q^*(s', a')
    \,, $$

respectively. The optimal policy is greedy with respect to $q$:

$$
\pi^*(s)
    = \delta_{a^*_s}
    \,, \text{ for }
    a^*_s = \arg\max_{a\in A_s} q^*(s, a)
    \,. $$

<br>

### Random policy

In [60]:
def random(env, state=None):
    return env.action_space.sample()

episode = [*run(env, random, fps=10)]

  (Left)
SFFFFFFF
FFFFFFFF
FFF[41mH[0mFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG


<br>

## Policy Evaluation

Bellman operator for a policy $\pi\colon \mathcal{S} \to \Delta_A$
$$
T_\pi(v)
\colon s \mapsto \mathbb{E}_{a\sim \pi(a\mid s)}
    \mathbb{E}_{z\sim q(z\mid s, a)} r(s, a, z) + \gamma v(z)
    \,. $$

In [61]:
def expected_state_reward(states, value, gamma=1.0):
    # kernel -- list of next state-rewards with probabilities
    return sum(
        prob * (reward + gamma * value[state])
        for prob, state, reward, term in states
    )

In [62]:
def expected_action_reward(actions, kernel, value, gamma=1.0):
    # policy -- list of actions with probabilities
    return sum(
        prob * expected_state_reward(kernel[action], value, gamma)
        for prob, action in actions
    )

The policy evaluation is performed via the fixed point iterations:

* repeat $v_{t+1} \leftarrow T_\pi(v_t)$ until convergence in $\|\cdot\|_\infty$


In [10]:
def evaluate_policy(env, policy, gamma=1.0, atol=1e-8):
    value, delta = {state: 0. for state in env.P}, float("+inf")
    while delta > atol:
        Tv = {
            state: expected_action_reward(policy[state], kernel, value, gamma)
            for state, kernel in env.P.items()
        }
        
        delta = max(abs(a - b) for a, b in zip(Tv.values(), value.values()))
        value = Tv

    return value

Let's evaluate a random exploration policy:

In [11]:
gamma = 1.

policy = {
    state: [(1. / env.nA, action) for action in kernel]
    for state, kernel in env.P.items()
}

value = evaluate_policy(env, policy, gamma)

The $q$-function of $v$ is
$$
q_\infty(s, a)
    = \mathbb{E}_{s'\sim q(s'\mid s, a)}
        r(s, a, s') + \gamma v_\infty(s')
    \,. $$

In [12]:
q_fun = {
    state: {
        action: expected_state_reward(states, value, gamma)
        for action, states in kernel.items()
    } for state, kernel in env.P.items()
}

In [27]:
def greedy_q(value):
    q_max = max(value.values())
    mask = [reward >= q_max for reward in value.values()]
    n_sum = sum(mask)

    return [(prob / n_sum, action)
            for action, prob in zip(value, mask)]

In [28]:
policy = {
    state: greedy_q(value) for state, value in q_fun.items()
}

<br>

## Policy improvement

The fixed point $v^\pi$ is the true value function of $\pi$. The associated $q$ function ca be used to reason about improvements in the policy $\pi$:
if at some $s\in S$ we have $q^\pi(s, a_s) > v^\pi(s)$ for some $a_s \in A_s$ then the new policy $\hat{\pi}(\cdot) = \pi(\cdot)$ but $\hat{\pi}(s) = \delta_{a_s}$ is strictly better than $\pi$ (w.r.t $v$-function)

Applying this to all states yields the **greedy** policy improvement:

$$
\pi_{t+1}(s) \in \arg\max_{a\in A_s} q^{\pi_t}(s, a)
    \,. $$

In [44]:
def policy_improvement(env, gamma=1.0, atol=1e-8):
    policy = {
        state: [(1. / env.nA, action) for action in kernel]
        for state, kernel in env.P.items()
    }

    value, delta = evaluate_policy(env, policy, gamma), float("+inf")
    print(value)
    while delta > atol*1000:
        q_fun = {
            state: {
                action: expected_state_reward(states, value, gamma)
                for action, states in kernel.items()
            } for state, kernel in env.P.items()
        }

        policy = {
            state: greedy_q(value) for state, value in q_fun.items()
        }
        new = evaluate_policy(env, policy, gamma)

        delta = max(abs(a - b) for a, b in zip(new.values(), value.values()))
        value = new
        print(value)

    return value, policy

If $\pi_{t+1} = \pi_t$ then $v^{\pi_t} = v^{\pi_{t+1}} = v$, which
satisfies the Bellamn Optimizality principle:
$T(v) = v$ for
$$
T(v)(s) = \max_{a\in A_s}
    \mathbb{E}_{z\sim q(s'\mid s, a)} r(s, a, s') + \gamma v(s')
    \,. $$

In [45]:
value, policy = policy_improvement(env)

{0: 0.0019034917913191038, 1: 0.0021696792436437335, 2: 0.0028150668643954074, 3: 0.004120481261236543, 4: 0.006547679453827359, 5: 0.009803216344269209, 6: 0.013447781895831665, 7: 0.01597009054274122, 8: 0.00163734211974823, 9: 0.001790514408962211, 10: 0.0021550714943247297, 11: 0.0029987255643118293, 12: 0.005719369174153175, 13: 0.009414218174120834, 14: 0.014570071824405437, 15: 0.018492433867286894, 16: 0.001218053951006093, 17: 0.0011999955713605007, 18: 0.0010160045684423939, 19: 0.0, 20: 0.003916875306314802, 21: 0.007564240159817048, 22: 0.016925881599881645, 23: 0.02493716961635377, 24: 0.0008168512084205238, 25: 0.0007754327286982362, 26: 0.0007089667154352155, 27: 0.0007732163271153411, 28: 0.002383902471090605, 29: 0.0, 30: 0.02063206470158456, 31: 0.03939321639661987, 32: 0.0004570863155807352, 33: 0.00037593357827042304, 34: 0.0002712238238873024, 35: 0.0, 36: 0.004845522515379675, 37: 0.011594196346800688, 38: 0.02620917131062662, 39: 0.07261042968220219, 40: 0.000178

In [43]:
policy

{0: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 1: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 2: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 3: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 4: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 5: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 6: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 7: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 8: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 9: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 10: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 11: [(0.3333333333333333, 0),
  (0.0, 1),
  (0.3333333333333333, 2),
  (0.3333333333333333, 3)],
 12: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 13: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 14: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 15: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 16: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 17: [(0.25, 0), (0.25, 1), (0.25, 2), (0.25, 3)],
 18: [(0.3333333333333333, 0),
  (0.333333333

In [None]:
def greedify(q):
    def _policy(env, state):
        actions, expected = zip(*q[state].items())
        return actions[np.argmax(expected)]

    return _policy

In [None]:
episode = [*run(env, get_greedy_policy(q_fun))]

<br>

<br>

## (original) Bellman equation

An MDP:
* $q(z\mid s, a)$ the transition kernel

The (original) Bellman operator is
$$
T(v)
\colon s \mapsto \max_{a\in A_s} \mathbb{E}_{z\sim q(z\mid s, a)} r(s, a, z) + \gamma v(z)
    \,. $$

It can be shown that $T$ is a contraction mapping w.r.t $\|\cdot\|_\infty$ for $\gamma \in (0, 1)$,
and thus the fixed point iteration converges to a $v_*$:
* repeat $v_{t+1} \leftarrow T(v_t)$ until convergence in $\|\cdot\|_\infty$

The optimal value function implies an optimal (greedy) poilcy:

$$
\pi(s) \in \arg\max_{a\in A_s}
    \mathbb{E}_{z\sim q(z\mid s, a)} r(s, a, z) + \gamma v^*(z)
    \,. $$

In [None]:
def expected_value(kernel, value, policy=None, gamma=1.0):
    return max(expected_value(kernel[action], value, gamma)
               for action, prob in policy.items())

In [None]:
def policy_evaluation(env, policy, gamma=1.0):
    value, delta = {state: 0. for state in env.P}, float("+inf")
    while delta > 1e-8:
        # compute the operator
        new = {state: expected_value(kernel, value, policy[state], gamma)
               for state, kernel in env.P.items()}
        
        delta = max(abs(a - b) for a, b in zip(value.values(), new.values()))
        value = new

    return value

Let's evaluate a random exploration policy:

In [None]:
policy = {
    state: {
        action: 1. / env.nA for action in kernel
    } for state, kernel in env.P.items()
}

value = policy_evaluation(env, policy, gamma=0.99)

In [None]:
states, values = zip(*value.items())
values = np.array(values).reshape(8, 8)

In [None]:
values.round(3)

In [None]:
plt.imshow(values, cmap=plt.cm.PuBu)

Compute the $q$-function implied by the converged value function:
$$
Q(s, a)
    = \mathbb{E}_{z\sim q(z\mid s, a)} r(s, a, s') + \gamma v^*(z)
    \,. $$

In [42]:
q_fun = {
    state: {
        a: expected_value(k, value, 0.5) for a, k in kernel.items()
    } for state, kernel in env.P.items()
}

NameError: name 'expected_value' is not defined

Thus the stationary $v$-function induces the following policy (consistent with it):
$$
\pi(s)
    = \delta_{a_s}
    \,, \text{ for } a_s = \arg \max_{a \in A_s} Q(s, a) 
    \,. $$

In [40]:
def get_greedy_policy(q):
    def _policy(env, state):
        actions, expected = zip(*q[state].items())
        return actions[np.argmax(expected)]

    return _policy

In [41]:
episode = [*run(env, get_greedy_policy(q_fun))]

  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFF[41mG[0m


<br>

## Policy improvement and Policy Iteration

In [None]:
assert False

In [None]:
from matplotlib.collections import LineCollection

In [None]:
s, *_ = zip(*episode)

segs = np.c_[np.unravel_index(s, (8, 8))]

In [None]:
assert False

<br>

In [None]:
# !pip install gym

In [None]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
import torch
import torch.nn.functional as F

In [None]:
import gym

In [None]:
from gym.envs.classic_control import MountainCarEnv

In [None]:
from gym import spaces

class ModifiedMountainCarEnv(MountainCarEnv):
    def __init__(self, goal_velocity = 0):
        self.min_position = -2.5
        self.max_position = 2.5
        self.max_speed = 0.07
        self.goal_position = 2.0
        self.goal_velocity = goal_velocity

        self.force = 0.001
        self.gravity = 0.0025

        self.low = np.array([self.min_position, -self.max_speed])
        self.high = np.array([self.max_position, self.max_speed])

        self.viewer = None

        self.action_space = spaces.Discrete(3)
        self.observation_space = spaces.Box(self.low, self.high, dtype=np.float32)

        self.seed()

In [None]:
env = ModifiedMountainCarEnv()

In [None]:
def run(env, agent, fps=15):
    state, terminated = env.reset(), False
    history = []
    while not terminated:
        state, reward, terminated, info = env.step(agent(state))
        history.append((state, reward))

    return history

In [None]:
class BaseAgent(object):
    def __init__(self, env):
        self.env = env
    
    def reset(self):
        pass
    
    def update(self, state, action, reward, next_state, terminated=False):
        pass

    def __call__(self, state=None):
        return env.action_space.sample()

In [None]:
space = env.observation_space

In [None]:
from gym.spaces import Space

In [None]:
shape = (51, 71)
state = 0, 0

In [None]:
unit = (state - space.low) / (space.high - space.low)

In [None]:
class Discretizer(object):
    def __init__(self, space, shape):
        assert isinstance(space, Box) and len(space.shape) == 1
        assert space.is_bounded

        if not isinstance(n_states, (list, tuple)):
            shape = space.shape[0] * [shape]

        assert shape == space.shape[0]
        self.space, self.shape = space, shape

    def to_ix(self, state, flatten=False):
        unit = (state - space.low) / (space.high - space.low)
        ix = (unit * shape + 0.5).astype(int)

        if flatten:
            return np.unravel_index(ix, shape=self.shape)
        return ix

    def from_ix(self, *index):
        return np.array(index) * (space.high - space.low) + space.low

In [None]:
def rescale(obs, env):
    space = env.observation_space
    return (obs - space.low) / (space.high - space.low)

class TabularQLearner(BaseAgent):
    def __init__(self, env, n_states=51):
        super().__init__(env)
        self.n_states, self.n_actions = n_states, env.action_space.n

        self.reset()
    
    def reset(self):
        self.q_table = torch.zeros(self.n_states, self.n_states, n_actions)

    def __call__(self, state=None):
        
        return env.action_space.sample()

In [63]:
value = np.zeros((env.nS, env.nA))

In [64]:
policy = np.ones((env.nS, env.nA))/env.nA

In [65]:
state = 0

In [66]:
env.P

{0: {0: [(0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 8, 0.0, False)],
  1: [(0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 8, 0.0, False),
   (0.3333333333333333, 1, 0.0, False)],
  2: [(0.3333333333333333, 8, 0.0, False),
   (0.3333333333333333, 1, 0.0, False),
   (0.3333333333333333, 0, 0.0, False)],
  3: [(0.3333333333333333, 1, 0.0, False),
   (0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 0, 0.0, False)]},
 1: {0: [(0.3333333333333333, 1, 0.0, False),
   (0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 9, 0.0, False)],
  1: [(0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 9, 0.0, False),
   (0.3333333333333333, 2, 0.0, False)],
  2: [(0.3333333333333333, 9, 0.0, False),
   (0.3333333333333333, 2, 0.0, False),
   (0.3333333333333333, 1, 0.0, False)],
  3: [(0.3333333333333333, 2, 0.0, False),
   (0.3333333333333333, 1, 0.0, False),
   (0.3333333333333333, 0, 0.0, False)]},
