## CHEME 5660 Lab 8: Solution of the Linear Tiger Problem as a Markov Decision Process (MDP)

<img src="./figs/Fig-Linear-MDP-Schematic.png" style="margin:auto; width:50%"/>

__Fig 1__: Schematic of the Tiger problem modeled as an N-state, two-action (left,right) Markov decision process. A tiger hides behind the green door while freedom awaits behind the red door.

## Introduction
A Markov decision process is the tuple $\left(\mathcal{S}, \mathcal{A}, R_{a}\left(s, s^{\prime}\right), T_{a}\left(s,s^{\prime}\right), \gamma\right)$ where:

* The state space $\mathcal{S}$ is the set of all possible states $s$ that a system can exist in
* The action space $\mathcal{A}$ is the set of all possible actions $a$ that are available to the agent, where $\mathcal{A}_{s} \subseteq \mathcal{A}$ is the subset of the action space $\mathcal{A}$ that is accessible from state $s$.
* An expected immediate reward $R_{a}\left(s, s^{\prime}\right)$ is received after transitioning from state $s\rightarrow{s}^{\prime}$ due to action $a$. 
* The transition $T_{a}\left(s,s^{\prime}\right) = P(s_{t+1} = s^{\prime}~|~s_{t}=s,a_{t} = a)$ denotes the probability that action $a$ in state $s$ at time $t$ will result in state $s^{\prime}$ at time $t+1$
* The quantity $\gamma$ is a _discount factor_; the discount factor is used to weigh the _future expected utility_.

Finally, a policy function $\pi$ is the (potentially probabilistic) mapping from states $s\in\mathcal{S}$ to actions $a\in\mathcal{A}$ used by the agent to solve the decision task. 

### Policy evaluation
One immediate question that jumps out is what is a policy function $\pi$, and how do we find the best possible policy for our decision problem? To do this, we need a way to estimate how good (or bad) a particular policy is; the approach we use is called _policy evaluation_. Let's denote the expected utility gained by executing some policy $\pi(s)$ from state $s$ as $U^{\pi}(s)$. Then, an _optimal policy_ function $\pi^{\star}$ is one that maximizes the expected utility:


$$\pi^{\star}\left(s\right) = \text{arg} \max_{\pi}~U^{\pi}(s)$$


for all $s\in\mathcal{S}$. We can iteratively compute the utility of a policy $\pi$. If the agent makes a single move, the utility will be the reward the agent receives by implementing policy $\pi$:


$$U_{1}^{\pi}(s) = R(s,\pi(s))$$


However, if we let the agent perform two, three, or $k$ possible iterations, we get a _lookahead_ equation which relates the value of 
the utility at iteration $k$ to $k+1$:


$$U_{k+1}^{\pi}(s) = R(s,\pi(s)) + \gamma\sum_{s^{\prime}\in\mathcal{S}}T(s^{\prime} | s, \pi(s))U_{k}^{\pi}(s^{\prime})$$


As $k\rightarrow\infty$ the lookahead utility converges to a stationary value $U^{\pi}(s)$.

### Q-functions
Policy evaluation gives us a method to compute the utility for a particular policy $U^{\pi}(s)$.  However, suppose we were given the utility and wanted to estimate the policy $\pi(s)$ from that utility.  Given a utility $U(s)$, we can estimate a policy $\pi(s)$ using the $Q$-function (action-value function):

$$Q(s,a) = R(s,a) + \gamma\sum_{s^{\prime}\in\mathcal{S}}T(s^{\prime} | s, a)U(s^{\prime})$$

The $Q$-function a $|\mathcal{S}|\times|\mathcal{A}|$ array, where the utility is given by:


$$U(s) = \max_{a} Q(s,a)$$

and the policy $\pi(s)$ is:

$$\pi(s) = \text{arg}\max_{a}Q(s,a)$$

### Problem
An agent is trapped in a long hallway with two doors at either end (Fig. 1). Behind the red door is a tiger (and certain death), while behind the green door is freedom. If the agent opens the red door, the agent is eaten (and receives a large negative reward). However, if the agent opens the green door, it escapes and gets a positive reward. 

For this problem, the MDP has the tuple components:
* $\mathcal{S} = \left\{1,2,\dots,N\right\}$ while the action set is $\mathcal{A} = \left\{a_{1},a_{2}\right\}$; action $a_{1}$ moves the agent one state to the left, action $a_{2}$ moves the agent one state to the right.
* The agent receives a postive reward for entering state N (escapes). However, the agent is penalized for entering state 1 (eaten by the tiger). Finally, the agent is not charged to move to adjacent locations.
* Let the probability of correctly executing an action $a_{j}\in\mathcal{A}$ be $\alpha$.

Let's compute $U^{\pi}(s)$ for different choices for the policy function $\pi$.

## Lab 8 setup

In [1]:
import Pkg; Pkg.activate("."); Pkg.resolve(); Pkg.instantiate();

[32m[1m  Activating[22m[39m project at `~/Desktop/julia_work/CHEME-5660-Markets-Mayhem-Example-Notebooks/labs/lab-8-MDP-Tiger-Problem`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5660-Markets-Mayhem-Example-Notebooks/labs/lab-8-MDP-Tiger-Problem/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Desktop/julia_work/CHEME-5660-Markets-Mayhem-Example-Notebooks/labs/lab-8-MDP-Tiger-Problem/Manifest.toml`


In [2]:
# load req'd packages -
using PrettyTables

In [3]:
include("CHEME-5660-Lab-8-CodeLib.jl");

In [4]:
# setup some global constants -
α = 0.95; # probability of moving the direction we are expect

#### Configure states and actions

In [5]:
# setup the states and actions -
start_index = 1;
stop_index = 10;
left_reward = -100.0;
right_reward = 1000.0;

states = range(start_index,stop=stop_index, step=1) |> collect;
actions = [1,2]; # a₁ = move left, a₂ = move right
γ = 0.95;

# setup flags -
should_run_T_check = false;

#### Configure the rewards array

In [6]:
# setup the rewards -
R = Array{Float64,2}(undef,length(states), length(actions));

# most of the rewards are zero -
fill!(R,0.0) # fill R w/zeros

# set the rewards for the ends -
R[start_index + 1, 1] = left_reward; # if in state 2, and we take action 1, we get eaten by a tiger. Bad.
R[stop_index - 1, 2] = right_reward; # if in state N - 1, and we take action 2, we live, get married, have kids (who grow up to be Surgeons). Good.

#### Configure the transition array $T_{a}(s,s^{\prime})$

In [7]:
# Setup the transitions
T = Array{Float64,3}(undef, length(states), length(states), length(actions));
fill!(T,0.0);

# We need to put values into the transition array (these are probabilities, so eah row much sum to 1)
T[start_index, 1, 1:length(actions)] .= 1.0; # if we are in state 1, we stay in state 1 ∀a ∈ 𝒜
T[stop_index, stop_index, 1:length(actions)] .= 1.0; # if we are in state 5, we stay in state 5 

# left actions -
for s ∈ 2:(stop_index - 1)
    T[s,s-1,1] = α;
    T[s,s+1,1] = (1-α);
end

# right actions -
for s ∈ 2:(stop_index - 1)
    T[s,s-1,2] = (1-α);
    T[s,s+1,2] = α; 
end

In [8]:
# show we run the T-check?
if (should_run_T_check == true)
    
    # row summation check -
    T_array_check_table = Array{Any,2}(undef, length(states), length(actions)+1)

    for s ∈ 1:length(states)
        T_array_check_table[s,1] = s;
    end

    for a ∈ 1:length(actions)

        # sum the action table -
        Z = sum(T[:,:,a], dims=2)

        for s ∈ 1:length(states)
            T_array_check_table[s,a+1] = Z[s]
        end
    end

    # header -
    T_check_header = (["State s", "a₁ (left)", "a₂ (right)"]);

    # display -
    pretty_table(T_array_check_table; header=T_check_header)
end

#### Build the MDP problem object and estimate the utility $U^{\pi}(s)$ 

In [9]:
mdp_problem = build(MDPProblem; 𝒮 = states, 𝒜 = actions, T = T, R = R, γ = γ);

In [17]:
# build a always left or always right policy -
always_move_right(s) = 2;
always_move_left(s) = 1;

In [11]:
U = iterative_policy_evaluation(mdp_problem, always_move_left, 20*length(states));

In [16]:
# display utility vector -
utility_table_data_array = Array{Any,2}(undef, length(states), 2);

# main table loop -
for s ∈ 1:length(states)
    utility_table_data_array[s,1] = s
    utility_table_data_array[s,2] = U[s]
end

# table header -
utility_table_header = (["State s", "U(s)"])

# display -
pretty_table(utility_table_data_array; header=utility_table_header)

┌─────────┬──────────┐
│[1m State s [0m│[1m     U(s) [0m│
├─────────┼──────────┤
│       1 │      0.0 │
│       2 │ -104.699 │
│       3 │ -98.9314 │
│       4 │ -93.4814 │
│       5 │ -88.3315 │
│       6 │  -83.465 │
│       7 │ -78.8592 │
│       8 │  -74.358 │
│       9 │ -67.1081 │
│      10 │      0.0 │
└─────────┴──────────┘


#### Estimate the Q-function

In [12]:
# compute the Q array -
Q_array = Q(mdp_problem, U)

# compute the policy -
policy = π(Q_array);

In [13]:
# make a Q-table -
Q_table_data_array = Array{Any,2}(undef, length(states), length(actions)+3)

for s ∈ 1:length(states)
    
    Q_table_data_array[s,1] = s;
    
    direction = "left"
    policy_index = policy[s];
    if policy_index == 2
        direction = "right" 
    elseif policy_index == 0
        direction = "stop"
    end
    
    Q_table_data_array[s,2] = direction;
    Q_table_data_array[s,3] = policy_index;
    
    
    for a ∈ 1:length(actions)
        Q_table_data_array[s,a+3] = Q_array[s,a];
    end
end

# header -
Q_table_header = (["State", "Direction", "π(s)", "U(a₁) l", "U(a₂) r"])

# show -
pretty_table(Q_table_data_array; header = Q_table_header)

┌───────┬───────────┬──────┬──────────┬──────────┐
│[1m State [0m│[1m Direction [0m│[1m π(s) [0m│[1m  U(a₁) l [0m│[1m  U(a₂) r [0m│
├───────┼───────────┼──────┼──────────┼──────────┤
│     1 │      stop │    0 │      0.0 │      0.0 │
│     2 │     right │    2 │ -104.699 │ -89.2856 │
│     3 │     right │    2 │ -98.9314 │ -89.3401 │
│     4 │     right │    2 │ -93.4814 │ -84.4184 │
│     5 │     right │    2 │ -88.3315 │ -79.7675 │
│     6 │     right │    2 │  -83.465 │ -75.3662 │
│     7 │     right │    2 │ -78.8592 │ -71.0727 │
│     8 │     right │    2 │  -74.358 │ -64.3109 │
│     9 │     right │    2 │ -67.1081 │  996.468 │
│    10 │      stop │    0 │      0.0 │      0.0 │
└───────┴───────────┴──────┴──────────┴──────────┘


### Additional Resources
* [Chapter 7: Mykel J. Kochenderfer, Tim A. Wheeler, Kyle H. Wray "Algorithms for Decision Making", MIT Press 2022](https://algorithmsbook.com)

### Disclaimer and Risks
__This content is offered solely for training and  informational purposes__. No offer or solicitation to buy or sell securities or derivative products, or any investment or trading advice or strategy,  is made, given, or endorsed by the teaching team. 

__Trading involves risk__. Carefully review your financial situation before investing in securities, futures contracts, options, or commodity interests. Past performance, whether actual or indicated by historical tests of strategies, is no guarantee of future performance or success. Trading is generally inappropriate for someone with limited resources, investment or trading experience, or a low-risk tolerance.  Only risk capital that is not required for living expenses.

__You are fully responsible for any investment or trading decisions you make__. Such decisions should be based solely on your evaluation of your financial circumstances, investment or trading objectives, risk tolerance, and liquidity needs.