# 1 
The 4 Bellman Policy Equations

For a deterministic policy, MDP --> MRP


We choose action $a$ with probability 1: $a=\pi(s)$; this action only depends on the state $s$, so I drop it from an input to $Q$ and $V$. We basically end up with the MRP equations



$V^{\pi_D}(s) = Q^{\pi_D}(s)$ 

$Q^{\pi_D}(s) = R(s) + \gamma  \sum_{s' \in N} P(s,s')  V^{\pi_D}(s')$

$V^{\pi_D}(s) = R^\pi(s) + \gamma \sum_{s' \in N} P(s,s')  V^{\pi_D}(s')$

$Q^{\pi_D}(s) = R(s) + \gamma  \sum_{s' \in N} P(s,s')  Q^{\pi_D}(s')$

# 2

$V_*(s) = max_{\pi \in \prod} V^\pi(s) = max_a Q_*(s,a)$

$R_s(a) = a(1-a) + (1-a)(1+a) = a - a^2 + 1 - a^2 = -2a^2 + a + 1$

$\frac{\partial R}{\partial a} = -4a + 1$, equals $0$ when $a=0.25$

$\max R_s(a) = R_s(0.25) = 1.125$


Discount across all future states. We will choose $a=0.25$ at every step, which gives us an expected reward of 1.125 each time. Hence, we need to discount 1.125 indefinitely into the future.

Optimal Value Function $V^*(s) = \frac{1.125}{0.50} = 2.25$

Optimal Deterministic Policy: $\pi^*(s) = 0.25$

# 3

State space = $\{0, 1, 2, \dots, n\}$

Action space = {Croak A, Croak B}

$P(i, A, i-1) = \frac{i}{n}$

$P(i, A, i+1) = \frac{n-1}{n}$

$P(i, B, i') = \frac{1}{n}$ for $i' \neq i$ in $\{0, 1, 2, \dots, n\}$

$R(0) = -100$

$R(n) = 10$



In [1]:
from rl.markov_decision_process import *

In [2]:
# Create Frog State (only attribute is an int -- the lilypad)

@dataclass(frozen=True)
class FrogState:
    position: int

Let croak A = 0, croak B = 1

In [3]:
from itertools import combinations, permutations
from rl.policy import FiniteDeterministicPolicy
from rl.distribution import Constant
# Create policies
n = 6

# Create a list with all possible permutations of croaks
# for n-2 non-terminal states where the frog makes an action.
all_possible_permutations_of_croaks = []
for num_A in range(0, n-1):
    list_with_croaks = [0 for i in range(num_A)]
    list_with_croaks += [1 for i in range(n-2 - num_A)]
    
    perm = list(permutations(list_with_croaks))
    all_possible_permutations_of_croaks.append(perm)


# List of all 2^(n-1) policies
policies = []
for permutation_of_croaks in all_possible_permutations_of_croaks:
    ### What's the right policy to use???
    ### What's the notation for a map??
    
    fdp : FinitePolicy[FrogState, int] = \
    FinitePolicy(
        {FrogState(i) : Constant(permutation_of_croaks[i])
         for i in range(1, n)}
    )
    
    policies.append(fdp)

In [4]:
# State to action, then from action to (next_state, reward)
FrogMapping = Mapping[
    FrogState,
    Mapping[int, Categorical[Tuple[FrogState, int]]]]

In [5]:
class FrogFMDP(FiniteMarkovDecisionProcess[FrogState,int]):
    def __init__(self, num_states : int):
        self.n = num_states
        super().__init__(self.get_action_transition_reward_map())

    def get_action_transition_reward_map(self) -> FrogMapping:
        d: Dict[FrogState, 
                Dict[int, Categorical[Tuple[FrogState, int]]]] = {}

        for i in range(1, n):
            state: FrogState = FrogState(i)
            
            d1: Dict[int, Categorical[Tuple[FrogState, int]]] = {}
            
            # croak = 0
            # Mapping of next_state and rewards, and probabilities
            # of getting those next_state and reward
            sr_probs_dict0: Dict[Tuple[FrogState, int], float] =\
            {(FrogState(i-1), 0) : i / self.n, 
             (FrogState(i+1), 0): (self.n-i)/self.n,
            }
            if i == 1:
                # Get rid of previos key for 0 state
                sr_probs_dict0.pop((FrogState(0), 0))
                # Add terminal state
                sr_probs_dict0[(FrogState(0), -100)] = i/self.n
            elif i == n-1:
                # Get rid of previous key for n state
                sr_probs_dict0.pop((FrogState(n), 0))
                # Add terminal (good) state
                sr_probs_dict0[(FrogState(n), 10)] = (self.n-i)/self.n
                
            d1[0] = Categorical(sr_probs_dict0)
            
            
            
            # croak = 1
            sr_probs_dict1: Dict[Tuple[FrogState, int], float] =\
            {(FrogState(i_next), 0) : 1 / self.n 
             for i_next in list(range(i)) + list(range(i+1,self.n+1))
            }
            
            # Get rid of previos key for 0 state
            sr_probs_dict1.pop((FrogState(0), 0))
            # Add terminal state
            sr_probs_dict1[(FrogState(0), -100)] = 1/self.n
           
            # Get rid of previous key for n state
            sr_probs_dict1.pop((FrogState(self.n), 0))
            # Add terminal (good) state
            sr_probs_dict1[(FrogState(self.n), 10)] = 1/self.n
                
            d1[1] = Categorical(sr_probs_dict1)    
                    
            d[state] = d1
        return d

In [6]:
n = 6
myFrogFMDP = FrogFMDP(n)

for mypol in policies:
    FMRP = myFrogFMDP.apply_finite_policy(policy=mypol)
    
    value_func = FMRP.get_value_function_vec(gamma=1)
    

Constant(value=(1, 1, 1, 1))
(1, 1, 1, 1)
{(NonTerminal(state=FrogState(position=2)), 0): 0.8333333333333334, (Terminal(state=FrogState(position=0)), -100): 0.16666666666666666}
{(NonTerminal(state=FrogState(position=2)), 0): 0.16666666666666666, (NonTerminal(state=FrogState(position=3)), 0): 0.16666666666666666, (NonTerminal(state=FrogState(position=4)), 0): 0.16666666666666666, (NonTerminal(state=FrogState(position=5)), 0): 0.16666666666666666, (Terminal(state=FrogState(position=0)), -100): 0.16666666666666666, (Terminal(state=FrogState(position=6)), 10): 0.16666666666666666}


KeyError: (1, 1, 1, 1)

policy.act(state) is returning the wrong actions. 

FinitePolicy.act() should return a finite distribution