Value based funcitons are based on the Bellman Equations.
- $V^*(s) = \underset{a}{\text{max}} \sum_{s'} P(s'|s,a)(R(s,a) + \gamma V^*(s'))$
- $Q^*(s,a) =  \sum_{s'} P(s'|s,a)[R(s,a) + \gamma \underset{a}{\text{max}} Q^*(s'])$

And for completion- the deterministic versions
- $V^*(s) = \underset{a}{\text{max}}(R(s,a) + \gamma\ V^*(s’))$
- $Q^*(s,a) = R(s,a)  + \gamma\  \underset{a'}{\text{max}}\ Q^*(s’,a'))$

The goal is to optimize for expected reward which is defined as follows
- $R = E_{\tau \sim p_\theta(\tau)}[\sum_t R(s_t,a_t)]$

If we know the transitions for all states (and there's a small number of them), we can just do policy iteration:
- Evaluate best possible next action given state and current value function, take that action
- Update value function based on observed reward

Since we know all the transitions and states, we can essentialy just fill out a table of the values we know
import numpy as np


In [1]:

# Define the MDP
num_states = 5
num_actions = 2
rewards = np.array([[-1, -1], [-1, -1], [-1, -1], [-1, -1], [0, 1]])
transitions = np.array([
    [[0.9, 0.1], [0.5, 0.5]],
    [[0.8, 0.2], [0.6, 0.4]],
    [[0.7, 0.3], [0.3, 0.7]],
    [[0.6, 0.4], [0.2, 0.8]],
    [[0.0, 1.0], [0.0, 1.0]]
])

# Initialize policy and value function
policy = np.zeros(num_states, dtype=int)
value_func = np.zeros(num_states)
gamma = 0.9  # Discount factor

def policy_evaluation():
    global value_func
    while True:
        delta = 0
        for s in range(num_states):
            v = value_func[s]
            action = policy[s]
            value_func[s] = rewards[s, action] + gamma * np.dot(transitions[s, action], value_func)
            delta = max(delta, abs(v - value_func[s]))
        if delta < 1e-5:
            break

def policy_improvement():
    global policy
    policy_stable = True
    for s in range(num_states):
        old_action = policy[s]
        q_values = rewards[s] + gamma * np.dot(transitions[s], value_func)
        policy[s] = np.argmax(q_values)
        if old_action != policy[s]:
            policy_stable = False
    return policy_stable

def policy_iteration():
    while True:
        policy_evaluation()
        if policy_improvement():
            break

# Run policy iteration
policy_iteration()

print("Optimal Policy:", policy)
print("Optimal Value Function:", value_func)

NameError: name 'np' is not defined