# Policies and Value Functions

## Overview
This notebook explores policies and value functions, the two key concepts for solving MDPs.

## Learning Objectives
1. Understand deterministic and stochastic policies
2. Learn state value functions V(s) and action value functions Q(s,a)
3. Implement policy evaluation
4. Visualize policies and value functions

## Part 1: Setup

In [None]:
import sys
sys.path.insert(0, '/home/dingziming/PycharmProjects/AI-Practices/07-reinforcement-learning/马尔科夫决策过程')

import numpy as np
import matplotlib.pyplot as plt
from src.core import (
    State, Action, MarkovDecisionProcess,
    DeterministicPolicy, StochasticPolicy,
    StateValueFunction, ActionValueFunction
)

print("Imports successful!")

## Part 2: Deterministic Policies

A deterministic policy π(s) → a maps each state to exactly one action.

**Advantages**:
- Simple representation: O(|S|) storage
- Often optimal for finite MDPs
- Easy to interpret and visualize

In [None]:
# Create simple MDP
mdp = MarkovDecisionProcess(discount_factor=0.99)

# Add states
states = [State(i) for i in range(4)]
for s in states:
    mdp.add_state(s)

# Add actions
actions = [Action(i, f"a{i}") for i in range(2)]
for a in actions:
    mdp.add_action(a)

mdp.set_initial_state(states[0])

print(f"Created MDP with {mdp.get_state_space_size()} states and {mdp.get_action_space_size()} actions")

In [None]:
# Create deterministic policy
policy = DeterministicPolicy(mdp)

# Set action for each state
policy.set_action(states[0], actions[0])
policy.set_action(states[1], actions[1])
policy.set_action(states[2], actions[0])
policy.set_action(states[3], actions[1])

print("Deterministic policy created")
print(f"Policy is complete: {policy.is_complete()}")

In [None]:
# Query policy
for state in states:
    action = policy.get_action(state)
    prob = policy.get_action_probability(state, action)
    print(f"π(s={state.state_id}) = a{action.action_id}, P(a|s) = {prob}")

In [None]:
# Convert policy to array
policy_array = policy.to_array()
print(f"Policy array shape: {policy_array.shape}")
print(f"Policy array: {policy_array}")

## Part 3: Stochastic Policies

A stochastic policy π(a|s) defines a probability distribution over actions for each state.

**Advantages**:
- Enables exploration
- Necessary for convergence in some algorithms
- Useful for theoretical analysis

**Disadvantages**:
- Larger storage: O(|S| × |A|)
- More complex to interpret

In [None]:
# Create stochastic policy
stoch_policy = StochasticPolicy(mdp)

# Set uniform policy: π(a|s) = 1/|A| for all s,a
stoch_policy.set_uniform_policy()

print("Uniform stochastic policy created")
print(f"Policy is complete: {stoch_policy.is_complete()}")

In [None]:
# Query stochastic policy
state = states[0]
probs = stoch_policy.get_all_action_probabilities(state)
print(f"Action probabilities for state {state.state_id}:")
for action, prob in probs.items():
    print(f"  π(a={action.action_id}|s={state.state_id}) = {prob:.3f}")

In [None]:
# Validate probabilities
try:
    stoch_policy.validate_probabilities()
    print("Policy probabilities are valid (sum to 1)")
except ValueError as e:
    print(f"Validation error: {e}")

In [None]:
# Convert stochastic policy to array
stoch_array = stoch_policy.to_array()
print(f"Stochastic policy array shape: {stoch_array.shape}")
print(f"Policy array (first 2 states):")
print(stoch_array[:2])

## Part 4: State Value Functions

The state value function V(s) estimates the expected long-term reward from state s.

$$V^\pi(s) = E[\sum_{t=0}^{\infty} \gamma^t R_t | s_0=s, \pi]$$

**Key Properties**:
- Storage: O(|S|)
- Sufficient for policy extraction (with transition model)
- Computed via policy evaluation

In [None]:
# Create state value function
value_fn = StateValueFunction(mdp, initial_value=0.0)

print("State value function created")
print(f"Initial values: {value_fn.to_array()}")

In [None]:
# Set values
value_fn.set_value(states[0], 10.0)
value_fn.set_value(states[1], 5.0)
value_fn.set_value(states[2], 3.0)
value_fn.set_value(states[3], 1.0)

print("Values set")
print(f"Values: {value_fn.to_array()}")

In [None]:
# Query value statistics
print(f"Max value: {value_fn.get_max_value():.3f}")
print(f"Min value: {value_fn.get_min_value():.3f}")
print(f"Mean value: {value_fn.get_mean_value():.3f}")

In [None]:
# Update values
value_fn.update_value(states[0], -2.0)  # Decrease by 2
print(f"After update: {value_fn.to_array()}")

## Part 5: Action Value Functions (Q-functions)

The action value function Q(s,a) estimates the expected long-term reward from taking action a in state s.

$$Q^\pi(s,a) = E[\sum_{t=0}^{\infty} \gamma^t R_t | s_0=s, a_0=a, \pi]$$

**Key Properties**:
- Storage: O(|S| × |A|)
- Enables direct policy extraction (no transition model needed)
- Foundation for Q-learning and other model-free algorithms

In [None]:
# Create action value function
q_fn = ActionValueFunction(mdp, initial_value=0.0)

print("Action value function created")
print(f"Q-function shape: {q_fn.to_array().shape}")

In [None]:
# Set Q-values
q_fn.set_value(states[0], actions[0], 10.0)
q_fn.set_value(states[0], actions[1], 5.0)
q_fn.set_value(states[1], actions[0], 3.0)
q_fn.set_value(states[1], actions[1], 8.0)

print("Q-values set")
print(f"Q-function array:")
print(q_fn.to_array())

In [None]:
# Query Q-values for a state
state = states[0]
q_values = q_fn.get_action_values(state)
print(f"Q-values for state {state.state_id}:")
for action, q_val in q_values.items():
    print(f"  Q(s={state.state_id}, a={action.action_id}) = {q_val:.3f}")

In [None]:
# Extract greedy policy from Q-function
best_action = q_fn.get_best_action(states[0])
max_q = q_fn.get_max_action_value(states[0])

print(f"Best action for state 0: a{best_action.action_id}")
print(f"Max Q-value: {max_q:.3f}")

## Part 6: Relationship Between V and Q

The state value and action value functions are related:

$$V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)$$

$$Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]$$

In [None]:
# Demonstrate relationship
state = states[0]

# Get Q-values
q_vals = q_fn.get_action_values(state)
print(f"Q-values for state {state.state_id}: {list(q_vals.values())}")

# Get policy probabilities
probs = stoch_policy.get_all_action_probabilities(state)
print(f"Policy probabilities: {list(probs.values())}")

# Compute V(s) = Σ_a π(a|s) Q(s,a)
v_computed = sum(probs[a] * q_vals[a] for a in actions)
print(f"\nComputed V(s) from Q-values: {v_computed:.3f}")

## Summary

Key concepts:
1. **Deterministic policies** map states to single actions
2. **Stochastic policies** define action probability distributions
3. **State value functions** V(s) estimate long-term rewards from states
4. **Action value functions** Q(s,a) estimate rewards from state-action pairs
5. **V and Q are related** through the policy and transition model