# Predicting Rewards with the Action Value Function

https://rl-book.com/learn/mdp/action_value_function/

This functions is same as state-value function, but adds the action to the consideration. In this case, action is a second dimension of the value/reward prediction problem

$$
Q_{\pi}(s, a) = E_{\pi} [G|s, a] = E_{\pi} [\sum_{k=0}^{T} \gamma^kr_k|s, a]
$$

where G is the return, s is the state, a is the action choise, γ is the discount factor, and r is the reward.



## Simple grid environment

In [2]:
starting_position = 1 # The starting position
cliff_position = 0 # The cliff position
end_position = 5 # The terminating state position
reward_goal_state = 5 # Reward for reaching goal
reward_cliff = 0 # Reward for falling off cliff

def reward(current_position) -> int:
    if current_position <= cliff_position:
        return reward_cliff
    if current_position >= end_position:
        return reward_goal_state
    return 0

def is_terminating(current_position) -> bool:
    if current_position <= cliff_position:
        return True
    if current_position >= end_position:
        return True
    return False



## Agent

In [3]:
import numpy as np

def strategy() -> int:
    if np.random.random() >= 0.5:
        return 1 # right
    else:
        return -1 # left

## Experiment

In [13]:
import numpy as np
np.random.seed(42)


# Global buffers to perform averaging later
# Second dimension is the actions
value_sum = np.zeros((end_position + 1, 2))
n_hits = np.zeros((end_position + 1, 2))

# A helper function to map the actions to valid buffer indices
def action_value_mapping(x): return 0 if x == -1 else 1

n_iter = 10
for i in range(n_iter):
    position_history = [] # A log of positions in this episode
    current_position = starting_position # Reset
    current_action = strategy()
    while True:
        # Append position to log
        position_history.append((current_position, current_action))

        if is_terminating(current_position):
            break
        
        # Update current position according to strategy
        current_position += strategy()

    # Now the episode has finished, what was the reward?
    current_reward = reward(current_position)
    

    # Now add the reward to the buffers that allow you to calculate the average
    for pos, act in position_history:
        value_sum[pos, action_value_mapping(act)] += current_reward
        n_hits[pos, action_value_mapping(act)] += 1
    
    

    # Now calculate the average for this episode and print
    expect_return_0 = ', '.join(
        f'{q:.2f}' for q in value_sum[:, 0] / n_hits[:, 0])
    expect_return_1 = ', '.join(
        f'{q:.2f}' for q in value_sum[:, 1] / n_hits[:, 1])
    
    print("transitions:", position_history[0])
    print("[{}] Average reward: [{} ; {}]".format(
        i, expect_return_0, expect_return_1))

transitions: (1, -1)
[0] Average reward: [nan, 5.00, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]
transitions: (1, -1)
[1] Average reward: [0.00, 3.33, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]
transitions: (1, -1)
[2] Average reward: [0.00, 2.50, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]
transitions: (1, 1)
[3] Average reward: [0.00, 2.50, 5.00, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]
transitions: (1, -1)
[4] Average reward: [0.00, 1.67, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]
transitions: (1, -1)
[5] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]
transitions: (1, 1)
[6] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]
transitions: (1, 1)
[7] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, 0.00, nan, nan, nan]
transitions: (1, 1)
[8] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, 0.00, 0.00, nan, nan]
transitio

  f'{q:.2f}' for q in value_sum[:, 0] / n_hits[:, 0])
  f'{q:.2f}' for q in value_sum[:, 1] / n_hits[:, 1])
