# Verify the Bellman Expectation Equation for the Q-Value Function

In the previous lesson, we discussed the Bellman Expectation Equation for the value function. <br><br><br>

<center>$\Large V_{\pi}(s(t)) = \mathbf{E}[r(t) + \gamma V_{\pi}(s(t+1)) | s(t)]$</center>

Actually, there's also a Bellman Expectation Equation for the **Q-value function** (aka action-value function), and it is as follows. <br><br><br>

<center>$\Large Q_{\pi}(s(t), a(t)) = \mathbf{E}[r(t) + \gamma Q_{\pi}(s(t+1), a(t+1)) | s(t), a(t)]$</center>

- Here the expectation $\mathbf{E}$ is computed over all possible $s(t+1)$ and $a(t+1)$, while following the policy $\pi$. 

# Task 1: Derive the Bellman Expectation Equation for the Q-value function.

- This task is not a coding task. You need to do this with pen and paper.
- Drawing diagrams (like I did in the lesson) may help.

## How to draw the diagram

1. Start with a solid circle representing $Q_{\pi}(s(t), a(t))$. This solid circle represents taking action $a(t)$ in state $s(t)$.
2. The environment will transition to new states in time $t+1$ as a result of the action $a(t)$. Draw hollow circles representing those states. Call one of them $s^{'}$.  $s^{'}$ is a possible $s(t+1)$, and so are the other hollow circles. 
3. Connect the solid circle to the hollow circles. These connecting lines represent the state transition. Mark the transitions with the rewards $r_{a}^{s(t)s^{'}}$ and the transition probabilities $P_{a}^{s(t)s^{'}}$. The reward $r_{a}^{s(t)s^{'}}$ is a possible $r(t)$, and so are the rewards marked on the other connecting lines.
4. Then from new states e.g. $s^{'}$, draw lines to new solid circles. These solid circles represent taking an action in the new states. Call one of the new solid circles $a^{'}$. $a^{'}$ is a possible $a(t+1)$, and so are the other new solid circles.
5. Mark the connecting lines between the hollow circles and the new solid circles with the probability of taking actions $\pi(a^{'} | s^{'})$.
6. Finally, try to express $Q_{\pi}(s(t), a(t))$ (Q-value at solid circle in Step 1) in terms of $Q_{\pi}(s(t+1), a(t+1))$ (Q-value at solid circles in Step 4).

Good luck!

# Task 2: Verify the Bellman Expectation Equation for the Q-value function in `CartPole-v0`.

To verify the Bellman Expectation Equation for the Q-Value function, we will first collect the Q-Values for many states using the following strategy. 

1. We will set up a wrapped environment `pole_right_init_cartpole_env` with the initial state `[0., 0.01, 0.15, 0.]` i.e. the pole tilted to the right at initialization. (code is supplied)
2. We will use the epsilon pole direction policy with epsilon 0.9. (code is supplied)
3. Using that policy, we will go through 100000 episodes in the wrapped environment and store the action value function for all states we see using the `QValue` helper class. For the `QValue` helper class, we will use $\gamma=0.8$. (code is supplied)

You don't have to write any code for the above. The code is supplied below, just run it. It will store the Q-values  in `q_value_epsilon_pole_direction_policy`, which is an instance of the `QValue` helper class.

In [None]:
import random

import gym
import numpy as np


class InitMod(gym.Wrapper):
    """Wrapper class to change initial state  in CartPole-v0
    """
    def __init__(self, env, initial_state):
        super().__init__(env)
        self.initial_state = initial_state
        
    def reset(self):
        observation = self.env.reset()
        self.unwrapped.state = self.initial_state
        return self.unwrapped.state

    
# create the wrapped env where the pole it tilted to the right in the initial state    
pole_right_init_cartpole_env = InitMod(env=gym.make("CartPole-v0"), initial_state=np.array([0, 0.01, 0.15, 0]))


def get_action_random(observation):
    """Sampling function for random policy
    """
    if random.random() < 0.5:
        return 0
    return 1
    
    
def get_action_epsilon_pole_direction_policy(observation):
    """Sampling function for the epsilon pole direction policy
    """
    if random.random() < 0.9:
        return get_action_random(observation)
    if observation[2] > 0:
        return 1
    return 0


class QValue:
    def __init__(self, gamma, visit_number=None, q_value_average=None):
        self.gamma = gamma
        if visit_number is None:
            self.visit_number = {}
        else:
            self.visit_number = visit_number
        if q_value_average is None:
            self.q_value_average = {}
        else:
            self.q_value_average = q_value_average
        
    def update(self, episode_history):
        backward_reward_sum = 0
        for step in reversed(episode_history):
            backward_reward_sum = (self.gamma * backward_reward_sum) + step["reward"]
            key = (tuple(step["observation"]), step["action"])
            try:
                visit_number = self.visit_number[key]
            except KeyError:
                visit_number = 0
            if visit_number == 0:
                self.q_value_average[key] = backward_reward_sum
            else:
                self.q_value_average[key] = ((visit_number * self.q_value_average[key]) + backward_reward_sum) / (visit_number + 1)
            self.visit_number[key] = visit_number + 1
            

# compute q-values of states by going through 100000 episodes
num_episodes = 100000
gamma = 0.8

q_value_epsilon_pole_direction_policy = QValue(gamma=gamma)
for num_episode in range(num_episodes):
    episode_history = []
    observation = pole_right_init_cartpole_env.reset()
    while True:
        action = get_action_epsilon_pole_direction_policy(observation)
        next_observation, reward, done, _ = pole_right_init_cartpole_env.step(action)
        episode_history.append({"observation": observation, "reward": reward, "action": action})
        observation = next_observation
        if done:
            break
    q_value_epsilon_pole_direction_policy.update(episode_history)
pole_right_init_cartpole_env.close()

##  $s(0)=[0., 0.01, 0.15, 0.]$ and let's assume $a(0)=0$. In a variable `q_s0_a0`, store the Q-value of this state-action pair by reading the data in `q_value_epsilon_pole_direction_policy`. Print `q_s0_a0`.

Your code goes in the cell below.

In [None]:
# In a variable q_s0_a0, store the q-value of the initial state and action 0 and print it out

## In a variable `s1`, store the next state $s(1)$, if $s(0)=[0., 0.01, 0.15, 0.]$ and  $a(0)=0$.

Your code goes in the cells below.

In [None]:
# In the variable s_1, store the next state if you take an action 0 in the initial state [0.,0.01,0.15,0.]

# The Bellman Expectation equation for the Q-value function for time $t=0$ is given below again for your reference <br><br><br>

<center>$\Large Q_{\pi}(s(0), a(0)) = \mathbf{E}[r(0) + \gamma Q_{\pi}(s(1), a(1)) | s(0), a(0)]$</center>

<br><br>
##  I have written an expression below that computes the right hand side of the equation, assuming $s(0)=[0., 0.01, 0.15, 0.]$ and $a(0)=0$. But the expression is not complete. Your job is to complete it.

1. The first (leftmost) `1` stands for $E(r(0)| s(0), a(0))$, since the reward in `CartPole-v0` is always `1`.
2. The second big expression (partially filled) stands for $E(\gamma Q_{\pi}(s(1), a(1) | s(0), a(0))$. 
3. The `0.8` in the second expression is the gamma.
4. Remember that $\pi$ is the epsilon pole direction policy.

Fill in the blanks with numerical values to complete the expression

In [None]:
# Complete the expression that computes the right hand side of the equation. Fill in the blanks with numerical values only.
1 + 0.8 * (____ * q_value_epsilon_pole_direction_policy.q_value_average[tuple(s1), 0] + ____ * q_value_epsilon_pole_direction_policy.q_value_average[tuple(s1), 1])

If you did this right, you should get a value very close to `q_s0_a0`, since the right hand side must be equal to `q_s0_a0`, according to the Bellman Expectation Equation.

The Bellman Expectation Equation is a foundational equation in RL, and it's awesome that we can reproduce it using `CartPole-v0` and Python!