# Calculate average values of states over many episodes

In the lessons, we discusses how $v_{\pi}(s)$ and $Q_{\pi}(s, a)$ are expected values. Therefore, we need to average over many value function samples or the Q-value function samples in order to get an accurate estimate for them.

In the last lesson, we did exactly that for the Q-value function and calculated the following Q-value for these state-action pairs.

| State (s) | Action (a) | Policy ($\pi$) | $Q_{\pi}(s, a)$ |
| --- | --- | --- | --- |
| `[0, 0.01, 0.15, 0]` | 1 | random | 8.15 |
| `[0, 0.01, 0.15, 0]` | 0 | random | 6.33 |

We took the help of a helper class called `QValue` to do this calculation. The code for this class is given below for your reference.

In [None]:
class QValue:
    def __init__(self, gamma, visit_number=None, q_value_average=None):
        self.gamma = gamma
        if visit_number is None:
            self.visit_number = {}
        else:
            self.visit_number = visit_number
        if q_value_average is None:
            self.q_value_average = {}
        else:
            self.q_value_average = q_value_average
        
    def update(self, episode_history):
        backward_reward_sum = 0
        for step in reversed(episode_history):
            backward_reward_sum = (self.gamma * backward_reward_sum) + step["reward"]
            key = (tuple(step["observation"]), step["action"])
            try:
                visit_number = self.visit_number[key]
            except KeyError:
                visit_number = 0
            if visit_number == 0:
                self.q_value_average[key] = backward_reward_sum
            else:
                self.q_value_average[key] = (visit_number * self.q_value_average[key] + backward_reward_sum) / (visit_number + 1)
            self.visit_number[key] = visit_number + 1

## In this exercise, we want to do the same thing, but for the value function instead of the action-value function. We want to average over the values of states over many episodes.

We want to do this in the following scenario.

1. We want to use the `pole_right_init_cartpole_env` (i.e. with initial state `[0., 0.01, 0.15, 0.]`)
2. We want to follow the random policy.
3. We want to compute averages of value sample obtained in 100000 episodes.
4. We want to use $\gamma=0.9$.

The following code should be able to do it. Read it carefully.

```
import random

import gym
import numpy as np


def get_action_random_policy(observation):
    if random.random() < 0.5:
        return 0
    return 1


class InitMod(gym.Wrapper):
    def __init__(self, env, initial_state):
        super().__init__(env)
        self.initial_state = initial_state
        
    def reset(self):
        observation = self.env.reset()
        self.unwrapped.state = self.initial_state
        return self.unwrapped.state
        
        
pole_right_init_cartpole_env = InitMod(env=gym.make("CartPole-v0"), initial_state=np.array([0., 0.01, 0.15, 0.]))

value_info = Value(gamma=0.9)    # this class is not defined yet

num_episodes = 100000
for num_episode in range(num_episodes):
    episode_history = []
    observation = pole_right_init_cartpole_env.reset()
    while True:
        action = get_action_random_policy(observation)
        next_observation, reward, done, _ = pole_right_init_cartpole_env.step(action)
        episode_history.append({"observation": observation, "reward": reward})
        observation = next_observation
        if done:
            break
    value_info.update(episode_history)    # value_info is not defined yet
pole_right_init_cartpole_env.close()
```

But the problem is: we don't have a `Value` class that can calculate the averages of the value samples. We have `QValue` class for the Q-Value function, but not an equivalent class for the value function.

Your job is to implement such a `Value` class.

## Implement the `Value` class

I have provided a skeleton below. Your job is to complete it in the cell below.

In [None]:
class Value:
    def __init__(self, gamma, visit_number=None, value_average=None):
        self.gamma = gamma
        if visit_number is None:
            self.visit_number = {}
        else:
            self.visit_number = visit_number
        if value_average is None:
            self.value_average = {}
        else:
            self.value_average = value_average
        
    def update(self, episode_history):
        # implement this method so that the value averages of the states in episode_history is updated
        # look at the QValue class for hints

## After you have implemented the `Value` class, run the code below.

It should print the value of the state `[0., 0.01, 0.15, 0.]`.

If you implemented the `Value` class correctly, you should get a value close to 7.15.

In [None]:
import random

import gym
import numpy as np


def get_action_random_policy(observation):
    if random.random() < 0.5:
        return 0
    return 1


class InitMod(gym.Wrapper):
    def __init__(self, env, initial_state):
        super().__init__(env)
        self.initial_state = initial_state
        
    def reset(self):
        observation = self.env.reset()
        self.unwrapped.state = self.initial_state
        return self.unwrapped.state
        
        
pole_right_init_cartpole_env = InitMod(env=gym.make("CartPole-v0"), initial_state=np.array([0., 0.01, 0.15, 0.]))

value_info = Value(gamma=0.9)

num_episodes = 100000
for num_episode in range(num_episodes):
    episode_history = []
    observation = pole_right_init_cartpole_env.reset()
    while True:
        action = get_action_random_policy(observation)
        next_observation, reward, done, _ = pole_right_init_cartpole_env.step(action)
        episode_history.append({"observation": observation, "reward": reward})
        observation = next_observation
        if done:
            break
    value_info.update(episode_history)
pole_right_init_cartpole_env.close()

state = (0., 0.01, 0.15, 0.)
print(f"The value of the state {state} is {value_info.value_average[state]}")

## Now check how many states the agent has seen in these 100000 episodes. Run the code below.

In [None]:
print(len(value_info.value_average))

## Do you see what a whopping huge number that is! The order of magnitude is a few hundred thousands!

Computing averages for a significant number of value samples (say 1000 samples for each state) of all these states is going to be unbelievably expensive computationally. And this is the *simplest* Reinforcement Learning environment `CartPole-v0`!

This is called "state space explosion" :D

So what would happen for more complicated environments?

If that scares you, don't worry. There is a solution for this. And actually, you have already implemented the solution in one of the earlier assignments!

Can you guess which one?