# Plot number of state-action pairs seen as a function of number of episodes in exploration mode

# Attention: Please install `matplotlib` before doing this exercise.

- Installing it is as simple as running `pip install matplotlib` in your virtualenv.
- Once you have installed it, start reading from the cell below.

## A key takeway of the previous lesson is that we are not exploring enough. We ran the random policy for `10000` episodes but we still didn't see all state-action pairs in the MDP. Therefore, we should be exploring more.

In this exercise, that's exactly what we are going to do. We are going to explore for `100000` episodes. That's `10` times longer compared to what we did before. We will discover many more state-action pairs!


### Your job is to plot the number of state-action pairs that we see as a function of number of episodes.

I have already supplied most of helper functions and classes that you will need below. Run the cell to load them into memory.

In [None]:
import random

import numpy as np


def get_action_random(observation):
    """Sampling function for random policy
    """
    if random.random() < 0.5:
        return 0
    return 1


class QValue:
    """Helper for computing Q-value of state-action pairs. 
    It has an update() method that updates averages of Q-value samples with new episode data
    """
    def __init__(self, gamma, visit_number=None, q_value_average=None):
        self.gamma = gamma
        if visit_number is None:
            self.visit_number = {}
        else:
            self.visit_number = visit_number
        if q_value_average is None:
            self.q_value_average = {}
        else:
            self.q_value_average = q_value_average
        
    def update(self, episode_history):
        backward_reward_sum = 0
        for step in reversed(episode_history):
            backward_reward_sum = (self.gamma * backward_reward_sum) + step["reward"]
            key = (tuple(step["observation"]), step["action"])
            try:
                visit_number = self.visit_number[key]
            except KeyError:
                visit_number = 0
            if visit_number == 0:
                self.q_value_average[key] = backward_reward_sum
            else:
                self.q_value_average[key] = (visit_number * self.q_value_average[key] + 
                                             backward_reward_sum
                                             ) / (visit_number + 1)
            self.visit_number[key] = visit_number + 1

I have also set up the wrapped environment where we will be performing this experiment. Run the cell below to load this into memory.

In [None]:
import gym


class InitMod(gym.Wrapper):
    """Wrapper class to change initial state  in CartPole-v0
    """
    def __init__(self, env, initial_state):
        super().__init__(env)
        self.initial_state = initial_state
        
    def reset(self):
        observation = self.env.reset()
        self.unwrapped.state = self.initial_state
        return self.unwrapped.state
    

# create the wrapped env    
wrapped_env = InitMod(env=gym.make("CartPole-v0"), initial_state=np.array([0, 0.01, 0.15, 0.]))

I have also written a loop where the agent will take random actions for `100000` episodes and calculate the Q values for all state-action pairs seen using a helper called `q_value_random_policy`. 

Here's what you need to do.

1. I have setup two empty 1D `numpy` arrays before starting the loop. They are called `x_num_episodes` and `y_num_state_action_pairs`. 
2. Anytime the agent completes a multiple of `1000` episodes, you should append the current episode number to `x_num_episodes`. You should also, at the same time, append the total number of state-action pairs seen so far to `y_num_state_action_pairs`.

Ready? Your code goes below.

In [None]:
num_episodes = 100000
gamma = 0.95

q_value_greedy_policy = QValue(gamma=gamma)

x_num_episodes = np.array([])
y_num_state_action_pairs = np.array([])

for num_episode in range(num_episodes):
    episode_history = []
    observation = wrapped_env.reset()
    while True:
        action = get_action_random(observation)
        next_observation, reward, done, _ = wrapped_env.step(action)
        episode_history.append({"observation": observation, "reward": reward, "action": action})
        observation = next_observation
        if done:
            break
    q_value_greedy_policy.update(episode_history)
    # Your code goes here. If the episode number is a multiple of 1000, append to x_num_episodes and y_num_state_action_pairs
wrapped_env.close()

# Run the cell below to plot how the number of unique state-action pairs grows with the number of episodes in exploration mode

- This should work if you collected the data properly in the last cell.

In [None]:
%matplotlib notebook

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x_num_episodes, y_num_state_action_pairs)
ax.set(xlabel="number of episodes", ylabel="number of unique state-action pairs seen",
       title="Growth of state-action pairs in exploration mode"
       )
fig.show()

# You should see an almost linear growth of states. The max number of state-action pairs seen (the data for the `100000`th episode) should be several hundred thousands.

If you got the same results, congrats! Now you see how we discover so many new state-action pairs as we explore the environment more. We won't be able to solve `CartPole-v0` unless the agent sees all the various state-action pairs in the MDP and understands how good or bad they are. Thus, exploration plays a crucial role in solving a RL problem, along with policy improvement (exploiting).