# Implement the sampling function for a new policy called "epsilon pole direction policy"

In the last lesson, we implemented the sampling functions for the **random policy** and the **"pole direction policy"** for the `CartPole-v0` environment. Here's how they looked.

In [1]:
import random

def get_action_random_policy(observation):
    if random.random() < 0.5:
        return 0
    return 1

In [2]:
def get_action_pole_direction_policy(observation):
    if observation[2] > 0:
        return 1
    return 0

We also wrote a function that computes the average total rewards per episode for any policy. Such a function helps us compare different policies. That function `get_average_total_rewards_per_episode()` is given below.

**Study this function carefully.** We are going to need this function frequently in this course.

In [3]:
import gym 

def get_average_total_rewards_per_episode(policy_sampling_function, num_episodes):
    env = gym.make("CartPole-v0")
    total_rewards = 0
    for num_episode in range(num_episodes):
        observation = env.reset()
        while True:
            if num_episode == 0:
                env.render()
            action = policy_sampling_function(observation)
            observation, reward, done, _ = env.step(action)
            total_rewards += reward
            if done:
                break
    env.close()
    return total_rewards / num_episodes

We then computed the average total rewards for the random policy and the "pole direction policy". 

In [4]:
# for random policy
get_average_total_rewards_per_episode(get_action_random_policy, 1000)

22.436

In [5]:
# for pole direction policy
get_average_total_rewards_per_episode(get_action_pole_direction_policy, 1000)

41.723

# Clearly, the pole direction policy is better than the random policy at getting rewards.

| Policy | Average total rewards per episode |
| --- | --- |
| Random policy | ~ 20 |
| Pole direction policy | ~ 40 |

Now here's a question: what if these two policies married and had a baby? How well would the baby do?

In this exercise, you are going to implement the sampling function for such a baby policy that is *in-between* these two parent policies. The in-between policy is called "epsilon pole direction policy".

## The epsilon pole direction policy

The epsilon pole direction policy is defined as follows.

- With probability $\epsilon$, the agent takes random actions (i.e. follows the random policy)
- With probability $1 - \epsilon$, the agent moves in the direction of the pole (i.e. follows the "pole direction policy")

Your job is to implement the sampling function for the epsilon pole direction policy, when $\epsilon = 0.9$.

Ready? Here we go!

In [None]:
# Implement the sampling function for the "epsilon pole direction policy", with epsilon = 0.9
def get_action_epsilon_pole_direction_policy(observation):
    # Your code goes here

Once you have implemented `get_action_epsilon_pole_direction_policy()`, it's time to find out how much average total rewards this *in-between* policy gets.

In [None]:
 # Fill in the blank with the correct sampling function to get the average total rewards per episode for the "epsilon pole direction policy"
get_average_total_rewards_per_episode(____, 1000)

## Based on what you found, rank the three policies in the following table by filling in the blanks (____). 

- The policy with the lowest average total rewards (worst policy) goes in the first row. 
- The policy with the highest average total rewards (best policy) goes in the last row.

| Policy | Average total rewards per episode |
| --- | --- |
| ____ | ____ |
| ____ | ____ |
| ____ | ____ |

If you did everything right, you would find that the *in-between* policy (epsilon pole direction policy) is also *in-between* when it comes to total rewards per episodes. 

**Keep this in mind.** This fact is going to play an important role in a future lesson.