# Back to the goal in model free RL

## Irrespective of the details of the MDP, find the *policy* that maximizes discounted reward sum (value) for all states in the MDP

# Policies we have seen

- Random
- Pole direction policy

In [3]:
import random


def get_action_random(observation):
    """Sampling function for random policy
    """
    if random.random() < 0.5:
        return 0
    return 1


def get_action_pole_direction_policy(observation):
    """Sampling function for random policy
    """
    if observation[2] > 0:
        return 1
    return 0

# Intuitively, the pole direction policy seems better than a random policy. But is it really so?

- Need to have a precise definition for how to compare different policies (**ordering** of policies).

# Policy $\pi_{1}$ is *better* than $\pi_{2}$ if <br><br>
<center>$\Large v_{\pi_{1}}(s) \geq v_{\pi_{2}}(s), \; \forall s$</center>

# `CartPole-v0` has huge number of states

- Several `100,000` states in `10000` episodes.

In [7]:
import numpy as np

default_state = np.array([0., 0., 0., 0.])
pole_right_state = np.array([0., 0.01, 0.15, 0.])
pole_moving_state = np.array([0., 0., 0., 2.])
cart_right_state = np.array([2.4, 0., 0., 0.])
cart_moving_state = np.array([0., 10., 0., 0.])

In [8]:
import gym


class InitMod(gym.Wrapper):
    """Wrapper class to change initial state  in CartPole-v0
    """
    def __init__(self, env, initial_state):
        super().__init__(env)
        self.initial_state = initial_state
        
    def reset(self):
        observation = self.env.reset()
        self.unwrapped.state = self.initial_state
        return self.unwrapped.state

    
default_init_cartpole_env = InitMod(env=gym.make("CartPole-v0"), initial_state=default_state)
pole_right_init_cartpole_env = InitMod(env=gym.make("CartPole-v0"), initial_state=pole_right_state)
pole_moving_init_cartpole_env = InitMod(env=gym.make("CartPole-v0"), initial_state=pole_moving_state)
cart_right_init_cartpole_env = InitMod(env=gym.make("CartPole-v0"), initial_state=cart_right_state)
cart_moving_init_cartpole_env = InitMod(env=gym.make("CartPole-v0"), initial_state=cart_moving_state)

In [9]:
wrapped_envs = [default_init_cartpole_env, 
                pole_right_init_cartpole_env, 
                pole_moving_init_cartpole_env, 
                cart_right_init_cartpole_env,
                cart_moving_init_cartpole_env
                ]

# Strategy for comparison

- Set initial state using the wrapper.
- Compute the average discounted reward sum obtained in an episode starting from that initial state (the value of the initial state).

In [15]:
def get_value_for_initial_state(envs, policy_sampling_function, num_episodes, gamma):
    for env in envs:
        drs = 0
        for num_episode in range(num_episodes):
            observation = env.reset()
            step_count = 0
            while True:
                if num_episode == 0:
                    env.render()
                action = policy_sampling_function(observation)
                observation, reward, done, _ = env.step(action)
                drs += reward * gamma ** step_count
                step_count += 1
                if done:
                    break
        env.close()
        print(f"Value for the initial state {env.initial_state} is {drs / num_episodes}")

# For the random policy

In [16]:
get_value_for_initial_state(wrapped_envs, get_action_random, 10000, 0.95)

Value for the initial state [0. 0. 0. 0.] is 12.790677021304347
Value for the initial state [0.   0.01 0.15 0.  ] is 9.977442194273785
Value for the initial state [0. 0. 0. 2.] is 5.243647256552028
Value for the initial state [2.4 0.  0.  0. ] is 5.751802573526081
Value for the initial state [ 0. 10.  0.  0.] is 9.348804032817483


# For the pole direction policy

In [17]:
get_value_for_initial_state(wrapped_envs, get_action_pole_direction_policy, 10000, 0.95)

Value for the initial state [0. 0. 0. 0.] is 18.868767452972946
Value for the initial state [0.   0.01 0.15 0.  ] is 10.734175396802499
Value for the initial state [0. 0. 0. 2.] is 5.298162187497666
Value for the initial state [2.4 0.  0.  0. ] is 6.7315913742164515
Value for the initial state [ 0. 10.  0.  0.] is 9.192798246744898


| State (s) | $V_{\textrm{random}}(s)$ | $V_{\textrm{pole-direction}}(s)$ |
| --- | --- | --- |
| `[0, 0., 0., 0.]` | &emsp;&emsp;&emsp;&emsp;13.65 | &emsp;&emsp;&emsp;&emsp;&emsp;18.87 | 
| `[0, 0.01, 0.15, 0]` | 10.08 | 10.73 | 
| `[0. 0. 0. 2.]` | 5.27 | 5.30 | 
| `[2.4 0.  0.  0. ]` | 5.79 | 6.73 | 
| `[ 0. 10.  0.  0.]` | 9.35 | 9.19 | 

# The pole direction policy is not a better policy than the naive random policy. The random policy is also not better than the pole direction policy.