# Goal of model free RL: Irrespective of the details of the MDP, find (or learn) the policy that maximizes total rewards per episode
- Since an episode starts with the initial state and ends with the terminal state, the total rewards in an episode depends on the *initial state*.

# `gym` makes a choice for the initialization of the `CartPole-v0` environment, but this is not the only choice

In [1]:
import gym

In [2]:
env = gym.make("CartPole-v0")
observation = env.reset()
print(observation)
env.render()

[-0.03272082  0.00775598  0.04386464  0.03129368]


True

In [3]:
env.close()

# If we made a different choice of initialization in `CartPole-v0`, how would that affect the total rewards?

# Can we at all change the default initialization?
- Yes! `gym` offers a way to modify details of any environment

# What is a `gym` environment?
- Must have the following attributes
    - `observation_space`
    - `action_space`
    - ...
- Must have certain methods
    - `reset()`
    - `step()`
    - `render()`
    - `close()`
    - ...

# In `gym`, the way to modify an existing environment is to create a *wrapped environment* from the existing environment

In [4]:
class SimplestWrappedEnv():    # SimplestWrappedEnv is wrapper. SimplestWrappedEnv() is a wrapped environment.
    # wrapped environment must have a reference to existing environment
    def __init__(self, env):    # env is the existing environment
        self.env = env
        self.observation_space = self.env.observation_space
        self.action_space = self.env.action_space
        
    def reset(self):
        return self.env.reset()
    
    def step(self, action):
        return self.env.step(action)
    
    def render(self):
        return self.env.render()
    
    def close(self):
        return self.env.close()

# `gym.Wrapper`

In [5]:
class MyWrappedEnv(gym.Wrapper):    # WHen we inherit from gym.Wrapper, it copies all attrs and methods of the existing environment automatically
    pass

In [7]:
cartpole_v0_env = gym.make("CartPole-v0")
my_wrapped_env = MyWrappedEnv(env=cartpole_v0_env)    # wrapped environment

In [8]:
my_wrapped_env.env == cartpole_v0_env

True

In [9]:
my_wrapped_env.observation_space

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

In [10]:
my_wrapped_env.action_space

Discrete(2)

In [11]:
observation = my_wrapped_env.reset()
print(observation)
my_wrapped_env.render()

[ 0.01328838 -0.02252086  0.00251714  0.04043948]


True

In [12]:
observation, reward, done, _ = my_wrapped_env.step(0)

In [13]:
my_wrapped_env.close()

# A wrapper that modifies the reward function

In [14]:
class DoubleReward(gym.Wrapper):
    def step(self, action):
        observation, reward, done, info = self.env.step(action)
        doubled_reward = reward * 2
        return observation, doubled_reward, done, info

In [15]:
double_reward_cartpole_env = DoubleReward(env=cartpole_v0_env)
observation = double_reward_cartpole_env.reset()
observation, reward, done, _ = double_reward_cartpole_env.step(0)
print(reward)

2.0


In [16]:
double_reward_cartpole_env.close()

# A big advantage of wrappers is that they can be applied to any `gym` environment
- You can use `DoubleReward` wrapper on both `CartPole-v0` and `MountainCar-v0`.
- No need to do it twice, which would be the case if we were overriding methods e.g. `step()` of each environment individually.

In [17]:
double_reward_mountaincar_env = DoubleReward(env=gym.make("MountainCar-v0"))

# A more general version of a reward scaling wrapper

In [18]:
class RewardScaling(gym.Wrapper):    # Always inherit from gym.Wrapper
    def __init__(self, env, scaling_factor):
        super().__init__(env)     # super important to call super().__init__(env) if we are overriding the init function
        self.scaling_factor = scaling_factor
        
    def step(self, action):
        observation, reward, done, info = self.env.step(action)    # super important to call self.env.step()
        reward *= self.scaling_factor
        return observation, reward, done, info

In [19]:
double_reward_cartpole_env = RewardScaling(env=cartpole_v0_env, scaling_factor=2)

# A  wrapped environment can be further wrapped

In [20]:
original_reward_cartpole_env = RewardScaling(env=double_reward_cartpole_env, scaling_factor=0.5)
observation = original_reward_cartpole_env.reset()
observation, reward, done, _ = original_reward_cartpole_env.step(1)
print(reward)

1.0


In [21]:
original_reward_cartpole_env.close()

# `CartPole-v0` is itself a wrapped environment!
- The base environment is `gym.envs.classic_control.cartpole.CartPoleEnv` (actually defines the dynamics)
    - Has everything in `CartPole-v0` except for the terminal state at 200 time steps.
- `TimeLimit` wrapper is applied on the base environment `gym.wrappers.time_limit.TimeLimit`. 
    - This introduces the terminal state at 200 time steps.
    
`CartPole-v0 = TimeLimit(env=CartPoleEnv, max_episode_steps=200)`

In [22]:
print(cartpole_v0_env)

<TimeLimit<CartPoleEnv<CartPole-v0>>>


In [23]:
print(original_reward_cartpole_env)

<RewardScaling<RewardScaling<TimeLimit<CartPoleEnv<CartPole-v0>>>>>


# Modifying the initialization of the `CartPole-v0` env

In [24]:
class InitMod(gym.Wrapper):
    def __init__(self, env, initial_state):
        super().__init__(env)
        self.initial_state = initial_state
        
    def reset(self):
        observation = self.env.reset()
        self.unwrapped.state = self.initial_state
        return self.unwrapped.state

# Initialization choice: pole to the right 

1. Cart position = 0
2. Cart velocity = 0.01
3. Pole angle = 0.15
4. The pole tip velocity = 0

In [25]:
import numpy as np
pole_right_init_cartpole_env = InitMod(env=cartpole_v0_env, initial_state=np.array([0, 0.01, 0.15, 0]))

# Verify that the wrapper does what it is supposed to do

In [26]:
observation = pole_right_init_cartpole_env.reset()
print(observation)
pole_right_init_cartpole_env.render()

[0.   0.01 0.15 0.  ]


True

In [None]:
pole_right_init_cartpole_env.close()