# OpenAi Gym - CartPole-v0

#### This notebook contains the application of reinforcement learning techniques to the classical CartPole problem where we attempt to learn how to control a one-dimensional cart to keep the cart's pole vertical. 

In [2]:
# Import OpenAi gym and test the env.

import gym
import numpy as np

env = gym.make('CartPole-v0')
env.reset()

# for episode in range(50):
#     observation = env.reset()
#     total_ep_reward = 0
#     for tstep in range(100):
#         env.render()
#         # print(observation)
#         # decide action to take (here - choose at random.)
#         action = env.action_space.sample()
#         # Move agent right.
#         #action = 1
#         # take action, receive reward and new state.
#         observation, reward, done, info = env.step(action)
#         total_ep_reward += reward
#         if done:
#             print("Finished after {} timesteps. Total reward averaged over timesteps: {}", tstep+1, total_ep_reward/(tstep+1))
#             break
    
# env.close()

array([-0.03156187, -0.04518858,  0.01798824,  0.04523498])

## Observations Of The Environment

The step function returns 4 values to indicate the environment's responds:

observation:object - A representation of the environment specific to the problem. For the CartPole problem this is angle of the pole there are 4 values: "Cart Position", "Cart Velocity", "Pole Angle", and "Pole Velocity At Tip".
For other problems this can be a pixel data of a camera, board states, etc.

reward:float - The reward value recieved for the previous action.

done:boolean - A flag which is true if a terminal state is reached and it is time to reset the environment.

info:dict - Diagnositc information used for debugging purposes, such as raw probabilities of state transitions.

## Observations for CartPole-v0

Four observations of the environment:

Type: Box(4):

|Index|Observation|Min|Max|
|---|---|---|---|
|0|Cart Position|-4.8|4.8|
|1|Cart Velocity |-Inf|Inf|
|2|Pole Angle| -24 deg | 24 deg|
|3|Pole Velocity At Tip|-Inf|Inf|

## Actions for CartPole-v0

Two actions (moving the cart left or right.) 

Type: Discrete(2)

|Index|Action|
|---|---|
|0|Push cart left|
|1|Push cart right|

## TODO list:

#### Create Discrete state representation:

- For each of the 4 state observations we need to bin all possible values into a finite number of bins. To make the Q table we will use a 5D table (4 states, 1 action.)

#### Maintain/Plot Episode metrics

- Per each episode we will maintain the total recieved.
- Every kth episode we will maintain an aggregate of the min, max, and average reward for this episode. 
- Once the episodes are node we will plot these rewards. 

#### Implement various learning algorithms:

- Q learning.
- SARSA.
- Expected SARSA.
- Double Q learning.

In [6]:
"""
To handle that velocity is unbounded, we use a wrapper class 
that extends gym's Wrapper class to bin our observation space
into discrete values.
"""

class DiscreteObservationSpaceWrapper(gym.ObservationWrapper):
    
    def __init__(self, env, num_bins, low=None, high=None):
        super().__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Box)
    
        self.num_bins = num_bins
        # For each observation, use linspace to split into equal sized bins.
        self.value_bins = [np.linspace(lo, hi, num_bins) for lo, hi in zip(low, high)]
        # The number of possible states using the number of bins. 
        self.observation_space.n = gym.spaces.Discrete(num_bins ** len(low))
        
    def observation(self, observation):
        # Determine the bin index where each state is.
        return [np.digitize([state_value], bins)[0] for state_value, bins in zip(observation, self.value_bins)]
        

In [7]:
env = DiscreteObservationSpaceWrapper(
    env,
    num_bins=10,
    low=[-2.4, -2.0, -0.42, -3.5],
    high=[2.4, 2.0, 0.42, 3.5])

In [10]:
print(env.observation([2.4, 2, 0.42, 3.5]))
print(env.observation_space.n)
print(env.observation([-2.4, -2, -0.2, -3.5]))
print(env.observation_space.n)
print(env.action_space.n)

[10, 10, 10, 10]
Discrete(10000)
[1, 1, 3, 1]
Discrete(10000)
2


In [46]:
# Create Q Table
Q = {}

In [52]:
print(env.value_bins)


[array([-2.4 , -1.92, -1.44, -0.96, -0.48,  0.  ,  0.48,  0.96,  1.44,
        1.92,  2.4 ]), array([-2. , -1.6, -1.2, -0.8, -0.4,  0. ,  0.4,  0.8,  1.2,  1.6,  2. ]), array([-4.20000000e-01, -3.36000000e-01, -2.52000000e-01, -1.68000000e-01,
       -8.40000000e-02, -5.55111512e-17,  8.40000000e-02,  1.68000000e-01,
        2.52000000e-01,  3.36000000e-01,  4.20000000e-01]), array([-3.5, -2.8, -2.1, -1.4, -0.7,  0. ,  0.7,  1.4,  2.1,  2.8,  3.5])]


In [53]:
print(env.obs_space)

Discrete(10000)
