# Q-Learning: Continuous Example with CartPole

This notebook implements the Q-Learning algorithm for the [CartPole](https://gym.openai.com/envs/CartPole-v1/) game.
See `../ReinforcementLearning_Guide.md` for theory and intuition.

According to the OpenAI environment page of CartPole: "A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force to the left (0) or right (1) of the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright (i.e., does not fall). The episode ends when the pole is more than 12 degrees from vertical, or the cart moves more than 2.4 units from the center."

Note that I needed to change the description above, since it is outdated on the web page.

Look at the Github page of the environment, on the docstring of the environment.

Note the following:
- The center position is 0 and the range of possibles postions is `[-2.4,2.4]`, continuous
- Pole angle can vary in `[-12,12] deg`
- Velocity (linear for cart, angular for pole) can be any
- An episode is done if
    1. The pole tresspasses the limits above
    2. 200 steps/actions taken
    3. A minimum return is achieved over 100 steps/actions

We need to discretize the domains using bins.
The unique difference in the Q-Learning process compared to a discrete environemnt (as in FrozenLake) is the mapping from the continuous domain to the discrete.

Overview of sections:

1. Basic Setup of CartPole
2. Q Table Discretization

## 1. Basic Setup of CartPole

In [1]:
import time
import gym
import matplotlib.pyplot as plt

  for external in metadata.entry_points().get(self.group, []):


In [8]:
%matplotlib notebook

env = gym.make("CartPole-v1")
env.reset()

for step in range(100):
    env.render()
    action = env.action_space.sample() # 0 or 1
    # 4 observations done each step:
    # cart position, cart velocity, pole angle, pole angular velocity
    observation,reward,done,info = env.step(action)
    print(f'Observation: {observation}')
    time.sleep(0.02)
    if done:
        break
env.close()



Observation: [ 0.01955238 -0.22954053  0.02286752  0.26312542]
Observation: [ 0.01496157 -0.03475232  0.02813003 -0.02225803]
Observation: [ 0.01426652  0.15995516  0.02768487 -0.30593458]
Observation: [ 0.01746563  0.3546719   0.02156617 -0.5897594 ]
Observation: [ 0.02455907  0.1592547   0.00977099 -0.29036185]
Observation: [ 0.02774416 -0.0360052   0.00396375  0.00538665]
Observation: [ 0.02702405  0.15905967  0.00407148 -0.28604305]
Observation: [ 0.03020525  0.35412332 -0.00164938 -0.57743907]
Observation: [ 0.03728772  0.15902454 -0.01319816 -0.2852762 ]
Observation: [ 0.0404682  -0.03590672 -0.01890368  0.00321507]
Observation: [ 0.03975007 -0.23075254 -0.01883938  0.28987423]
Observation: [ 0.03513502 -0.03536708 -0.0130419  -0.0086904 ]
Observation: [ 0.03442768  0.15993945 -0.01321571 -0.3054595 ]
Observation: [ 0.03762647 -0.0349917  -0.0193249  -0.01697361]
Observation: [ 0.03692663 -0.22983125 -0.01966437  0.26955   ]
Observation: [ 0.03233001 -0.42466715 -0.01427337  0.55

In [7]:
# We can also play it manually
action = 0
k = 0
def key_press(k, mod):
    '''
    This function gets the key press for gym
    '''
    global action
    if k == key.LEFT:
        action = 0
    if k == key.RIGHT:
        action = 1

env.reset()
rewards = 0
for _ in range(1000):
    env.render()
    env.viewer.window.on_key_press = key_press  # update the key press
    observation, reward, done, info = env.step(action)  # get the reward and the done flag
    rewards+=1
    if done:
        print(f"You got {rewards} points!")
        break
    time.sleep(0.5)  # reduce speed a little bit (edit as needed on you computer)
env.close()



You got 8 points!


## 2. Q Table Discretization

In the FrozenLake environment we had one observation which mapped to a unique state variable. Now, we have 4 observation variables and they map to one state. And recall that the Q-Learning table is `state x action`.

To address that issue, we discretize each observation variable in bins and build from the combinations of those observation-bins all possible discrete states. We can represent that in two forms:
- We build a multidimensional matrix/array: dimensions are `observations + action`. Each observation dimension is discretized/binned in all defined ranges and the action dimension contains all possible actions; then, each combination of `observations + actions` has a Q value. Thus, 4 observation values lead to a dimension of 5 (`observations + action`). Each dimension has the size of the number of bins or the number of actions. Example: `4` observations, each one with `3` bins and additionally `2` actions: `np.zeros((3,3,3,3,2))`. Then, each cell maps to a possible Q value.
- Note that combining `observations` and `bins` we get all the `states`. Thus, we can further develop the structure above to have `3^4` states and `2` actions. However, the multidimensional array is more comfortable

In [14]:
import numpy as np

In [15]:
np.zeros((3,3,3,3,2))

array([[[[[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]]],


        [[[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]]],


        [[[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]]]],



       [[[[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]]],


        [[[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]]],


        [[[0., 0.],
          [0., 0.],
          [0., 0.]