# Pole balancing

We consider the classic pole balancing task due to Sutton & Barto (see also lecture notes). 

Each state is determined by four values:

|Num |    Observation |              Min  |                   Max |
|----------|--------|-------------|---------------|
|0|       Cart Position|             -4.8 |                   4.8|
|1|       Cart Velocity      |       -Inf         |           Inf|
|2|       Pole Angle        |        -0.418 rad (-24 deg) |   0.418 rad (24 deg)|
|3|       Pole Angular Velocity |    -Inf           |         Inf|

There are only two actions possible

|Num   |Action|
|-------|--------|
|0|  Push cart to the left|
|1|  Push cart to the right|


The pole balancing task is implemented in the package <code>gym</code> that you need to install. The package provides a framework for reinforcement tasks and a number of fun tasks. Check it out here: http://gym.openai.com/

In [1]:
## install gym in colab via
# !pip install gym

import gym

We initialise the task with <code>gym.make</code>. Unfortunately, it's not easy to get information on the task. We only get to the documentation when we find out where explicitely the task is defined. (<code>help(env)</code> does **not** work.)

In [2]:
env=gym.make("CartPole-v1")
help(gym.envs.classic_control.cartpole)

Help on module gym.envs.classic_control.cartpole in gym.envs.classic_control:

NAME
    gym.envs.classic_control.cartpole

DESCRIPTION
    Classic cart-pole system implemented by Rich Sutton et al.
    Copied from http://incompleteideas.net/sutton/book/code/pole.c
    permalink: https://perma.cc/C9ZM-652R

CLASSES
    gym.core.Env(builtins.object)
        CartPoleEnv
    
    class CartPoleEnv(gym.core.Env)
     |  Description:
     |      A pole is attached by an un-actuated joint to a cart, which moves along
     |      a frictionless track. The pendulum starts upright, and the goal is to
     |      prevent it from falling over by increasing and reducing the cart's
     |      velocity.
     |  
     |  Source:
     |      This environment corresponds to the version of the cart-pole problem
     |      described by Barto, Sutton, and Anderson
     |  
     |  Observation:
     |      Type: Box(4)
     |      Num     Observation               Min                     Max
     |      0

From the documentation we learn that there are two actions possible, and we introduce variables to better distinguish between the two.

In [3]:
# we define two variable so that we can better distinguish between the actions
push_left=0
push_right=1

Next, reset the environment and observe the starting state.

In [4]:
state=env.reset()
state

array([ 0.01698376,  0.01610257, -0.02270971, -0.00659515], dtype=float32)

Let's choose an action.

In [5]:
new_state, reward, done, info = env.step(push_left)
new_state,reward,done

(array([ 0.01730581, -0.17868645, -0.02284162,  0.27883697], dtype=float32),
 1.0,
 False)

## Let's try out a simple strategy

We set up a simple strategy: Whenever the pole leans to the left, push left; whenever it leans to the right, push right. We then test the strategy by playing 1000 times and computing the mean total reward. We set the maximum length of an episode to 300: If you can balance the pole for that long you probably can do so indefinitely.

In [6]:
def basic_policy(state):
    angle=state[2]
    return push_left if angle <0 else push_right

def play_single_episode(policy,env,max_episode_length=300):
    total_reward=0
    state=env.reset()
    for step in range(max_episode_length):
        action=policy(state)
        state,reward,done,info=env.step(action)
        total_reward+=reward
        if done: # pole fell over, episode ended
            break
    return total_reward

def play_many_episodes(policy,env,repeats,max_episode_length=300):
    total_rewards=[]
    for i in range(repeats):
        total_rewards.append(play_single_episode(policy,env,max_episode_length))
    return sum(total_rewards)/len(total_rewards)

play_many_episodes(basic_policy,env,1000)

42.058

Could be better! Can you come up with a better policy? If you want to experiment, you should consult the [documentation](http://gym.openai.com/docs/), where, in particular, you can find out how to visualise episodes. 