# <center> CartPole, familiarizing with the environment
    
The exercize is based on one environment taken from on the **Gym** library, a toolkit for developing and comparing reinforcement learning algorithms:
    
https://gym.openai.com/envs/CartPole-v1/
    
### Description

A pole is attached by an un-actuated joint to a cart, which moves along a 1-dim frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson (example 3.4, 2018 version)

In [3]:
import numpy as np
import gym
env = gym.make('CartPole-v1')

The true content of the class can be seen here: https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py . The basic structure follows what we already had in the past exercises.

In [4]:
class FakeCartPole():
    def __init__(self):
        """
        Fake Class similar to CartPole which does NOTHING.
        """
        # Definition of State Space
        
        # Definition of Action Space
        
        # Definition of "Physics" of the problem
        
    def reset(self):
        """
        Resets the env.
        """
        # Reset the environment to initial state
        pass
        
    def step(self, A):
        """
        Evolves the environment given action A which is application of force to left or right
        """
        # ----
        # actual evolution
        # ----
        return new_state, reward, done, info
        
        
    def render(self):
        """
        Does nothing.
        """
        pass
    
    # PLUS A COUPLE OF OTHERS...
    def seed(self):
        """
        For random seed
        """
        # STUFF
        pass
        
    def close(self):
        """
        To close the environment
        """
        # STUFF
        pass

### State space
The state space is a Box(4) object, which is a 4-dimensional vector of floats.
Each dimension can be bounded, and represents the following observables:

In [5]:
print("State space: ", env.observation_space)
print()

low_bounds, high_bounds = (env.observation_space.low, env.observation_space.high)
print("1st element:\tPosition of the cart along the x-axis. Bounds: [%2.1f, %2.1f]" %(low_bounds[0], high_bounds[0]))
print("2nd element:\tCart velocity. Not bounded")
print("3rd element:\tPole angle. Bounds: [%2.1f, %2.1f]" %(np.rad2deg(low_bounds[2]), np.rad2deg(high_bounds[2])))
print("4th element:\tPole velocity at its tip. Not bounded")

State space:  Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

1st element:	Position of the cart along the x-axis. Bounds: [-4.8, 4.8]
2nd element:	Cart velocity. Not bounded
3rd element:	Pole angle. Bounds: [-24.0, 24.0]
4th element:	Pole velocity at its tip. Not bounded


### Action space
The action space is a discrete space of size two.
The two actions are applying a certain fixed force to the cart towards left or right

In [6]:
print("Action space: ", env.action_space)

Action space:  Discrete(2)


### Rewards
Reward is 1 for every step taken, including the termination step
### Starting State
All observations are assigned a uniform random value in [-0.05..0.05]
### Episode Termination
* Pole Angle is more than 12 degrees
* Cart Position is more than 2.4 (center of the cart reaches the edge of the display)
* Episode length is greater than 500, or another value set by the variable *env.\_max\_episode\_steps*
* Solved Requirements (considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials).

### How the environment works
The key methods of the gym environments are reset() and step(action).

As the name suggests, the first reset the environment. It also returns the starting state following the rule written above (all the 4 values are assiged at random within a given window).

In [7]:
starting_state = env.reset()

The step(action) method returns the next state as a consequence of the action (pushing the cart rigth or left).
Calling this method the env engine has to compute the physics of the system. 
Together with the new state, step(action) returns also the obtained reward and whether the episode has reached a terminal state (the fourth variable info will not be considered here).

In [8]:
# The action 0 corresponds to the left pushing
new_state, reward, done, info = env.step(0) 

If you now compare the two state, you can see that the cart has decreased its velocity (second element of the vector) as a consequence of the left pushing

In [9]:
print(starting_state)
print(new_state)

[-0.02344187 -0.03038084  0.02490161 -0.0284238 ]
[-0.02404948 -0.22585088  0.02433314  0.27201068]


## Trying a naive strategy
One can define a naive strategy (i.e. the action to take by knowing the current state) based on the physical intuition of the problem. 

Remember that the best performance is having an episode cumulative reward of 500, because after 500 steps the environment automatcally resets (see episode termination above).
Having a smaller reward means that the episode has ended before 500 steps because (1) the angle of the pole has become too large, (2) the cart is outside the boundaries.

In [10]:
# Definition of possible policies. Method which returns an action given the state as argument

def my_bad_policy(state):
    """
    If the pole angle is less than 0 (bent towards left) I apply a force towards left.
    """
    if state[2] < 0:
        return 0
    else:
        return 1 
    
def random_policy(state):
    return np.random.randint(2)

In [11]:
# Main cycle for running the environemnt

def run(env, n_episodes, strategy, render=False):
    """
    Running the enviroment for a given number of episodes, according to a given strategy.
    It returns the average reward over all the episodes.
    """
    average_reward = 0 # Cumulative reward averaged over all the episodes
    
    for _ in range(n_episodes): # Cycle over all the episodes
        
        state = env.reset() # Episode initiaization
        ep_reward = 0
        
        while True: # Cycle over the steps
            
            if render:
                env.render() # This method can render the environment
                
            action = strategy(state) # Getting the action from the heuristic policy
            state, reward, done, info = env.step(action) # Environmental step
            ep_reward += reward
            
            if done: # Check if the state is terminal
                break
                
        average_reward += ep_reward / float(n_episodes)
        
    return average_reward

Printing the average reward of the heuristic policy over some episodes:

In [12]:
n_episodes = 500

print (run(env, n_episodes, my_bad_policy))
print (run(env, n_episodes, random_policy))

42.08999999999998
22.332000000000004


Can you come up with a better strategy? 

Reaching the best performance with just intuition, is almost impossible.. We can approach the problem with **reinforcement learning**!

### Rendering
The gym environment provides also the possibility to visualize how the system evolve throught the render method

In [15]:
run(env, 100, my_bad_policy, render=True)

NameError: name 'base' is not defined