### Setting up the OpenAI Gym's environment

## [**Cart Pole**](https://www.gymlibrary.dev/environments/classic_control/cart_pole/)

![CartPole-v1](https://miro.medium.com/max/1188/1*LVoKXR7aX7Y8npUSsYZdKg.png "CartPole-v1")

| | |
| :-: | :-: |
| Action Space | Discrete(2) |
| Observation Shape | (4,) |
| Observation High | \[4.8 inf 0.42 inf\] |
| Observation Low | \[-4.8 -inf -0.42 -inf\] |
| Import |`gym.make("CartPole-v1")` |

**Description**
This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in [“Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”](https://ieeexplore.ieee.org/document/6313077). A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.

**Action Space**

The action is a `ndarray` with shape `(1,)` which can take values `{0, 1}` indicating the direction of the fixed force the cart is pushed with.

| Num | Action |
| :-: | :-: |
| 0 | Push cart to the left |
| 1 | Push cart to the right |
    
**Note**: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it

**Observations Space**

The observation is a `ndarray` with shape `(4,)` with the values corresponding to the following positions and velocities:

| Num | Observation | Min | Max |
| :-: | :-: | :-: | :-: |
| 0 | Cart Position | -4.8 | 4.8 |
| 1 | Cart Velocity | -Inf | Inf |
| 2 | Pole Angle | ~-0.418 rad (-24º) | ~0.418 rad (24º) |
| 3 | Pole Angular Velocity | -Inf | Inf |
    
**Note**: While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:

- The cart x-position (index 0) can be take values between `(-4.8, 4.8)`, but the episode terminates if the cart leaves the `(-2.4, 2.4)` range.

- The pole angle can be observed between `(-.418, .418)` radians (or **±24°**), but the episode terminates if the pole angle is not in the range `(-.2095, .2095`) (or **±12°**)

**Rewards**
    
Since the goal is to keep the pole upright for as long as possible, a reward of `+1` for every step taken, including the termination step, is allotted. The threshold for rewards is 475 for v1.

**Starting State**
    
All observations are assigned a uniformly random value in `(-0.05, 0.05)`

**Episode End**
    
The episode ends if any one of the following occurs:

1. Termination: Pole Angle is greater than ±12°

2. Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)

3. Truncation: Episode length is greater than 500 (200 for v0)

In [3]:
import gymnasium as gym

In [9]:
# são renderizadas 200 acoes aleatorias
env = gym.make("CartPole-v1",render_mode="human")
env.reset()
for _ in range(200):
    env.render()
    env.step(env.action_space.sample())
env.close()

  logger.warn(


In [5]:
EPISODES = 5000
DISCOUNT = 0.95
EPISODE_DISPLAY = 500
LEARNING_RATE = 0.25
EPSILON = 0.2

In [6]:
env = gym.make("CartPole-v1",render_mode="human")
env.reset()
for t in range(20):
    observation= env.reset()
    for t in range(30):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation,reward,terminated,truncated,info = env.step(action)
        if terminated:
            print("episode finished after {} time steps".format(t+1))
            break    

(array([ 0.04048085, -0.00938806, -0.03770751, -0.02480632], dtype=float32), {})
[ 0.04029309  0.18625379 -0.03820363 -0.32914388]
[ 0.04401816 -0.00830406 -0.04478651 -0.04874916]
[ 0.04385208 -0.20275615 -0.0457615   0.22947365]
[ 0.03979696 -0.00701115 -0.04117202 -0.07728565]
[ 0.03965674  0.1886761  -0.04271774 -0.38266894]
[ 0.04343026 -0.00581411 -0.05037111 -0.10375495]
[ 0.04331398  0.18999216 -0.05244621 -0.41189468]
[ 0.04711382  0.38581684 -0.06068411 -0.72063994]
[ 0.05483016  0.5817235  -0.07509691 -1.0317892 ]
[ 0.06646463  0.3876765  -0.09573269 -0.76359683]
[ 0.07421815  0.5839774  -0.11100462 -1.0848024 ]
[ 0.08589771  0.39048097 -0.13270067 -0.8289124 ]
[ 0.09370732  0.1973985  -0.14927892 -0.5807346 ]
[ 0.0976553   0.3942621  -0.16089362 -0.9164711 ]
[ 0.10554054  0.5911509  -0.17922303 -1.2550888 ]
[ 0.11736355  0.39871806 -0.20432481 -1.023473  ]
episode finished after 17 time steps
(array([ 0.02420584, -0.04866919,  0.04471892, -0.04325758], dtype=float32), {})
[

### Let’s develop a Q-learning and SARSA model to solve this problem

In [1]:
import pygame
import gymnasium as gym
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline

pygame 2.5.2 (SDL 2.28.3, Python 3.10.13)
Hello from the pygame community. https://www.pygame.org/contribute.html


### Prepare OpenAI Gym Environment

In [7]:
def prepare_env():
    env = gym.make("CartPole-v1",render_mode="human")
    print("Env. Observation Space:",env.observation_space)
    print('Env. Observation Space - High: ',env.observation_space.high)
    print('Env. Observation Space - Low:', env.observation_space.low)
    
    print('Env. Action Space:', env.action_space)
    print('Env. Actions Space:', env.action_space.n)
    return env

In [8]:
prepare_env()

Env. Observation Space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Env. Observation Space - High:  [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
Env. Observation Space - Low: [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
Env. Action Space: Discrete(2)
Env. Actions Space: 2


<TimeLimit<OrderEnforcing<PassiveEnvChecker<CartPoleEnv<CartPole-v1>>>>>

### Prepare Reinforcement Learning Model Hyper-parameters

In [6]:
def discretised_state(state, theta_minmax, theta_dot_minmax, theta_state_size, theta_dot_state_size):

    discrete_state = np.array([0,0])

    theta_window= (theta_minmax-(-theta_minmax)) / theta_state_size
    discrete_state[0] = (state[2] - (-theta_minmax)) // theta_window
    discrete_state[0] = min(theta_state_size -1, max(0,discrete_state[0]))

    theta_dot_window = (theta_dot_minmax - (- theta_dot_minmax)) / theta_dot_state_size
    discrete_state[1] = (state[3] - (-theta_dot_minmax)) // theta_dot_window
    discrete_state[1] = min(theta_dot_state_size -1, max(0,discrete_state[1]))

    return tuple(discrete_state.astype(np.int32))

### Q-Learning

In [12]:
def train_cart_pole_qlearning(EPISODES, DISCOUNT, EPISODE_DISPLAY, LEARNING_RATE, EPSILON):
    
    #Prepare OpenGym CartPole Environment
    env = prepare_env()
    #Q-Table of size theta_state_size * theta_dot_state_size * env.action_space.n
    theta_minmax = env.observation_space.high[2]
    theta_dot_minmax = math.radians(50)
    theta_state_size=50
    theta_dot_state_size = 50

    Q_TABLE = np.random.randn(theta_state_size,theta_dot_state_size,env.action_space.n)
    #For stats
    ep_rewards = []
    ep_rewards_table={'ep':[],'avg':[],'min':[],'max':[]}
    
    for episode in range(EPISODES):
        episode_reward=0
        terminated = False
        if episode % EPISODE_DISPLAY == 0:
            render_state = True
        else:
            render_state=False
        curr_discrete_state = discretised_state(env.reset()[0],
                                                theta_minmax,theta_dot_minmax,
                                                theta_state_size,theta_dot_state_size)

        while not terminated:
            if np.random.random() > EPSILON:
                action = np.argmax(Q_TABLE[curr_discrete_state])
            else:
                action = np.random.randint(0,env.action_space.n)
            new_state, reward, terminated, _,_ = env.step(action)
            new_discrete_state = discretised_state(new_state,theta_minmax,theta_dot_minmax,
                                                   theta_state_size,theta_dot_state_size)
                


            if render_state:
                env.render()

            if not terminated:
                max_future_q=np.max(Q_TABLE[new_discrete_state[0]],new_discrete_state[1])
                current_q = Q_TABLE[curr_discrete_state[0],curr_discrete_state[1],action]
                new_q = current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q - current_q)
                Q_TABLE[curr_discrete_state[0],curr_discrete_state[1],action] = new_q
            i += 1
            curr_discrete_state = new_discrete_state
            episode_reward += reward
            ep_rewards.append(episode_reward)     
        
            if not episode % EPISODE_DISPLAY:
                avg_reward = sum(ep_rewards[-EPISODE_DISPLAY:]) / len(ep_rewards[-EPISODE_DISPLAY:])
                ep_rewards_table['ep'].append(episode)
                ep_rewards_table['abg'].append(avg_reward)
                ep_rewards_table['min'].append(min(ep_rewards[-EPISODE_DISPLAY:]))
                ep_rewards_table['max'].append(max(ep_rewards[-EPISODE_DISPLAY:]))
                print(f"Episode:{episode} avg:{avg_reward} min:{min(ep_rewards[-EPISODE_DISPLAY:])} max:{max(ep_rewards[-EPISODE_DISPLAY:])}")

    env.close()
    
    #Plot Model evolution performance
    plt.plot(ep_rewards_table['ep'], ep_rewards_table['avg'], label = "avg")
    plt.plot(ep_rewards_table['ep'], ep_rewards_table['min'], label = "min")
    plt.plot(ep_rewards_table['ep'], ep_rewards_table['max'], label = "max")
    plt.legend(loc = 4) #bottom right
    plt.title('CartPole Q-Learning')
    plt.ylabel('Average reward/Episode')
    plt.xlabel('Episodes')
    plt.show()
    
    return ep_rewards_table

ep_rewards_table_qlearning = train_cart_pole_qlearning(EPISODES, DISCOUNT, EPISODE_DISPLAY, LEARNING_RATE, EPSILON)

Env. Observation Space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Env. Observation Space - High:  [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
Env. Observation Space - Low: [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
Env. Action Space: Discrete(2)
Env. Actions Space: 2


NameError: name 'math' is not defined

### SARSA

In [None]:
def train_cart_pole_sarsa(EPISODES, DISCOUNT, EPISODE_DISPLAY, LEARNING_RATE, EPSILON):
    #Prepare OpenGym CartPole Environment
    

    #Q-Table of size theta_state_size * theta_dot_state_size * env.action_space.n
    

    #For stats
    
    
    for episode in range(EPISODES):
        

        if episode % EPISODE_DISPLAY == 0:
            
        else:
           
        if np.random.random() > EPSILON:
            
        else:
           

        while not terminated:
            

            if np.random.random() > EPSILON:
               
            else:
               

            if render_state:
               
            if not done:
                
            curr_discrete_state = new_discrete_state
            
        ep_rewards.append(episode_reward)

        if not episode % EPISODE_DISPLAY:
            
            print(f"Episode:{episode} avg:{avg_reward} min:{min(ep_rewards[-EPISODE_DISPLAY:])} max:{max(ep_rewards[-EPISODE_DISPLAY:])}")

    env.close()

    #Plot Model evolution performance
    plt.plot(ep_rewards_table['ep'], ep_rewards_table['avg'], label = "avg")
    plt.plot(ep_rewards_table['ep'], ep_rewards_table['min'], label = "min")
    plt.plot(ep_rewards_table['ep'], ep_rewards_table['max'], label = "max")
    plt.legend(loc = 4) #bottom right
    plt.title('CartPole SARSA')
    plt.ylabel('Average reward/Episode')
    plt.xlabel('Episodes')
    plt.show()
    
    return ep_rewards_table

### Results: Q-Learning vs SARSA

In [None]:
#Q-learning
ep_rewards_table_qlearning = train_cart_pole_qlearning

In [None]:
#SARSA
ep_rewards_table_sarsa = train_cart_pole_sarsa

In [None]:
#Comparison
plt.figure(figsize = (20, 10))
plt.plot(ep_rewards_table_qlearning['ep'], ep_rewards_table_qlearning['avg'], label = "qlearning_avg")
plt.plot(ep_rewards_table_sarsa['ep'], ep_rewards_table_sarsa['avg'], label = "sarsa_avg")
plt.plot(ep_rewards_table_qlearning['ep'], ep_rewards_table_qlearning['min'], label = "qlearning_min")
plt.plot(ep_rewards_table_sarsa['ep'], ep_rewards_table_sarsa['min'], label = "sarsa_min")
plt.plot(ep_rewards_table_qlearning['ep'], ep_rewards_table_qlearning['max'], label = "qlearning_max")
plt.plot(ep_rewards_table_sarsa['ep'], ep_rewards_table_sarsa['max'], label = "sarsa_max")
