In [1]:
import numpy as np
import matplotlib.pyplot as plt

# Cart Pole

The cart pole problem is a classic reinforcement task which involves balancing a pole on a cart by pushing the cart either to the right or left at each time step.

**Dynamics** The observed state for the problem has four components:

1. The position of the cart (ranges from -4.8 to 4.8). The simulation will terminate, however, if the cart exceeds -2.4 or 2.4.

2. The velocity of the cart, a real value between -$\infty$ and $\infty$.

3. The angle of the pole on the cart, measured in radians (ranges between -.418 and .418, or $\pm 24^\circ$). The simulation will terminate if the angle exceeds $\pm 12^\circ$

4. The angular velocity of the pole, a real value between -$\infty$ and $\infty$.

**Actions** At each time step, the user has one of two actions available: Push the cart to the left (0) or push the cart to the right (1).

**Objective** The goal of the task is to keep the pole upright for as long as possible, and a reward of $+1$ is given for each time step the pole is upright. The simulation will auto-terminate after 500 iterations, so the maximum reward is 500.

## Demo
A demo of Cart Pole with random actions taken is as follows:

In [3]:
## Install the package and additional stuff needed to record
!pip install gymnasium[classic_control]
!pip install pyvirtualdisplay
!apt-get install -y xvfb

Collecting pyvirtualdisplay
  Downloading PyVirtualDisplay-3.0-py3-none-any.whl.metadata (943 bytes)
Downloading PyVirtualDisplay-3.0-py3-none-any.whl (15 kB)
Installing collected packages: pyvirtualdisplay
Successfully installed pyvirtualdisplay-3.0
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
xvfb is already the newest version (2:21.1.4-2ubuntu1.7~22.04.15).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [4]:
import gymnasium as gym
import os

## To play video
import pyvirtualdisplay
from IPython import display
pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

<pyvirtualdisplay.display.Display at 0x7aced931b650>

In [5]:
from gymnasium.wrappers import RecordVideo

env = gym.make("CartPole-v1", render_mode='rgb_array')
env = RecordVideo(env, video_folder="videos") # To record

observation, info = env.reset()

for _ in range(1000):
    action = env.action_space.sample()  # Random actions for demonstration
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        break

env.close()

  IMAGEMAGICK_BINARY = r"C:\Program Files\ImageMagick-6.8.8-Q16\magick.exe"


In [6]:
import base64
from IPython.display import HTML

def show_video(video_path):
    video = open(video_path, "rb").read()
    encoded_video = base64.b64encode(video).decode()
    return HTML(data='''<video width="400" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4">

</video>
'''.format(encoded_video))

video_path = f"videos/rl-video-episode-0.mp4"
show_video(video_path)

## Tabular Q-Learning

As a first pass, we can try a standard $Q$-learning approach by discretizing the state space to get a $Q$ table $Q(s, a)$, and update this table in the same way as the previous example by interacting with the simulated environment to observe rewards as a function of states and actions.

In [9]:
## Design Q Table
disc_dimensions = np.array([50,50,50,50])
q_table = np.zeros(shape=([50,50,50,50] + [env.action_space.n]))
q_table.shape

(50, 50, 50, 50, 2)

In [10]:
def get_discrete_state(state):
    discrete_state = state/np.array([0.1, 0.25, 0.01, 0.25])+ np.array([25,25,25,25])
    discrete_state = np.minimum(np.maximum(discrete_state, 0), disc_dimensions - 1)
    return tuple(discrete_state.astype(np.int32))

In [11]:
observation, get_discrete_state(observation)

(array([ 0.19698527,  1.3898189 , -0.250507  , -2.2457354 ], dtype=float32),
 (np.int32(26), np.int32(30), np.int32(0), np.int32(16)))

In [12]:

lr = 0.1

gamma = 0.95
episodes = 10000
total = 0
total_reward = 0
prior_reward = 0

epsilon = 1
epsilon_decay_value = 0.9999


env = gym.make("CartPole-v1")
for episode in range(episodes + 1): #go through the episodes
    discrete_state = get_discrete_state(env.reset()[0]) #get the discrete start for the restarted environment
    done = False
    episode_reward = 0 #reward starts as 0 for each episode

    while not done:

        if np.random.random() > epsilon:

            action = np.argmax(q_table[discrete_state]) #take cordinated action
        else:

            action = np.random.randint(0, env.action_space.n) #do a random ation

        new_state, reward, done, _, _ = env.step(action) #step action to get new states, reward, and the "done" status.

        episode_reward += reward #add the reward

        new_discrete_state = get_discrete_state(new_state)

        if not done: #update q-table
            q_table[discrete_state + (action,)] = (1 - lr) * q_table[discrete_state + (action,)] + \
                                                     lr * (reward + gamma * np.max(q_table[new_discrete_state]))

        discrete_state = new_discrete_state

    epsilon = epsilon * epsilon_decay_value


    total_reward += episode_reward #episode total reward
    prior_reward = episode_reward

    if episode % 1000 == 0: #every 1000 episodes print the average time and the average reward
        print("Episode: " + str(episode))
        mean_reward = total_reward / 1000
        print("Mean Reward: " + str(mean_reward))
        total_reward = 0

env.close()

Episode: 0
Mean Reward: 0.013
Episode: 1000
Mean Reward: 22.501
Episode: 2000
Mean Reward: 23.05
Episode: 3000
Mean Reward: 25.832
Episode: 4000
Mean Reward: 30.506
Episode: 5000
Mean Reward: 35.415
Episode: 6000
Mean Reward: 41.766
Episode: 7000
Mean Reward: 49.925
Episode: 8000
Mean Reward: 54.54
Episode: 9000
Mean Reward: 59.964
Episode: 10000
Mean Reward: 67.219


In [13]:
env = gym.make("CartPole-v1", render_mode='rgb_array')
env = RecordVideo(env, video_folder="qtable_videos") # To record

state, info = env.reset()

for j in range(1000):
    discrete_state = get_discrete_state(state)
    action = np.argmax(q_table[discrete_state])
    state, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        break

env.close()

In [14]:
video_path = f"qtable_videos/rl-video-episode-0.mp4"
show_video(video_path)

In [None]:
j