# Assignment - Reinforcement  Learning with CartPole

In this notebook, we'll explore reinforcement learning (RL) using the CartPole environment from the Gymnasium library. RL is a type of machine learning where an agent learns to make decisions by interacting with an environment and receiving rewards or penalties for its actions.

The CartPole environment is a classic control problem where the agent (a cart) has to balance a pole upright by applying forces to the cart (left or right). The goal is to prevent the pole from falling and keep the cart within the track boundaries.

First, let's install and import the necessary libraries:

In [6]:
# this cell may take a few minutes to run depending on your system
#!pip install gymnasium
#!pip install stable_baselines3

In [None]:
import numpy
numpy.__version__

'1.26.4'

In [1]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

We'll be using the Gymnasium library for the CartPole environment and stable_baselines3 for the PPO RL algorithm.

Gymnasium is a library that provides a collection of reinforcement learning environments that are easy for our agents to interact with. Stable_baselines3 is a library that contains implementations of state-of-the-art reinforcement learning algorithms and is a convenient tool for training and evaluating RL agents.

Now, let's create the CartPole environment with a human-rendered mode.

In [3]:
env = gym.make('CartPole-v1', render_mode='human')

This creates a visual representation of the CartPole environment, allowing us to observe the agent's actions and the environment's state.

Run the following code to observe the behavior of a random agent in the CartPole environment for 10 episodes. Observe how the cart and pole react to random actions.

In [4]:
for i in range(3):
    obs = env.reset()
    done = False
    score = 0
    
    while not done:
        env.render()
        action = env.action_space.sample() # Select a random action from the list of possible actions
        obs, reward, done, info, _ = env.step(action)
        score += reward

env.close()

Observe how the cart and pole react to random actions. Notice that the pole falls quickly, and the cart moves randomly, unable to balance the pole.

After seeing the random agent's behavior, we'll create a new environment without the human-rendered mode for faster training:

In [5]:
env = gym.make('CartPole-v1')

Before we start training agents, let's explore the CartPole environment and understand its properties:

In [6]:
print("Environment Information:")
print(f"Observation Space: {env.observation_space}")
print(f"Action Space: {env.action_space}")

Environment Information:
Observation Space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Action Space: Discrete(2)


The observation space represents the information the agent receives about the environment's state. The action space represents the actions the agent can take.

This code resets the environment and prints the initial observation. The observation `obs` consists of the cart position, cart velocity, pole angle, and pole angular velocity.

In [7]:
obs = env.reset()
print("\nInitial Observation:")
print('Cart Position', obs[0][0])
print('Cart Velocity', obs[0][1])
print('Pole Angle', obs[0][2])
print('Pole Angular Velocity', obs[0][3])


Initial Observation:
Cart Position 0.040557783
Cart Velocity -0.033332333
Pole Angle 0.030406915
Pole Angular Velocity 0.012443069


This code prints the available actions in the action space (0 and 1, representing left and right forces) and their data type (integers).

In [8]:
print("Action Space:")
print("Action Data Type:", env.action_space.dtype)

for action in range(env.action_space.n):
    print(action)

Action Space:
Action Data Type: int64
0
1


Now, let's run a random agent for 10 episodes and record its performance:

In [9]:
env.close()
env = gym.make('CartPole-v1', render_mode='human')
episodes = 10

for episode in range(1, episodes + 1):
    obs, _ = env.reset()
    done = False
    score = 0

    while not done:
        env.render()
        action = env.action_space.sample() # Select a random action
        obs, reward, done, info, _ = env.step(action)
        score += reward
    
    print(f'Episode: {episode} Score: {score}')
        
env.close()

Episode: 1 Score: 17.0
Episode: 2 Score: 38.0
Episode: 3 Score: 14.0
Episode: 4 Score: 13.0
Episode: 5 Score: 14.0
Episode: 6 Score: 12.0
Episode: 7 Score: 40.0
Episode: 8 Score: 18.0
Episode: 9 Score: 10.0
Episode: 10 Score: 21.0


This code runs a random agent for 10 episodes, selects random actions, and prints the score for each episode. The score represents the number of time steps that the pole remained balanced for.

Now, we'll train a more intelligent agent using the Proximal Policy Optimization (PPO) RL algorithm. We need to start by creating a new environment and wrap it with DummyVecEnv to make it compatible with the stable_baselines3 library (don't worry about the details of this for now).

In [10]:
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env])

Next we'll create a PPO model with an MlpPolicy and specify the environment. MLP stands for multilayer perceptron. An MLP is a type of feedforward artificial neural network consisting of multiple layers of interconnected nodes (perceptrons).

In the context of reinforcement learning, the MLP policy represents the agent's decision-making model. It takes the environment's state (observations) as input and outputs the probability distribution over actions. The MLP architecture allows the agent to learn complex patterns and relationships between the states and actions.

In [11]:
model = PPO('MlpPolicy', env, verbose=1)

Using cuda device


Now that we've established the environment and model, we can start training. This line trains the PPO model for 10,000 time steps, during which the agent learns to balance the pole by receiving rewards or penalties for its actions and ajusting the model accordingly.

In [12]:
model.learn(10000)

-----------------------------
| time/              |      |
|    fps             | 1011 |
|    iterations      | 1    |
|    time_elapsed    | 2    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 816         |
|    iterations           | 2           |
|    time_elapsed         | 5           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009234609 |
|    clip_fraction        | 0.0879      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.687      |
|    explained_variance   | -0.000863   |
|    learning_rate        | 0.0003      |
|    loss                 | 6.67        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.013      |
|    value_loss           | 57.4        |
-----------------------------------------
----------------------------------

<stable_baselines3.ppo.ppo.PPO at 0x27a122406d0>

This code evaluates the trained model's performance over 10 episodes and prints the evaluation results (mean reward, standard deviation).

In [None]:
results = evaluate_policy(model, env, n_eval_episodes=10)
print("Mean Reward:", results[0])
print("Standard Deviation", results[1])

**Your Turn:** In the cell below, create a new PPO model and store it in the variable `model2`. Try out a different number of time steps during training and see how it impacts the mean reward.

In [None]:
# create the model
model2 = 

# train the model

# evaluate the model

Finally, let's visualize the trained agent's performance:

In [None]:
env.close()
env = gym.make('CartPole-v1', render_mode="human")
episodes = 3
max_duration = 200
for episode in range(1, episodes + 1):
    obs, _ = env.reset()
    done = False
    score = 0
    i = 0
    while not done and i < max_duration:
        i = i + 1
        env.render()
        action, _ = model.predict(obs)
        obs, reward, done, info, _ = env.step(action)
        score += reward
    print(f'Episode: {episode} Score: {score}')
env.close()

This code runs the trained agent for 3 episodes then prints the score for each episode. We also limit each episode to a max duration of 200 time steps. You should observe that the trained agent can balance the pole for longer periods compared to the random agent.

**Your Turn**: Run the visualization loop using your `model2` agent to predict the next action.

In [None]:
env.close()
env = gym.make('CartPole-v1', render_mode="human")
episodes = 3
max_duration = 200
for episode in range(1, episodes + 1):
    obs, _ = env.reset()
    done = False
    score = 0
    i = 0
    while not done and i < max_duration:
        i = i + 1
        env.render()
        action, _ = # predict the next action using your model
        obs, reward, done, info, _ = env.step(action)
        score += reward
    print(f'Episode: {episode} Score: {score}')
env.close()