# SC3000/CZ3005 Assignment 1 - Balancing a Pole on a Cart

Submission Deadline: 11:59 PM, 5 April 2024

## Group Members and Contributions:

1. Bryan Toh Wee Sheng
2. Lim Shaojun
3. Keith Lim En Kai (U2220506C)

# 1. Problem description: (Taken from Assignment Document)
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on thecart and the goal is to balance the pole by applying forces in the left and right direction on the cart. In this project, you will need to develop a Reinforcement Learning (RL) agent. The trained agent makes the decision to push the cart to the left or right based on the cart position, velocity, and the pole angle, angular velocity. (4 parameters)

![Alternative Text](https://www.gymlibrary.dev/_images/cart_pole.gif)

## 1.1 Problem Instance
You are given an instance of the cart pole environment implemented by the gym library. As with any good solution to a problem, we start with

### Action Space:
The action is an nd array with shape (1, which can take values {0,1} indicating actions pushing the cart to the left or right respectively. Note that the velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it.

### Observation Space:
The observation is an nd array with shape (4,) with the values corresponding to the following positions and velocities:

| Num | Observation | Min | Max |
|---|---|---|---|
| 1 | Cart Position | -4.8 | 4.8 |
| 2 | Cart Velocity | -inf | inf |
| 3 | Pole Angle | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
| 4 | Pole Angular Velocity | -inf | inf |

### Reward:
Since the goal is to keep the pole upright for as long as possible, a reward of +1 for every step taken, including the termination step, is allotted.

### Starting State:
All observations are assigned a uniformly random value in **(-0.05, 0.05)**

### Episode End:
The episode ends if any one of the following occurs:
1. Termination: Pole Angle is greater than +/- 12 Degrees == 0.209 rad
2. Termination: Cart Position is greater than +/- 2.4 (center of the cart reaches the edge of the display)
3. Truncation: Episode length is greater than 500.

# 2. Requirements and Guidelines


## Installing Required Libraries and Dependencies:

In [97]:
!pip install stable_baselines3
!pip install shimmy
import gym
from gym import logger as gymlogger
from gym.wrappers import RecordVideo
gymlogger.set_level(40) #error only
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from collections import deque
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Nadam
import torch
import os # for creating directories
from torch import nn
from torch.optim import Adam
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

def show_video():
    ipythondisplay.clear_output(wait=True)
    mp4list = glob.glob('video/*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay
                     loop controls style="height: 400px;">
                     <source src="data:video/mp4;base64,{0}" type="video/mp4" />
                     </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")



### Loading Cartpole Environment: (Mini Tutorial taken from example code)

In [98]:
env = gym.make("CartPole-v1")

# Taking a look at the action space
print(env.action_space)

# Taking a look at the observation space
print(env.observation_space)

Discrete(2)
Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)


Based on the action_space, Discrete(2) means that there are 2 valid discrete actions: 0 and 1. Where 0 represents left while 1 represents right.

Based on the observation_space, the first two arrays define the min and max values of the 4 observed values, corresponding to cart position, velocity and pole angle, angular velocity.

In [99]:
# We call each round of the pole-balancing game an "episode". At the start of each episode, make sure the environment is reset, which chooses a random initial state,
# e.g., pole slightly tilted to the right. This initialization can be achieved by the code below, which returns the observation of the initial state.
observation = env.reset()

# Taking a look at the initial observations:
print("Initial Observations: ", observation)

Initial Observations:  [ 0.00424547  0.02089479 -0.04617178  0.03780938]


We call each round of the pole-balancing game an "episode". At the start of each episode, make sure the environment is reset, which chooses a random initial state, e.g., pole slightly tilted to the right. This initialization can be achieved by the code below, which returns the observation of the initial state.

In [100]:
observation, reward, done, info = env.step(0)[:4]
print("New observations after choosing action 0:", observation)
print("Reward for this step:", reward)
print("Is this round done?", done)

New observations after choosing action 0: [ 0.00466336 -0.17353569 -0.0454156   0.31557462]
Reward for this step: 1.0
Is this round done? False


### Example of Game Run using Naive Strategy:
Now we can play a full round of the game using a naive strategy (always choosing action 0), and show the cumulative reward in the round. Note that reward returned by env.step(*) corresponds to the reward for current step. So we have to accumulate the reward for each step. Clearly, the naive strategy performs poorly by surviving only a dozen of steps.

In [101]:
observation = env.reset()
cumulative_reward = 0
done = False
while not done:
    observation, reward, done, info = env.step(0)[:4]
    cumulative_reward += reward
print("Cumulative reward for this round:", cumulative_reward)

Cumulative reward for this round: 9.0


## 2.1 Tasks and Marking Criteria
Now that we have taken a look at how the cartpole game works, we can begin to start using Reinforcement Learning Agents to tackle it.

### Task 1: Development of an Reinforcement Learning (RL) Agent. (30 Marks)
Demonstrate the correctness of the implementation by sampling a random state from the cart pole environment, inputting to the agent, and outputting a chosen action. Print the values of the state and chosen action in the Jupyter Notebook.

To tackle this problem, we have decided to go with the PPO approach and built a RL Agent based on it.

### PPO


In [102]:
env_name = 'CartPole-v1'
env = gym.make(env_name)
env = DummyVecEnv([lambda: env])

# model = PPO('MlpPolicy', env, verbose=1, device="cuda") # run this if you have an Nvidia GPU installed
model = PPO('MlpPolicy', env, verbose=1, device="auto")   # otherwise run this instead

Using cpu device


In [104]:
# total_timesteps is the number of env.steps(action) being run during training
model.learn(total_timesteps=10000, progress_bar=True)

ValueError: too many values to unpack (expected 2)

In [None]:
observation = env.reset()
action, _ = model.predict(observation)
print("Observation space is: ", observation)
print("Action taken is: ", action)

### Task 2: Demonstrate the effectiveness of the RL Agent (40 Marks)
Run for 100 episodes (reset the enviroment at the beginning of each episode) and plot the cumulative reward against all episodes in the Jupyter Notebook. Print the average reward over the 100 episodes. The average reward should be larger than **195**.

In [None]:
sum_episode_scores = []

for episode in range(1, 100):    ## total 100 episodes
    score = 0                   ## reward init
    obs = env.reset()         ## observations
    done = False                ## episode completes will make done True
    state = 0

    while True:
        action = model.predict(obs)[0]
        n_state, reward, done, info = env.step(action)      ## apply action
        if (abs(n_state[0][0]) > 2.4 or abs(n_state[0][2]) > 0.209):
            break
        if state == 500:
            break
        obs = n_state
        score += reward
        state += 1

    print('Episode:', episode, ';   Score:', score)
    sum_episode_scores.append(score)


print("Average score is ", sum(sum_episode_scores) / len(sum_episode_scores))

env.close()

In [None]:
plt.plot(sum_episode_scores)
plt.title("Cumulative reward for each episode")
plt.ylabel("Cumulative reward")
plt.xlabel("Episode")
plt.show()

### Task 3: Render one episode played by the developed RL agent on the Jupyter Notebook (10 Marks)

In [None]:
env = RecordVideo(gym.make("CartPole-v1"), "./video")
observation = env.reset()
total = 0
state = 0
while True:
    env.render()
    action = model.predict(observation)[0]
    n_state, reward, done, info = env.step( int(action))      ## apply action
    if (abs(n_state[0]) > 2.4 or abs(n_state[2]) > 0.209):
        break
    if state == 500:
        break
    observation = n_state
    total += reward
    state += 1

env.close()
show_video()

In [None]:
print("Episode reward is ", total)

### Task 4: Format the Jupyter Notebook by including step-by-step instructions and explanations, such that the notebook is easy to follow and run (20 Marks)
Include text explanation to demonstrate the originality of your implementation and your understanding of the code. For example, for each task, explain your approach and analyze the output; if you improve an exisiting approach, explain your improvements.

## 2.2 Output Format
All codes are to be included in a single Jupyter notebook written in Python (i.e., .ipynb file).
1.	Include all codes for Task 1-4. Note that the submission is invalid if it only contains the outputs and plots without codes to obtain it.
2.	Run the notebook before submission to save the output in the notebook, i.e., by opening the ipynb file (without running it), one can see the outputs and plots for Task 1-3
3.	Make sure the Jupyter notebook is runnable, i.e., by running each code block sequentially from top to bottom, one can get the results for Task 1-3. The TAs may run your notebook.
4.	Unless you are experienced with Jupyter, it is recommended to modify from the provided Jupyter notebook sample, rather than creating a new one.
5.	If the developed RL agent is a trainable neural network, submit a .zip file by zipping the trained model parameters (e.g., .pth for PyTorch) and the ipynb file. In this case, your notebook must include the training code and model loading code.
6.	Contribution: Please clearly state the contribution of each team member in the beginning of the Jupyter notebook if you have more than one member in your team.