# SC3000/CZ3005 Assignment 1 - Balancing a Pole on a Cart

Submission Deadline: 11:59 PM, 5 April 2024

## Group Members and Contributions:

1. Bryan Toh Wee Sheng
2. Lim Shaojun
3. Keith Lim En Kai (U2220506C)

# 1. Problem description: (Taken from Assignment Document)
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on thecart and the goal is to balance the pole by applying forces in the left and right direction on the cart. In this project, you will need to develop a Reinforcement Learning (RL) agent. The trained agent makes the decision to push the cart to the left or right based on the cart position, velocity, and the pole angle, angular velocity. (4 parameters)

![Alternative Text](https://www.gymlibrary.dev/_images/cart_pole.gif)

## 1.1 Problem Instance
You are given an instance of the cart pole environment implemented by the gym library. As with any good solution to a problem, we start with 

### Action Space:
The action is an nd array with shape (1, which can take values {0,1} indicating actions pushing the cart to the left or right respectively. Note that the velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it.

### Observation Space:
The observation is an nd array with shape (4,) with the values corresponding to the following positions and velocities:

| Num | Observation | Min | Max |
|---|---|---|---|
| 1 | Cart Position | -4.8 | 4.8 |
| 2 | Cart Velocity | -inf | inf |
| 3 | Pole Angle | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
| 4 | Pole Angular Velocity | -inf | inf |

### Reward:
Since the goal is to keep the pole upright for as long as possible, a reward of +1 for every step taken, including the termination step, is allotted.

### Starting State:
All observations are assigned a uniformly random value in **(-0.05, 0.05)**

### Episode End:
The episode ends if any one of the following occurs:
1. Termination: Pole Angle is greater than +/- 12 Degrees == 0.209 rad
2. Termination: Cart Position is greater than +/- 2.4 (center of the cart reaches the edge of the display)
3. Truncation: Episode length is greater than 500.

# 2. Requirements and Guidelines
## 2.1 Tasks and Marking Criteria

### Task 1: Development of an Reinforcement Learning (RL) Agent. (30 Marks)
Demonstrate the correctness of the implementation by sampling a random state from the cart pole environment, inputting to the agent, and outputting a chosen action. Print the values of the state and chosen action in the Jupyter Notebook.

### Task 2: Demonstrate the effectiveness of the RL Agent (40 Marks)
Run for 100 episodes (reset the enviroment at the beginning of each episode) and plot the cumulative reward against all episodes in the Jupyter Notebook. Print the average reward over the 100 episodes. The average reward should be larger than **195**.

### Task 3: Render one episode played by the developed RL agent on the Jupyter Notebook (10 Marks)
Please refer to the sample code link for rendering code.

### Task 4: Format the Jupyter Notebook by including step-by-step instructions and explanations, such that the notebook is easy to follow and run (20 Marks)
Include text explanation to demonstrate the originality of your implementation and your understanding of the code. For example, for each task, explain your approach and analyze the output; if you improve an exisiting approach, explain your improvements.

## 2.2 Output Format
All codes are to be included in a single Jupyter notebook written in Python (i.e., .ipynb file).
1.	Include all codes for Task 1-4. Note that the submission is invalid if it only contains the outputs and plots without codes to obtain it.
2.	Run the notebook before submission to save the output in the notebook, i.e., by opening the ipynb file (without running it), one can see the outputs and plots for Task 1-3
3.	Make sure the Jupyter notebook is runnable, i.e., by running each code block sequentially from top to bottom, one can get the results for Task 1-3. The TAs may run your notebook.
4.	Unless you are experienced with Jupyter, it is recommended to modify from the provided Jupyter notebook sample, rather than creating a new one.
5.	If the developed RL agent is a trainable neural network, submit a .zip file by zipping the trained model parameters (e.g., .pth for PyTorch) and the ipynb file. In this case, your notebook must include the training code and model loading code.
6.	Contribution: Please clearly state the contribution of each team member in the beginning of the Jupyter notebook if you have more than one member in your team.

## Installing Required Dependencies and Libraries

In [None]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
import random
import gym
import numpy as np
from collections import deque
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.optimizers import Adam, RMSprop


def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")


### Crafting the Cartpole Enviroment (gym) and analysing action space.

In [None]:
env = gym.make("CartPole-v1")
print('The action space consists of', env.action_space.n, 'actions, left and right represented by 0 and 1 respectively.')

In [None]:
print(env.observation_space)

The observation space is given above. The first two arrays define the min and max values of the 4 observed values, corresponding to cart position, velocity and pole angle, angular velocity.

In [None]:
observation = env.reset()
print("Initial observations:", observation)

In [None]:
observation, reward, done, info = env.step(0)[:4]
# print(env.step(0))
print("New observations after choosing action 0:", observation)
print("Reward for this step:", reward)
print("Is this round done?", done)

In [None]:
observation = env.reset()
cumulative_reward = 0
done = False
while not done:
    observation, reward, done, info = env.step(0)[:4]
    cumulative_reward += reward
print("Cumulative reward for this round:", cumulative_reward)

## Task 1

## Reinforcement Learning Agent (Using _______)

In [None]:
def random_agent(observation):
    return random.randint(0, 1)

In [None]:
observation = env.reset()
action = random_agent(observation)
print("Observation:", observation)
print("Chosen action:", action)

### Task 1 Answer

In [None]:
observation = env.reset()
random_state = env.observation_space.sample()
chosen_action = random_agent(random_state)

# [0]: Cart Position, [1]: Cart Velocity, [2]: Pole Angle, [3]: Pole Angular Velocity
print("Random state: ", random_state)
print("Chosen action: ", chosen_action)

In [None]:
observation, reward, done, info = env.step(chosen_action)[:4]
print("New observations after choosing the chosen action:", observation)
print("Reward for this step:", reward)
print("Is this round done?", done)

Explanation: We used a sample from the observation space and inputted it into the policy agent to determine the action to be taken.

## Task 2: Demonstrate the effectiveness of the RL agent

In [None]:
episode_results = np.random.randint(150, 250, size=100)
plt.plot(episode_results)
plt.title('Cumulative reward for each episode')
plt.ylabel('Cumulative reward')
plt.xlabel('episode')
plt.show()

Printing out the average reward over the 100 episodes.

In [None]:
print("Average cumulative reward:", episode_results.mean())
print("Is my agent good enough?", episode_results.mean() > 195)

### Task 2 Answer

In [None]:
def run_episodes(agent, num_episodes):
    episode_rewards = []
    for i, episode in enumerate(range(num_episodes)):
        total_reward = 0
        observation = env.reset()
        done = False
        
        while not done:
            chosen_action = agent(observation)
            observation, reward, done, info = env.step(chosen_action)[:4]
            total_reward += reward
#             print(f"Episode {i} Score:", reward)
        
        
        print(f"Total reward: {total_reward}")
        
        episode_rewards.append(total_reward)
        
    return episode_rewards

In [None]:
num_episodes = 100
episode_rewards = run_episodes(random_agent, num_episodes)

# Plot cumulative reward against episodes
plt.plot(np.arange(1, num_episodes + 1), np.cumsum(episode_rewards))
plt.xlabel("Episodes")
plt.ylabel("Cumulative Reward")
plt.title("Cumulative Reward over 100 Episodes")
plt.show()

In [None]:
average_reward = np.mean(episode_rewards)
print(f"Average reward over {num_episodes} Episodes: {average_reward}")

In [None]:
if average_reward > 195:
    print("The average reward is > 195")
else:
    print("The average reward did not meet the required threshold")

## Task 3: Render one episode played by the agent

In [None]:
from IPython.display import display, clear_output
import glob

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")


# env = RecordVideo(gym.make("CartPole-v1"), "./video")
# observation = env.reset()
# while True:
#     env.render()
#     action = RL_agent(observation)
#     observation, reward, done, info = env.step(action) [:4]
#     if done: 
#       break
# env.close()
# show_video()

In [None]:
# import matplotlib.pyplot as plt
# %matplotlib inline
# from IPython import display
from gym.wrappers import RecordVideo
# from gym.wrappers import TimeLimit
# from gym.wrappers.monitoring import video_recorder

In [None]:
# env = TimeLimit(gym.make("CartPole-v1"), max_episode_steps=500)
# env = video_recorder.VideoRecorder(env, "./video")
# observation = env.reset()


In [None]:
# render_episode(RL_agent)

In [None]:
env = RecordVideo(gym.make("CartPole-v1"), "./video")
observation = env.reset()
while True:
    env.render()
    #your agent goes here
    action = random_agent(observation)
    observation, reward, done, info = env.step(action) 
    if done: 
      break;    
env.close()
show_video()

In [None]:
# display.stop()

In [None]:
import tensorflow as tf