# <span style="color:blue">Hopper Enviroment with PPO </span>

by Robin Wolf and Mathias Fuhrer (RKIM)
Project: Reinforcement Learning in module 'Roboterprogrammierung' by Prof. Hein

In [None]:
import gymnasium as gym
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_probability as tfp
import tensorboard
from keras.callbacks import TensorBoard
import os
import datetime
import pygame
import mujoco


physical_devices = tf.config.list_physical_devices('GPU') 
for device in physical_devices:
    tf.config.experimental.set_memory_growth(device, True)

This notebook contains all of our analysis and improvement process of the given PPO implementation on open ai's hopper enviroment. A detailed comparison in respect to implementation details and agent performance between stable baselines PPO and our PPO is included too.

**Table of Content:**
1) Description of the hopper enviroment
2) Training an Agent in the Hopper enviroment with our continuous PPO used in the MountainCarContinuous enviroment
3) Training an Agent in the Hopper enviroment with stable baselines 3
4) Analysis of differences (performance and implementation)
5) Trys to improve our PPO Continuous with findings from the analysis
6) Lessons Learned

---

### 1) Description of the Hopper-v4 enviroment

<img src="visu/hopper-env.jpg" alt="Hopper-v4 Enviroment" width="100"/>

**Observation Space (11 dimensional):**
- Height of the hopper [-Inf … Inf]
- Angle of all joints and the top [-Inf … Inf]
- Angular velocity of all joints and the top [-Inf … Inf]
- Velocity of the top in X and Z of the world [-Inf … Inf]

**Action Space (3 dimensional):** 
- torque applied to the top joint [-1 … 1]
- torque applied to the leg joint [-1 … 1]
- torque applied to the foot joint [-1 … 1]

**Episode End:**
- Termination if hopper is unhealthy:
    - hopper has fallen (healthy z range)
    - Angle of the tigh joint is to big (healthy angle range)
    - All other observations are out of range e.g. hopper leaves the enviroment (healthy_state_range)
- Truncation if episode step >= 1000

**Rewards = sum of:**
- Healthy reward (not terminated)
- Forward_reward:  positive if hopper hops to the right
(forward_reward_weight * (x before action – x after action)/dt
- Ctrl_cost: penalizing big actions
- Ctrl_cost_weight * sum (action²)



---

### 2) Training of an Agent with our continuous PPO

The used code was modified from discrete action space to continuous action space on the MountainCar enviroments.

To understand the changes were needed to switch the action space to continuous, please refer to the provided MountainCar  notebook.

To run this algorithm on hopper enviroment, we must only adapt the train_agent method, because the hoppers behavior in the rollouts (gathering experience from the enviroment).

**Differences in the train methods:**
- MountainCar and MountainCarContinuous terminate if they reach the goal at top and truncate if a specific count of timesteps was passed.
- Hopper terminates if he's unhealthy (see description in 1) or truncates after 1000 passed timesteps.
    - because a untrained hopper falls instantly after spawning in the enviroment the rollouts would be very short, if the training method gathers experience for a specific count of rolloust in the enviroment --> no data to learn from

--> Our solution was letting the agent gather experience until a limited count of timesteps instead of ending the experience gathering after a limited count of rollouts.

For furhter coding details please refer to the train_agent.py mathod.

<u> Initializing and Training </u>

In [None]:
# create a directory (./model)to save all trained agents with a timestamp
jetzt = datetime.datetime.now()
datum_uhrzeit = jetzt.strftime("%Y%m%d_%H%M%S")
savedir = f'model\\YOUR_AGENT_NAME_HERE_{datum_uhrzeit}'
os.makedirs('model', exist_ok=True)
os.makedirs(savedir, exist_ok=True)

# create a directory to save logs from Tensorboard (Visualization and analysis tool from Tensorflow)
log_dir1 = f"{savedir}\\log"
os.makedirs(log_dir1, exist_ok=True)

# user-feedback if logdir already exists --> should not be possible with using the timestamp, 
# but better prove to avoid overwiriting of the log - data
if os.path.exists(log_dir1):
    print(f"The directory {log_dir1} exists.")
    absolute_path = os.path.abspath(log_dir1)
    print(absolute_path)
else:
    print(f"The directory {log_dir1} does not exist.")

In [None]:
# Parameter for the actor and critic networks --> standards used in the MountainCarContinuous
actor_learning_rate = 0.00025   # learning rate for the actor
critic_learning_rate = 0.001    # learning rate for the critic (shold be > than actor)
# If your actor changes faster than your critic, your estimated Q-value will not truly represent the value of your action, because that value is based on the past policies

# Parameter for the agent
gamma = 0.99                    # discount factor
epsilon = 0.2                   # clip range for the actor loss function

# Parameter for training (n_rollouts NOT USED)
epochs =  50                 # number of learning iterations
n_steps = 2048                 # number of steps per epoch -> issue wth hopper while using rollouts
batch_size = 16                  # number of samples per learning step
learn_steps = 10                # number of learning steps per epoch

In [None]:
# initialize the enviroment from gym
# Note: possible reward-weights from the hopper-enviroment are all defaults! 
env = gym.make('Hopper-v4', render_mode='rgb_array')

# user-feedback about dimensions of observation and action space
print('Observation LOW: ',env.observation_space.low)
print('Observation LOW: ',env.observation_space.high)

print('Action LOW: ',env.action_space.low)
print('Action LOW: ',env.action_space.high)

In [None]:
from lib.PPOAgentContinuous import PPOAgentContinuous as PPOAgent
agent = PPOAgent(env.action_space, env.observation_space, gamma, epsilon, actor_learning_rate, critic_learning_rate) 

In [None]:
# train the agent
from lib.train_agent import training_steps
training_steps(env, agent, log_dir1, epochs, n_steps, batch_size, learn_steps, render=False)

# this method will provide a continuous feedback about the training (e.g. steps, epoch, rollouts, actor loss, critic loss)
# to save runtime it's recommendet to set the render-flag to false while training

In [None]:
# save the agent to h5 format
filepath_actor = f"{savedir}\\actor.h5"
filepath_critic = f"{savedir}\\critic.h5"

# this method saves the trained weights of actor and critic networks to created folder structure (see obove)
agent.save_models(filepath_actor, filepath_critic)

<u> Reloading and Rendering </u>

In [None]:
# import the rendering function (self-written) for the gym framework
from lib.render_GUI import render_GUI_GYM

# set up the enviroment and load the trained agent from directory
# set the render_mode to 'human' to create a external window (pyglet) to render a video of the agent acting in the enviroment
render_env = gym.make('Hopper-v4', render_mode = 'human')

# set up an agent, where the saved weights loaded in
render_agent = PPOAgent(render_env.action_space, render_env.observation_space)
render_agent._init_networks()

filepath_actor = "model\\YOUR_AGENT_NAME_HERE\\actor.h5"
filepath_critic = "model\\YOUR_AGENT_NAME_HERE\\critic.h5"

# load the saved weights to the initialized model
render_agent.load_models(filepath_actor, filepath_critic)

# call the render function 
# Note: to close this window you have to interrupt this running cell, closing the window with the red x doesn't work
render_GUI_GYM(render_env, render_agent)

<u> Example for a trained agent </u>

Note: embedded videos are created with *PPO_renderung_video.ipynb*

Trained agent with given hyperparameters in the env:  
<video width="200" height="200" controls>
  <source src="visu/Video-standard.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = epochs/ y-axis = return):  
<img src="visu/Learning-curve-standard.jpg" alt="Learning Curve Standard" width="400"/>

Note: Training was interrupted after 10 epochs, because this incorrect behavior, no improvements expected.

<u> Observations to mention </u>

- Standard implementation (from MountainCarContinuous) doesn't work at all.
- Policy seems to become verv deterministic (head falls back, instant end of episode).
- Agent doesn't gather enough experience to learn jumping or learning is to complex.
- Changing of hyperparameters and learning-parameters like batch_size, learn_steps, ... --> no improvement

---

### 3) Training of an agent with stable baselines 3 PPO

<u> Initializing, Training and Saving </u>

In [None]:
# import stable baselines algorithms (installation of the package recommended)
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

In [None]:
# create a directory (./baselines_model)to save all trained agents with a timestamp
jetzt = datetime.datetime.now()
datum_uhrzeit = jetzt.strftime("%Y%m%d_%H%M%S")
savedir = f'baselines_model\\YOUR_AGENT_NAME_HERE_{datum_uhrzeit}'
os.makedirs('baselines_model', exist_ok=True)
os.makedirs(savedir, exist_ok=True)

# create a directory to save logs from Tensorboard (Visualization and analysis tool from Tensorflow)
log_dir = f"{savedir}\\log"
os.makedirs(log_dir1, exist_ok=True)

In [None]:
# set up the agent (in stable baselines3 using a vectorized enviroment is prefered)
# --> SB3 framework, not compatible with GYM framework
vec_env = make_vec_env('Hopper-v4')

# set up the agent in baselines framework (used policy, enviroment, optional: log)
model = PPO("MlpPolicy", vec_env, verbose=1, tensorboard_log= log_dir)   
# this model provides continuous feedback about the training process when verbose 1 (doesn't effects runtime that much)

# call the learn method (training ends, if agent passes 1 Mio timesteps ini total)
model.learn(total_timesteps=1000000) 

# save the model (this will create a data.zip folder containing all model informations)
model.save(savedir)

<u> Reloading and Rendering </u>

In [None]:
# import the rendering function (self-written) for the sb3 framework/ old gym framework
from lib.render_GUI import render_GUI_SB3

# define the directory the model is loaded from
loaddir = 'baseline_model\YOUR_AGENT_NAME_HERE\data.zip'

# load the model with sb3 function
render_model = PPO.load(loaddir)

# initialize a enviroment (must be the same framework as used while training)
render_env = make_vec_env('Hopper-v4')

# call the render function 
# Note: to close this window you have to interrupt this running cell, closing the window with the red x doesn't work
render_GUI_SB3(render_env, render_agent)

<u> Example for a trained agent </u>

Trained agent with default hyperparameters from the BaselinesZoo by DLR in the env:      
<video width="200" height="200" controls>
  <source src="visu/Video-SB3.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = total timesteps/ y-axis = return):  
<img src="visu/Learning-curve-SB3.jpg" alt="Learning Curve SB3" width="400"/>

<u> Observations to mention </u>

- stable training progress
- good performance in this complex enviroment
- head is always vertical
- hopper tahes big jumps to the righthand side, sometimes it leaves the enviroment

---

### 4) Analysis of the differences in performance and implementation

<u> comparison in performance </u>

- our basic implementation can not solve the complex hopper enviroment, SB3 can without any problems
- our implementation needs much more runtime to process the same timesteps
- our implementation is much more sample-efficient (see MountainCar analysis)

<u> main differences in implementation </u>

- our implementation uses frozen targets, SB3 uses saved variables from the last rollout/ minibatch update
- network architecture (both have actor and critic network)
    - our implementation: 1 hidden layer with 400 units, 1 with 300
    - SB3: 2 hidden layers with 64 units each
- diffenent loss definition and optimization (both use Adam optimizer)
    - our implementation: Critic Loss only related to advantage estimation (by TD-learning), Actor Loss related to clipped surrogate objective (ratio*advantage)
    - SB3: both networks optimized with the same loss definition (combination of policy loss (surrogate objective), entropy loss (entropy current policy), actor loss (advantage))
- Policy estimation:
    - in our implementation mean and std of gaussian action distribution are outputs of the actor network
    - in SB3 only the mean is an output, the std is implememnted as a seperate trainable variable.
- SB3 uses standardized observations, we don't

We expect that the lack of performance in our implementation is caused by it's simplictiy itself. It is to basic to deal with the high requirements of such a complex enviroment like the hopper. For the MountainCarContinuous it seems to be enough, but it's to basic to deal with the high requirements of such a complex enviroment like the hopper.

In particular, we assume that the different loss definitions, lack of frozen networks and standardisation is crucial for good agent performances in complex enviroments. 

---

### 5) Improve our PPO Continuous implementation with features from SB3

<u> Key changes </u>

- Define the trainable log standard deviation (instead of outputting the std by the network). We expect this approach provides more flexibility in shaping the exploration strategy (should support more exploration)
- Standardizing observations before they were processed by the networks (networks usually can handle inputs between 0-1 better)
- changed loss definition for the actor network (adapted to SB3 approach)
- implemented the parameter lambda in critic loss definition (trade of between mean and std in advantage calculation)

Combining the networks with their loss definition and replacing the frozen networks was not implemented due to lack of time and particular computing power.

For more implementation details, please look up the XXXX_modified files in comparison to the standard ones.

<u> Initializing, Training and Saving </u>

In [None]:
# create a directory (./model)to save all trained agents with a timestamp
jetzt = datetime.datetime.now()
datum_uhrzeit = jetzt.strftime("%Y%m%d_%H%M%S")
savedir = f'modified_model\\YOUR_AGENT_NAME_HERE_{datum_uhrzeit}'
os.makedirs('modified_model', exist_ok=True)
os.makedirs(savedir, exist_ok=True)

# create a directory to save logs from Tensorboard (Visualization and analysis tool from Tensorflow)
log_dir1 = f"{savedir}\\log"
os.makedirs(log_dir1, exist_ok=True)

# user-feedback if logdir already exists --> should not be possible with using the timestamp, 
# but better prove to avoid overwiriting of the log - data
if os.path.exists(log_dir1):
    print(f"The directory {log_dir1} exists.")
    absolute_path = os.path.abspath(log_dir1)
    print(absolute_path)
else:
    print(f"The directory {log_dir1} does not exist.")

In [None]:
# Parameter for the actor and critic networks --> standards used in the MountainCarContinuous
actor_learning_rate = 0.00025   # learning rate for the actor
critic_learning_rate = 0.001    # learning rate for the critic (shold be > than actor)
# If your actor changes faster than your critic, your estimated Q-value will not truly represent the value of your action, because that value is based on the past policies

# Parameter for the agent
gamma = 0.99                    # discount factor
epsilon = 0.2                   # clip range for the actor loss function

# Parameter for training (n_rollouts NOT USED)
epochs =  50                 # number of learning iterations
n_steps = 2048                 # number of steps per epoch -> issue wth hopper while using rollouts
batch_size = 16                  # number of samples per learning step
learn_steps = 10                # number of learning steps per epoch

In [None]:
# import the modified agent
from lib.PPOAgentContinuous_modified import PPOAgentContinuous as PPOAgent_mod
agent = PPOAgent_mod(env.action_space, env.observation_space, gamma, epsilon, actor_learning_rate, critic_learning_rate) 

# train the agent
from lib.train_agent import training_steps
training_steps(env, agent, log_dir1, epochs, n_steps, batch_size, learn_steps, render=False)

# this method will provide a continuous feedback about the training (e.g. steps, epoch, rollouts, actor loss, critic loss)
# to save runtime it's recommended to set the render-flag to false while training

In [None]:
# save the agent to h5 format
filepath_actor = f"{savedir}\\actor.h5"
filepath_critic = f"{savedir}\\critic.h5"

# this method saves the trained weights of actor and critic networks to created folder structure (see obove)
agent.save_models(filepath_actor, filepath_critic)

<u> Reloading and Rendering </u>

In [None]:
# import the rendering function (self-written) for the gym framework
from lib.render_GUI import render_GUI_GYM

# set up the enviroment and load the trained agent from directory
# set the render_mode to 'human' to create a external window (pyglet) to render a video of the agent acting in the enviroment
render_env = gym.make('Hopper-v4', render_mode = 'human')

# set up an agent, where the saved weights loaded in
render_agent_mod = PPOAgent_mod(render_env.action_space, render_env.observation_space)
render_agent_mod._init_networks()

filepath_actor = "modified_model\\YOUR_AGENT_NAME_HERE\\actor.h5"
filepath_critic = "modified_model\\YOUR_AGENT_NAME_HERE\\critic.h5"

# load the saved weights to the initialized model
render_agent_mod.load_models(filepath_actor, filepath_critic)

# call the render function 
# Note: to close this window you have to interrupt this running cell, closing the window with the red x doesn't work
render_GUI_GYM(render_env, render_agent_mod)

<u> Example for a trained agent </u>

Trained agent with default hyperparameters from the BaselinesZoo by DLR in the env:     
<video width="200" height="200" controls>
  <source src="visu/Video-modified.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = total timesteps/ y-axis = return):  
<img src="visu/Learning-curve-modified.jpg" alt="Learning Curve modified" width="400"/>

<u> Observations to mention </u>

- better performance compared to the standard implementation, but still not on the level of SB3 
- Implementation of some SB3 approaches makes the training much more stable
- Hopper learns how to jump
- Hopper desn't move to the righthand side --> no improvements with more training steps expected due to flattening of the learning curve

---

### 6) Lessons Learned (overall for MountainCar and Hopper)

- Conversion of a discrete algorithm to a continuous possibly by changing the policy structure (actor network) and some sampeling methods
- Performance depends on small implementation details, not only on the algorithm itself (see hopper)
- There are a lot possibilities to implement the same algorithm -> huge variablity
- Reward Shaping helps to solve difficult environments (with destructive reward)
- NaN Issue - for complex environments (e.g. hopper) some of the actor weights became 0 
- Units get “deactivated”, Adam Optimizer fails and outputs a NaN – value
“Solved” by using leaky_relu instead of relu as activation of the hidden layers (weight = 0 is much less possible)
- Avoid calculating a fraction, use log difference instead  better numerical stability
- Standardize observations in experience gathering can be crucial, if actions get out of bounds [-1 … 1] to often
- Tuning hyperparameters is only for fine-tuning  using defaults worked out to be always the best
 