# <span style="color:blue">MountainCar and MountainCarCOntinuous Enviroment with PPO </span>

by Robin Wolf and Mathias Fuhrer (RKIM)
Project: Reinforcement Learning in module 'Roboterprogrammierung' by Prof. Hein

This notebook contains all of our analysis and improvement process of the given PPO implementation on open ai's MountainCar and MountainCarContinuous enviroment. A detailed comparison in respect to implementation details and agent performance between stable baselines PPO and our PPO is included too.

### Table of Content:
1) Proximal Policy Optimization (PPO)
2) Difference between discrete and continuous action space implementation
3) MountainCar and MountainCarContinuous Environment
4) Reward shaping
5) Training with our implementation
6) Results
7) Comparison to stable baseline3

---

### 1) Proximal Policy Optimization (PPO)


PPO is a policy gradient Actor-Critic algorithm. The policy model, the **actor** network  produces a stochastic policy. It maps the state to a probability distribution over the set of possible actions. The **critic** network is used to approximate the value function and then, the advantage is calculated:

$$
A_\Phi (s_t, a_t) = q_\Phi (s_t,a_t) - v_\Phi (s_t) = R_t + \gamma v_{\Phi'} (s_{t+1}) - v_\Phi (s_t)
$$

The critic, $v_\Phi$ is trained in the same manner, as the DQN model and the critic of DDPG, with TD-learning and a "frozen" and periodically updated target critic network, $v_{\Phi'}$. Instead of approximating a q-value, it approximates the value.

To train the actor, PPO uses the ratio of two policies:
- a current policy $\pi_\Theta$, that is learned currently
- a baseline policy $\pi_{\Theta´}$, an earlier version of the policy

$$
r^t (\Theta)=r_\Theta (s_t,a_t) = \frac{\pi_\Theta (a_t | s_t)}{\pi_{\Theta'} (a_t | s_t)}
$$

It is the ratio of the probabilities of selecting $a_t$ given $\pi_\Theta$ and the probability of selecting the same action with $\pi_{\Theta´}$.

When multiplied with the the approximated advantage, calculated using the critic network, it can be used as the objective function (maximize with SGA)

$$
loss_{actor} = - r_\Theta (s_t, a_t) A_\Phi (s_t, a_t)
$$

as when
- the advantage is positive, meaning, that selecting the action would increase the value, the probability of selecting this action would increase
- the advantage is negative, meaning, that selecting the action would decrease the value, the probability of selecting this action would decrease

Instead of using this directly as loss function, to stabilize the implementation by adjusting the policy optimization step size, the loss is extended in a pessimistic way:

$$
loss_{actor} = \min [r_\Theta (s_t, a_t) A_\Phi (s_t, a_t), clip(r_\Theta (s_t, a_t), 1-\epsilon, 1+\epsilon) A_\Phi (s_t, a_t)]
$$

PPO uses 2 main models. The actor network learns the stochastic policy. It maps the state to a probability distribution over the set of possible actions. The critic network learns the value function. It maps the state to a scalar.

The critic, $v_\Phi$ is trained in the same manner, as the DQN model and the critic of DDPG, with TD-learning and a "frozen" and periodically updated target critic network, $v_{\Phi'}$. Instead of approximating a q-value, it approximates the value.

To train the actor, PPO uses the ratio of two policies:
- a current policy $\pi_\Theta$, that is learned currently
- a baseline policy $\pi_{\Theta´}$, an earlier version of the policy

$$
r^t (\Theta)=r_\Theta (s_t,a_t) = \frac{\pi_\Theta (a_t | s_t)}{\pi_{\Theta'} (a_t | s_t)}
$$

It is the ratio of the probabilities of selecting $a_t$ given $\pi_\Theta$ and the probability of selecting the same action with $\pi_{\Theta´}$.

When multiplied with the the approximated advantage, calculated using the critic network, it can be used as the objective function (maximize with SGA)

$$
loss_{actor} = - r_\Theta (s_t, a_t) A_\Phi (s_t, a_t)
$$

as when
- the advantage is positive, meaning, that selecting the action would increase the value, the probability of selecting this action would increase
- the advantage is negative, meaning, that selecting the action would decrease the value, the probability of selecting this action would decrease

Instead of using this directly as loss function, to stabilize the implementation by adjusting the policy optimization step size, the loss is extended in a pessimistic way:



$$
loss_{actor} = \min [r_\Theta (s_t, a_t) A_\Phi (s_t, a_t), clip(r_\Theta (s_t, a_t), 1-\epsilon, 1+\epsilon) A_\Phi (s_t, a_t)]
$$




<img src="visu/PPO.png" alt="PPO-Shematic" width="1000"/>

---

### 2) Difference between discrete and continuous action space implementation
In the discrete case, the actor network has an output for every possible action with a probability for this action in the given state.
In the Mountain Car Environment there are 3 possible actions: Accelerate to the left, to the right or not accelerate at all.

<img src="visu/actionsDiscrete.png" alt="actionsDiscrete" width="500"/>

In a continuous action space you need a probability distribution for each degree of freedom. In the case of the Mountain Car Environment, the acceleration to the left or right would be one degree of freedom. In the Hopper Environment there are 3 degrees of freedom for the torques at the 3 joints (see other notebook)
With a continuous action space, the actor network must output a probability distribution consisting of the mean and standard deviation for each degree of freedom.

<img src="visu/actionsContinuous.png" alt="actionsContinuous" width="500"/>

The actor network, the act and the get_actor_grads methods of the PPO agent must be adjusted accordingly.


---

### 3) MountainCar and MountainCarContinuous Environment
### 3.1) MountainCar
https://gymnasium.farama.org/environments/classic_control/mountain_car/  
 
Length of one episode is 200 steps 

**Observation Space (2 dimensional):**
- position of the car along the x-axis [-1,2 … 0,6]
- velocity of the car [-0,7 … 0,7]

**Action Space (3 dimensional):** 
- 0: Accelerate to the left
- 1: Don’t accelerate
- 2: Accelerate to the right

**Reward:** 
 - negative reward of -1 at each timestep
 
  
   
### 3.2) MountainCarContinuous
https://gymnasium.farama.org/environments/classic_control/mountain_car_continuous/ 
  
Length of one episode is 999 steps
   
**Observation Space (2 dimensional):** 
- position of the car along the x-axis [-Inf … Inf]
- velocity of the car [-Inf … Inf]
**Action Space (1 dimensional):**
- force applied to the car [-1 … 1]

**Reward:** 
- negative reward of -0.1 * action2 at each timestep, positive reward of +100 added if the car reaches the goal

<img src="visu/MtnCar.png" alt="MtnCar" width="400"/> 


**--> The goal is to reach the flag on the right hill**

---

### 4) Reward shaping
In the discrete environment, the agent receives a negative reward of -1 at each time step. If the vehicle doesn't make it to the finish, this adds up to a return of -200 in each episode. Only if the vehicle happens to make it to the finish line and the episode ends earlier the return will be greater than -200. The agent cannot learn anything useful if the vehicle does not make it to the destination.

In the continuous environment, the agent receives a negative reward at each time step, which increases as the height of the action increases. If the vehicle makes it to the finish line, it will also receive a positive reward of +100. If the vehicle doesn't make it to the finish line by chance, the agent tends to learn to do nothing or as little as possible in order to keep the negative reward as low as possible.

To solve this, we changed the reward strategy of the original environments. This is quite common and is called reward shaping.


We tried two different things.

On the first attempt, we gave him a staggered positive reward as he made his way up the right hill bit by bit. The idea was that he would learn to go up the right hill.

<img src="visu/rewardPosition.png" alt="MtnCar" width="400"/> 


The second consideration was that he gets a positive reward if the vehicle is fast. The idea was that if the vehicle was fast, it would inevitably have to get up the hill.


---

### 5) Training

In [2]:
import gymnasium as gym
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
from IPython.display import clear_output
import matplotlib.pyplot as plt
import tensorboard
from keras.callbacks import TensorBoard  # to visualize the training process
import os
import datetime
import pygame
import mujoco

physical_devices = tf.config.list_physical_devices('GPU') 
for device in physical_devices:
    tf.config.experimental.set_memory_growth(device, True)

<u> Parameter </u>

In [3]:
# Parameter for the actor and critic networks
actor_learning_rate = 0.00025   # learning rate for the actor
critic_learning_rate = 0.001    # learning rate for the critic

# Parameter for the agent
gamma = 0.99                    # discount factor
epsilon = 0.1                   # clip range for the actor loss function

# Parameter for training
epochs = 1#50                  # number of learning iterations
n_rollouts = 5                  # number of episodes/ rollouts to collect experience
batch_size = 8                  # number of samples per learning step
learn_steps = 16                # number of learning steps per epoch

### 5.1) discrete

<u> Initializing Datalogging </u>

In [7]:
# refers to log data and model data -> below for model data
jetzt1 = datetime.datetime.now()
datum_uhrzeit1 = jetzt1.strftime("%Y%m%d_%H%M%S")
savedir1 = f'model\\CustomMountainCarEnv_velocity_discrete_{datum_uhrzeit1}'
os.makedirs('model', exist_ok=True)
os.makedirs(savedir1, exist_ok=True)
log_dir1 = f"{savedir1}\\log"
os.makedirs(log_dir1, exist_ok=True)

if os.path.exists(log_dir1):
    print(f"The directory {log_dir1} exists.")
    absolute_path1 = os.path.abspath(log_dir1)
    print(absolute_path1)
else:
    print(f"The directory {log_dir1} does not exist.")

The directory model\CustomMountainCarEnv_velocity_discrete_20240115_202429\log exists.
c:\Users\Mathias\Documents\StudiumMaster\Semester1\Roboterprogrammierung_Hein\Projektarbeit_PPO\04_Abgabe\01_Code\model\CustomMountainCarEnv_velocity_discrete_20240115_202429\log


<u> Environment initialisieren </u>

In [8]:
from lib.CustomMtnCarEnvironments import CustomMountainCarEnv_velocity
from lib.CustomMtnCarEnvironments import CustomMountainCarEnv_position

env1 = gym.make('MountainCar-v0', render_mode='rgb_array')

# Reward shaping
env1 = CustomMountainCarEnv_velocity(env1)

<u> initializing PPO-Agent </u>

In [9]:
from lib.PPOAgentDiscrete import PPOAgentDiscrete
agent1 = PPOAgentDiscrete(env1.action_space, env1.observation_space, gamma, epsilon, actor_learning_rate, critic_learning_rate)

<u> training PPO-Agent</u>

In [11]:
from lib.train_agent import training_rollouts as training
training(env1, agent1, log_dir1, epochs, n_rollouts, batch_size, learn_steps, render=False)

start training
collecting experience in rollouts finished, start learning phase
update online nets, learn step 0 of 16 finished --> actor loss 0.02296990156173706, critic loss 0.43175390362739563
update online nets, learn step 1 of 16 finished --> actor loss -0.014846965670585632, critic loss 0.0029164645820856094
update online nets, learn step 2 of 16 finished --> actor loss -0.006653338670730591, critic loss 0.00811038352549076
update online nets, learn step 3 of 16 finished --> actor loss -0.02178068459033966, critic loss 0.03970930725336075
update online nets, learn step 4 of 16 finished --> actor loss 0.030912980437278748, critic loss 0.8937200307846069
update online nets, learn step 5 of 16 finished --> actor loss 0.029167860746383667, critic loss 0.0022610872983932495
update online nets, learn step 6 of 16 finished --> actor loss -0.0076074860990047455, critic loss 0.0006917249993421137
update online nets, learn step 7 of 16 finished --> actor loss -0.030809350311756134, critic 

<u> Storing models </u>

In [12]:
# save the model to h5 format
filepath_actor1 = f"{savedir1}\\actor.h5"
filepath_critic1 = f"{savedir1}\\critic.h5"
agent1.save_models(filepath_actor1, filepath_critic1)

<u> Rendering with pygame </u>

In [13]:
from lib.render_GUI import render_GUI


# Set up the enviroment and load the trained agent from directory
render_env1 = gym.make('MountainCar-v0', render_mode = 'human')

render_agent1 = PPOAgentDiscrete(render_env1.action_space, render_env1.observation_space)
render_agent1._init_networks()

# filepath_actor1 = f"model\\ TODO \\actor.h5"
# filepath_critic1 = f"model\\ TODO \\critic.h5"

# load the model from h5 format
render_agent1.load_models(filepath_actor1, filepath_critic1)

#call the function
render_GUI(render_env1, render_agent1)

Model loaded sucessful
Episode 0 finished
Closed Rendering sucessful


### 5.1) continuous

<u> Initializing Datalogging </u>

In [14]:
# refers to log data and model data -> below for model data
jetzt2 = datetime.datetime.now()
datum_uhrzeit2 = jetzt2.strftime("%Y%m%d_%H%M%S")
savedir2 = f'model\\CustomMountainCarEnv_velocity_continuous_{datum_uhrzeit2}'
os.makedirs('model', exist_ok=True)
os.makedirs(savedir2, exist_ok=True)


#from Fctn_log_metrics import log_metrics
log_dir2 = f"{savedir2}\\log"
os.makedirs(log_dir2, exist_ok=True)

if os.path.exists(log_dir2):
    print(f"The directory {log_dir2} exists.")
    absolute_path2 = os.path.abspath(log_dir2)
    print(absolute_path2)
else:
    print(f"The directory {log_dir2} does not exist.")

The directory model\CustomMountainCarEnv_velocity_continuous_20240115_202950\log exists.
c:\Users\Mathias\Documents\StudiumMaster\Semester1\Roboterprogrammierung_Hein\Projektarbeit_PPO\04_Abgabe\01_Code\model\CustomMountainCarEnv_velocity_continuous_20240115_202950\log


<u> Initializing environment</u>

In [15]:
from lib.CustomMtnCarEnvironments import CustomMountainCarEnv_velocity
from lib.CustomMtnCarEnvironments import CustomMountainCarEnv_position

env2 = gym.make('MountainCarContinuous-v0', render_mode='rgb_array')

# Reward shaping
env2 = CustomMountainCarEnv_velocity(env2)

<u> Initializing PPO-Agent </u>

In [16]:
from lib.PPOAgentContinuous import PPOAgentContinuous
agent2 = PPOAgentContinuous(env2.action_space, env2.observation_space, gamma, epsilon, actor_learning_rate, critic_learning_rate)

<u> Training PPO-Agent </u>

In [17]:
from lib.train_agent import training_rollouts as training
training(env2, agent2, log_dir2, epochs, n_rollouts, batch_size, learn_steps, render=False)

start training
collecting experience in rollouts finished, start learning phase
update online nets, learn step 0 of 16 finished --> actor loss -0.00444340705871582, critic loss 0.15101514756679535
update online nets, learn step 1 of 16 finished --> actor loss -0.002040386199951172, critic loss 0.0020012622699141502
update online nets, learn step 2 of 16 finished --> actor loss -0.01733473315834999, critic loss 0.00512193376198411
update online nets, learn step 3 of 16 finished --> actor loss -0.003203696571290493, critic loss 0.032389018684625626
update online nets, learn step 4 of 16 finished --> actor loss 0.012715565040707588, critic loss 0.012662029825150967
update online nets, learn step 5 of 16 finished --> actor loss -0.006671170238405466, critic loss 0.00016426289221271873
update online nets, learn step 6 of 16 finished --> actor loss -0.005980385467410088, critic loss 0.11311068385839462
update online nets, learn step 7 of 16 finished --> actor loss -0.00513810571283102, criti

<u> Storing models </u>

In [18]:
# save the model to h5 format
filepath_actor2 = f"{savedir2}\\actor.h5"
filepath_critic2 = f"{savedir2}\\critic.h5"
agent2.save_models(filepath_actor2, filepath_critic2)

<u> Rendering with pygame </u>

In [19]:
from lib.render_GUI import render_GUI


# Set up the enviroment and load the trained agent from directory
render_env2 = gym.make('MountainCarContinuous-v0', render_mode = 'human')
render_agent2 = PPOAgentContinuous(render_env2.action_space, render_env2.observation_space)
render_agent2._init_networks()

# filepath_actor2 = f"model\\ TODO \\actor.h5"
# filepath_critic2 = f"model\\ TODO \\critic.h5"

# load the model from h5 format
render_agent2.load_models(filepath_actor2, filepath_critic2)

#call the function
render_GUI(render_env2, render_agent2)

Model reloaded sucessful
Closed Rendering sucessful


---

### 6) Results


**Position based reward** 
  
As you can see in the video, the position based approach was not successful. The vehicle tries to drive up the right hill without gaining the necessary momentum on the left hill.
If the vehicle were to gain momentum on the left, it would receive a worse reward in the meantime. This causes the agent to get stuck in a local maximum.
 
<video width="400" height="400" controls>
  <source src="visu/CustomMountainCarEnv_position_discrete_20240111_174107.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

**Velocity based reward** 
 
The velocity based approach was significantly more successful. For both discrete and continuous environments.
In the diagrams you can see that the average return and the terminations converge well. Termination here means that the episode ended early because the vehicle made it to the finish line. You can see here that at the end the vehicle has reached the destination in each of the 5 episodes per epoch.
 
<video width="400" height="400" controls>
  <source src="visu/CustomMountainCarEnv_velocity_discrete_20240114_000941_200epochs.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = epochs/ y-axis = average return):  
<img src="visu/mtnCar_avgReturn.png" alt="mtnCar_avgReturn" width="400"/>

Learning curve (x-axis = epochs/ y-axis = terminations per epoch (here one epoch has 5 episodes)):   
<img src="visu/mtnCar_avgReturn.png" alt="mtnCar_avgReturn" width="400"/>

---

### 7) Comparison to stable baseline3

<u> Initializing and Training an agent with SB3 </u>

In [None]:
# import stable baselines algorithms (installation of the package recommended)
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

In [None]:
# create a directory (./baselines_model)to save all trained agents with a timestamp
jetzt = datetime.datetime.now()
datum_uhrzeit = jetzt.strftime("%Y%m%d_%H%M%S")
savedir = f'baselines_model\\YOUR_AGENT_NAME_HERE_{datum_uhrzeit}'
os.makedirs('baselines_model', exist_ok=True)
os.makedirs(savedir, exist_ok=True)

# create a directory to save logs from Tensorboard (Visualization and analysis tool from Tensorflow)
log_dir = f"{savedir}\\log"
os.makedirs(log_dir1, exist_ok=True)

Attention: Run only one from the next two cells, the init-process is different for enviroments in preferred SB3 framework and wrapped enviroments in the GYM framework.

In [None]:
# set up the agent (in stable baselines3 using a vectorized enviroment is prefered)
# --> SB3 framework, not compatible with GYM framework
vec_env = make_vec_env('Hopper-v4')

In [None]:
# set up the agent with a wrapped enviroment (now the GYM-Framework must be used!)
inheritance_env = gym.make('MountainCarContinuous-v0', render_mode='rgb_array')

# Reward shaping (Velocity Reward as an example)
env = CustomMountainCarEnv_velocity(inheritance_env)

In [None]:
# set up the agent in baselines framework (used policy, enviroment, optional: log)
model = PPO("MlpPolicy", vec_env, verbose=1, tensorboard_log= log_dir)   
# this model provides continuous feedback about the training process when verbose 1 (doesn't effects runtime that much)

# call the learn method (training ends, if agent passes 1 Mio timesteps ini total)
model.learn(total_timesteps=1000000) 

# save the model (this will create a data.zip folder containing all model informations)
model.save(savedir)

<u> Reloading and Rendering </u>

Choose one of both rendering frameworks in respect to the enviroment you trained the agent in.

In [None]:
# import the rendering function (self-written) for the sb3 framework
from lib.render_GUI import render_GUI_SB3

# define the directory the model is loaded from
loaddir = 'baseline_model\YOUR_AGENT_NAME_HERE\data.zip'

# load the model with sb3 function
render_agent = PPO.load(loaddir)

# initialize a enviroment (must be the same framework as used while training)
render_env = make_vec_env('MountainCarContinuous-v0')

# call the render function 
# Note: to close this window you have to interrupt this running cell, closing the window with the red x doesn't work
render_GUI_SB3(render_env, render_agent)

In [None]:
# import the rendering function (self-written) for the GYM framework (wrapped enviroments)
from lib.render_GUI import render_GUI_GYM

# define the directory the model is loaded from
loaddir = 'baseline_model\YOUR_AGENT_NAME_HERE\data.zip'

# load the model with sb3 function
render_agent = PPO.load(loaddir)

# initialize a enviroment (must be the same framework as used while training)
render_env = gym.make('MountainCarContinuous-v0')

# call the render function 
# Note: to close this window you have to interrupt this running cell, closing the window with the red x doesn't work
render_GUI_GYM(render_env, render_agent)

<u> Comparison of trained agents in standard enviroment </u>

Trained discrete agent with default hyperparameters from the BaselinesZoo by DLR in the standard env:      
<video width="200" height="200" controls>
  <source src="visu/SB3-dis-standard.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = total timesteps/ y-axis = return):  
<img src="visu/LC-discrete-standard.jpg" alt="Learning Curve SB3 discrete MntnCar" width="400"/>

- This approach works well if the goal at the top of the hill is reached once (high slope around 500.000 timesteps)
- It's more or less randomly when the agent hit's the goal first during exploration
- in comparison to our implementation, the agent from SB3 solves the enviroment without reward shaping too.


Trained continous agent with default hyperparameters from the BaselinesZoo by DLR in the standard env:      
<video width="200" height="200" controls>
  <source src="visu/SB3-cont-standard.mp4" type="video/mp4"> 
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = total timesteps/ y-axis = return):  
<img src="visu/LC-cont-standard.jpg" alt="Learning Curve SB3 continuous MntnCar" width="400"/>

- This approach doesn't solve the enviroment.
- Agent learns to do 'nothing' due to big actions are penalized ba a negative reward
- Agent doesn't explore the positive reward if he terimnates
- Learning progress get stuck at a local loss minimum/ return maximum at 0



<u> Comparison of discrete agents in wrapped enviroments: </u>

Trained discrete agent in the wrapped enviroment (position reward shaping) env:      
<video width="200" height="200" controls>
  <source src="visu/SB3-dis-position.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = total timesteps/ y-axis = return):  
<img src="visu/LC-discrete-position.jpg" alt="Learning Curve SB3 MC_position" width="400"/>

- the agent does't terminate/ enviroment is not solved
- because there's only a positive reward on the righthand side of the hill, the agent doesn't learn to get momentum by climbing the lefthand side hill.
- Same dynamics/ behavior as recognized in our implementation.

Trained discrete agent in the wrapped enviroment (velocity reward shaping) env:     
<video width="200" height="200" controls>
  <source src="visu/SB3-dis-velocity.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = total timesteps/ y-axis = return):  
<img src="visu/LC-discrete-velocity.jpg" alt="Learning Curve SB3 MC velocity" width="400"/>

- agent solves the wrapped enviroment with shaped reward well
- terminates much earlier in comparison to the standard enviroment
- performance (max. return) nearly equal, but learning is a more stable or deterministic process
- in comparison to our  implementation we have to mention, that SB3 is much less sample efficient (we need approx. 300.000 - 400.000 timesteps to solve the enviroment in comparison to <10.000)

<u> Comparison of continuous agents in wrapped enviroments: </u>

Trained continuous agent in the wrapped enviroment (position reward shaping) env:      
<video width="200" height="200" controls>
  <source src="visu/SB3-cont-position.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = total timesteps/ y-axis = return):  
<img src="visu/LC-cont-position.jpg" alt="Learning Curve SB3 MCC_position" width="400"/>

- enviroment is not solved
- Same dynamics/ behavior as recognized in our implementation and with discrete PPO
- actions taken by the continuous agent are smaller than the discrete ones, because they are penalized by a negative reward

Trained discrete agent in the wrapped enviroment (velocity reward shaping) env:     
<video width="200" height="200" controls>
  <source src="visu/SB3-cont-velocity.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Learning curve (x-axis = total timesteps/ y-axis = return):  
<img src="visu/LC-cont-velocity.jpg" alt="Learning Curve SB3 MCC_velocity" width="400"/>

- agent solves the wrapped enviroment with shaped reward well
- needs more attempts to build up enough momentum to reach the goal
- that's because of the balance between a reward penalizing big actions and a other reward pushing the agent to geh high velocities (they work against each other)
- in comparison to our implementation, this agent needs much more timesteps to learn terminating (same as discrete agent with SB3 -> a educated guess why is described in the Hopper-Notebook)