#  Reinforcement Learning (RL)

## **Session 3-1:** Introduction to Reinforcement Learning

In this module, the basics of reinforcement learning will be shown. You will learn the required components for setting up the agent and learning environment and how the critical components interact. 

Reinforcement Learning is a (deep) learning method where: 
* the data is created by the interaction of an agent with the environment in the learning process.   
* the learning process is driven by the _reward_ - a metric calculated from the environment 
    * the objective can be directly formulated
    * no knowledge required of what good behaviour is
* __Challenge__: data depends on learning


### A (very) Brief History of RL

<!--span style="font-size: 16pt"-->  
  
* __1983__: [Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems](doi.org/10.1109/TSMC.1983.6313077) introduces concept of _Actor-Critic_
* __1989__: Q-learning is introduced by Watkins
* __2013__: Deep Q-Network (DQN) play Atari games - and beat human performance in 2015
* __2015/2016__: AlphaGo beats human masters in Go, which was previously thought too complicated for AI.  
* __2017__: Proximal policy optimization (PPO) is published
* __2022__: Reinforcement Learning Through Human Feedback (RLHF) is used in InstructGPT for fine-tuning Language models - which is today standard. 


### Components: 
<img src="../figures/RLScheme.png" alt="fig" width="450" align="right" style="padding: 30px;" />


* __agent__:  
    * in (deep) RL the agent consists of one or more neural networks
    * The network parameters are adjusted in the learning process 
    * __policy__: the strategy/behaviour of the agent

<br> <!-- Add extra space between sections -->
* __environment__: refers to the system or context within which the agent operates  
    * simulation model, e.g. multibody dynamics model  
    * real system --> might be safety-critical  
    * The environment is advanced in _steps_  
        * In every environment _step_ the agent recieves current __reward__, __state__ and  and returns the __action__   
        * A single step in the environment might correspond to multiple steps in the simulation model  



### Components: 
<img src="../figures/RLScheme.png" alt="fig" width="450" align="right" style="padding: 30px;" />



* __action__: how the agent interacts with the environment:  
    * could be a setpoint for PD-control, a force, torque, ...  
    * either discrete or continuous depending on agent's algorithm

<br> <!-- Add extra space between sections -->
* __state__ (observation): 
    * The data from the environment observed by the agent 
    * In a _real_ system this could be sensor data
    * action is chosen based on current state
    * An observation is a partial description of a state, often the state is used synonymously 



### Components: 
<img src="../figures/RLScheme.png" alt="fig" width="450" align="right" style="padding: 30px;" />


* __reward__:
    * Used to learn from the environment   
    * essential design choice
    * Policy is trained by the optimizer to maximize the expected cumulative reward
    * determines good/bad behaviour by giving high/low reward


<br> <!-- Add extra space between sections -->
* __episodes__:
    * training is organised in episodes, where one episode is a sequence of subsequent interactions of the agent with the environment
    * at the start of each episode the system is _reset_, thus randomly initialized
    * the end of an episode is reached when: 
        * truncated, e.g. time is over or agent made no progress for longer time
        * terminated; environment's state/observation left permitted range, e.g. pendulum fell over
        * terminated and truncated might not be distinguished in algorithm
    * Typically, the range of values is limited for each state
        * if the state is outside the permitted range environment is terminated
        * This helps to learn efficiently and avoid wasting resources on failed episodes

<br> <!-- Add extra space between sections -->
* __value__:
    * how good is it to be in a state?
        * generally depends on state and policy



### Example 1: _cartpole_ 


<img src="../figures/cartpole.png" alt="fig" width="450" align="right" style="padding: 30px;"/>

* inverted pendulum on a linear actuator
* environment: multibody model, either using:
    * redundant coordinates of bodies $[x_1, y_1, \varphi_1, x_2, y_2, \varphi_2]$, 1 prismatic, and 1 revolute joint
    * minimal coordinates $[q_1, q_2] = [x_\mathrm{cart}, \varphi]$
* state: [$x_{cart}$, $\dot{x}_{cart}$, $phi$, $\dot{\varphi}$]
    * state is obtained from MBD-model  
    * also redundant coordinates could be used  
* action: Force on cart $F_\mathrm{cart}$, continuous or discrete (depending on agent!)
* see [example training](./02_Exudyn_ExamplePendulum.ipynb) and [implementation environment](./environmentExudyn.py) for a custom environment using Exudyn.
* reward e.g. $r = 1 - \frac{\varphi}{\varphi_{max}}$




### Gym / Gymnasium Interface
<img src="../figures/cartpole.png" alt="fig" width="450" align="right" style="padding: 30px;"/>
  

Gymnasium, formerly developed as OpenAI Gym, is a library for RL environments; most libraries ether support it or are compatible with it. The cartpole is a standard example - see also [here](https://gymnasium.farama.org/environments/classic_control/cart_pole/). 
At default: 
* The reward $r=1$ is given constantly as long as it does not fall over. 
* At reset, all states at initialization randomly from interval $(-0.05, 0.05)$
* The episode ends when $|\varphi| > 12°$, $|x_{cart}| > 2.4$m or episode length $>500$.




In [1]:
import gymnasium as gym
env = gym.make("CartPole-v1") # create environment object

# when resetting with the seed we always get the same initialization
observation, info = env.reset(seed=42) 
print('observation at initialization \n[x, x_t, phi, phi_t] = ', observation) 

for _ in range(1): # do a single step
    action = env.action_space.sample() # here a random action is chosen from {0,1} --> 
    observation, reward, terminated, truncated, info = env.step(action) # apply action (force) and call solver

    if terminated or truncated:
        observation, info = env.reset() # if done (truncatd/terminated) environment is reset 

print('action: ', action)
print('new observation: ', observation, '\n')
print('reward = {}, terminated={}, truncated={}, info={}'.format(reward, terminated, truncated, info))

env.close()

observation at initialization 
[x, x_t, phi, phi_t] =  [ 0.0273956  -0.00611216  0.03585979  0.0197368 ]
action:  0
new observation:  [ 0.02727336 -0.20172954  0.03625453  0.32351476] 

reward = 1.0, terminated=False, truncated=False, info={}


### Example 2: Bicycle

    
* environment:
    * multibody model state $\mathbf{s}_\mathrm{MBD}$ and path $\mathbf{s}_\mathrm{path}$
    * $\mathbf{s}_\mathrm{MBD} = [x_P, y_P, \Psi, \varphi, \delta, \theta_R, \theta_F, \dot{\varphi}, \dot{\delta}, \dot{\theta_f}]$
    *  $\mathbf{s}_\mathrm{path}$ contains lateral distance to path and preview points 
* action: setpoint for desired steering angle $\delta$ for underlying PD-control


<p float="left">
<img src="../figures/bike.png" alt="fig" width="300"/>
<img src="../figures/bike-preview.png" alt="fig" width="450"/>
</p>

## Challenges in RL: 
* explorations vs. exploitation:  
    * _exploration_: trying new or less-visited actions to discover potentially better outcomes.
    * Too much exploration might waste resources and converge slowly. 
    * _exploitation_: choosing the best known action based on the current knowledge.
    * Too much exploitation might lead to suboptimal behaviour (local optima).

<br> <!-- Add extra space between sections -->
* data efficiency:  
    * describes how many _interactions_ with the environment are needed for learning

<br> <!-- Add extra space between sections -->
* _Sensitivity_ to hyperparameters:
    * Compared to supervised learning, there is a feedback loop of the agent to the acquired data over the policy
    * This increases the risk of instability or failure to learn

<br> <!-- Add extra space between sections -->
* rewards can be _sparse_, thus not provided in every timestep, but only when some goal state is reached

<br> <!-- Add extra space between sections -->
* _Glitching or breaking_ the environment: depending on the reward, the agent might exploit _problems_ in the environment
    * Example 1: If only the angle is penalized in the inverted pendulum, a constant translational velocity might occur 
    * Example 2: Bicycle: when the reward is set to the lateral distance to the closest point on the path it should follow, when leaving the path, being normal to the path leads to the best reward



### On-Policy

* On-policy algorithms directly optimize the policy, requiring data from the current policy to calculate updates.  
* Generally less sample efficient, but more stable. 



<br> <!-- Add extra space between sections -->

### Off-Policy: 

* Off-policy algorithms learn from data not generated by the current policy.
* Typically, data is saved into a buffer and reused by sampling from the buffer, increasing sample efficiency.
    * the most recent agent is used to generate the data
* The buffer contains tuples $(\mathbf{s}_t, \mathbf{a}_t, r_t, \mathbf{s}_{t+1}, done)$ to learn from.
* typically: before learning starts, the buffer is (partly) filled to start learning with diverse, uncorrelated data





### Deep-Q-Network (DQN): 



<!--div style="width: 50%; margin: 0 auto; text-align: center;"-->

The expected discounted reward the agent will receive can be expressed by:  
$Q_\pi(s, a) = \mathbb{E}\left[r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+\ldots \mid s_t=s, a_t=a\right]$
-  _expected_ because the environment and policy might be stochastic
- _discounted_ because future reward are weighted with discount factor $\gamma \in [0, 1)$
- in terminal state: no future rewards
* starting in state $\mathbf{s}$
* taking action $a$
* following policy $\pi$
The policy $\pi(s)= \underset{a \in A}{\arg \max } Q_\pi(s, a)$ chooses the action that maximizes the expected return


<!--Bellman equation for optimal value function:  
$ Q^*\left(s_t, a_t\right)=\mathbb{E}\left[r\left(s_t, a_t\right)+\gamma \max _{a^{\prime}} Q^*\left(s_{t+1}, a^{\prime}\right)\right] $

__Q-Learning__: 
* learn the optimal Q-function $Q^*$, thus the best expected return from given states s and action a. 
* For a known $Q^*$ the optimal policy is $\pi^*(\mathbf{s}) = \arg \max  Q^*(\mathbf{s},a)$ 

-->

__Deep Q Networks (DQN)__: 
* Concept: Learn the optimal Q-function $Q^*$, thus the best expected return from given states s and action a:  
   $ Q^*\left(s_t, a_t\right)=\mathbb{E}\left[r\left(s_t, a_t\right)+\gamma \max _{a^{\prime}} Q^*\left(s_{t+1}, a^{\prime}\right)\right] $
* The Q-function is learned by a neural network $Q(s,a, \theta)$ with parameters $\theta$ (weights/biases) of the neural network 
* temporal difference (TD) loss $\mathcal{L}(\theta)=\left(Q(s, a ; \theta)-\left[r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime} ; \theta\right)\right]\right)^2$
    * Stability problems arise because both the selection of the next action and it's evaluation through the Q-value depend on $\theta$
    * Solution: target network predicts $y = \gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime} ; \theta\right)$
    * Loss is now $\mathcal{L}(\theta)=\left(Q(s, a ; \theta)-y\right)^2$, helps stabilizing training
    * The target network is updated with lower frequency from the Q-network's parameters, either as _hard_ or _soft_ (Polyak) updates
* $\epsilon$-greedy: with chance of $\epsilon$ a random action is taken, while with probability of $1-\epsilon$ the best-known action is chosen
    * often $\epsilon$ is initialized close to 1 and decreased over learning steps
    * higher $\epsilon$ 


Note:  
* DQN works only with a discrete action space
* off-policy
* See also the paper [Human-level control through deep reinforcement learning](https://doi.org/10.1038/nature14236) and [stable-baselines](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) parameters.
* _double_ Q-learning: in Q-learning overestimation of action values might cause poor performance []()
* Depending on the input 




In [3]:
# example DQN code
import gymnasium as gym 
from torch import tensor
from stable_baselines3 import DQN

env = gym.make("CartPole-v1", render_mode="rgb_array") # inverted pendulum on cart

# note: here fully connected layers are used. When learning from images/pixel based representations, CNNs are commonly used.
# Also RNNs (Recurrent Neural Networks - see Introduction) can be applied. 
model = DQN.load("agent_dqn_cartpole_rewardsimple", env=env) # previously trained model
observation = tensor([1,0,0,0])

print("the model: policy: ", model.policy, '\n'*2)
# note: the input of q network for the cartpole has 4 dimensions --> state = [x, x_t, phi, phi_t]
# The output of the q network has 2 dimensions --> 2 discrete actions



Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
the model: policy:  DQNPolicy(
  (q_net): QNetwork(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (q_net): Sequential(
      (0): Linear(in_features=4, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=128, bias=True)
      (3): ReLU()
      (4): Linear(in_features=128, out_features=2, bias=True)
    )
  )
  (q_net_target): QNetwork(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (q_net): Sequential(
      (0): Linear(in_features=4, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=128, bias=True)
      (3): ReLU()
      (4): Linear(in_features=128, out_features=2, bias=True)
    )
  )
) 




### Soft Actor-Critic (SAC)

    
* Just as DQN, SAC is also __off-policy__ and learns __Q-functions__ (both Q-network and target Q-network)

* entropy regularization: policy maximizes trade-off between expected return and entropy $\mathcal{H}$ (exploration)
\begin{equation}
\pi^*=\arg \max _\pi \sum_t \mathbb{E}_{\left(s_t, a_t\right) \sim \pi}\left[r\left(s_t, a_t\right)+\alpha \cdot \mathcal{H}\left(\pi\left(\cdot \mid s_t\right)\right)\right]
\end{equation}

* Uses actor and critic networks:
    * __actor__: determines the action to take according to the policy function
    * __critic__: evaluate the action

* The stochastic critic outputs a mean and standard deviation, enabling continuous actions
* double _clipped_ Q-function: two Q-functions are learned and the $min(Q_{\theta_1}(\mathbf{s}, a), Q_{\theta_2}(\mathbf{s}, a))$ used to avoid over-estimation of value.


* Only supports continuous actions by default. 
* See also the paper [Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](https://doi.org/10.48550/arXiv.1801.01290), 2018. 



In [None]:
from stable_baselines3 import SAC
env_continous = gym.make("Pendulum-v1", render_mode="human") 
model_SAC = SAC("MlpPolicy", env_continous, verbose=1)

print('SAC actor: ', model_SAC.actor, '\n'*2)
print('*'*70, '\nSAC critic uses two Q-networks: \n', model_SAC.critic) # two Q-networks, where the min(Q1, Q2) is used for the update to reduce overestimation of Q-values
print('*'*70, '\nand two target Q-networks: \n', model_SAC.critic_target)


### Asynchronous Advantage Actor Critic (A3C)


1. Actor-Critic architecture
    * __actor__: provides the policy $\pi(a|s,\theta)$, choosing the action
    * __critic__: estimates the state's value $V (s,\theta)$
    * Seperated loss functions $\mathcal{L}_{actor} = - \log{\pi(a_t| s_t, \theta) A_t}$ and $\mathcal{L}_{critic} = \left(R_t - V(s_t, \theta_v)\right)^2$
    * _Advantage_ $A_t = R_t - V(s_t)$: $A >0$:  action is better than expected -> increase probability
    * Optionally: entropy term for exploration

* A3C: multiple environments/threads run in parallel and update a shared model asynchronously
* A2C: synchronized version (also parallelized), but waits and applies update from all threads
    * A2C is better suited for GPU implementation, A3C for CPU

* on-policy, no experience buffer. 
* supports discrete and continuous actions
* In practice actor/critic often share parts of the neural network layers.
* see papers [Asynchronous Methods for Deep Reinforcement Learning](https://doi.org/10.48550/arXiv.1602.01783)



In [None]:
from stable_baselines3 import A2C
model_A2C = A2C("MlpPolicy", env, verbose=1)
print('A2C structure: ', model_A2C.policy, '\n'*2, '*'*70)



### Proximal Policy Optimization (PPO): 
    
PPO is a policy gradient method, learning on-policy: no buffer is used and the latest data is discarded. 
With the default Policy Gradient Loss 
\begin{equation}
\mathcal{L}^{PG}(\theta) = \mathbb{E}_t\left[log\pi_\theta (a_t | \mathbf{s}_t) \hat{A}_t \right]
\end{equation}
with Advantage $\hat{A} = R_t - V(\mathbf{s}_t)$. 
* for positive advantages the gradient is positive, thus the action probability is positive
* for negative advantages the gradient is negative and the action probability decreases 
* Trust-region optimization (TRPO) is the basis of PPO by introducing a constraint - but this constraint adds training/implementation overhead

In Proximal Policy Optimization (PPO) the loss is 
\begin{equation}
\mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ 
\min\left( r_t(\theta) \hat{A}_t,\ 
\text{clip}\left(r_t(\theta),\ 1 - \epsilon,\ 1 + \epsilon\right) \hat{A}_t \right)
\right]
\end{equation}
clipping using the hyperparameter $\epsilon$ which controls the size of the trust region and the probability ratio $r_t$: how much the policy changed. 
* Clipping reduces the adaptation of the policy in the update step as the advantage estimation might be noisy

In addition to the clipping Loss, entropy is added to promote exploration. 


* on-policy, discrete and continuous. 
<!---
shared network parts
Example PPO: OpenAI Five (Dota)
https://github.com/henanmemeda/RL-Adventure-2/blob/master/3.ppo.ipynb
-->


In [None]:
from stable_baselines3 import PPO
model_PPO = PPO("MlpPolicy", env, verbose=1)
print('\n\nPPO structure: \n', model_PPO.policy, '\n'*2, '*'*70)

### Libraries


Widely used libraries for reinforcement learning are: 
* [gymnasium](https://gymnasium.farama.org/index.html): a collection of environments, previously developed as [gym](https://www.gymlibrary.dev/index.html).  
* [pytorch](https://pytorch.org/): widely used for training of neural networks, also features an rl library. 
* [stable-baselines3](https://stable-baselines3.readthedocs.io/en/master/): A set of reliable implementations of RL algorithms using pytorch at the backend.  
* [tensorboard](https://www.tensorflow.org/tensorboard): visualization of the learning process. 
* [ray](https://www.ray.io/): scaling (RL) tasks on heterogeneous clusters


<!--* (Nvidia) [Isaac Lab](https://developer.nvidia.com/isaac/lab): framework for robotics learning built on Isaac Sim. -->


### Session Content: 

    
* [3-2_DQN_and_Cart_Pole](3-2_DQN_and_Cart_Pole.ipynb): Run a training of an RL agent on the _cartpole_ using DQN. 
* [3-3_Custom_Environment_Exudyn](3-3_Custom_Environment_Exudyn.ipynb): Recreation of the _cartpole_ example using a __custom__ gymnasium environment, __vectorized__ (parallel) environments for speedup, and the multibody code Exudyn. __Tensorboard__ is used for visualization while training.   
* [3-4_Application_Agents](3-4_Application_Agents.ipynb): Apply the agents (trained in scripts 3-2 and 3-3) to the _cartpole_ and see the influence of different rewards.  



### Let's see how we can apply these algorithms!