#  Reinforcement Learning (RL)

## **Session 3-4:** Appylication and visualization of agents

Different (trained) models/agents are loaded and applied. They differ mostly in the reward.  

In [10]:
from stable_baselines3 import DQN, A2C, PPO
import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
# visualisation in Jupyter
from matplotlib import animation
from IPython.display import HTML

env = gym.make("CartPole-v1", render_mode="rgb_array") 
# Discrete(2) actions, Box(4,) observations  [oai_citation:1‡gymlibrary.dev](https://www.gymlibrary.dev/environments/classic_control/cart_pole/?utm_source=chatgpt.com)

agent1 = DQN.load("agent_dqn_cartpole_rewardsimple", env=env) # reward 1
agent2 = A2C.load("agent_A2C_cartPole_rewardupdated", env=env) # reward 2


Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [11]:
def simulate_and_capture(myAgent, env, max_steps=1000):
    obs, _ = env.reset()
    frames = []
    for _ in range(max_steps):
        action, _ = myAgent.predict(obs, deterministic=True)
        obs, reward, done, truncated, info = env.step(action)
        frames.append(env.render())
        if done or truncated:
            break
    return frames

def create_animation(myAgent, env): 
    frames = simulate_and_capture(myAgent, env)
    fig = plt.figure(figsize=(4,4))
    plt.axis('off')
    im = plt.imshow(frames[0])

    def update(i):
        im.set_data(frames[i])
        return (im,)

    myAnimation = animation.FuncAnimation(
        fig, update, frames=len(frames), interval=50, blit=True
    )
    plt.close(fig)               
    return myAnimation

animations = [
                create_animation(agent1, env),
                create_animation(agent2, env),
                ]



### Reward 1
The first agent was trained with the reward
\begin{equation}
 r = 1. 
\end{equation}
* In short episodes this enables the (simple) strategy of laterally moving with a constant velocity to balance
* This policy still reaches a high reward - although it is not an intended behavior.  

In [12]:
HTML(animations[0].to_jshtml())

### Reward 2
In the training of the second agent, the reward
\begin{equation}
 r = 1 - 0.5 \frac{| x |}{x_{max}} - 0.5 \frac{| \varphi| }{\varphi_{max}}
\end{equation}
was used in the training. This encourages it to stay closer to the middle position. 

In [13]:
HTML(animations[1].to_jshtml())