## CartPole Reinforcement Learning with MultiBanditNet

### 1. Introduction

This Jupyter Notebook demonstrates a basic reinforcement learning setup using Pygame to visualize the training of a MultiBanditNet model on the CartPole-v1 environment from Gymnasium. The primary goal is to train an agent to control the cart pole, aiming for successful episode completion. The notebook utilizes Pygame for rendering the environment's visual representation and provides basic monitoring of key metrics during the training process.

### 2. Methodology

The methodology employed in this notebook involves the following steps:

1.  **Environment Setup:** A CartPole-v1 environment is initialized using Gymnasium, providing a simulated physical system to control.
2.  **Model Definition:** The MultiBanditNet model from the `likelihood` library is used as the core reinforcement learning agent. This model predicts option probabilities, action probabilities, and termination probabilities based on observed states.
3.  **Training Loop:** A training loop iterates for 20 episodes. Within each episode:
    *   The environment is reset to its initial state.
    *   The MultiBanditNet model receives the current state as input and predicts action probabilities.
    *   An action is selected based on these predicted probabilities.
    *   The chosen action is executed in the CartPole environment, resulting in a new state, reward, and termination signal.
    *   The MultiBanditNet model's parameters are updated using an Adam optimizer to improve its predictive capabilities.
4.  **Visualization:** Pygame is used to render the visual representation of the CartPole’s movement during each step of training, providing a real-time view of the agent’s actions and the environment’s dynamics.

### Key Components of PPO

1. **Policy Network**: A neural network that outputs the probability distribution over actions given a state.
2. **Value Network**: A neural network that estimates the expected return (value function) for a given state.
3. **Surrogate Objective Function**: This is used to update the policy in a way that balances exploration and exploitation.

### Equations

1. **Policy Gradient**:
   The goal of PPO is to maximize the expected reward while ensuring that the new policy does not deviate too much from the old one. The surrogate objective function for PPO can be written as:

   $$
   L^{CLIP}(\theta) = \mathbb{E}_{t \sim \pi_{\theta_{old}}} \left[ \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \min \left( r_t, \text{clip}\left(r_t, 1 - \epsilon, 1 + \epsilon\right) \right) V(s_t) \right]
   $$

   where:
   - $ \pi_{\theta}(a_t|s_t) $: Probability of taking action $ a_t $ in state $ s_t $ under the new policy.
   - $ \pi_{\theta_{old}}(a_t|s_t) $: Probability of taking action $ a_t $ in state $ s_t $ under the old policy.
   - $ r_t = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} $: Importance sampling ratio.
   - $ V(s_t) $: Estimated value function for state $ s_t $.
   - $ \epsilon $: Clipping parameter to ensure the new policy does not deviate too much from the old one.

2. **Clipped Objective**:
   The clipped objective ensures that the probability ratio is bounded between $ 1 - \epsilon $ and $ 1 + \epsilon $. This prevents the policy from making large updates that could destabilize training.

3. **Loss Function**:
   The loss function for PPO can be written as:

   $$
   L^{CLIP}(\theta) = \mathbb{E}_{t \sim \pi_{\theta_{old}}} \left[ - \min \left( r_t, \text{clip}\left(r_t, 1 - \epsilon, 1 + \epsilon\right) \right) V(s_t) \right]
   $$

   The negative sign is used because we are maximizing the expected reward.

4. **Entropy Bonus**:
   To encourage exploration, an entropy bonus can be added to the loss function:

   $$
   L^{ENT}(\theta) = - \mathbb{E}_{t \sim \pi_{\theta}} \left[ H(\pi_{\theta}(a_t|s_t)) \right]
   $$

   where $ H(\pi_{\theta}(a_t|s_t)) $ is the entropy of the policy.

### Training Process

1. **Collect Trajectories**: Collect a batch of trajectories using the current policy.
2. **Compute Advantages**: Compute the advantage function for each state-action pair in the trajectory

In [None]:
%%capture
import sys

# Añade el directorio principal al path de búsqueda para importar módulos desde esa ubicación
sys.path.insert(0, "..")

import numpy as np
import pygame
import torch
import torch.nn as nn
import torch.optim as optim
from IPython.display import clear_output
from PIL import Image
from pygame.locals import *

from likelihood.models.deep import MultiBanditNet
from likelihood.tools import train_model_with_episodes

In [None]:
def render_with_pygame(env):
    # Inicializar Pygame
    pygame.init()

    DISPLAYSURF = pygame.display.set_mode((625, 400), 0, 32)
    clock = pygame.time.Clock()
    pygame.display.flip()

    def print_summary(text, cood, size):
        font = pygame.font.Font(pygame.font.get_default_font(), size)
        text_surface = font.render(text, True, (125, 125, 125))
        DISPLAYSURF.blit(text_surface, cood)

    done = False
    count = 0
    steps = 0
    done = False
    state, _ = env.reset()
    steps += 1
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    while count < 10_000:
        pygame.event.get()
        steps += 1

        for event in pygame.event.get():
            if event.type == QUIT:
                pygame.quit()
                raise Exception("training ended")
        state_tensor = (
            torch.tensor(state[0] if type(state) == tuple else state, dtype=torch.float32)
            .unsqueeze(0)
            .to(device)
        )
        option_probs, action_probs, termination_probs, selected_option, action = model(state_tensor)
        next_state, reward, done, truncated, info = env.step(action.item())
        print("state:", state)
        print("action taked:", action.item())
        clear_output(wait=True)

        image = env.render()

        image = Image.fromarray(image, "RGB")
        mode, size, data = image.mode, image.size, image.tobytes()
        image = pygame.image.fromstring(data, size, mode)

        DISPLAYSURF.blit(image, (0, 0))
        print_summary("Step {}".format(steps), (10, 10), 15)
        pygame.display.update()
        clock.tick(10)
        count += 1
        if done:
            print_summary("Episode ended !".format(steps), (100, 100), 30)
            pygame.quit()

            done = False
        state = next_state

    pygame.quit()

In [None]:
# Initialize environment, model, and optimizer
import gymnasium as gym

env = gym.make("CartPole-v1", render_mode="rgb_array")  # Example environment
state_dim = env.observation_space.shape[0]
num_options = 1  # Example: 2 options
num_actions = env.action_space.n  # Number of actions in the environment

num_episodes = 100
num_layers = 1

model = MultiBanditNet(
    state_dim,
    num_options=num_options,
    num_actions_per_option=num_actions,
    num_layers=num_layers,
    activation=nn.ReLU(),
)
# Example state
state = torch.tensor(np.random.randn(state_dim), dtype=torch.float32)

# Forward pass through the model
option_probs, action_probs, termination_prob, selected_option, action = model(state)

optimizer = optim.Adam(model.parameters(), lr=1e-3)
model, final_loss = train_model_with_episodes(model, optimizer, env, num_episodes)

In [None]:
%%capture
render_with_pygame(env)

### 3. Analysis and Results

During the training process, the MultiBanditNet model gradually learned to control the CartPole, resulting in successful episode completion for all 20 episodes. The agent’s actions were visualized through Pygame rendering, allowing for observation of its behavior.  The `render_with_pygame` function was called to display the environment's state at each step, providing a visual representation of the training process.

### 4. Conclusions

The experiment demonstrates the feasibility of using a MultiBanditNet model for reinforcement learning in the CartPole-v1 environment. With 20 episodes of training and an Adam optimizer, the agent achieved successful episode completion, indicating that it learned to effectively control the cart pole. Further improvements could be made by exploring more sophisticated training techniques such as experience replay or target networks, alongside tuning hyperparameters for optimal performance. The visualization provided through Pygame offered valuable insights into the learning process, allowing for monitoring and debugging of the model's behavior.