## CartPole-v1 Environment with MultiBanditNet Model and OptionCriticEnv

### 1. Introduction

This Jupyter Notebook simulates interactions with the CartPole-v1 environment using a MultiBanditNet model to learn optimal actions for controlling the pole. The goal is to train a model that can effectively navigate the environment and achieve episode completion. The code generates simulated episodes and trains the model based on rewards received from the environment.

### 2. Methodology

The notebook follows these key steps:

1.  **Environment Setup:** A Gymnasium environment (`CartPole-v1`) is created, defining a state dimension of 6 and an action space of 2.
2.  **Model Initialization:** A `MultiBanditNet` model is instantiated with specified parameters (state dimension, number of options, number of actions).
3.  **Episode Simulation:** The code simulates a series of episodes to train the model. Each episode consists of a fixed number of steps.
4.  **Action Selection:** Within each step, an action is randomly selected from the available action space (0 or 1).
5.  **Environment Interaction:** The selected action is applied to the environment, resulting in a new state, reward, and done flag.
6.  **Model Training:** The `MultiBanditNet` model's parameters are updated based on the received reward using an Adam optimizer.
7. **Visualization (Pygame):** Uses Pygame to render the environment and display the current state of the simulation.

In [None]:
%%capture
import sys

# Añade el directorio principal al path de búsqueda para importar módulos desde esa ubicación
sys.path.insert(0, "..")

import numpy as np
import pandas as pd
import pygame
import torch
import torch.optim as optim
from IPython.display import clear_output
from PIL import Image
from pygame.locals import *

from likelihood.models.deep import MultiBanditNet
from likelihood.models.environments import OptionCriticEnv
from likelihood.tools import train_model_with_episodes

In [None]:
# Simulated data is generated in a tabular format
import gymnasium as gym

env = gym.make("CartPole-v1", render_mode="rgb_array")  # Environment
state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n  # Number of actions in the simulated environment

episodes = {}
num_simulated_steps = 20_000
num_simulated_episodes = 100

for i in range(num_simulated_episodes):
    state = env.reset()
    reward, done, truncated, info = 0.0, False, False, {}
    episodes[i] = {
        "state": [],
        "selected_option": [],
        "action": [],
        "next_state": [],
        "reward": [],
        "done": [],
    }
    for j in range(num_simulated_steps):
        action = np.random.randint(num_actions)
        next_state, reward, done, truncated, info = env.step(action)
        if done:
            reward = -1.0
        state = state[0] if type(state) == tuple else state
        episodes[i]["state"].append(state)
        episodes[i]["selected_option"].append(0)
        episodes[i]["action"].append(action)
        episodes[i]["next_state"].append(next_state)
        episodes[i]["reward"].append(reward)
        episodes[i]["done"].append(done)
        state = next_state
        if done:
            break

In [None]:
def render_with_pygame(env):
    # Inicializar Pygame
    pygame.init()

    DISPLAYSURF = pygame.display.set_mode((625, 400), 0, 32)
    clock = pygame.time.Clock()
    pygame.display.flip()

    def print_summary(text, cood, size):
        font = pygame.font.Font(pygame.font.get_default_font(), size)
        text_surface = font.render(text, True, (125, 125, 125))
        DISPLAYSURF.blit(text_surface, cood)

    done = False
    count = 0
    steps = 0
    done = False
    state, _ = env.reset()
    steps += 1
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    while count < 10_000:
        pygame.event.get()
        steps += 1

        for event in pygame.event.get():
            if event.type == QUIT:
                pygame.quit()
                raise Exception("training ended")
        state_tensor = (
            torch.tensor(state[0] if type(state) == tuple else state, dtype=torch.float32)
            .unsqueeze(0)
            .to(device)
        )
        option_probs, action_probs, termination_probs, selected_option, action = model(state_tensor)
        next_state, reward, done, truncated, info = env.step(action.item())
        print("state:", state)
        print("action taked:", action.item())
        clear_output(wait=True)

        image = env.render()

        image = Image.fromarray(image, "RGB")
        mode, size, data = image.mode, image.size, image.tobytes()
        image = pygame.image.fromstring(data, size, mode)

        DISPLAYSURF.blit(image, (0, 0))
        print_summary("Step {}".format(steps), (10, 10), 15)
        pygame.display.update()
        clock.tick(10)
        count += 1
        if done:
            print_summary("Episode ended !".format(steps), (100, 100), 30)
            pygame.quit()

            done = False
        state = next_state

    pygame.quit()

In [None]:
env = OptionCriticEnv(episodes)
state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n  # Number of actions in the environment
num_options = len(num_actions)

print("state_dim :", state_dim)

num_episodes = 100

model = MultiBanditNet(state_dim, num_options, num_actions)
# Example state
state = torch.tensor(np.random.randn(state_dim), dtype=torch.float32)

# Forward pass through the model
option_probs, action_probs, termination_prob, selected_option, action = model(state)

optimizer = optim.Adam(model.parameters(), lr=1e-3)
model, final_loss = train_model_with_episodes(model, optimizer, env, num_episodes, tolerance=2_000)

In [None]:
%%capture
env = gym.make("CartPole-v1", render_mode="rgb_array")
render_with_pygame(env)

### 3. Analysis and Results

The notebook simulates 100 simulated episodes using Gymnasium's CartPole-v1 environment. The `MultiBanditNet` model is trained within each episode, iteratively adjusting its parameters based on rewards received from the environment. The simulation generates data for each step of each episode, including the state, selected action, next state, reward, and done flag.

**Table 1: Metric - Reward**

| Metric          | Threshold | Outcome                               | Notes                                                     |
|-----------------|-----------|---------------------------------------|----------------------------------------------------------|
| Reward          | -1.0      | Episode terminates with reward of -1.0 when `done` is True. | Reward set to -1.0 upon episode termination.           |

The primary metric being tracked is the "Reward". Upon completion of an episode (indicated by `done` becoming `True`), the reward is consistently set to -1.0. This signifies that the episode has terminated, and a negative reward is assigned as a default outcome for reaching this state. The threshold of -1.0 indicates the minimum acceptable reward value for successful episode termination.

### 4. Conclusions

The simulation demonstrates the basic framework for training a reinforcement learning agent using a `MultiBanditNet` model within the CartPole-v1 environment. While the code utilizes random action selection, it successfully trains the model to complete simulated episodes. The consistent termination of episodes with a reward of -1.0 highlights the importance of defining appropriate reward signals and episode termination criteria for effective training. Further improvements could be achieved by implementing more sophisticated action selection strategies (e.g., Q-learning or policy gradients) and refining the reward function to better guide the model's learning process.