### Repaso RL, RLHF para NLP
---


El aprendizaje por refuerzo sin modelo (Model-Free RL) se centra en aprender directamente una política o un valor de acción sin construir un modelo explícito del entorno. Esto se logra a través de métodos que optimizan políticas y funciones de valor utilizando la experiencia directa de interacción con el entorno.

### Conceptos básicos y notación

Consideremos un problema de RL formulado como un Proceso de Decisión de Markov (MDP), que se define por un conjunto de estados $S$, un conjunto de acciones $A$, una función de recompensa $R$, y una función de transición $P$. El objetivo es encontrar una política $\pi$ que maximice la recompensa acumulada esperada.

#### Definiciones formales

1. **Valor del Estado**: $V^\pi(s)$ es el valor esperado de seguir la política $\pi$ desde el estado $s$.

   $$
   V^\pi(s) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, \pi\right]
   $$
   
   Esta ecuación se genera a partir de la expectativa de la suma de las recompensas futuras descontadas ($\gamma$ es el factor de descuento) al seguir la política $\pi$ comenzando en el estado $s$.

2. **Valor de la Acción**: $Q^\pi(s, a)$ es el valor esperado de tomar la acción $a$ en el estado $s$ y luego seguir la política $\pi$.
   
   $$
   Q^\pi(s, a) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a, \pi\right]
   $$
   Aquí, $Q^\pi(s, a)$ representa el valor esperado de la recompensa futura descontada después de tomar la acción $a$ en el estado $s$ y seguir la política $\pi$.

---

## Optimización de políticas

La optimización de políticas se refiere a la mejora continua de la política $\pi$ para maximizar la recompensa acumulada esperada.

### Policy Gradient

Los métodos de Policy Gradient optimizan directamente la política parametrizada $\pi_\theta$ mediante gradientes de la recompensa acumulada esperada.

#### Ecuaciones clave

La función objetivo a maximizar es el retorno esperado:

$$
J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]
$$

El gradiente de la política se calcula como:

$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(a_t \mid s_t) Q^\pi(s_t, a_t) \right]
$$

Esta ecuación se deriva utilizando la regla del gradiente logarítmico (Log-Likelihood Ratio) y el teorema de la expectativa, lo que permite que el gradiente de la política sea calculado como la expectativa de los gradientes ponderados por el valor de la acción $Q^\pi(s_t, a_t)$.

### A2C/A3C (Asynchronous Advantage Actor-Critic)

A2C y A3C son algoritmos actor-crítico que combinan un actor que aprende la política y un crítico que evalúa la política.

#### Ecuaciones clave

El objetivo es minimizar la pérdida:

$$
L(\theta) = -\log \pi_\theta(a_t \mid s_t) \left( R_t - V^\pi(s_t) \right) + \frac{1}{2} \left( R_t - V^\pi(s_t) \right)^2
$$

Donde $R_t$ es el retorno observado y $V^\pi(s_t)$ es el valor estimado del estado. La primera parte de la ecuación es la pérdida del actor, que se minimiza cuando las acciones que lleva a cabo la política son buenas según la estimación del crítico. La segunda parte es la pérdida del crítico, que se minimiza ajustando el valor del estado para que coincida con el retorno observado.

### PPO (Proximal Policy Optimization)

PPO es un método que restringe la actualización de la política para evitar grandes cambios que desestabilicen el entrenamiento.

#### Ecuaciones clave

La función objetivo para PPO es:

$$
L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]
$$

Donde $r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$ y $\hat{A}_t$ es la ventaja estimada. El objetivo es maximizar la recompensa esperada mientras se evita que $r_t(\theta)$, que es la relación de probabilidad, se aleje demasiado de 1 mediante la operación de "clipping".

### TRPO (Trust Region Policy Optimization)

TRPO optimiza la política dentro de una región de confianza para garantizar la mejora de la política.

#### Ecuaciones clave

El objetivo de TRPO es maximizar:

$$
L^{\text{TRPO}}(\theta) = \mathbb{E}_t \left[ \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)} \hat{A}_t \right]
$$

Sujeto a:

$$
\mathbb{E}_t \left[ \text{KL} \left[ \pi_{\theta_{\text{old}}} (\cdot \mid s_t) \parallel \pi_\theta (\cdot \mid s_t) \right] \right] \leq \delta
$$

Donde la función objetivo maximiza el rendimiento esperado, y la restricción asegura que el cambio en la política (medido por la divergencia KL) no sea mayor que un valor umbral $\delta$.

### DPPO (Distributed Proximal Policy Optimization)

DPPO extiende PPO para entornos distribuidos, permitiendo el entrenamiento paralelo de múltiples agentes.

#### Ecuaciones clave

DPPO utiliza la misma función objetivo que PPO, pero con actualizaciones distribuidas y sincronizadas de los parámetros de la política a través de múltiples trabajadores. La clave está en la gestión de la sincronización y agregación de las actualizaciones para asegurar una convergencia estable y eficiente.

### TD3 (Twin Delayed Deep Deterministic Policy Gradient)

TD3 es una mejora del DDPG (Deep Deterministic Policy Gradient) que reduce la sobreestimación del valor de la acción mediante el uso de dos críticos y actualizaciones retrasadas.

#### Ecuaciones clave

Las actualizaciones de los críticos son:

$$
y_t = r_t + \gamma \min_{i=1,2} Q_{\theta_i'}(s_{t+1}, \pi_{\phi'}(s_{t+1}))
$$

Donde $y_t$ es la estimación de la recompensa futura, $Q_{\theta_i'}$ son las funciones de valor de los dos críticos, y $\pi_{\phi'}$ es la política actualizada.

Las actualizaciones del actor son retrasadas:

$$
\nabla_\phi J(\phi) = \mathbb{E}_t \left[ \nabla_a Q_{\theta_1}(s_t, a) \mid_{a = \pi_\phi(s_t)} \nabla_\phi \pi_\phi(s_t) \right]
$$

El actor se actualiza usando el gradiente de la función de valor del primer crítico, asegurando que las actualizaciones sean más estables al introducir un retraso.

---


## Aprendizaje Q

El aprendizaje Q (Q-Learning) es uno de los algoritmos más fundamentales en el campo del Aprendizaje por Refuerzo (RL). Este algoritmo aprende una política óptima para un MDP (Proceso de Decisión de Markov) al estimar el valor esperado de las acciones en cada estado, denominado valor Q.

Los avances recientes han extendido Q-Learning a través de técnicas de redes neuronales profundas y enfoques estadísticos avanzados, mejorando su capacidad para resolver problemas complejos y de alta dimensionalidad.

### Q-Learning

Q-Learning es un método off-policy que busca aprender la función de valor Q $Q(s, a)$, que representa el valor esperado de realizar una acción $ a $ en un estado $ s $ y seguir la política óptima a partir de ese punto.

#### Ecuaciones clave

La actualización de la función Q en Q-Learning se basa en la ecuación de Bellman:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left( r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right)
$$

Donde:
- $\alpha $ es la tasa de aprendizaje.
- $\gamma $ es el factor de descuento.
- $r_t$ es la recompensa obtenida al realizar la acción $a_t$ en el estado $s_t$.

Esta ecuación ajusta el valor Q actual hacia el valor Q estimado del estado siguiente $s_{t+1}$.

---

### Deep Q-Network (DQN)

DQN extiende Q-Learning utilizando redes neuronales profundas para aproximar la función Q. Esto permite que el algoritmo maneje espacios de estado continuos y de alta dimensionalidad.

#### Ecuaciones clave

La función de pérdida en DQN se define como:

$$
L(\theta) = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim D} \left[ \left( y_t - Q(s_t, a_t; \theta) \right)^2 \right]
$$

Donde:
- $y_t = r_t + \gamma \max_{a'} Q(s_{t+1}, a'; \theta^{-})$
- $\theta $ son los parámetros de la red neuronal.
- $\theta^{-}$ son los parámetros de la red de destino, que se actualizan periódicamente.

El gradiente de esta función de pérdida se utiliza para actualizar los parámetros $\theta$ de la red Q mediante retropropagación.

---

### C51 (Categorical DQN)

C51 es una extensión de DQN que modela la distribución completa de los retornos futuros en lugar de solo su valor esperado. Utiliza una representación categórica de la distribución de valores.

#### Ecuaciones clave

La distribución de valores se representa mediante una función de probabilidad discreta con $N$ categorías:

$$
Z_{\theta}(s, a) = \sum_{i=1}^{N} p_i \delta(z_i)
$$

Donde $z_i$ son los valores de soporte y $p_i$ son las probabilidades asociadas. La función de pérdida se basa en la divergencia Kullback-Leibler:

$$
L(\theta) = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim D} \left[ D_{\text{KL}}(P^{\pi}_{\theta^{-}}(s_{t+1}, r_t + \gamma z) \parallel Z_{\theta}(s_t, a_t)) \right]
$$

---

### QR-DQN (Quantile Regression DQN)

QR-DQN extiende DQN utilizando regresión cuantílica para aproximar la distribución de retornos futuros. En lugar de modelar la media de los retornos, QR-DQN modela múltiples cuantiles de la distribución.

#### Ecuaciones clave

QR-DQN utiliza $K$ cuantiles para representar la distribución de valores:

$$
Z_{\theta}(s, a) = \sum_{i=1}^{K} \hat{Q}_\theta^{\tau_i}(s, a) \delta(\tau_i)
$$

Donde $\tau_i$ son los cuantiles, y $\hat{Q}_\theta^{\tau_i}(s, a)$ son las estimaciones de los valores de cuantiles. La función de pérdida se basa en la pérdida de cuantiles:

$$
L(\theta) = \frac{1}{K} \sum_{i=1}^{K} \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim D} \left[ \rho_{\tau_i} \left( r_t + \gamma \max_{a'} \hat{Q}_{\theta^{-}}^{\tau_i}(s_{t+1}, a') - \hat{Q}_{\theta}^{\tau_i}(s_t, a_t) \right) \right]
$$

Donde $\rho_{\tau_i}$ es la función de pérdida cuantílica de Huber.

---

### HER (Hindsight Experience Replay)

HER es una técnica que modifica la forma en que se almacenan y reutilizan las experiencias en el aprendizaje por refuerzo. HER permite que los agentes aprendan de los fracasos transformando los objetivos de las trayectorias fallidas en objetivos alcanzables.

#### Ecuaciones clave

La idea principal es reutilizar cada transición observada $(s, a, r, s')$ como si se hubiera perseguido un objetivo diferente $g'$. Para cada transición original, se crean múltiples transiciones "ficticias" con diferentes objetivos:

$$
Q(s, a, g) \leftarrow Q(s, a, g) + \alpha \left( r + \gamma \max_{a'} Q(s', a', g) - Q(s, a, g) \right)
$$

Donde $g$ es el objetivo original, y $g'$ es el nuevo objetivo. Esta técnica aumenta la eficiencia del uso de datos, especialmente en entornos de alta dimensión y con sparse rewards.

---

Este informe proporciona una visión detallada de varios métodos avanzados de aprendizaje Q, con un enfoque en las ecuaciones clave que permiten el entrenamiento y mejora de políticas en entornos complejos y de alta dimensionalidad.

## Códigos

In [None]:
# Policy Gradient
import torch
import torch.nn as nn
import torch.optim as optim

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, action_dim)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.softmax(self.fc2(x), dim=-1)
        return x

def update_policy(policy_network, optimizer, rewards, log_probs):
    discounted_rewards = []
    cumulative_reward = 0
    for reward in rewards[::-1]:
        cumulative_reward = reward + 0.99 * cumulative_reward
        discounted_rewards.insert(0, cumulative_reward)
    
    discounted_rewards = torch.tensor(discounted_rewards)
    discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9)
    
    policy_loss = []
    for log_prob, reward in zip(log_probs, discounted_rewards):
        policy_loss.append(-log_prob * reward)
    
    optimizer.zero_grad()
    policy_loss = torch.cat(policy_loss).sum()
    policy_loss.backward()
    optimizer.step()


In [None]:
#A2C/A3C
import torch.multiprocessing as mp

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(ActorCritic, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.actor = nn.Linear(128, action_dim)
        self.critic = nn.Linear(128, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return torch.softmax(self.actor(x), dim=-1), self.critic(x)

def worker(global_model, optimizer, global_ep, res_queue):
    local_model = ActorCritic(state_dim, action_dim)
    local_model.load_state_dict(global_model.state_dict())
    
    while global_ep.value < max_episodes:
        state = env.reset()
        done = False
        rewards, log_probs, values = [], [], []
        while not done:
            state = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            action_probs, value = local_model(state)
            action = torch.multinomial(action_probs, 1).item()
            
            next_state, reward, done, _ = env.step(action)
            log_prob = torch.log(action_probs.squeeze(0)[action])
            
            rewards.append(reward)
            log_probs.append(log_prob)
            values.append(value)
            
            state = next_state
        
        R = torch.zeros(1, 1)
        if not done:
            R = local_model(torch.tensor(state, dtype=torch.float32).unsqueeze(0))[1]
        
        policy_loss, value_loss = 0, 0
        gae = torch.zeros(1, 1)
        for i in reversed(range(len(rewards))):
            R = 0.99 * R + rewards[i]
            advantage = R - values[i]
            value_loss += advantage.pow(2)
            
            delta_t = rewards[i] + 0.99 * values[i + 1].data - values[i].data
            gae = gae * 0.95 * 0.99 + delta_t
            
            policy_loss -= log_probs[i] * gae - 0.01 * log_probs[i].exp() * log_probs[i]
        
        optimizer.zero_grad()
        loss = policy_loss + 0.5 * value_loss
        loss.backward()
        for local_param, global_param in zip(local_model.parameters(), global_model.parameters()):
            global_param._grad = local_param.grad
        optimizer.step()
        
        local_model.load_state_dict(global_model.state_dict())
        with global_ep.get_lock():
            global_ep.value += 1
        res_queue.put(global_ep.value)


In [None]:
# PPO (Proximal Policy Optimization)
import torch
import torch.nn as nn
import torch.optim as optim
import gym

class PPO(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(PPO, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.actor = nn.Linear(128, action_dim)
        self.critic = nn.Linear(128, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        action_probs = torch.softmax(self.actor(x), dim=-1)
        state_values = self.critic(x)
        return action_probs, state_values

# Función para recolectar trayectorias
def collect_trajectories(env, policy_network, max_steps=200):
    state = env.reset()
    states, actions, log_probs, rewards = [], [], [], []
    for _ in range(max_steps):
        state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
        with torch.no_grad():
            action_probs, _ = policy_network(state_tensor)
        action = torch.multinomial(action_probs, 1).item()
        log_prob = torch.log(action_probs.squeeze(0)[action])
        
        next_state, reward, done, _ = env.step(action)
        states.append(state_tensor)
        actions.append(action)
        log_probs.append(log_prob)
        rewards.append(reward)
        
        state = next_state
        if done:
            break
    
    return states, actions, log_probs, rewards

# Función para calcular las ventajas
def compute_advantages(rewards, values, gamma=0.99, lam=0.95):
    advantages, returns = [], []
    gae = 0
    # Asegurarse de que values incluya el valor final (0 si el episodio ha terminado)
    values.append(0)  # Añadir un valor adicional para manejar el último estado
    for i in reversed(range(len(rewards))):
        delta = rewards[i] + gamma * values[i + 1] - values[i]
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)
        returns.insert(0, gae + values[i])
    return returns, advantages


def update_policy(policy_network, old_policy_network, optimizer, states, actions, log_probs, returns, advantages, epsilon=0.2):
    new_log_probs = []
    state_values = []
    for state, action in zip(states, actions):
        action_probs, state_value = policy_network(state)
        new_log_prob = torch.log(action_probs.squeeze(0)[action])
        new_log_probs.append(new_log_prob)
        state_values.append(state_value)
    
    new_log_probs = torch.stack(new_log_probs)
    state_values = torch.stack(state_values).squeeze(1)
    returns = torch.tensor(returns)
    advantages = torch.tensor(advantages)

    ratio = torch.exp(new_log_probs - torch.stack(log_probs))
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()
    value_loss = (returns - state_values).pow(2).mean()
    loss = policy_loss + 0.5 * value_loss
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Ejemplo de uso
def main():
    env = gym.make('CartPole-v1')
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    model = PPO(state_dim, action_dim)
    optimizer = optim.Adam(model.parameters(), lr=1e-3)

    for epoch in range(10):
        states, actions, old_log_probs, rewards = collect_trajectories(env, model)
        # Evaluar todos los estados a la vez si es posible para eficiencia
        state_tensors = torch.cat(states)
        _, state_values = model(state_tensors)
        state_values = state_values.squeeze().tolist()  # Convertir los valores del tensor a lista
        returns, advantages = compute_advantages(rewards, state_values)
        update_policy(model, model, optimizer, states, actions, old_log_probs, returns, advantages)

if __name__ == "__main__":
    main()



In [None]:
# TRPO (Trust Region Policy Optimization) - estructura de codigo elemental


def conjugate_gradient(Ax, b, nsteps, residual_tol=1e-10):
    x = torch.zeros_like(b)
    r = b.clone()
    p = r.clone()
    r_dot_r = torch.dot(r, r)
    for i in range(nsteps):
        Ap = Ax(p)
        alpha = r_dot_r / torch.dot(p, Ap)
        x += alpha * p
        r -= alpha * Ap
        new_r_dot_r = torch.dot(r, r)
        beta = new_r_dot_r / r_dot_r
        p = r + beta * p
        r_dot_r = new_r_dot_r
        if r_dot_r < residual_tol:
            break
    return x

def trpo_step(policy_network, states, actions, advantages, max_kl=1e-2):
    def get_loss_and_kl():
        action_probs, _ = policy_network(states)
        log_probs = torch.log(action_probs.gather(1, actions.unsqueeze(1)))
        policy_loss = -(log_probs * advantages).mean()
        
        kl = (old_action_probs * (old_log_probs - log_probs)).mean()
        return policy_loss, kl
    
    policy_loss, kl = get_loss_and_kl()
    
    grads = torch.autograd.grad(policy_loss, policy_network.parameters(), create_graph=True)
    grads = torch.cat([grad.view(-1) for grad in grads])
    
    Hvp = hessian_vector_product(kl, policy_network.parameters())
    
    step_direction = conjugate_gradient(Hvp, grads, 10)
    
    step_size = torch.sqrt(2 * max_kl / (torch.dot(step_direction, Hvp(step_direction)) + 1e-8))
    final_step = step_size * step_direction
    
    apply_update(policy_network, final_step)


In [None]:
# Distributed Proximal Policy Optimization (DPPO)
import torch
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
import gym
import numpy as np

class PPO(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(PPO, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.actor = nn.Linear(128, action_dim)
        self.critic = nn.Linear(128, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return torch.softmax(self.actor(x), dim=-1), self.critic(x)

def collect_trajectories(policy_network, env, max_steps=200):
    state = env.reset()
    states, actions, log_probs, rewards = [], [], [], []
    for _ in range(max_steps):
        state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
        action_probs, _ = policy_network(state_tensor)
        action = torch.multinomial(action_probs, 1).item()
        
        next_state, reward, done, _ = env.step(action)
        log_prob = torch.log(action_probs.squeeze(0)[action])
        
        states.append(state_tensor)
        actions.append(action)
        log_probs.append(log_prob)
        rewards.append(reward)
        
        state = next_state
        if done:
            break

    return states, actions, log_probs, rewards

def compute_advantages(rewards, values, gamma=0.99, lam=0.95):
    returns, advantages = [], []
    R = 0
    A = 0
    for reward, value in zip(rewards[::-1], values[::-1]):
        R = reward + gamma * R
        delta = reward + gamma * value - value
        A = delta + gamma * lam * A
        returns.insert(0, R)
        advantages.insert(0, A)
    return returns, advantages

def update_policy(policy_network, optimizer, states, actions, log_probs, returns, advantages, epsilon=0.2):
    new_log_probs, state_values = [], []
    for state, action in zip(states, actions):
        action_probs, state_value = policy_network(state)
        new_log_prob = torch.log(action_probs.squeeze(0)[action])
        new_log_probs.append(new_log_prob)
        state_values.append(state_value)

    new_log_probs = torch.stack(new_log_probs)
    state_values = torch.stack(state_values).squeeze(1)
    returns = torch.tensor(returns)
    advantages = torch.tensor(advantages)

    ratio = torch.exp(new_log_probs - torch.stack(log_probs))
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()

    value_loss = (returns - state_values).pow(2).mean()
    loss = policy_loss + 0.5 * value_loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

def dppo_worker(global_policy_params, global_ep, global_ep_lock, env_name, state_dim, action_dim, max_episodes=1000):
    env = gym.make(env_name)
    local_model = PPO(state_dim, action_dim)
    optimizer = optim.Adam(local_model.parameters(), lr=3e-4)

    while True:
        with global_policy_params['lock']:
            for local_param, global_param in zip(local_model.parameters(), global_policy_params['params']):
                param_data = np.array(global_param[:]).reshape(local_param.data.shape)
                local_param.data.copy_(torch.from_numpy(param_data))

        states, actions, log_probs, rewards = collect_trajectories(local_model, env)
        values = [local_model(state)[1] for state in states]
        advantages, returns = compute_advantages(rewards, values)

        update_policy(local_model, optimizer, states, actions, log_probs, returns, advantages)

        with global_policy_params['lock']:
            for global_param, local_param in zip(global_policy_params['params'], local_model.parameters()):
                global_param[:] = local_param.data.cpu().numpy()

        with global_ep_lock:
            if global_ep.value >= max_episodes:
                break
            global_ep.value += 1


def main():
    env_name = "CartPole-v1"
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    with mp.Manager() as manager:
        global_policy_params = manager.dict()
        global_policy_params['lock'] = manager.Lock()
        global_policy_params['params'] = [manager.list(param.data.cpu().numpy().flatten()) for param in PPO(state_dim, action_dim).parameters()]

        global_ep = manager.Value('i', 0)
        global_ep_lock = manager.Lock()

        num_workers = mp.cpu_count()
        workers = [mp.Process(target=dppo_worker, args=(global_policy_params, global_ep, global_ep_lock, env_name, state_dim, action_dim)) for _ in range(num_workers)]

        for worker in workers:
            worker.start()
        for worker in workers:
            worker.join()

if __name__ == "__main__":
    main()


In [None]:
#Twin Delayed Deep Deterministic Policy Gradient (TD3) 
import torch
import torch.nn as nn
import torch.optim as optim

# Red Actor 
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(state_dim, 400)
        self.fc2 = nn.Linear(400, 300)
        self.fc3 = nn.Linear(300, action_dim)
        self.max_action = max_action
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.max_action * torch.tanh(self.fc3(x))

# Red Critic
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()
        # First Q-function network
        self.fc1 = nn.Linear(state_dim + action_dim, 400)
        self.fc2 = nn.Linear(400, 300)
        self.q1 = nn.Linear(300, 1)
        # Second Q-function network
        self.fc3 = nn.Linear(state_dim + action_dim, 400)
        self.fc4 = nn.Linear(400, 300)
        self.q2 = nn.Linear(300, 1)
    
    def forward(self, x, a):
        xu = torch.cat([x, a], 1)  # Concatena estado y accion
        # Primera Q-function
        x1 = torch.relu(self.fc1(xu))
        x1 = torch.relu(self.fc2(x1))
        q1 = self.q1(x1)
        # Segunda Q-function
        x2 = torch.relu(self.fc3(xu))
        x2 = torch.relu(self.fc4(x2))
        q2 = self.q2(x2)
        return q1, q2

# Actualiza la funcion para TD3
def update_td3(actor, critic, target_actor, target_critic, replay_buffer, optimizer_actor, optimizer_critic, it, batch_size=100, gamma=0.99, tau=0.005, policy_noise=0.2, noise_clip=0.5, policy_freq=2):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)
    
    with torch.no_grad():
        # Ruido para smoothing
        noise = (torch.randn_like(action) * policy_noise).clamp(-noise_clip, noise_clip)
        # Smoothing de proximas acciones
        next_action = (target_actor(next_state) + noise).clamp(-actor.max_action, actor.max_action)
        
        # Calcula el valor Q objetivo
        target_q1, target_q2 = target_critic(next_state, next_action)
        target_q = reward + (1 - done) * gamma * torch.min(target_q1, target_q2)
    
    # Obtiene estimaciones Q actuales
    current_q1, current_q2 = critic(state, action)
    critic_loss = torch.nn.functional.mse_loss(current_q1, target_q) + torch.nn.functional.mse_loss(current_q2, target_q)
    
    # Optimiza el critic
    optimizer_critic.zero_grad()
    critic_loss.backward()
    optimizer_critic.step()
    
    # Actualizaciones de políticas retrasadas
    if it % policy_freq == 0:
        # Compute actor loss
        actor_loss = -critic.q1(state, actor(state)).mean()
        optimizer_actor.zero_grad()
        actor_loss.backward()
        optimizer_actor.step()
        
        # actualizar los modelos objetivos congelados
        for param, target_param in zip(actor.parameters(), target_actor.parameters()):
            target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
        for param, target_param in zip(critic.parameters(), target_critic.parameters()):
            target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)



In [None]:
#DQN (Deep Q-Network)
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

def update_dqn(policy_network, target_network, optimizer, replay_buffer, batch_size=64, gamma=0.99):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)
    
    q_values = policy_network(state).gather(1, action.unsqueeze(1)).squeeze(1)
    next_q_values = target_network(next_state).max(1)[0]
    expected_q_values = reward + gamma * next_q_values * (1 - done)
    
    loss = torch.nn.functional.mse_loss(q_values, expected_q_values)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


In [None]:
#C51
class C51(nn.Module):
    def __init__(self, state_dim, action_dim, num_atoms, v_min, v_max):
        super(C51, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim * num_atoms)
        self.num_atoms = num_atoms
        self.v_min = v_min
        self.v_max = v_max
        self.delta_z = (v_max - v_min) / (num_atoms - 1)
        self.z = torch.linspace(v_min, v_max, num_atoms)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x).view(-1, action_dim, num_atoms)
        return torch.nn.functional.softmax(x, dim=-1)

def update_c51(policy_network, target_network, optimizer, replay_buffer, batch_size=64, gamma=0.99):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)
    
    with torch.no_grad():
        next_dist = target_network(next_state)
        next_action = next_dist.sum(-1).max(1)[1]
        next_dist = next_dist[range(batch_size), next_action]
        
        t_z = reward + (1 - done) * gamma * policy_network.z
        t_z = t_z.clamp(policy_network.v_min, policy_network.v_max)
        b = (t_z - policy_network.v_min) / policy_network.delta_z
        l = b.floor().long()
        u = b.ceil().long()
        
        m = torch.zeros(batch_size, policy_network.num_atoms)
        offset = torch.linspace(0, (batch_size - 1) * policy_network.num_atoms, batch_size).long().unsqueeze(1).expand(batch_size, policy_network.num_atoms)
        
        m.view(-1).index_add_(0, (l + offset).view(-1), (next_dist * (u - b)).view(-1))
        m.view(-1).index_add_(0, (u + offset).view(-1), (next_dist * (b - l)).view(-1))
    
    dist = policy_network(state)
    log_p = torch.log(dist[range(batch_size), action])
    loss = -(m * log_p).sum(1).mean()
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


In [None]:
# QR-DQN (Quantile Regression DQN)
class QRDQN(nn.Module):
    def __init__(self, state_dim, action_dim, num_quantiles):
        super(QRDQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim * num_quantiles)
        self.num_quantiles = num_quantiles
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x).view(-1, action_dim, self.num_quantiles)

def update_qrdqn(policy_network, target_network, optimizer, replay_buffer, batch_size=64, gamma=0.99):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)
    
    quantiles = policy_network(state).gather(1, action.unsqueeze(1).unsqueeze(1).expand(batch_size, 1, policy_network.num_quantiles)).squeeze(1)
    next_quantiles = target_network(next_state)
    next_action = next_quantiles.mean(2).max(1)[1]
    next_quantiles = next_quantiles[range(batch_size), next_action]
    
    target_quantiles = reward.unsqueeze(1) + (1 - done).unsqueeze(1) * gamma * next_quantiles
    target_quantiles = target_quantiles.unsqueeze(1).expand(batch_size, policy_network.num_quantiles, policy_network.num_quantiles)
    quantiles = quantiles.unsqueeze(2).expand(batch_size, policy_network.num_quantiles, policy_network.num_quantiles)
    
    diff = target_quantiles - quantiles
    loss = torch.where(diff > 0, diff * 0.5, -diff * 0.5)
    loss = loss.mean()
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


In [None]:
# HER (Hindsight Experience Replay)
class HERReplayBuffer:
    def __init__(self, capacity, k=4):
        self.capacity = capacity
        self.buffer = []
        self.k = k
    
    def add(self, episode):
        self.buffer.append(episode)
        if len(self.buffer) > self.capacity:
            self.buffer.pop(0)
    
    def sample(self, batch_size):
        sampled_episodes = random.sample(self.buffer, batch_size)
        return self._create_transitions(sampled_episodes)
    
    def _create_transitions(self, episodes):
        states, actions, rewards, next_states, dones = [], [], [], [], []
        for episode in episodes:
            for i in range(len(episode['states'])):
                states.append(episode['states'][i])
                actions.append(episode['actions'][i])
                next_states.append(episode['next_states'][i])
                dones.append(episode['dones'][i])
                
                if dones[-1]:
                    break
                
                for _ in range(self.k):
                    future = random.randint(i, len(episode['states']) - 1)
                    goal = episode['states'][future]
                    
                    new_state = episode['states'][i]
                    new_action = episode['actions'][i]
                    new_next_state = episode['next_states'][i]
                    new_done = episode['dones'][i]
                    new_reward = self._compute_reward(new_next_state, goal)
                    
                    states.append(new_state)
                    actions.append(new_action)
                    next_states.append(new_next_state)
                    dones.append(new_done)
                    rewards.append(new_reward)
        
        return torch.tensor(states), torch.tensor(actions), torch.tensor(rewards), torch.tensor(next_states), torch.tensor(dones)
    
    def _compute_reward(self, state, goal):
        return -np.linalg.norm(state - goal)

def update_dqn_with_her(policy_network, target_network, optimizer, her_replay_buffer, batch_size=64, gamma=0.99):
    state, action, reward, next_state, done = her_replay_buffer.sample(batch_size)
    
    q_values = policy_network(state).gather(1, action.unsqueeze(1)).squeeze(1)
    next_q_values = target_network(next_state).max(1)[0]
    expected_q_values = reward + gamma * next_q_values * (1 - done)
    
    loss = torch.nn.functional.mse_loss(q_values, expected_q_values)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


### Ejercicios

1 . Policy Gradient
Descripción: Implementa un algoritmo básico de gradientes de política para un agente que aprende a jugar un juego simple como CartPole usando Gym de OpenAI. Integra feedback humano simulado para mejorar las decisiones del agente.

Tareas:

* Define y entrena una red neuronal para representar la política del agente.
* Recolecta trayectorias de entrenamiento y calcula los retornos.
* Usa el método de gradientes de política para actualizar los pesos de la red.

2 . A2C/A3C

Descripción: Modifica el ejercicio anterior para utilizar el algoritmo Actor-Critic, específicamente A2C y A3C, para mejorar la estabilidad y eficiencia del entrenamiento.

Tareas:
* Implementa una arquitectura Actor-Critic con PyTorch.
* Añade soporte para operaciones asíncronas en el caso de A3C.
* Evalúa la diferencia en el rendimiento y la convergencia entre A2C y A3C.

3 . PPO (Proximal Policy Optimization)

Descripción: Implementa PPO para entrenar un agente que puede interactuar con un entorno más complejo, como BipedalWalker de Gym.

Tareas:

* Implementa la función de recorte de ventajas que caracteriza a PPO.
* Realiza múltiples actualizaciones de política usando el mismo conjunto de datos para mejorar la eficiencia de los datos.
Integra técnicas de normalización para estabilizar el aprendizaje.

4 . TRPO (Trust Region Policy Optimization) y DPPO
Descripción: Utiliza TRPO y luego extiéndelo a DPPO para manejar múltiples agentes en un entorno distribuido.

Tareas:

- Implementa TRPO con restricciones de región de confianza para asegurar actualizaciones de política seguras.
- Escala la solución usando DPPO, gestionando varios workers para recopilar datos y actualizar una política central.

5 . TD3 (Twin Delayed Deep Deterministic Policy Gradient)
Descripción: Aplica TD3 para un problema de control continuo como Pendulum.

Tareas:

- Implementa redes gemelas para la estimación Q y añade ruido a las acciones para exploración.
- Utiliza actualizaciones retardadas para los parámetros del actor y el crítico para estabilizar el entrenamiento.

6 . Q-Learning: DQN, C51, QR-DQN, y HER
Descripción: Implementa diferentes variantes de DQN y explora cómo HER puede mejorar el aprendizaje en entornos con objetivos esparsos.

Tareas:

- Implementa un DQN estándar para SpaceInvaders.
- Extiende DQN a C51 y QR-DQN para aprender distribuciones de valor.
- Aplica HER en un entorno como FetchPickAndPlace para mejorar la eficiencia en tareas con recompensas esparsas.

In [None]:
## Respuestas

Ejercicio 1: Policy Gradient

Objetivo: Implementa el método Policy Gradient utilizando un transformer para la predicción de la política.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import gym

class TransformerPolicy(nn.Module):
    def __init__(self, state_dim, action_dim, nhead, num_layers):
        super(TransformerPolicy, self).__init__()
        self.transformer = nn.Transformer(d_model=state_dim, nhead=nhead, num_encoder_layers=num_layers)
        self.fc = nn.Linear(state_dim, action_dim)
    
    def forward(self, state):
        state = state.unsqueeze(1)  # Add sequence dimension
        transformer_out = self.transformer(state)
        policy = torch.softmax(self.fc(transformer_out.squeeze(1)), dim=-1)
        return policy

def policy_gradient(env, policy_network, optimizer, num_episodes):
    for episode in range(num_episodes):
        state = env.reset()
        rewards = []
        log_probs = []
        done = False
        
        while not done:
            state = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            action_probs = policy_network(state)
            action = torch.multinomial(action_probs, 1).item()
            log_prob = torch.log(action_probs.squeeze(0)[action])
            
            next_state, reward, done, _ = env.step(action)
            rewards.append(reward)
            log_probs.append(log_prob)
            
            state = next_state
        
        G = 0
        policy_loss = []
        for r, log_prob in zip(rewards[::-1], log_probs[::-1]):
            G = r + 0.99 * G
            policy_loss.append(-log_prob * G)
        
        optimizer.zero_grad()
        policy_loss = torch.cat(policy_loss).sum()
        policy_loss.backward()
        optimizer.step()

env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
nhead = 2
num_layers = 2

policy_network = TransformerPolicy(state_dim, action_dim, nhead, num_layers)
optimizer = optim.Adam(policy_network.parameters(), lr=1e-3)

policy_gradient(env, policy_network, optimizer, num_episodes=1000)


In [None]:
# Tu respuesta

Ejercicio 2: A2C/A3C (Asynchronous Advantage Actor-Critic)

Objetivo: Implementa A2C/A3C utilizando un transformer para la predicción de la política y el valor.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
import gym

class TransformerActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, nhead, num_layers):
        super(TransformerActorCritic, self).__init__()
        self.transformer = nn.Transformer(d_model=state_dim, nhead=nhead, num_encoder_layers=num_layers)
        self.actor = nn.Linear(state_dim, action_dim)
        self.critic = nn.Linear(state_dim, 1)
    
    def forward(self, state):
        state = state.unsqueeze(1)  # Add sequence dimension
        transformer_out = self.transformer(state)
        policy = torch.softmax(self.actor(transformer_out.squeeze(1)), dim=-1)
        value = self.critic(transformer_out.squeeze(1))
        return policy, value

def worker(worker_id, env_name, global_policy, optimizer, global_episode, gamma=0.99):
    env = gym.make(env_name)
    local_policy = TransformerActorCritic(state_dim, action_dim, nhead, num_layers)
    local_policy.load_state_dict(global_policy.state_dict())
    
    while global_episode.value < 1000:
        state = env.reset()
        done = False
        log_probs = []
        values = []
        rewards = []
        
        while not done:
            state = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            policy, value = local_policy(state)
            action = torch.multinomial(policy, 1).item()
            log_prob = torch.log(policy.squeeze(0)[action])
            
            next_state, reward, done, _ = env.step(action)
            rewards.append(reward)
            log_probs.append(log_prob)
            values.append(value)
            
            state = next_state
        
        G = 0
        actor_loss = 0
        critic_loss = 0
        for r, log_prob, value in zip(rewards[::-1], log_probs[::-1], values[::-1]):
            G = r + gamma * G
            advantage = G - value.item()
            actor_loss += -log_prob * advantage
            critic_loss += advantage ** 2
        
        optimizer.zero_grad()
        loss = actor_loss + critic_loss
        loss.backward()
        optimizer.step()
        
        with global_episode.get_lock():
            global_episode.value += 1

env_name = 'CartPole-v1'
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
nhead = 2
num_layers = 2

global_policy = TransformerActorCritic(state_dim, action_dim, nhead, num_layers)
global_policy.share_memory()
optimizer = optim.Adam(global_policy.parameters(), lr=1e-3)

global_episode = mp.Value('i', 0)
num_workers = mp.cpu_count()
workers = [mp.Process(target=worker, args=(i, env_name, global_policy, optimizer, global_episode)) for i in range(num_workers)]

for worker in workers:
    worker.start()
for worker in workers:
    worker.join()


In [None]:
# Tu respuesta

Ejercicio 3: PPO (Proximal Policy Optimization)
    
Objetivo: Implementa PPO utilizando un transformer para la predicción de la política y el valor.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import gym

class TransformerPPO(nn.Module):
    def __init__(self, state_dim, action_dim, nhead, num_layers):
        super(TransformerPPO, self).__init__()
        self.transformer = nn.Transformer(d_model=state_dim, nhead=nhead, num_encoder_layers=num_layers)
        self.actor = nn.Linear(state_dim, action_dim)
        self.critic = nn.Linear(state_dim, 1)
    
    def forward(self, state):
        state = state.unsqueeze(1)  # Add sequence dimension
        transformer_out = self.transformer(state)
        policy = torch.softmax(self.actor(transformer_out.squeeze(1)), dim=-1)
        value = self.critic(transformer_out.squeeze(1))
        return policy, value

def ppo(env, policy_network, optimizer, num_episodes, clip_epsilon=0.2, gamma=0.99):
    for episode in range(num_episodes):
        state = env.reset()
        log_probs = []
        values = []
        rewards = []
        states = []
        actions = []
        dones = []
        done = False
        
        while not done:
            state = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            policy, value = policy_network(state)
            action = torch.multinomial(policy, 1).item()
            log_prob = torch.log(policy.squeeze(0)[action])
            
            next_state, reward, done, _ = env.step(action)
            rewards.append(reward)
            log_probs.append(log_prob)
            values.append(value)
            states.append(state)
            actions.append(action)
            dones.append(done)
            
            state = next_state
        
        G = 0
        returns = []
        for r in rewards[::-1]:
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns)
        
        states = torch.cat(states)
        actions = torch.tensor(actions)
        log_probs = torch.tensor(log_probs)
        values = torch.cat(values)
        
        advantages = returns - values.squeeze(1)
        
        for _ in range(4):  # PPO multiple epochs
            new_log_probs = []
            new_values = []
            for state, action in zip(states, actions):
                policy, value = policy_network(state)
                new_log_probs.append(torch.log(policy.squeeze(0)[action]))
                new_values.append(value)
            
            new_log_probs = torch.stack(new_log_probs)
            new_values = torch.stack(new_values).squeeze(1)
            
            ratios = torch.exp(new_log_probs - log_probs)
            surr1 = ratios * advantages
            surr2 = torch.clamp(ratios, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
            actor_loss = -torch.min(surr1, surr2).mean()
            critic_loss = (returns - new_values).pow(2).mean()
            loss = actor_loss + 0.5 * critic_loss
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
nhead = 2
num_layers = 2

policy_network = TransformerPPO(state_dim, action_dim, nhead, num_layers)
optimizer = optim.Adam(policy_network.parameters(), lr=1e-3)

ppo(env, policy_network, optimizer, num_episodes=1000)


In [None]:
## Tu respuesta

Ejercicio 4: TRPO (Trust Region Policy Optimization)

Objetivo: Implementa TRPO utilizando un transformer para la predicción de la política y el valor.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import gym

class TransformerTRPO(nn.Module):
    def __init__(self, state_dim, action_dim, nhead, num_layers):
        super(TransformerTRPO, self).__init__()
        self.transformer = nn.Transformer(d_model=state_dim, nhead=nhead, num_encoder_layers=num_layers)
        self.actor = nn.Linear(state_dim, action_dim)
        self.critic = nn.Linear(state_dim, 1)
    
    def forward(self, state):
        state = state.unsqueeze(1)  # Add sequence dimension
        transformer_out = self.transformer(state)
        policy = torch.softmax(self.actor(transformer_out.squeeze(1)), dim=-1)
        value = self.critic(transformer_out.squeeze(1))
        return policy, value

def trpo(env, policy_network, optimizer, num_episodes, max_kl=0.01, gamma=0.99):
    for episode in range(num_episodes):
        state = env.reset()
        log_probs = []
        values = []
        rewards = []
        states = []
        actions = []
        dones = []
        done = False
        
        while not done:
            state = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            policy, value = policy_network(state)
            action = torch.multinomial(policy, 1).item()
            log_prob = torch.log(policy.squeeze(0)[action])
            
            next_state, reward, done, _ = env.step(action)
            rewards.append(reward)
            log_probs.append(log_prob)
            values.append(value)
            states.append(state)
            actions.append(action)
            dones.append(done)
            
            state = next_state
        
        G = 0
        returns = []
        for r in rewards[::-1]:
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns)
        
        states = torch.cat(states)
        actions = torch.tensor(actions)
        log_probs = torch.tensor(log_probs)
        values = torch.cat(values)
        
        advantages = returns - values.squeeze(1)
        
        def get_loss():
            new_log_probs = []
            new_values = []
            for state, action in zip(states, actions):
                policy, value = policy_network(state)
                new_log_probs.append(torch.log(policy.squeeze(0)[action]))
                new_values.append(value)
            
            new_log_probs = torch.stack(new_log_probs)
            new_values = torch.stack(new_values).squeeze(1)
            
            ratios = torch.exp(new_log_probs - log_probs)
            surr1 = ratios * advantages
            actor_loss = -surr1.mean()
            critic_loss = (returns - new_values).pow(2).mean()
            loss = actor_loss + 0.5 * critic_loss
            return loss
        
        loss = get_loss()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
nhead = 2
num_layers = 2

policy_network = TransformerTRPO(state_dim, action_dim, nhead, num_layers)
optimizer = optim.Adam(policy_network.parameters(), lr=1e-3)

trpo(env, policy_network, optimizer, num_episodes=1000)


In [None]:
# Tu respuesta

Ejercicio 5: Utilizando LLMs para Generar Feedback para RLHF
    
Objetivo: Implementa un agente que utilice un LLM para generar feedback y mejorar la política.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import gym

class TransformerPolicy(nn.Module):
    def __init__(self, state_dim, action_dim, nhead, num_layers):
        super(TransformerPolicy, self).__init__()
        self.transformer = nn.Transformer(d_model=state_dim, nhead=nhead, num_encoder_layers=num_layers)
        self.fc = nn.Linear(state_dim, action_dim)
    
    def forward(self, state):
        state = state.unsqueeze(1)  # Add sequence dimension
        transformer_out = self.transformer(state)
        policy = torch.softmax(self.fc(transformer_out.squeeze(1)), dim=-1)
        return policy

def generate_feedback(llm, tokenizer, state, action):
    input_text = f"State: {state.tolist()}, Action: {action}"
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = llm(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss.item()
    feedback = torch.tensor([loss])
    return feedback

def train_policy_with_llm_feedback(policy_network, env, llm, tokenizer, optimizer, num_episodes):
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        while not done:
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            action_probs = policy_network(state_tensor)
            action = torch.multinomial(action_probs, 1).item()
            
            next_state, reward, done, _ = env.step(action)
            
            feedback = generate_feedback(llm, tokenizer, state, action)
            
            optimizer.zero_grad()
            action_one_hot = torch.zeros(action_probs.shape)
            action_one_hot[0, action] = 1.0
            loss = -torch.mean(feedback * torch.sum(action_probs * action_one_hot, dim=1))
            loss.backward()
            optimizer.step()
            
            state = next_state

# Inicializar entorno y modelo de LLM
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
nhead = 2
num_layers = 2

policy_network = TransformerPolicy(state_dim, action_dim, nhead, num_layers)
optimizer = optim.Adam(policy_network.parameters(), lr=1e-3)

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
llm = GPT2LMHeadModel.from_pretrained('gpt2')

# Entrenar la política con feedback del LLM
train_policy_with_llm_feedback(policy_network, env, llm, tokenizer, optimizer, num_episodes=100)


In [None]:
## Tu respuesta

Ejercicio 6: PPO (Proximal Policy Optimization) con Feedback de LLM

Objetivo: Implementa PPO utilizando un transformador para la predicción de la política y el valor, y un LLM para generar feedback.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import gym

class TransformerPPO(nn.Module):
    def __init__(self, state_dim, action_dim, nhead, num_layers):
        super(TransformerPPO, self).__init__()
        self.transformer = nn.Transformer(d_model=state_dim, nhead=nhead, num_encoder_layers=num_layers)
        self.actor = nn.Linear(state_dim, action_dim)
        self.critic = nn.Linear(state_dim, 1)
    
    def forward(self, state):
        state = state.unsqueeze(1)  # Add sequence dimension
        transformer_out = self.transformer(state)
        policy = torch.softmax(self.actor(transformer_out.squeeze(1)), dim=-1)
        value = self.critic(transformer_out.squeeze(1))
        return policy, value

def generate_feedback(llm, tokenizer, state, action):
    input_text = f"State: {state.tolist()}, Action: {action}"
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = llm(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss.item()
    feedback = torch.tensor([loss])
    return feedback

def ppo_with_llm_feedback(env, policy_network, llm, tokenizer, optimizer, num_episodes, clip_epsilon=0.2, gamma=0.99):
    for episode in range(num_episodes):
        state = env.reset()
        log_probs = []
        values = []
        rewards = []
        states = []
        actions = []
        dones = []
        feedbacks = []
        done = False
        
        while not done:
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            policy, value = policy_network(state_tensor)
            action = torch.multinomial(policy, 1).item()
            log_prob = torch.log(policy.squeeze(0)[action])
            
            next_state, reward, done, _ = env.step(action)
            feedback = generate_feedback(llm, tokenizer, state, action)
            
            rewards.append(reward)
            log_probs.append(log_prob)
            values.append(value)
            states.append(state_tensor)
            actions.append(action)
            dones.append(done)
            feedbacks.append(feedback)
            
            state = next_state
        
        G = 0
        returns = []
        for r in rewards[::-1]:
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns)
        
        states = torch.cat(states)
        actions = torch.tensor(actions)
        log_probs = torch.tensor(log_probs)
        values = torch.cat(values)
        
        advantages = returns - values.squeeze(1)
        
        for _ in range(4):  # PPO multiple epochs
            new_log_probs = []
            new_values = []
            for state, action in zip(states, actions):
                policy, value = policy_network(state)
                new_log_probs.append(torch.log(policy.squeeze(0)[action]))
                new_values.append(value)
            
            new_log_probs = torch.stack(new_log_probs)
            new_values = torch.stack(new_values).squeeze(1)
            
            ratios = torch.exp(new_log_probs - log_probs)
            surr1 = ratios * advantages
            surr2 = torch.clamp(ratios, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
            actor_loss = -torch.min(surr1, surr2).mean()
            critic_loss = (returns - new_values).pow(2).mean()
            feedback_loss = torch.mean(torch.stack(feedbacks))  # Adding feedback loss
            loss = actor_loss + 0.5 * critic_loss + feedback_loss  # Combined loss
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

# Inicializar entorno y modelo de LLM
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
nhead = 2
num_layers = 2

policy_network = TransformerPPO(state_dim, action_dim, nhead, num_layers)
optimizer = optim.Adam(policy_network.parameters(), lr=1e-3)

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
llm = GPT2LMHeadModel.from_pretrained('gpt2')

# Entrenar PPO con feedback del LLM
ppo_with_llm_feedback(env, policy_network, llm, tokenizer, optimizer, num_episodes=100)


In [None]:
## Tu respuesta

Ejercicio 7: DQN (Deep Q-Network) con Transformers

Objetivo: Implementa DQN utilizando un transformer para la predicción de la Q-valor.

In [None]:
## Tu respuesta

Ejercicio 8: Implementa C51 utilizando un transformer para la predicción de la Q-valor.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import gym
import random
import numpy as np
from collections import deque

class TransformerC51(nn.Module):
    def __init__(self, state_dim, action_dim, nhead, num_layers, num_atoms, v_min, v_max):
        super(TransformerC51, self).__init__()
        self.transformer = nn.Transformer(d_model=state_dim, nhead=nhead, num_encoder_layers=num_layers)
        self.fc = nn.Linear(state_dim, action_dim * num_atoms)
        self.num_atoms = num_atoms
        self.v_min = v_min
        self.v_max = v_max
        self.delta_z = (v_max - v_min) / (num_atoms - 1)
        self.z = torch.linspace(v_min, v_max, num_atoms)
    
    def forward(self, state):
        state = state.unsqueeze(1)  # Add sequence dimension
        transformer_out = self.transformer(state)
        q_values = self.fc(transformer_out.squeeze(1))
        q_values = q_values.view(q_values.size(0), -1, self.num_atoms)
        return q_values

def projection_distribution(next_q_values, rewards, dones, gamma, num_atoms, v_min, v_max, delta_z):
    next_z = rewards + gamma * (1 - dones) * z
    next_z = next_z.clamp(v_min, v_max)
    b = (next_z - v_min) / delta_z
    l = b.floor().long()
    u = b.ceil().long()
    m = torch.zeros_like(next_q_values)
    for i in range(num_atoms):
        m[:, i] += next_q_values[:, i] * (u - b)
        m[:, i] += next_q_values[:, i] * (b - l)
    return m

def train_c51(env, policy_network, target_network, optimizer, replay_buffer, batch_size=64, gamma=0.99):
    for _ in range(1000):  # Number of training iterations
        if len(replay_buffer) < batch_size:
            continue
        batch = random.sample(replay_buffer, batch_size)
        state, action, reward, next_state, done = zip(*batch)
        
        state = torch.tensor(state, dtype=torch.float32)
        action = torch.tensor(action, dtype=torch.long)
        reward = torch.tensor(reward, dtype=torch.float32)
        next_state = torch.tensor(next_state, dtype=torch.float32)
        done = torch.tensor(done, dtype=torch.float32)
        
        q_values = policy_network(state)
        q_values = q_values[range(batch_size), action].squeeze(1)
        
        with torch.no_grad():
            next_q_values = target_network(next_state)
            next_q_values = next_q_values[range(batch_size), next_q_values.argmax(1)]
            target_q_values = projection_distribution(next_q_values, reward, done, gamma, policy_network.num_atoms, policy_network.v_min, policy_network.v_max, policy_network.delta_z)
        
        loss = -torch.sum(target_q_values * torch.log(q_values), dim=1).mean()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Update target network
        target_network.load_state_dict(policy_network.state_dict())

env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
nhead = 2
num_layers = 2
num_atoms = 51
v_min = -10
v_max = 10

policy_network = TransformerC51(state_dim, action_dim, nhead, num_layers, num_atoms, v_min, v_max)
target_network = TransformerC51(state_dim, action_dim, nhead, num_layers, num_atoms, v_min, v_max)
target_network.load_state_dict(policy_network.state_dict())
optimizer = optim.Adam(policy_network.parameters(), lr=1e-3)

replay_buffer = deque(maxlen=10000)

num_episodes = 500
batch_size = 64

for episode in range(num_episodes):
    state = env.reset()
    episode_reward = 0
    
    for t in range(200):
        state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
        q_values = policy_network(state_tensor)
        action = q_values.sum(-1).argmax().item()
        
        if random.random() < 0.1:  # Epsilon-greedy policy
            action = env.action_space.sample()
        
        next_state, reward, done, _ = env.step(action)
        replay_buffer.append((state, action, reward, next_state, done))
        
        state = next_state
        episode_reward += reward
        
        if done:
            break
    
    train_c51(env, policy_network, target_network, optimizer, replay_buffer, batch_size)



In [None]:
## Tu respuesta

El aprendizaje por refuerzo Basado en Modelo (Model-Based RL) implica la construcción y utilización de un modelo explícito del entorno para la planificación y toma de decisiones. Este enfoque puede ser más eficiente en cuanto a muestras que los métodos sin modelo (Model-Free RL), ya que permite que el agente utilice simulaciones del modelo para aprender y mejorar su política.

### Aprender el modelo

Aprender un modelo del entorno implica construir una representación interna que capture las dinámicas del entorno. Este modelo puede ser utilizado para predecir las transiciones de estado y recompensas, facilitando la planificación y la toma de decisiones.

#### World Models

Los World Models son un enfoque en el que el agente aprende una representación compacta del entorno y utiliza esta representación para planificar y tomar decisiones.

##### Componentes clave

1. **VAE (Variational Autoencoder)**: Se utiliza para aprender una representación latente de los estados del entorno.

   - **Función de pérdida del VAE**:
     $$
     \mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_\phi(z|s)} \left[ \log p_\theta(s|z) \right] - D_{\text{KL}}(q_\phi(z|s) \| p(z))
     $$
     
     - **Término de reconstrucción**: $\mathbb{E}_{q_\phi(z|s)} \left[ \log p_\theta(s|z) \right]$ mide cuán bien el VAE puede reconstruir el estado original $s$ a partir de la representación latente $z$.
     - **Término de regularización**: $D_{\text{KL}}(q_\phi(z|s) \| p(z))$ es la divergencia KL entre la distribución aproximada $q_\phi(z|s)$ y una distribución prior $p(z)$ (normalmente una distribución gaussiana).

2. **MDN-RNN (Mixture Density Network - Recurrent Neural Network)**: Modelo recurrente que predice la próxima representación latente y recompensa dadas las representaciones latentes actuales y acciones.

   - **Función de pérdida del MDN-RNN**:
     $$
     p(z_{t+1}, r_t | z_t, a_t) = \text{MDN-RNN}(z_t, a_t)
     $$
     - La salida del MDN-RNN es una combinación de varias distribuciones gausianas que modelan las posibles próximas representaciones latentes $z_{t+1}$ y recompensas $r_t$.

3. **Controller**: Utiliza la representación latente y las predicciones del MDN-RNN para tomar decisiones de acción.

#### I2A (Imagination-Augmented Agents)

I2A utiliza un modelo imaginativo para simular futuros posibles y mejorar la toma de decisiones.

##### Componentes clave

1. **Modelo imaginativo**: Simula transiciones futuras del entorno.

   - **Modelo de transición**:
     $$
     \hat{s}_{t+1}, \hat{r}_t = f(s_t, a_t)
     $$
     - Aquí, $f(s_t, a_t)$ representa un modelo que predice el siguiente estado $\hat{s}_{t+1}$ y la recompensa $\hat{r}_t$ dado el estado actual $s_t$ y la acción $a_t$.

2. **Imagination Core**: Genera múltiples trayectorias imaginadas a partir del estado actual y posibles acciones.

   - **Trayectorias imaginadas**:
     $$
     \{ (\hat{s}_{t+1}^i, \hat{r}_t^i) \}_{i=1}^n
     $$
     - Este conjunto de trayectorias imaginadas permite que el agente considere varios futuros posibles y sus respectivas recompensas.

3. **Policy Network**: Integra las trayectorias imaginadas para seleccionar la acción óptima.

   - **Selección de acción**:
     $$
     a_t = \pi(s_t, \{\hat{s}_{t+1}^i, \hat{r}_t^i \}_{i=1}^n)
     $$
     - La red de políticas $\pi$ toma como entrada el estado actual $s_t$ y las trayectorias imaginadas para seleccionar la mejor acción $a_t$.

#### MBMF (Model-Based Model-Free)

MBMF combina elementos de RL basado en modelo y sin modelo para aprovechar las ventajas de ambos.

##### Componentes clave

1. **Modelo dinámico**: Predice transiciones y recompensas.

   - **Modelo de transición**:
     $$
     \hat{s}_{t+1}, \hat{r}_t = f(s_t, a_t)
     $$

2. **Planificación corta**: Utiliza el modelo dinámico para planificar a corto plazo y generar rollouts.

   - **Rollouts a corto plazo**:
     $$
     \{ (\hat{s}_{t+1}^k, \hat{r}_t^k) \}_{k=1}^K
     $$
     - Se generan múltiples secuencias de estados y recompensas simuladas para un horizonte de planificación corto.

3. **Entrenamiento sin modelo**: Utiliza los rollouts generados para actualizar una política o función de valor sin modelo.

   - **Actualización de la política**:
     $$
     \theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(\pi_\theta, \{ (\hat{s}_{t+1}^k, \hat{r}_t^k) \}_{k=1}^K)
     $$
     - Los rollouts simulados se utilizan para calcular las gradientes y actualizar los parámetros de la política $\theta$.

#### MBVE (Model-Based Value Expansion)

MBVE utiliza un modelo del entorno para expandir la estimación de valor más allá del horizonte de planificación actual.

##### Componentes clave

1. **Modelo dinámico**: Predice transiciones y recompensas.

   - **Modelo de transición**:
     $$
     \hat{s}_{t+1}, \hat{r}_t = f(s_t, a_t)
     $$

2. **Expansión de valor**: Utiliza el modelo para expandir la estimación de valor más allá del horizonte.

   - **Expansión de valor**:
     $$
     V(s_t) = r_t + \gamma \hat{r}_{t+1} + \gamma^2 \hat{r}_{t+2} + \dots + \gamma^H V_\text{target}(\hat{s}_{t+H})
     $$
     - Esta ecuación combina las recompensas inmediatas predichas $\hat{r}_t$ con el valor estimado a largo plazo $V_\text{target}$.

3. **Entrenamiento de Valor**: Actualiza la función de valor utilizando la expansión de valor.

   - **Actualización de la función de valor**:
     $$
     \theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(V_\theta, V(s_t))
     $$
     - Se utiliza la expansión de valor para calcular las gradientes y actualizar los parámetros de la función de valor $\theta$.

### Usar un modelo dado

En algunos casos, se utiliza un modelo predefinido del entorno para facilitar la toma de decisiones.

#### AlphaZero

AlphaZero es un algoritmo que combina búsqueda en árbol de Monte Carlo (MCTS) con redes neuronales profundas para jugar juegos de tablero de manera superhumana.

##### Componentes clave

1. **Red neuronal**: Estima la política y el valor del estado.

   - **Red neuronal**:
     $$
     \pi(a|s), v(s) = f_\theta(s)
     $$
     - La red neuronal parametrizada por $\theta$ estima la probabilidad de cada acción $\pi(a|s)$ y el valor del estado $v(s)$.

2. **Búsqueda MCTS**: Realiza simulaciones para explorar posibles futuras secuencias de jugadas.

   - **Selección**: Elige el nodo con el mayor valor de UCB.
     $$
     \text{UCB}(s, a) = Q(s, a) + c \sqrt{\frac{\log N(s)}{N(s, a)}}
     $$
     - Aquí, $Q(s, a)$ es el valor esperado de la acción $a$ en el estado $s$, $N(s)$ es el número total de visitas al nodo $s$, y $N(s, a)$ es el número de visitas al nodo hijo correspondiente a la acción $a$.

   - **Expansión**: Añade un nuevo nodo al árbol de búsqueda.

   - **Simulación**: Realiza un rollout desde el nodo expandido hasta un estado terminal o un número fijo de pasos.

   - **Backup**: Actualiza los valores de $Q(s, a)$ hacia atrás en el árbol.
     $$
     Q(s, a) \leftarrow Q(s, a) + \frac{1}{N(s, a)} \left( v - Q(s, a) \right)
     $$
     - Aquí, $v$ es el valor simulado desde el nodo expandido.

3. **Actualización de la política**: Utiliza las estadísticas de las simulaciones MCTS para actualizar la política.

   - **Nueva política**:
     $$
     \pi_\text{new}(a|s) \propto \text{visit count}(a|s)
     $$
     - La nueva política se actualiza en función del conteo de visitas de las acciones durante la búsqueda MCTS.

4. **Entrenamiento**: Actualiza los parámetros de la red neuronal usando las jugadas simuladas y el valor de las jugadas.

   - **Pérdida de entrenamiento**:
     $$
     \mathcal{L}(\pi_\theta, \pi_\text{new}, v_\theta, z) = (z - v_\theta)^2 - \pi_\text{new} \cdot \log \pi_\theta
     $$
     - Aquí, $z$ es el resultado del juego (1, 0, -1) y $\pi_\text{new}$ es la política mejorada de MCTS.

   - **Actualización de los parámetros**:
     $$
     \theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(\pi_\theta, \pi_\text{new}, v_\theta, z)
     $$

---


**World models**

Los World Models combinan Variational Autoencoders (VAE), Mixture Density Networks - Recurrent Neural Networks (MDN-RNN), y un controlador para la toma de decisiones.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Normal

class VAE(nn.Module):
    def __init__(self, state_dim, latent_dim):
        super(VAE, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, latent_dim * 2)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, state_dim),
            nn.Sigmoid()
        )

    def encode(self, x):
        mu_logvar = self.encoder(x)
        mu, logvar = mu_logvar.chunk(2, dim=-1)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

def vae_loss(recon_x, x, mu, logvar):
    BCE = nn.functional.binary_cross_entropy(recon_x, x, reduction='sum')
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return BCE + KLD

class MDNRNN(nn.Module):
    def __init__(self, latent_dim, action_dim, hidden_dim, num_gaussians):
        super(MDNRNN, self).__init__()
        self.rnn = nn.LSTM(latent_dim + action_dim, hidden_dim, batch_first=True)
        self.fc_pi = nn.Linear(hidden_dim, num_gaussians)
        self.fc_mu = nn.Linear(hidden_dim, num_gaussians * latent_dim)
        self.fc_sigma = nn.Linear(hidden_dim, num_gaussians * latent_dim)

    def forward(self, z, a, h):
        x = torch.cat([z, a], dim=-1).unsqueeze(1)
        out, h = self.rnn(x, h)
        pi = self.fc_pi(out).squeeze(1)
        mu = self.fc_mu(out).squeeze(1)
        sigma = self.fc_sigma(out).squeeze(1)
        return pi, mu, sigma, h

def mdn_loss(pi, mu, sigma, z):
    m = Normal(mu, sigma)
    z = z.unsqueeze(2).expand_as(m.loc)
    log_prob = m.log_prob(z)
    log_prob = log_prob.sum(dim=1)
    log_sum_exp = torch.logsumexp(log_prob + torch.log_softmax(pi, dim=1), dim=1)
    return -log_sum_exp.mean()

class Controller(nn.Module):
    def __init__(self, latent_dim, action_dim):
        super(Controller, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, z):
        return self.fc(z)

# Implementación de entrenamiento para World Models
def train_world_model(vae, mdnrnn, controller, data, vae_optimizer, mdnrnn_optimizer, controller_optimizer):
    # Asume que `data` es un DataLoader que proporciona estados y acciones.
    for states, actions in data:
        # VAE Training
        vae_optimizer.zero_grad()
        recon_x, mu, logvar = vae(states)
        loss_vae = vae_loss(recon_x, states, mu, logvar)
        loss_vae.backward()
        vae_optimizer.step()

        # MDN-RNN Training
        mdnrnn_optimizer.zero_grad()
        z = vae.encode(states)[0]
        pi, mu, sigma, _ = mdnrnn(z, actions, None)
        loss_mdnrnn = mdn_loss(pi, mu, sigma, z)
        loss_mdnrnn.backward()
        mdnrnn_optimizer.step()

        # Controller Training (usualmente requiere un ciclo de simulación)
        # controller_optimizer.zero_grad()
        # action_probs = controller(z)
        # loss_controller = ...
        # loss_controller.backward()
        # controller_optimizer.step()


I2A utiliza un modelo imaginativo para simular futuros posibles y mejorar la toma de decisiones.

In [None]:
class ImaginationCore(nn.Module):
    def __init__(self, model, policy, value):
        super(ImaginationCore, self).__init__()
        self.model = model
        self.policy = policy
        self.value = value

    def forward(self, state, action):
        next_state, reward = self.model(state, action)
        next_action = self.policy(next_state)
        next_value = self.value(next_state)
        return next_state, reward, next_action, next_value

# Define modelos de política y valor
class Policy(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Policy, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.fc(state)

class Value(nn.Module):
    def __init__(self, state_dim):
        super(Value, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, state):
        return self.fc(state)

# Define el modelo de transición
class TransitionModel(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(TransitionModel, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim + action_dim, 256),
            nn.ReLU(),
            nn.Linear(256, state_dim + 1)
        )

    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        out = self.fc(x)
        next_state = out[:, :-1]
        reward = out[:, -1]
        return next_state, reward

# Implementación de entrenamiento para I2A
def train_i2a(imagination_core, policy, value, data, policy_optimizer, value_optimizer):
    # Asume que `data` es un DataLoader que proporciona estados y acciones.
    for states, actions in data:
        # Forward pass
        imagined_states, imagined_rewards, imagined_actions, imagined_values = imagination_core(states, actions)
        
        # Policy Training
        policy_optimizer.zero_grad()
        action_probs = policy(states)
        loss_policy = ...  # Calcula la pérdida de política utilizando las imaginaciones
        loss_policy.backward()
        policy_optimizer.step()

        # Value Training
        value_optimizer.zero_grad()
        state_values = value(states)
        loss_value = ...  # Calcula la pérdida de valor utilizando las imaginaciones
        loss_value.backward()
        value_optimizer.step()


AlphaZero combina la búsqueda en árbol de Monte Carlo (MCTS) con redes neuronales profundas.

In [None]:
class AlphaZeroNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(AlphaZeroNetwork, self).__init__()
        self.conv1 = nn.Conv2d(state_dim, 256, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.policy_head = nn.Conv2d(256, action_dim, kernel_size=1)
        self.value_head = nn.Conv2d(256, 1, kernel_size=1)
        self.fc_value = nn.Linear(state_dim, 1)

    def forward(self, state):
        x = torch.relu(self.conv1(state))
        x = torch.relu(self.conv2(x))
        policy = torch.softmax(self.policy_head(x).view(x.size(0), -1), dim=-1)
        value = torch.tanh(self.fc_value(self.value_head(x).view(x.size(0), -1)))
        return policy, value

def mcts_policy_value(network, state, num_simulations):
    for _ in range(num_simulations):
        mcts_simulation(network, state)
    return compute_policy_and_value_from_mcts()

# Implementación de entrenamiento para AlphaZero
def train_alphazero(network, data, optimizer):
    # Asume que `data` es un DataLoader que proporciona estados, políticas y valores.
    for states, target_policies, target_values in data:
        optimizer.zero_grad()
        policies, values = network(states)
        loss_policy = nn.CrossEntropyLoss()(policies, target_policies)
        loss_value = nn.MSELoss()(values, target_values)
        loss = loss_policy + loss_value
        loss.backward()
        optimizer.step()


#### Ejercicio 1: World Models

Objetivo: Implementa un World Model que aprenda una representación latente del entorno usando un VAE, predecir las transiciones de estado y recompensas con un MDN-RNN, y tomar decisiones con un controlador.

In [None]:
## Tu respuesta

#### Ejercicio 2: I2A (Imagination-Augmented Agents)

Objetivo: Implementa I2A que utiliza un modelo imaginativo para simular futuros posibles y mejorar la toma de decisiones.

In [None]:
## Tu respuesta

#### Ejercicio 3: MBMF (Model-Based Model-Free)
Objetivo: Implementa MBMF que combina RL basado en modelo y sin modelo.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class TransitionModel(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(TransitionModel, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim + action_dim, 256),
            nn.ReLU(),
            nn.Linear(256, state_dim + 1)
        )

    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        out = self.fc(x)
        next_state = out[:, :-1]
        reward = out[:, -1]
        return next_state, reward

class Policy(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Policy, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.fc(state)

class Value(nn.Module):
    def __init__(self, state_dim):
        super(Value, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, state):
        return self.fc(state)

def train_mbmf(transition_model, policy, value, data, policy_optimizer, value_optimizer, num_rollouts):
    for states, actions in data:
        rollouts = []
        for _ in range(num_rollouts):
            next_states, rewards = transition_model(states, actions)
            rollouts.append((next_states, rewards))

        policy_optimizer.zero_grad()
        action_probs = policy(states)
        loss_policy = -torch.mean(torch.sum(action_probs * torch.log(actions + 1e-10), dim=1))
        loss_policy.backward()
        policy_optimizer.step()

        value_optimizer.zero_grad()
        state_values = value(states)
        rollout_values = [value(ns) for ns, _ in rollouts]
        loss_value = nn.MSELoss()(state_values, sum(rollout_values) / len(rollout_values))
        loss_value.backward()
        value_optimizer.step()

# Inicializar entorno y datos ficticios
state_dim = 10
action_dim = 5

transition_model = TransitionModel(state_dim, action_dim)
policy = Policy(state_dim, action_dim)
value = Value(state_dim)

policy_optimizer = optim.Adam(policy.parameters(), lr=1e-3)
value_optimizer = optim.Adam(value.parameters(), lr=1e-3)

states = torch.randn(1000, state_dim)
actions = torch.randint(0, action_dim, (1000,))

data = DataLoader(TensorDataset(states, actions), batch_size=32, shuffle=True)

# Entrenar MBMF
train_mbmf(transition_model, policy, value, data, policy_optimizer, value_optimizer, num_rollouts=10)


In [None]:
## Tu respuesta

#### Ejercicio 4: MBVE (Model-Based Value Expansion)
Objetivo: Implementa MBVE que utiliza un modelo del entorno para expandir la estimación de valor más allá del horizonte de planificación actual.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class TransitionModel(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(TransitionModel, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim + action_dim, 256),
            nn.ReLU(),
            nn.Linear(256, state_dim + 1)
        )

    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        out = self.fc(x)
        next_state = out[:, :-1]
        reward = out[:, -1]
        return next_state, reward

class Value(nn.Module):
    def __init__(self, state_dim):
        super(Value, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, state):
        return self.fc(state)

def train_mbve(transition_model, value, data, value_optimizer, gamma, horizon):
    for states, actions in data:
        expanded_values = []
        for t in range(horizon):
            next_states, rewards = transition_model(states, actions)
            state_values = value(next_states)
            expanded_values.append(rewards + gamma * state_values)

        value_optimizer.zero_grad()
        state_values = value(states)
        loss_value = nn.MSELoss()(state_values, sum(expanded_values) / len(expanded_values))
        loss_value.backward()
        value_optimizer.step()

# Inicializar entorno y datos ficticios
state_dim = 10
action_dim = 5

transition_model = TransitionModel(state_dim, action_dim)
value = Value(state_dim)

value_optimizer = optim.Adam(value.parameters(), lr=1e-3)

states = torch.randn(1000, state_dim)
actions = torch.randint(0, action_dim, (1000,))

data = DataLoader(TensorDataset(states, actions), batch_size=32, shuffle=True)

# Entrenar MBVE
train_mbve(transition_model, value, data, value_optimizer, gamma=0.99, horizon=10)


In [None]:
## Tu respuesta

#### Ejercicio 5: AlphaZero

Objetivo: Implementa AlphaZero que combina búsqueda en árbol de Monte Carlo (MCTS) con redes neuronales profundas.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class AlphaZeroNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(AlphaZeroNetwork, self).__init__()
        self.conv1 = nn.Conv2d(state_dim, 256, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.policy_head = nn.Conv2d(256, action_dim, kernel_size=1)
        self.value_head = nn.Conv2d(256, 1, kernel_size=1)
        self.fc_value = nn.Linear(state_dim, 1)

    def forward(self, state):
        x = torch.relu(self.conv1(state))
        x = torch.relu(self.conv2(x))
        policy = torch.softmax(self.policy_head(x).view(x.size(0), -1), dim=-1)
        value = torch.tanh(self.fc_value(self.value_head(x).view(x.size(0), -1)))
        return policy, value

def mcts_policy_value(network, state, num_simulations):
    # Aquí se implementa MCTS (omisión por simplicidad)
    pass

def train_alphazero(network, data, optimizer):
    for states, target_policies, target_values in data:
        optimizer.zero_grad()
        policies, values = network(states)
        loss_policy = nn.CrossEntropyLoss()(policies, target_policies)
        loss_value = nn.MSELoss()(values, target_values)
        loss = loss_policy + loss_value
        loss.backward()
        optimizer.step()

# Inicializar entorno y datos ficticios
state_dim = (3, 19, 19)
action_dim = 362

network = AlphaZeroNetwork(state_dim[0], action_dim)
optimizer = optim.Adam(network.parameters(), lr=1e-3)

states = torch.randn(1000, *state_dim)
target_policies = torch.randint(0, action_dim, (1000,))
target_values = torch.randn(1000, 1)

data = DataLoader(TensorDataset(states, target_policies, target_values), batch_size=32, shuffle=True)

# Entrenar AlphaZero
train_alphazero(network, data, optimizer)


In [None]:
## Tu respuesta

#### Ejercicio 6: World Models con transformers

Objetivo: Implementa un World Model utilizando transformers para la predicción de secuencias en lugar de un MDN-RNN.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import gym

# VAE para aprender la representación latente
class VAE(nn.Module):
    def __init__(self, state_dim, latent_dim):
        super(VAE, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, latent_dim * 2)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, state_dim),
            nn.Sigmoid()
        )

    def encode(self, x):
        mu_logvar = self.encoder(x)
        mu, logvar = mu_logvar.chunk(2, dim=-1)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

def vae_loss(recon_x, x, mu, logvar):
    BCE = nn.functional.binary_cross_entropy(recon_x, x, reduction='sum')
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return BCE + KLD

# Transformer para predecir las transiciones
class TransformerModel(nn.Module):
    def __init__(self, latent_dim, action_dim, nhead, num_layers):
        super(TransformerModel, self).__init__()
        self.transformer = nn.Transformer(d_model=latent_dim + action_dim, nhead=nhead, num_encoder_layers=num_layers)
        self.fc = nn.Linear(latent_dim + action_dim, latent_dim + 1)

    def forward(self, z, a):
        x = torch.cat([z, a], dim=-1).unsqueeze(1)
        out = self.transformer(x)
        out = self.fc(out.squeeze(1))
        next_z = out[:, :-1]
        reward = out[:, -1]
        return next_z, reward

# Controlador para tomar decisiones
class Controller(nn.Module):
    def __init__(self, latent_dim, action_dim):
        super(Controller, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, z):
        return self.fc(z)

# Entrenamiento del World Model
def train_world_model(vae, transformer, controller, data, vae_optimizer, transformer_optimizer, controller_optimizer):
    for states, actions in data:
        # VAE Training
        vae_optimizer.zero_grad()
        recon_x, mu, logvar = vae(states)
        loss_vae = vae_loss(recon_x, states, mu, logvar)
        loss_vae.backward()
        vae_optimizer.step()

        # Transformer Training
        transformer_optimizer.zero_grad()
        z = vae.encode(states)[0]
        next_z, rewards = transformer(z, actions)
        loss_transformer = nn.MSELoss()(next_z, z) + nn.MSELoss()(rewards, torch.randn_like(rewards))  # Fake target
        loss_transformer.backward()
        transformer_optimizer.step()

        # Controller Training
        controller_optimizer.zero_grad()
        z = vae.encode(states)[0]
        action_probs = controller(z)
        loss_controller = -torch.mean(torch.sum(action_probs * torch.log(actions + 1e-10), dim=1))
        loss_controller.backward()
        controller_optimizer.step()

# Inicializar entorno y datos
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
latent_dim = 32
nhead = 8
num_layers = 6

vae = VAE(state_dim, latent_dim)
transformer = TransformerModel(latent_dim, action_dim, nhead, num_layers)
controller = Controller(latent_dim, action_dim)

vae_optimizer = optim.Adam(vae.parameters(), lr=1e-3)
transformer_optimizer = optim.Adam(transformer.parameters(), lr=1e-3)
controller_optimizer = optim.Adam(controller.parameters(), lr=1e-3)

# Generar datos ficticios para el entrenamiento
states = torch.randn(1000, state_dim)
actions = torch.randint(0, action_dim, (1000,))

data = DataLoader(TensorDataset(states, actions), batch_size=32, shuffle=True)

# Entrenar el World Model
train_world_model(vae, transformer, controller, data, vae_optimizer, transformer_optimizer, controller_optimizer)


In [None]:
## Tu respuesta

Ejercicio 7: RLHF (Reinforcement Learning with Human Feedback)
Objetivo: Implementa un agente que utilice feedback humano para mejorar la política.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import gym

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(PolicyNetwork, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, x):
        return self.fc(x)

def train_policy_with_human_feedback(policy_network, feedback_data, optimizer):
    for states, actions, feedback in feedback_data:
        optimizer.zero_grad()
        action_probs = policy_network(states)
        loss = -torch.mean(feedback * torch.sum(action_probs * actions, dim=1))
        loss.backward()
        optimizer.step()

# Inicializar entorno y datos ficticios
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

policy_network = PolicyNetwork(state_dim, action_dim)
optimizer = optim.Adam(policy_network.parameters(), lr=1e-3)

# Generar datos ficticios para el entrenamiento
states = torch.randn(1000, state_dim)
actions = torch.randint(0, action_dim, (1000,))
feedback = torch.randn(1000, 1)  # Fake feedback from human

feedback_data = DataLoader(TensorDataset(states, actions, feedback), batch_size=32, shuffle=True)

# Entrenar la política con feedback humano
train_policy_with_human_feedback(policy_network, feedback_data, optimizer)


In [None]:
## Tu respuesta

#### Ejercicio 8: AlphaZero con Transformers
Objetivo: Implementa AlphaZero utilizando transformers para la predicción de políticas y valores.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class TransformerAlphaZero(nn.Module):
    def __init__(self, state_dim, action_dim, nhead, num_layers):
        super(TransformerAlphaZero, self).__init__()
        self.transformer = nn.Transformer(d_model=state_dim, nhead=nhead, num_encoder_layers=num_layers)
        self.policy_head = nn.Linear(state_dim, action_dim)
        self.value_head = nn.Linear(state_dim, 1)

    def forward(self, state):
        x = self.transformer(state.unsqueeze(1)).squeeze(1)
        policy = torch.softmax(self.policy_head(x), dim=-1)
        value = torch.tanh(self.value_head(x))
        return policy, value

def train_alphazero_transformer(network, data, optimizer):
    for states, target_policies, target_values in data:
        optimizer.zero_grad()
        policies, values = network(states)
        loss_policy = nn.CrossEntropyLoss()(policies, target_policies)
        loss_value = nn.MSELoss()(values, target_values)
        loss = loss_policy + loss_value
        loss.backward()
        optimizer.step()

# Inicializar entorno y datos ficticios
state_dim = 128
action_dim = 10
nhead = 8
num_layers = 6

network = TransformerAlphaZero(state_dim, action_dim, nhead, num_layers)
optimizer = optim.Adam(network.parameters(), lr=1e-3)

states = torch.randn(1000, state_dim)
target_policies = torch.randint(0, action_dim, (1000,))
target_values = torch.randn(1000, 1)

data = DataLoader(TensorDataset(states, target_policies, target_values), batch_size=32, shuffle=True)

# Entrenar AlphaZero con transformers
train_alphazero_transformer(network, data, optimizer)


In [None]:
## Tu respuesta