# 📋 **Student Information**

*Complete the required fields below with your personal and W&B account details.*

In [1]:
from kaggle_secrets import UserSecretsClient
secret_label = "WANDB_API_KEY"
secret_value = UserSecretsClient().get_secret(secret_label)
print(secret_value)

1940b11d1a8dd3b2fd752c6599025c5128ac0877


In [2]:
FIRST_NAME = "Mohammadjavad" # replace with your first name
LAST_NAME = "Ahmadpour" # replace with your last name
STUDENT_ID = 400104697 # replace with your student id
WANDB_ID = "mohamadahmadpour1383" # replace with your wandb username
PROJECT_NAME = f"{FIRST_NAME}-{LAST_NAME}-DQN-EXPLORE-HW"
print(f"Project name: {PROJECT_NAME}")

Project name: Mohammadjavad-Ahmadpour-DQN-EXPLORE-HW


In [3]:
print(f"Check my results at https://wandb.ai/{WANDB_ID}/{PROJECT_NAME}")

Check my results at https://wandb.ai/mohamadahmadpour1383/Mohammadjavad-Ahmadpour-DQN-EXPLORE-HW


In [4]:
# Set DEBUG to True if you are still implementing the code and debugging
# and don't want to make your wandb dashboard messy.
# set DEBUG to False if you are almost done with the implementation
# and want check performance and compare hyperparameters and models
DEBUG = False

# 📘 Guidelines

> ⚠️ **Please read this section carefully before proceeding.**

### 🔧 Install Dependencies

In [5]:
!apt install build-essential python3-dev
!git clone https://github.com/DeepRLCourse/Homework-10.git
%pip install swig
%pip install "Homework-10/BootstrapDQN"

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
python3-dev is already the newest version (3.10.6-1~22.04.1).
python3-dev set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 87 not upgraded.
Cloning into 'Homework-10'...
remote: Enumerating objects: 111, done.[K
remote: Counting objects: 100% (111/111), done.[K
remote: Compressing objects: 100% (67/67), done.[K
remote: Total 111 (delta 37), reused 108 (delta 34), pack-reused 0 (from 0)[K
Receiving objects: 100% (111/111), 1.14 MiB | 8.83 MiB/s, done.
Resolving deltas: 100% (37/37), done.
Collecting swig
  Downloading swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (3.5 kB)
Downloading swig-4.3.1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m0

### 📊 Weights & Biases (W&B) Integration

Follow these steps to set up tracking with [Weights & Biases](https://wandb.ai/site/):

1. [Create a W&B account](https://wandb.ai/site/).
2. Set the `WANDB_ID` variable in the **Student Information** section to your W&B username.
3. Create a new project using the name defined in the `PROJECT_NAME` variable. Ensure the project visibility is set to **Public**.
4. [Retrieve your W&B API key](https://docs.wandb.ai/support/find_api_key/).
5. Set the `WANDB_API_KEY`:
   - As a **secret** if you're using **Google Colab** or **Kaggle**
   - As an **environment variable** if running locally

#### 💻 Platform-Specific Setup

##### Google Colab

#### Kaggle

To configure W&B API key in Kaggle:

- Go to: `Add-ons` → `Secrets` → `Add Secret`
- **Label:** `WANDB_API_KEY`  
- **Value:** `<your_api_key>`

> You only need to add the secret — no code changes are required.

#### Local

You can set the `WANDB_API_KEY` as an environment variable manually or,

store it in a `.env` file:
```bash
# secrets.env
WANDB_API_KEY=your_api_key
```
and then run the following cell:

In [6]:
from bootstrapdqn import get_machine
if get_machine() == "Local Machine":
    import dotenv
    dotenv.load_dotenv(".workspace/secrets.env") # give it the path to your secrets.env file

### 📤 Submission Requirements

In addition to submitting this notebook on **Quera**, you must:

- Have a W&B project matching the name in `PROJECT_NAME`, under the account defined by `WANDB_ID`
- Ensure the W&B link displayed in the **Student Information** section is valid
- Tag your final experiment run for **each algorithm** with `Final`:
  - Go to **Runs** (left sidebar) → **Tags** → Add the tag `Final`
  - A total of **four** runs should be tagged as `Final`

⚠️ **Important:** The `save_code` option must remain enabled. If a `Final` run does not include saved code, it will **not** be graded.

### 🧮 Grading Criteria

The score for each algorithm is provided in its respective section. This score is then multiplied by the environment score:
- `CartPole`: × 0.1
    - Minimum requirement: over 200 points across 5 consecutive evaluations
- `LunarLander`: × 0.7
    - Minimum requirement: over 200 points across 5 consecutive evaluations
- `MountainCar`: × 1.0
    - Minimum requirement: reach the goal state across 5 consecutive evaluations
- `FrozenLake`: × 1.2
    - Minimum requirement: reach the goal state across at least 5 evaluations of 15 consecutive evaluations
- `SeaQuest`: × 1.5
    -  Minimum requirement: over 2000 points across 5 consecutive evaluations

Total Score is 100. you can get up to 80 bonus score (180)

### 📝 Implementation Guide

- Implement the algorithms as subclasses of `BaseDQNAgent` provided in [`base_agent.py`](https://github.com/DeepRLCourse/Homework-10/blob/main/BootstrapDQN/src/bootstrapdqn/base_agent.py). You may add or override methods/properties as needed.
    - The `BaseDQNAgent` code will be automatically downloaded and imported. Ensure you review it carefully before implementing your algorithms.
- Code blocks or lines marked with `# DO NOT CHANGE` must remain unaltered in your final submission. You may modify them during development for debugging purposes, but revert them before submitting.
- If running locally, real-time W&B logging might face restrictions. Use W&B's offline mode for experiments and sync them later using the `wandb sync` command ([link](https://docs.wandb.ai/support/run_wandb_offline/)).
- Prioritize vector operations over loops for better performance. While algorithm descriptions might use loops for clarity, only the main training and rollout loops (implemented in `BaseDQNAgent`) should remain iterative. Failure to vectorize may significantly increase convergence time.
- For potentially more stable and faster training, you may consider using *Smooth L1 Loss* instead of Mean Squared Error. (Optional)
- Weight initialization significantly impacts performance. Orthogonal initialization is generally recommended in the RL community and might be worth trying.


### 💡 Tips & More

The following resource provides general advice for implementing and debugging RL algorithms (not required for this homework, but highly recommended):

- [Debugging RL, Without the Agonizing Pain](https://andyljones.com/posts/rl-debugging.html)

# 🧭 Exploration Techniques in DQN

## 🚀 Initialization


In [7]:
# DO NOT CHANGE THIS BLOCK
from bootstrapdqn import ReplayBuffer, BaseDQNAgent, get_machine, set_wandb_key_form_secrets, envs
import torch
from torch import nn
import wandb
import random
import gymnasium as gym
import ale_py

gym.register_envs(ale_py)

In [8]:
# DO NOT CHANGE THIS BLOCK
TA = True if WANDB_ID == "alireza9" else False
SAVE_CODE = False if TA else True

In [9]:
# DO NOT CHANGE THIS BLOCK
# IF YOU CHANGE ANYTHING ABOUT ENVIRONMENTS AND THEIR RUN CONFIGS, YOUR CODE WILL NOT BE GRADED
from pprint import pprint
ENVS = envs()
pprint(ENVS)

{'CartPole': {'env': {'env_config': {}, 'env_name': 'CartPole-v1', 'seed': 43},
              'run': {'max_episodes': 1000,
                      'max_steps': 50000,
                      'max_steps_per_episode': 100000,
                      'max_time': 720.0}},
 'FrozenLake': {'env': {'env_config': {'p': 0.87, 'size': 14},
                        'env_name': 'FrozenLake-v1',
                        'seed': 42},
                'run': {'max_episodes': 1000000,
                        'max_steps': 1000000,
                        'max_steps_per_episode': 100000,
                        'max_time': 14400}},
 'LunarLander': {'env': {'env_config': {},
                         'env_name': 'LunarLander-v3',
                         'seed': 43},
                 'run': {'max_episodes': 100000,
                         'max_steps': 200000,
                         'max_steps_per_episode': 100000,
                         'max_time': 7200}},
 'MountainCar': {'env': {'env_config': {},
         

In [10]:
if not DEBUG:
    set_wandb_key_form_secrets()

your machine is detected as Kaggle


## 💻 Algorithms Implementation

### Epsilon Greedy DQN

Consider the following implementation as a reference for implementing other algorithms.

You can also use it as a baseline for comparing the performance of subsequent algorithms.

In [11]:
class EpsGreedyDQNAgent(BaseDQNAgent):
    """
    Epsilon-greedy DQN agent.
    """

    def __init__(self, epsilon: float = 0.1, eps_decay: float = 0.999, eps_min: float = 0.01, **kwargs):
        super().__init__(**kwargs)
        self.epsilon = epsilon
        self.eps_decay = eps_decay
        self.eps_min = eps_min

    def _decay_eps(self):
        """
        Decay the epsilon value.
        """
        self.epsilon = max(self.epsilon * self.eps_decay, self.eps_min)

    def _create_replay_buffer(self, max_size=1000000):
        self.replay_buffer = ReplayBuffer(
            [
                ("state", (self.env.observation_space.shape[0],), torch.float32),
                ("action", (), torch.int64),
                ("reward", (), torch.float32),
                ("next_state", (self.env.observation_space.shape[0],), torch.float32),
                ("done", (), torch.float32),
            ],
            max_size=max_size,
            device=self.device,
        )

    def _create_network(self):
        self.q_network = nn.Sequential(
            nn.Linear(self.env.observation_space.shape[0], 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, self.env.action_space.n),
        ).to(self.device)
        self.q_network.apply(
            lambda m: torch.nn.init.orthogonal_(m.weight, gain=torch.nn.init.calculate_gain("relu"))
            if isinstance(m, nn.Linear)
            else None
        )
        self.target_network = nn.Sequential(
            nn.Linear(self.env.observation_space.shape[0], 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, self.env.action_space.n),
        ).to(self.device)

    def _compute_loss(self, batch):
        """
        Compute the loss for the DQN agent.
        """
        states = batch["state"]
        actions = batch["action"]
        rewards = batch["reward"]
        next_states = batch["next_state"]
        dones = batch["done"]

        q_values = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze()
        next_q_values = self.target_network(next_states).max(1)[0]
        expected_q_values = rewards + (1 - dones) * self.gamma * next_q_values

        loss = nn.SmoothL1Loss()(q_values, expected_q_values)
        return loss

    def _act_in_training(self, state):
        """
        Select an action during training.
        """
        self._decay_eps()
        if torch.rand(1).item() < self.epsilon:
            return self.env.action_space.sample()
        else:
            with torch.no_grad():
                q_values = self.q_network(torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device))
                return q_values.argmax().item()

    def _act_in_eval(self, state):
        """
        Select an action during evaluation.
        """
        with torch.no_grad():
            q_values = self.q_network(torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device))
            return q_values.argmax().item()

    def _wandb_train_step_dict(self):
        log_dict = super()._wandb_train_step_dict()
        log_dict["train_step/epsilon"] = self.epsilon
        return log_dict

    def _save_dict(self):
        save_dict = super()._save_dict()
        save_dict["epsilon"] = self.epsilon
        save_dict["eps_decay"] = self.eps_decay
        save_dict["eps_min"] = self.eps_min
        return save_dict


### Bootstrap DQN

> Paper: [Deep Exploration via Bootstrapped DQN](https://arxiv.org/abs/1602.04621)

**40 Points**

#### Details

In this algorithm, instead of using a single network, we maintain an ensemble of networks (or a single network with multiple heads). At the start of each training episode, we randomly select one of these networks (heads) and use it to choose actions for the entire episode. This strategy approximates Thompson Sampling for the K-armed Bandit problem, enabling deeper exploration by leveraging the diversity among the ensemble members.

#### Implementation

In [12]:
class MultiHeadQNet(nn.Module):
    """
    Multi-head Q-network for Bootstrap DQN.
    Shares the feature extractor and has k independent output heads.
    """
    def __init__(self, obs_dim: int, action_dim: int, num_heads: int):
        super().__init__()
        self.num_heads = num_heads

        # Shared feature extractor
        self.feature_extractor = nn.Sequential(
            nn.Linear(obs_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
        )

        # K independent output heads
        self.heads = nn.ModuleList([
            nn.Linear(256, action_dim) for _ in range(num_heads)
        ])

        self._initialize_weights()

    def _initialize_weights(self):
        """
        Initialize weights of the network.
        """
        for m in self.feature_extractor:
            if isinstance(m, nn.Linear):
                torch.nn.init.orthogonal_(m.weight, gain=torch.nn.init.calculate_gain("relu"))
                if m.bias is not None:
                    torch.nn.init.constant_(m.bias, 0)
        for head in self.heads:
            if isinstance(head, nn.Linear):
                torch.nn.init.orthogonal_(head.weight) # Default gain for linear
                if head.bias is not None:
                    torch.nn.init.constant_(head.bias, 0)

    def forward(self, x: torch.Tensor):
        """
        Forward pass through the network.
        Returns Q-values for all heads.
        Input x: (batch_size, obs_dim)
        Output: (batch_size, num_heads, action_dim)
        """
        features = self.feature_extractor(x)
        q_values_per_head = []
        for head in self.heads:
            q_values_per_head.append(head(features))
        # Stack the results: list of (batch_size, action_dim) -> (batch_size, num_heads, action_dim)
        return torch.stack(q_values_per_head, dim=1)


class BootstrapDQNAgent(EpsGreedyDQNAgent):
    """
    Bootstrap DQN agent.
    """

    def __init__(self, k: int = 10, bernoulli_p: float = 0.5, **kwargs):
        self.k = k
        super().__init__(**kwargs)
        self.bernoulli_p = bernoulli_p
        self.bernoulli_dist = torch.distributions.Bernoulli(probs=torch.tensor([self.bernoulli_p]).to(self.device))
        self.current_head = 0 # This will be sampled at the start of each episode

    def _create_network(self):
        self.q_network = MultiHeadQNet(
            self.env.observation_space.shape[0],
            self.env.action_space.n,
            self.k
        ).to(self.device)

        self.target_network = MultiHeadQNet(
            self.env.observation_space.shape[0],
            self.env.action_space.n,
            self.k
        ).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict()) # Initialize target network

    def _create_replay_buffer(self, max_size=1000000):
        # Add 'mask' to the replay buffer schema
        self.replay_buffer = ReplayBuffer(
            [
                ("state", (self.env.observation_space.shape[0],), torch.float32),
                ("action", (), torch.int64),
                ("reward", (), torch.float32),
                ("next_state", (self.env.observation_space.shape[0],), torch.float32),
                ("done", (), torch.float32),
                ("mask", (self.k,), torch.float32), # Mask for each head
            ],
            max_size=max_size,
            device=self.device,
        )

    def _preprocess_add(self, state, action, reward, next_state, done):
        """
        Generates k Bernoulli masks and converts all transition data
        to torch tensors, moving them to the agent's device.
        """
        mask = self.bernoulli_dist.sample((self.k,)).squeeze(-1).to(self.device) # Shape (k,)

        state_tensor = torch.tensor(state, dtype=torch.float32, device=self.device)
        action_tensor = torch.tensor(action, dtype=torch.int64, device=self.device)
        reward_tensor = torch.tensor(reward, dtype=torch.float32, device=self.device)
        next_state_tensor = torch.tensor(next_state, dtype=torch.float32, device=self.device)
        # 'done' needs to be float32 for (1 - dones) multiplication in loss calculation
        done_tensor = torch.tensor(done, dtype=torch.float32, device=self.device)

        return dict(state=state_tensor, action=action_tensor, reward=reward_tensor,
                    next_state=next_state_tensor, done=done_tensor, mask=mask)


    def _compute_loss(self, batch):
        states = batch["state"]
        actions = batch["action"]
        rewards = batch["reward"]
        next_states = batch["next_state"]
        dones = batch["done"]
        masks = batch["mask"] # Shape (batch_size, k)

        # Q-values from the main Q-network (batch_size, k, action_dim)
        all_q_values = self.q_network(states)
        # Target Q-values from the target network (batch_size, k, action_dim)
        with torch.no_grad():
            all_target_q_values = self.target_network(next_states)
            max_next_q_values = all_target_q_values.max(dim=2)[0] # Max Q-value for each head (batch_size, k)
            expected_q_values_per_head = rewards.unsqueeze(1) + (1 - dones.unsqueeze(1)) * self.gamma * max_next_q_values # (batch_size, k)

        total_loss = 0.0
        # Iterate over each head to compute masked loss
        for i in range(self.k):
            # Select Q-values for the action taken for the current head i
            q_values_for_head_i = all_q_values[:, i, :].gather(1, actions.unsqueeze(1)).squeeze() # (batch_size,)
            
            # Select the corresponding expected Q-values for head i
            expected_q_values_for_head_i = expected_q_values_per_head[:, i] # (batch_size,)
            
            # Compute SmoothL1Loss for this head (reduction='none' to apply mask)
            loss_for_head_i = nn.SmoothL1Loss(reduction='none')(q_values_for_head_i, expected_q_values_for_head_i)
            
            # Apply the mask for head i and sum up the masked loss
            # Only transitions where mask[i] is 1 contribute to this head's loss
            masked_loss_for_head_i = loss_for_head_i * masks[:, i]
            
            # Add the mean of the masked loss for this head to total loss
            # We take mean of masked loss to avoid issues with varying number of active samples.
            # If all masks are 0, mean will be 0.
            if masked_loss_for_head_i.sum() > 0: # Avoid division by zero if all masks are zero for this head
                total_loss += masked_loss_for_head_i.sum() / masks[:, i].sum()
            
        # Average the total loss across all heads
        return total_loss / self.k

    def _episode(self):
        super()._episode()
        # Sample the current head to be used for action selection in this episode
        self.current_head = random.randrange(self.k)

    def _act_in_training(self, state):
        self._decay_eps()
        if torch.rand(1).item() < self.epsilon:
            return self.env.action_space.sample()
        else:
            with torch.no_grad():
                # Use the current head for action selection during training
                state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
                # Q-values for the current head (1, action_dim)
                q_values_current_head = self.q_network(state_tensor)[:, self.current_head, :]
                return q_values_current_head.argmax().item()

    def _act_in_eval(self, state):
        with torch.no_grad():
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
            # In evaluation, average Q-values across all heads for a more robust estimate
            all_q_values = self.q_network(state_tensor) # (1, k, action_dim)
            mean_q_values = all_q_values.mean(dim=1).squeeze(0) # (action_dim,)
            return mean_q_values.argmax().item()

    def _wandb_train_episode_dict(self):
        log_dict = super()._wandb_train_episode_dict()
        log_dict["train_episode/current_head"] = self.current_head
        return log_dict

    def _save_dict(self):
        save_dict = super()._save_dict()
        save_dict["k"] = self.k
        save_dict["bernoulli_p"] = self.bernoulli_p
        # No need to save self.bernoulli_dist or self.current_head as they are re-initialized/sampled
        return save_dict

### Bootstrap DQN with Randomized Prior Function

> Paper: [Randomized Prior Functions for Deep Reinforcement Learning](https://arxiv.org/abs/1806.03335)

**25 Points**

#### Details

This method is very similar to Bootstrap DQN, but introduces additional **non-trainable** networks (with multiple heads) called random priors. These priors are added to the Q-network outputs to encourage diversity among ensemble members, both across states and over time. During training, the Q-networks learn to compensate for the effect of these fixed random priors, which helps maintain exploration.

##### Notes
- Random prior networks are typically smaller (narrower and shallower) than the main Q-networks, so the Q-networks tend to distill their influence during training.
- There is a $\delta_\mathrm{RPF}$ coefficient to control the strength of the random priors, but for simplicity, you can set $\delta_\mathrm{RPF}=1$ and omit tuning this hyperparameter.

#### Implementation

In [13]:
class PriorMultiHeadQNet(MultiHeadQNet):
    """
    A shallower multi-head Q-network for the prior in RPF-Bootstrap DQN.
    This network's weights are fixed after initialization and are not trained.
    """
    def __init__(self, obs_dim: int, action_dim: int, num_heads: int):
        # We don't call super().__init__ directly for the feature_extractor
        # and heads, as we want to define our own shallower versions.
        # However, we inherit from MultiHeadQNet to reuse its structure
        # and _initialize_weights method.
        nn.Module.__init__(self) # Call the base nn.Module constructor

        self.num_heads = num_heads

        # Shared feature extractor - made shallower
        self.feature_extractor = nn.Sequential(
            nn.Linear(obs_dim, 128), # Smaller first layer
            nn.ReLU(),
            nn.Linear(128, 128),     # Smaller second layer
            nn.ReLU(),
        )

        # K independent output heads - mapping from the new feature dimension
        self.heads = nn.ModuleList([
            nn.Linear(128, action_dim) for _ in range(num_heads)
        ])

        self._initialize_weights() # Re-initialize weights for the new architecture

    def _initialize_weights(self):
        """
        Initialize weights specifically for the shallower prior network.
        Uses orthogonal initialization for linear layers.
        """
        for m in self.feature_extractor:
            if isinstance(m, nn.Linear):
                torch.nn.init.orthogonal_(m.weight, gain=torch.nn.init.calculate_gain("relu"))
                if m.bias is not None:
                    torch.nn.init.constant_(m.bias, 0)
        for head in self.heads:
            if isinstance(head, nn.Linear):
                torch.nn.init.orthogonal_(head.weight) # Default gain for linear
                if head.bias is not None:
                    torch.nn.init.constant_(head.bias, 0)

    # The forward method from MultiHeadQNet is directly applicable as the structure is similar.
    # def forward(self, x: torch.Tensor):
    #     features = self.feature_extractor(x)
    #     q_values_per_head = []
    #     for head in self.heads:
    #         q_values_per_head.append(head(features))
    #     return torch.stack(q_values_per_head, dim=1)


class RPFBootstrapDQNAgent(BootstrapDQNAgent):
    """
    Random Prior Functions (RPF) Bootstrap DQN agent.
    Incorporates a fixed, randomly initialized prior network.
    """

    def _create_network(self):
        """
        Creates the main Q-network, target network, and the prior network.
        The prior network is fixed and not trained.
        """
        super()._create_network() # Creates self.q_network and self.target_network (MultiHeadQNet)

        # Create the prior network
        self.prior_network = PriorMultiHeadQNet(
            self.env.observation_space.shape[0],
            self.env.action_space.n,
            self.k
        ).to(self.device)

        # Freeze the prior network's parameters; they should not be trained
        for param in self.prior_network.parameters():
            param.requires_grad = False
        self.prior_network.eval() # Set prior network to evaluation mode

    def _act_in_training(self, state):
        """
        Select an action during training using epsilon-greedy strategy.
        When exploiting, actions are chosen based on Q(s,a) + P(s,a) for the current head.
        """
        self._decay_eps() # Decay epsilon as per EpsGreedyDQNAgent

        if torch.rand(1).item() < self.epsilon:
            return self.env.action_space.sample() # Explore
        else:
            with torch.no_grad(): # No gradients needed for action selection
                state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)

                # Get Q-values from the main Q-network for the current head
                q_values_current_head = self.q_network(state_tensor)[:, self.current_head, :] # (1, action_dim)

                # Get Q-values from the prior network for the current head
                prior_q_values_current_head = self.prior_network(state_tensor)[:, self.current_head, :] # (1, action_dim)

                # Combine Q-values: Q_effective(s,a) = Q(s,a) + P(s,a)
                combined_q_values = q_values_current_head + prior_q_values_current_head

                return combined_q_values.argmax().item() # Exploit

    def _act_in_eval(self, state):
        """
        Select an action during evaluation.
        Actions are chosen by averaging Q(s,a) + P(s,a) across all heads.
        """
        with torch.no_grad(): # No gradients needed for evaluation
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)

            # Get Q-values from the main Q-network (1, k, action_dim)
            all_q_values = self.q_network(state_tensor)

            # Average combined Q-values across all heads for robust evaluation
            mean_combined_q_values = all_q_values.mean(dim=1).squeeze(0) # (action_dim,)

            return mean_combined_q_values.argmax().item()

    def _compute_loss(self, batch):
        """
        Compute the loss for the RPF-Bootstrap DQN agent.
        The Bellman target incorporates the prior network's values.
        L = SmoothL1Loss( (Q_main(s,a) + P(s,a)), (R + gamma * max_a' (Q_target(s',a') + P(s',a'))) )
        """
        states = batch["state"]
        actions = batch["action"]
        rewards = batch["reward"]
        next_states = batch["next_state"]
        dones = batch["done"]
        masks = batch["mask"] # Shape (batch_size, k)

        # 1. Compute current Q-values with prior: Q_main(s,a) + P(s,a)
        # Q-values from the main Q-network (batch_size, k, action_dim)
        
        q_values = self.q_network(states).gather(2, actions.view(-1, 1, 1).expand(-1, self.k, -1)).squeeze()
    
        
        with torch.no_grad(): # Target network and prior network are fixed for target computation
            
            
            all_target_q_values = self.target_network(next_states)
            
            
            all_prior_q_values_next = self.prior_network(next_states)

            
            target_q_values_per_head_with_prior = all_target_q_values + all_prior_q_values_next # (batch_size, k, action_dim)

            
            max_next_q_values_action = target_q_values_per_head_with_prior.argmax(dim=2, keepdim=True) # (batch_size, k)
            final_target_q_values = self.target_network(next_states).gather(2, max_next_q_values_action).squeeze()
            final_prior_q_values = self.prior_network(next_states).gather(2, max_next_q_values_action).squeeze()

            max_next_q_values_with_prior = final_target_q_values + final_prior_q_values
            
            expected_q_values_per_head = rewards.unsqueeze(1) + \
                                         (1 - dones.unsqueeze(1)) * self.gamma * max_next_q_values_with_prior # (batch_size, k)

        loss_per_head = nn.SmoothL1Loss(reduction='none')(q_values, expected_q_values_per_head)
        return (loss_per_head * masks).sum() / masks.sum()
        

### Uncertainty Estimation for Sample Efficient RPF Bootstrap DQN

> Paper: [Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation](https://arxiv.org/abs/2201.01666)

**35 Points**

#### Details

This method does not explicitly address the exploration problem but focuses on improving sample efficiency.

By maintaining an ensemble of Q-networks, multiple Q-values can be computed for each state-action pair. This enables the estimation of uncertainty in the target values. Using this uncertainty, a weighted loss is calculated, where the weights are inversely proportional to the uncertainty. The more confident we are about a target, the higher its weight during the update.

#### Bonus

1. Use the minimum Effective Batch Size (EBS) as a hyperparameter instead of $\xi$, and numerically calculate $\xi$ during each training step based on the minimum EBS. (5 points)
2. Implement the complete IV-DQN algorithm as described in the appendix of the original paper. (15 points)

#### Implementation

In [14]:
import torch
from torch import nn
import torch.nn.functional as F

# Assuming PriorMultiHeadQNet and RPFBootstrapDQNAgent are defined.
# For this code to run independently, you'd need their definitions.
# Let's assume RPFBootstrapDQNAgent provides self.q_network, self.target_network,
# self.prior_network, self.k, self.gamma, self.delta_rpf (which we'll add to __init__
# if it's not already there in RPFBootstrapDQNAgent), and super().__init__() works.

class UEBootstrapDQNAgent(RPFBootstrapDQNAgent):
    """
    Uncertainty Estimation (UE) Bootstrap DQN agent, modified to act like Code 1.
    Weights the loss based on the estimated uncertainty of target values,
    with head-specific variance calculation and prior scaling.
    """

    def __init__(self, xi: float = 2.0, **kwargs):
        """
        Args:
            xi (float): Target minimum Effective Batch Size (1.0 to K).
        """

        super().__init__(**kwargs)
        self.xi = xi
        self.latest_min_ebs = float(self.k)

    def _compute_loss(self, batch: dict) -> torch.Tensor:
        states = batch["state"]
        actions = batch["action"]
        rewards = batch["reward"]
        next_states = batch["next_state"]
        dones = batch["done"]
        masks = batch["mask"] # Shape (batch_size, k)

        B = states.size(0)
        k = self.k
        gamma = self.gamma
        eps = 1e-6

        
        q_main_all = self.q_network(states) # (batch_size, k, action_dim)
        p_all = self.prior_network(states)   # (batch_size, k, action_dim)
        
        combined_current = q_main_all + p_all

        a_expanded = actions.unsqueeze(1).expand(-1, k).unsqueeze(2)
        q_taken = combined_current.gather(dim=2, index=a_expanded).squeeze(2) # (batch_size, k)

        with torch.no_grad():

            q_next_target = self.target_network(next_states)
            q_next_prior = self.prior_network(next_states)
            q_next = q_next_target + q_next_prior # (batch_size, k, action_dim)

            max_combined_next, _ = torch.max(q_next, dim=2)

            variances = torch.var(max_combined_next, dim=1)

            variances_detach = variances.detach()
            
            low, high = 0.0, 10.0
    
            for _ in range(30):
                mid = (low + high) / 2.0
                weights = 1.0 / (variances_detach + mid)
                curr_ebs = torch.sum(weights)**2 / torch.sum(weights**2)
                if curr_ebs <= self.xi:
                    high = mid
                else:
                    low = mid
            
            weights = 1.0 / (variances + (low + high) / 2.0)
            self.latest_min_ebs = (weights.sum() ** 2 / weights.pow(2).sum()).item()

            best_actions = q_next.argmax(dim=2, keepdim=True)  # (B, K, 1)

            q_target_selected = self.target_network(next_states).gather(2, best_actions).squeeze()
            q_prior_selected = self.prior_network(next_states).gather(2, best_actions).squeeze()
            q_combined = q_target_selected + q_prior_selected  # (B, K)

            expected_q = rewards.unsqueeze(1) + (1 - dones.unsqueeze(1)) * self.gamma * q_combined

        q_current = self.q_network(states).gather(2, actions.view(-1, 1, 1).expand(-1, self.k, -1)).squeeze()
        raw_loss = nn.SmoothL1Loss(reduction='none')(q_current, expected_q)  # (B, K)

        weighted_loss = raw_loss * weights.unsqueeze(1)  # (B, K)
        masked_loss = weighted_loss * masks

        return masked_loss.sum() / masks.sum()

    def _save_dict(self) -> dict:
        """
        Add 'xi'parameter to the saved dictionary.
        """
        save_dict = super()._save_dict()
        save_dict["xi"] = self.xi
        return save_dict

    def _wandb_train_step_dict(self) -> dict:
        """
        """
        log_dict = super()._wandb_train_step_dict()
        log_dict["train_step/latest_min_ebs"] = self.latest_min_ebs
        return log_dict

## ⚙️ Configs

Feel free to change hyperparameters

In [15]:
env = ["FrozenLake", "CartPole", "MountainCar", "SeaQuest", "LunarLander"][2]
print(f"{env} is selected.")

base_agent_config = {
    **ENVS[env]["env"],
    "default_batch_size": 128,
    "gamma": 0.995,
    "learning_rate": 3e-4,
    "replay_buffer_capacity":200_000,
    "tau": 5e-3,
    "device": "cuda" if torch.cuda.is_available() else "cpu",
    "gradient_norm_clip": 10.0,
    "start_training_after": 10000,
    "normalize_rewards": False,
    "scale_rewards": None
}

base_run_config = {
    **ENVS[env]["run"],
    "learn_every": 1,  # Apply learning every n steps of rollout
    "eval_every": 10_000,  # Evaluate model approximately every n steps
}

MountainCar is selected.


In [16]:

eps_greedy_config = {
    **base_agent_config,
    "eps_decay": 0.99996,
    "eps_min": 0.01,
    "epsilon": 1.0,
}

In [17]:
bootstrap_dqn_config = {
    **eps_greedy_config,
    "k": 10,
    "bernoulli_p": 0.5
}

In [18]:
rpf_bootstrap_dqn_config = {
    **bootstrap_dqn_config,
}

In [19]:
ue_bootstrap_dqn_config = {
    **rpf_bootstrap_dqn_config,
    "xi": 0.2
}

## 🔄 Training

The `try-except` block allows you to terminate the current algorithm's run directly from the W&B panel and proceed to the next algorithm without crashing the entire notebook. This can be particularly useful when using Kaggle's *Save Version* feature.

### Epsilon Greedy DQN

In [20]:
# # DON'T CHANGE THIS BLOCK
# wandb_config = {
#     "project": PROJECT_NAME,
#     "name": f"eps_greedy {env}",
#     "config": {**eps_greedy_config, **base_run_config, "machine": get_machine()},
#     "save_code": SAVE_CODE,
#     "tags": ["dqn", "eps_greedy"],
# }

# if DEBUG:
#     wandb_run = None
# else:
#     wandb_run = wandb.init(**wandb_config)

# eps_greedy_dqn_agent = EpsGreedyDQNAgent(wandb_run=wandb_run, **eps_greedy_config)

In [21]:
# # DON'T CHANGE THIS BLOCK
# try:
#     eps_greedy_dqn_agent.train(**base_run_config)
#     wandb_run.finish()
# except KeyboardInterrupt:
#     pass

### Bootstrap DQN

In [22]:
# # DON'T CHANGE THIS BLOCK
# wandb_config = {
#     "project": PROJECT_NAME,
#     "name": f"bootstrap {env}",
#     "save_code": SAVE_CODE,
#     "tags": ["dqn", "bootstrap"],
# }

# wandb_config["config"] = {} if TA else {**bootstrap_dqn_config, **base_run_config, "machine": get_machine()}
# DEBUG = False
# if DEBUG:
#     wandb_run = None
# else:
#     wandb_run = wandb.init(**wandb_config)

# bootstrap_dqn_agent = BootstrapDQNAgent(wandb_run=wandb_run, **bootstrap_dqn_config)

In [23]:
# # DON'T CHANGE THIS BLOCK
# try:
#     bootstrap_dqn_agent.train(**base_run_config)
#     wandb_run.finish()
# except KeyboardInterrupt:
#     pass

### Bootstrap DQN with Randomized Prior Function

In [24]:
# DON'T CHANGE THIS BLOCK
wandb_config = {
    "project": PROJECT_NAME,
    "name": f"randomized_prior {env}",
    "save_code": SAVE_CODE,
    "tags": ["dqn", "rpf_bootstrap"],
}

wandb_config["config"] = {} if TA else {**rpf_bootstrap_dqn_config, **base_run_config, "machine": get_machine()}

if DEBUG:
    wandb_run = None
else:
    wandb_run = wandb.init(**wandb_config)

rpf_bootstrap_dqn_agent = RPFBootstrapDQNAgent(wandb_run=wandb_run, **rpf_bootstrap_dqn_config)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mmohamadahmadpour1383[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [25]:
# DON'T CHANGE THIS BLOCK
try:
    rpf_bootstrap_dqn_agent.train(**base_run_config)
    wandb_run.finish()
except KeyboardInterrupt:
    pass

### Uncertainty Estimation for Sample Efficient RPF Bootstrap DQN

  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
error: XDG_RUNTIME_DIR not set in the environment.
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, d

Trained for 300000 steps.


0,1
eval_episode/episode_length,███▁▅▆██▇███▇██▄█▆██▆█▄▃██████
eval_episode/mean_reward,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eval_episode/sum_reward,▁▁▁█▄▃▁▁▂▁▁▁▂▁▁▅▁▃▁▁▃▁▅▆▁▁▁▁▁▁
train_episode/current_head,▇▃▃▄█▆▅▃▆▁▆▇▁▅▃▂▃▆▆▆█▁▄▅▃█▁▄▇█▁▂▅█▇█▇█▂▆
train_episode/episode_length,██████▅█▅▄▄██▅▇█▆██████▂▅██▁█▁█▇█▁█▂▂▂▂▂
train_episode/mean_loss,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▃▄▄▄▄▄▄▅▅▅▅▅█
train_episode/mean_return,▁▁▁▁▁▅██▇▇▇▇▇▇▆▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇████
train_episode/mean_reward,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_episode/sum_loss,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▅▄▄▅▇▇▇▇█
train_episode/sum_reward,▁▁▁▄▁▆█▁▁▂▃▂▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▆▂▁▁▂▁▁▁▁▆▆▆▁

0,1
eval_episode/episode_length,200.0
eval_episode/mean_reward,-1.0
eval_episode/sum_reward,-200.0
train_episode/current_head,4.0
train_episode/episode_length,119.0
train_episode/mean_loss,2021.63389
train_episode/mean_return,-119.93597
train_episode/mean_reward,-1.0
train_episode/sum_loss,240574.43243
train_episode/sum_reward,-119.0


In [28]:

# DON'T CHANGE THIS BLOCK
wandb_config = {
    "project": PROJECT_NAME,
    "name": f"uncertainty_estimation {env}",
    "save_code": SAVE_CODE,
    "tags": ["dqn", "ue_bootstrap"],
}

wandb_config["config"] = {} if TA else {**ue_bootstrap_dqn_config, **base_run_config, "machine": get_machine()}

if DEBUG:
    wandb_run = None
else:
    wandb_run = wandb.init(**wandb_config)

ue_bootstrap_dqn_agent = UEBootstrapDQNAgent(wandb_run=wandb_run, **ue_bootstrap_dqn_config)

In [29]:
# DON'T CHANGE THIS BLOCK
ue_bootstrap_dqn_agent.train(**base_run_config)
wandb_run.finish()

  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  torch.tensor(state, device=self.device),
  state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device)
  

Trained for 300000 steps.


0,1
eval_episode/episode_length,███▂▆▂▃▄▁▅▆▆▁▁▁▂▂▅▅▅▅▆▅▁▆▇▆▁▅▆
eval_episode/mean_reward,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eval_episode/sum_reward,▁▁▁▇▃▇▆▅█▄▃▃███▇▇▄▄▄▄▃▄█▃▂▃█▄▃
train_episode/current_head,▅▅▄▆▅▂█▅▅▇▅▄▄▅▃▁▅▃▄▁▄▇█▂▇▃▅▇█▅▄▄▂▁▁▄▆▅█▁
train_episode/episode_length,█████▄▅▅▄▅▆▅▁▂▅▅▅▁▅▁▅▅▂▅▂▁▆▁▁▅▆▅▅▆▂▁▅▁▅▂
train_episode/mean_loss,▁▁▅▅▄▅█▇▄▅▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_episode/mean_return,▁▁▁▁▂▃▃▃▄▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇█████████████
train_episode/mean_reward,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_episode/sum_loss,▁▂▆▇▇▆█▆▅▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_episode/sum_reward,▁▁▁▃▃▃▄▁▄▄▁▁▃█▄▄▆▃▄███▄▄▄▄▄▃▄▄▃▃▄▃▄█▄▄▇█

0,1
eval_episode/episode_length,162.0
eval_episode/mean_reward,-1.0
eval_episode/sum_reward,-162.0
train_episode/current_head,3.0
train_episode/episode_length,20.0
train_episode/mean_loss,2e-05
train_episode/mean_return,-100.13332
train_episode/mean_reward,-1.0
train_episode/sum_loss,0.00033
train_episode/sum_reward,-20.0
