## **Tailoring QR-DQN to Seaquest**


This report analyzes how QR-DQN (Quantile Regression DQN) can be tailored to the Atari game Seaquest. The goal is to evaluate how well the algorithm performs in this environment and how specific changes to the training setup influence its performance.

## 1. Seaquest

### 1.1 Seaquest (Atari 2600)

Seaquest is an Atari 2600 game in which a player controls a submarine. The main goal is to rescue divers and destroy enemies. The player must manage limited oxygen, which makes regular surfacing important.

The submarine can move in all directions and shoot torpedoes. The player has a limited number of lives. Oxygen continuously decreases while the submarine is underwater and is shown by an on-screen indicator. To refill oxygen, the player must surface.

If the oxygen level reaches zero, the player loses one life. When surfacing with at least one rescued diver, exactly one diver is safely delivered. Surfacing without carrying a diver results in the loss of one life. This creates a risk–reward trade-off between staying underwater longer and surfacing in time.

The scoring system rewards both rescuing divers and destroying enemies. Destroying an enemy grants 20 points. After rescuing six divers, each diver is worth 50 points, and an additional oxygen bonus is awarded. After every set of six rescued divers, the game difficulty increases: the points per diver increase by 50, up to a maximum of 1000 points, and the points for destroying enemies increase by 10, up to a maximum of 90 points.

There are two main enemy types. Sharks actively follow the divers, and colliding with them causes the player to lose one life. Enemy submarines move horizontally, often near the surface, and fire torpedoes in their direction of movement. Being hit by a torpedo or colliding with an enemy submarine also results in the loss of one life.

### 1.2 Seaquest Environment (Gymnasium)

The Seaquest environment is part of the Atari environments provided by Gymnasium. In this report, the version  
`gymnasium.make("ALE/Seaquest-v4")` is used.

![Seaquest gameplay](seaques.gif)

#### Observation Space

By default, Seaquest uses an RGB image as observation.

- **Observation Space:** `Box(0, 255, (210, 160, 3), uint8)`

This represents the raw game screen with a resolution of 210 × 160 pixels and three color channels.  
Alternative observation types such as RAM `(128,)` and grayscale `(210, 160)` are also available and can be used for comparison.

#### Action Space

Seaquest uses a discrete action space:

- **Action Space:** `Discrete(18)`

Each action is a movement direction, firing torpedoes, or a combination of both.  

#### Table 1: Discrete Action Space of Seaquest

| Action ID | Meaning              |
|----------:|----------------------|
| 0         | NOOP                 |
| 1         | FIRE                 |
| 2         | UP                   |
| 3         | RIGHT                |
| 4         | LEFT                 |
| 5         | DOWN                 |
| 6         | UPRIGHT              |
| 7         | UPLEFT               |
| 8         | DOWNRIGHT            |
| 9         | DOWNLEFT             |
| 10        | UP + FIRE            |
| 11        | RIGHT + FIRE         |
| 12        | LEFT + FIRE          |
| 13        | DOWN + FIRE          |
| 14        | UPRIGHT + FIRE       |
| 15        | UPLEFT + FIRE        |
| 16        | DOWNRIGHT + FIRE     |
| 17        | DOWNLEFT + FIRE      |


#### Reward System

The reward signal is directly based on the in game score. The standard Seaquest reward system is used without modification.  
Points are awarded for destroying enemies, rescuing divers, and surfacing with remaining oxygen, as described in Section 1.1.

## 2. Quantile Regression Deep Q-Network (QR-DQN)

In this section, the learning method used in this project is explained. First, the basic idea of Deep Q-Learning is introduced. After that, distributional reinforcement learning and finally QR-DQN are described.

### 2.1 Deep Q-Learning

Q-Learning is a value-based reinforcement learning method. It learns a function  
$Q(s, a)$, which tells how good it is to take action $a$ in state $s$.  
The goal is to choose actions that lead to high future rewards.

The optimal Q-function follows the Bellman equation:

$$
Q^*(s,a) = \mathbb{E}\left[r + \gamma \max_{a'} Q^*(s', a') \right]
$$


In Deep Q-Networks (DQN), the Q-function is approximated with a neural network.  
This is needed because for environments like Atari games the state space is too large to store all Q-values in a table.

The network is trained by minimizing the difference between predicted Q-values and target values:

$$
L(\theta) = \mathbb{E}\left[\left( y - Q_\theta(s,a) \right)^2 \right]
$$

To make training more stable, techniques like replay buffers, target networks and exploration strategies are used.

### 2.2 Distributional Reinforcement Learning

In standard DQN, the Q-value represents only the expected return.  
However, future rewards are random and can vary a lot.

The return can be written as:

$$
Z(s,a) = \sum_{t=0}^{\infty} \gamma^t r_t
$$

Distributional reinforcement learning tries to learn the full return distribution $Z(s,a)$ instead of only its expected value:

$$
Q(s,a) = \mathbb{E}[Z(s,a)]
$$

This gives the agent more information about possible outcomes and uncertainty.  
In practice, this often makes learning more stable.

### 2.3 Quantile Regression DQN (QR-DQN)

QR-DQN is an extension of DQN that models the return distribution using quantiles.  
Instead of predicting one Q-value per action, the network predicts multiple values.

For each state-action pair, the network outputs $N$ quantiles:

$$
Z_\theta(s,a) = \{ \theta_1(s,a), \dots, \theta_N(s,a) \}
$$

Each quantile corresponds to a probability level:

$$
\tau_i = \frac{i - 0.5}{N}, \quad i = 1, \dots, N
$$

Training is done using quantile regression with a special loss function.  
Compared to normal DQN, QR-DQN has the same general training setup, but learns bit more stable and can achieve better performance, especially in more complex environments.


## 3. Baseline: Hyperparameter & Training

This section describes the baseline setup used for training QR-DQN on Seaquest.  

### 3.1 Baseline Hyperparameters (rl-zoo Atari)

The baseline hyperparameters are taken from the Atari configurations provided by rl-zoo.  
Compared to many other reinforcement learning algorithms, relatively few hyperparameters are explicitly specified in this setup.



In [None]:
atari:
  env_wrapper:
    - stable_baselines3.common.atari_wrappers.AtariWrapper
  frame_stack: 4
  policy: 'CnnPolicy'
  n_timesteps: !!float 1e7
  learning_starts: 50000
  exploration_fraction: 0.025  # explore 250k steps = 10M * 0.025
  # If True, you need to deactivate handle_timeout_termination
  # in the replay_buffer_kwargs
  optimize_memory_usage: False

Most remaining hyperparameters are assumed to be already well chosen for Atari environments. Thats why, the focus is less on fine-tuning hyperparameters. Only small adjustments are made.

In [None]:
SeaquestNoFrameskip-v4:
  env_wrapper:
    - stable_baselines3.common.atari_wrappers.AtariWrapper
  frame_stack: 4
  policy: 'CnnPolicy'
  n_timesteps: 8400000
  learning_starts: 50000
  exploration_fraction: 0.05  # explore 500k steps = 10M * 0.05
  buffer_size: 200000
  optimize_memory_usage: False

First, the replay buffer size was reduced from 1,000,000 to 200,000.  
The default value of 1M seemed too large, and 200k felt like a better compromise than going with 100k, like other algorithms.

Second, the exploration fraction was increased from 0.025 to 0.05.  
This lets the agent explore for a longer time and helps to avoid converging too early to bad strategies.

Third, the total training time was reduced from 10 million to 8.4 million timesteps.  
After around 8 million steps, only little improvement was observed, so it was set to 8.4 million to get all the training done faster.


The standard SB3 AtariWrapper is used. This wrapper applies common preprocessing steps such as:
- frame skipping and frame stacking,
- conversion to grayscale images,
- resizing of observations,
- reward clipping to the range [-1, 1].

These steps simplify the input space and help to stabilize training.

### 3.3 Training and Results
The final baseline runs were done with 8.4 million timesteps and with a total of 20 seeds.

The training curves show a consistent improvement in performance over the timesteps, but also a increasing variance between seeds. While the average score increases steadily, the individual runs start to diverge more strongly.

![baseline_eval](Picture1.png)

This behavior becomes clearer when looking at the individual training curves. In many runs, the agent suddenly discovers a new strategy, leading to a rapid increase in score. However, the point at which this happens differs strongly between seeds. In some cases, this improvement happens early, while in others it occurs much later or not at all.

![baseline_indi](Picture2.png)

For example, some seeds never manage to move beyond a score of 1000 points, even after 8.4 million timesteps, while other seeds reach 4000 points or more. This explains the growing variance observed in the aggregated training curves.

Overall, the training curves suggest that learning in Seaquest happens in distinct stages. The agent often stabilizes around certain performance levels, such as around 900 points, then later around 2000 points, and finally around 4000 points or higher. 

These observations suggest that the results can be used to guide further improvements.  
Since good strategies appear suddenly and at very different times, the goal is to speed up learning, so that better strategies are discovered earlier and more consistently across seeds.

Looking at individual agents will give us a better insight.

<table>
  <tr>
    <td><img src="Picture6.png" width="360"></td>
    <td><img src="Picture7.png" width="360"></td>
    <td><img src="Picture8.png" width="360"></td>
  </tr>
  <tr>
    <td><img src="Picture3.gif" width="360"></td>
    <td><img src="Picture4.gif" width="360"></td>
    <td><img src="Picture5.gif" width="360"></td>
  </tr>
</table>




The best-performing agent (seed 4) reaches a mean reward of over 5000 points.  
When observing its gameplay, it can be seen that this agent consistently collects divers and only surfaces to refill oxygen when it is carrying at least one diver. It also avoids enemies well and survives for long periods of time.

At the same time, this agent behaves quite conservatively. It mostly stays in the upper half of the screen and avoids risky dives to deeper divers. While this strategy is safe and stable, it also limits the maximum score that can be achieved.

The medium-performing agent shows a similar overall strategy, but executes it less effectively.  
It stays even closer to the surface, collects fewer divers, and therefore reaches lower scores. The basic idea of the strategy is present, but it is applied more cautiously and with worse timing.

The worst-performing agent behaves very differently.  
It repeatedly dives down one level, shoots continuously in one direction, and waits for enemies to spawn. After destroying them, it returns to the same position and repeats this behavior. Over time, it runs out of oxygen and dies. This agent has not learned that collecting divers and surfacing for oxygen are the most important features in this game.

Overall, these examples show that performance differences mainly depend on how well the agent understands diver collection, oxygen management, and risk. Agents that learn these concepts early perform much better.

## 4. Action-Space Adjustment

This section describes an experiment where the action space of Seaquest was reduced to simplify learning.

### 4.1 Motivation

Compared to other Atari games, Seaquest has a larger action space with 18 actions.  
Many of these actions are combinations of movement and shooting, this creates redundancy and increases the complexity of exploration.

The idea was that a large action space makes it harder for the agent to explore efficiently, especially in the early phase of training. Therefore, reducing the action space could help the agent to learn faster.

### 4.2 Analysis of Action Usage

To decide which actions are important, the action usage of the best baseline agent was analyzed for a single episode.

<h4>Table 2: Action Usage</h4>

<table style="margin-left: 0;">
  <tr>
    <th align="left">Action ID</th>
    <th align="left">Meaning</th>
    <th align="left">Usage (%)</th>
  </tr>
  <tr><td>0</td><td>NOOP</td><td>8.5%</td></tr>
  <tr><td>1</td><td>FIRE</td><td>0.4%</td></tr>
  <tr><td>2</td><td>UP</td><td>13.5%</td></tr>
  <tr><td>3</td><td>RIGHT</td><td>16.1%</td></tr>
  <tr><td>4</td><td>LEFT</td><td>11.3%</td></tr>
  <tr><td>5</td><td>DOWN</td><td>10.4%</td></tr>
  <tr><td>6</td><td>UPRIGHT</td><td>0.8%</td></tr>
  <tr><td>7</td><td>UPLEFT</td><td>0.4%</td></tr>
  <tr><td>8</td><td>DOWNRIGHT</td><td>1.8%</td></tr>
  <tr><td>9</td><td>DOWNLEFT</td><td>1.5%</td></tr>
  <tr><td>10</td><td>UP + FIRE</td><td>0.8%</td></tr>
  <tr><td>11</td><td>RIGHT + FIRE</td><td>0.4%</td></tr>
  <tr><td>12</td><td>LEFT + FIRE</td><td>1.1%</td></tr>
  <tr><td>13</td><td>DOWN + FIRE</td><td>1.8%</td></tr>
  <tr><td>14</td><td>UPRIGHT + FIRE</td><td>6.7%</td></tr>
  <tr><td>15</td><td>UPLEFT + FIRE</td><td>9.3%</td></tr>
  <tr><td>16</td><td>DOWNRIGHT + FIRE</td><td>5.5%</td></tr>
  <tr><td>17</td><td>DOWNLEFT + FIRE</td><td>9.6%</td></tr>
</table>



From this, it can be seen that combined movement + fire actions are used to shoot and move.  
Pure diagonal movement actions without firing are used only rarely.


### 4.3 Adjusted Action Space

Based on this observation, the decision was made to remove the pure movement actions and keep:

The idea is that combined actions already allow movement and shooting at the same time, which makes separate movement actions less necessary.

This reduces the action space from 18 actions to 10 actions, making exploration easier and more focused.



In [None]:
class SeaquestOnlyMoveWithFire(_ActionMapWrapper):
    def __init__(self, env: gym.Env):
        action_map = [
            0,   # NOOP
            1,   # FIRE
            10,  # UPFIRE
            11,  # RIGHTFIRE
            12,  # LEFTFIRE
            13,  # DOWNFIRE
            14,  # UPRIGHTFIRE
            15,  # UPLEFTFIRE
            16,  # DOWNRIGHTFIRE
            17,  # DOWNLEFTFIRE
        ]
        super().__init__(env, action_map)

### 4.4 Training and Observations

After reducing the action space, the agent learns faster.  
The training curve rises more quickly compared to the baseline. One possible reason is that the agent has fewer actions to explore and can focus on more useful behaviors earlier.

Overall this adjustment was a success.

![Training curves: baseline vs reduced action space](action.png)

## 5. Reward Shaping

This section describes an  experiment where the reward function of Seaquest was extended to give the agent more direct feedback.

### 5.1 Motivation

In the original Seaquest game, rewards are relatively sparse.  
Collecting divers is clearly a good action, but it is not rewarded directly.  
The agent only receives points later, when surfacing and delivering the divers.

This often leads to early deaths and inefficient behavior, especially during the early training phase.  
Without direct feedback, learning progress is slow.

### 5.2 Modified Reward System

To give the agent more guidance, additional rewards were added on top of the original game rewards.

To detect when a diver is collected, the RAM representation of the environment was analyzed.  
While playing the game manually, all RAM values were logged and compared to in-game events.  
Based on this the following RAM indices were identified:

- **RAM[62]**: number of currently collected divers  
- **RAM[102]**: oxygen level (63 → 0)  
- **RAM[97]**: vertical (Y) position of the player

Using this information, the reward system was extended as follows:

- **Diver collection:** 
  Every time RAM[62] increases, the agent receives +1.5 reward as direct feedback.

- **Surfacing with empty oxygen:**  
  If points are scored while oxygen is empty (RAM[102] = 0), the agent receives an additional +1.0 reward.

- **Low oxygen behavior:**  
  If RAM[102] ≤ 6, every 4 steps it is checked whether the agent moves upward.  
  - No upward movement: −0.25 reward  
  - Reaching the surface: +1.0 reward

### 5.3 Training and Observations

With reward shaping, the agent shows a much faster learning progress, especially early in training.  
The training curve rises way faster compared to the baseline.

This happens because the agent now receives direkt feedback that collecting divers is a good event, instead of learning this only indirectly after surfacing.

However, towards the end of training the difference to the baseline becomes smaller.  
This can be explained by the fact that many baseline agents eventually also learn that collecting divers and surfacing for oxygen is the correct strategy.

In other words, it helps the agent learn earlier, but does not fundamentally change the optimal behavior that is discovered later anyway.

Overall, reward shaping improves training speed and stability.

![Training curves: reward](reward1.png)

![Training curves: reward2](reward2.png)


When comparing the best agent trained with reward shaping to the best baseline agent, it can be seen that both follow a very similar overall strategy.


<div style="display: flex; gap: 40px; align-items: flex-start; margin-top: 20px;">

  <div style="text-align: center;">
    <img src="rewardbestagent.gif" width="420"><br>
    <b>Best Agent (Reward Shaping)</b>
  </div>

  <div style="text-align: center;">
    <img src="Picture3.gif" width="420"><br>
    <b>Best Agent (Baseline)</b>
  </div>

</div>



Both agents mainly operate in the upper half of the screen. They shoot enemies, avoid collisions, and try to survive as long as possible. Divers are collected, but only when they are located in the upper region of the map. In addition, both agents usually surface for oxygen only when they are carrying at least one diver and the oxygen level becomes low.

Despite this similarity, the reward shaping agent executes this strategy much better. It collects divers more reliably, manages oxygen better, and reaches significantly higher scores. This suggests that reward shaping does not introduce a fundamentally new strategy, but helps the agent to learn and apply an existing good strategy more consistently.

Interestingly, both agents also show the same unintended behavior when oxygen is almost empty and no diver is being carried. In this situation they exploit a game mechanic by staying one or two pixels below the surface. At this position, the oxygen level does not decrease.
The agents then move slightly up and down while shooting enemies, which allows them to survive for a longer time without actually surfacing. This behavior is a form of bug exploitation rather than intended. But still, it is interesting to observe that both agents independently discover this behavior during training.

Overall, the comparison shows that reward shaping, in this case improves how well and how early a good strategy is learned, but not necessarily change the late strategy.

## 6. Observation Space Adjustment

This section describes an experiment where the observation space was modified to reduce noise and to explicitly provide important game information.

### 6.1 Motivation

When looking at the Seaquest game screen, it becomes clear that not all visual information is equally relevant for the agent. 

![SeaquestUI](Picture11.png)


Large parts of the UI, such as score display and background are not relevant.

The idea was that by removing irrelevant regions of the screen, the agent would receive a cleaner input signal and might learn faster due to reduced noise.

![SeaquestUI](Picture12.png)

In addition, information that is important for gameplay, could be given to the agent through other channels.  
The goal was to explicitly provide information to the agent without changing the CNN policy.

To achieve this, two additional channels were added to the observation:
- one channel filled with the current value of RAM[102] (oxygen level),
- one channel filled with the current value of RAM[62] (number of collected divers).

This results in an observation of size 84 × 84 × 3, where:
- channel 1 contains the grayscale game screen,
- channel 2 encodes the oxygen information,
- channel 3 encodes the diver information.

Each of these values is constant across the whole channel, which allows the CNN to process them in the same way as image data.

![SeaquestUI](Picture13.png)
![SeaquestUI](Picture14.png)
![SeaquestUI](Picture15.png)

### 6.4 Results and Observations

![Oberservation Space change Eval](observ1.png)

Overall, the results of this experiment were not very strong.  
Only a really small improvement was observed at the beginng of training. Towards the later timesteps, performance was similar or even slightly worse than the baseline.

In addition, training took significantly longer, due to the increased input size and additional computation.

One possible explanation is that the baseline agent already learns to filter out visual noise early during training.  
Therefore, explicitly removing parts of the observation does not provide a large benefit.

Another reason could be that encoding RAM values as constant image channels is not an efficient way to represent this information. The CNN may not be well suited to interpret such global, non-spatial signals.

Overall, chanigng the observation space increased complexity without impoving the performance.

![Oberservation Space change Eval](observ2.png)


## 7. Summary and Conclusion

In conclusion, QR-DQN was adapted to the Atari game Seaquest using several different approaches.  
Overall, most changes compared to the baseline improved performance in some way.

Reward shaping mainly helped the agent to learn faster. Important behaviors, such as collecting divers and managing oxygen, were discovered much earlier. However, the final performance was often similar to the baseline.

The action-space adjustment was the most effective change. Reducing the number of actions clearly improved early learning and led to better and more stable results.

The observation-space adjustment did not have a strong impact. Removing visual noise and adding RAM information resulted only in small improvements and sometimes even worse performance. This suggests that the baseline setup already works quite well.

Overall, guiding the agent with better rewards and a simpler action space was more useful than changing the observation space.

## 8. Future Work

Reward shaping could be improved more, for example by increasing rewards per collected diver or by preventing bug exploitation.

Further improvements could also come from dedicated hyperparameter tuning or from training directly on the RAM observation space.
