<div id=top></div>

# Reinforcement Learning with Doom - Reward shaping and curriculum learning

Leandro Kieliger
contact@lkieliger.ch

---
## Description

In this notebook we are going to significantly improve the learning efficiency of the setup created in the previous part of this series. First, we will see how to modify rewards to incentivize behaviors helping reach the learning objective, a method called "reward shaping". In the second part, we will design an adaptive learning process that changes the difficulty of the training environment based on the performance of the agent. 


### [Part 1 - Reward Shaping](#part_1)
* [Action multipliers](#shaping_table)
* [Shaped environment wrapper](#shaped_env)

    
### [Part 2 - Curriculum Learning](#part_2)
* [ACS script](#acs_script)
* [Curriculum environment wrapper](#curriculum_env)
* [Final model](#final_model)
    
    
### [Bonus - Human vs AI, playing against a trained agent](#bonus)

### [Conclusion](#conclusion)

<div id=part_1></div>

# [^](#top) Part 1 - Reward Shaping


## Preparations

In [1]:
%load_ext autoreload
%autoreload 2

import cv2
import gym
import matplotlib.pyplot as plt
import numpy as np
import torch as th
import typing as t
import vizdoom

from stable_baselines3 import ppo
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common import evaluation, policies
from torch import nn

from common import envs, plotting

In the previous notebook we saw that the learning process was very slow. Indeed, even after training more than 2 million steps, our agent barely reached 2 frags per match on average. In comparison, the best bot manages to get around 13 frags. For reference, here is the average performance of six consecutive runs. The shaded area shows the mean error $ \frac{\sigma}{\sqrt{n}}$ for $n=6$.

![Comparison performance](./figures/comparison_shaping_1.png)

We also discussed one of the main reasons why the model had so much difficulty getting started: sparse rewards. That is, the agent has to execute many steps "just right" before it can observe some meaningful reward signal. Indeed, it must manage to move and aim at enemies while repeatedly shooting them in order to (possibly) get some rewards. Such sequences of actions rarely happen by chance. If rewards are rare, it will take a long time to reinforce good behaviors.

<div id=shaping_table></div>

## Action multipliers

To solve the issue of sparse rewards, we can give our agent small positive rewards for every action we believe is beneficial to the learning process. Here is the list of actions we would like to incentivize as well as the associated rewards:

| Action                     | Reward       |
| -------------------------- |--------------| 
| Frag                       |  1 per frag   | 
| Damaging enemies           |  0.01 per damage point | 
| Picking up ammunition      |  0.02 per unit |
| Using ammunition           | -0.01 per unit | 
| Picking up health          |  0.02 per health point |
| Losing health              | -0.01 per health point |
| Picking up armor           |  0.01 per armor point |
| Moved distance > 3 units   |  5e-4 per step |
| Moved distance < 3 units   | -2.5e-3 per step |

Note that players typically have 100 health points and that damage points correspond to the number of enemy health points that were removed. Also, players can typically move at around 16 units per tick. The distance reward is here to avoid "camping" behavior. Values have been inspired and adapted from a paper using this technique to improve their performance: 
> Wu, Yuxin and Yuandong Tian. “Training Agent for First-Person Shooter Game with Actor-Critic Curriculum Learning.” ICLR (2017). [PDF](https://research.fb.com/wp-content/uploads/2017/04/paper_camera_ready_small-1.pdf)

To modify the rewards we just need to keep track of a few variables and adapt the value before passing it on to the agent. First, we set the action rewards:

In [2]:
# Rewards
# 1 per kill
reward_factor_frag = 1.0
reward_factor_damage = 0.01

# Player can move at ~16.66 units per tick
reward_factor_distance = 5e-4
penalty_factor_distance = -2.5e-3
reward_threshold_distance = 3.0

# Pistol clips have 10 bullets
reward_factor_ammo_increment = 0.02
reward_factor_ammo_decrement = -0.01

# Player starts at 100 health
reward_factor_health_increment = 0.02
reward_factor_health_decrement = -0.01
reward_factor_armor_increment = 0.01

<div id=shaped_env></div>

## Shaped environment wrapper
Then, we define a game environment wrapper class, just like the one with bots we did in the previous part. It might seem lenghty at first but most of the code is actually computing reward components based on the multipliers defined above. Each component is aggregated in the `shape_rewards` function when performing an action `step`.

In [3]:
import random
import string

from gym import spaces
from vizdoom.vizdoom import GameVariable

# List of game variables storing ammunition information. Used for keeping track of ammunition-related rewards.
AMMO_VARIABLES = [GameVariable.AMMO0, GameVariable.AMMO1, GameVariable.AMMO2, GameVariable.AMMO3, GameVariable.AMMO4,
                  GameVariable.AMMO5, GameVariable.AMMO6, GameVariable.AMMO7, GameVariable.AMMO8, GameVariable.AMMO9]

# List of game variables storing weapon information. Used for keeping track of ammunition-related rewards.
WEAPON_VARIABLES = [GameVariable.WEAPON0, GameVariable.WEAPON1, GameVariable.WEAPON2, GameVariable.WEAPON3,
                    GameVariable.WEAPON4,
                    GameVariable.WEAPON5, GameVariable.WEAPON6, GameVariable.WEAPON7, GameVariable.WEAPON8,
                    GameVariable.WEAPON9]

class DoomWithBotsShaped(envs.DoomWithBots):
    """An environment wrapper for a Doom deathmatch game with bots. 
    
    Rewards are shaped according to the multipliers defined in the notebook.
    """

    def __init__(self, game, frame_processor, frame_skip, n_bots, shaping):
        super().__init__(game, frame_processor, frame_skip, n_bots)

        # Give a random two-letter name to the agent for identifying instances in parallel learning.
        self.name = ''.join(random.choices(string.ascii_uppercase + string.digits, k=2))
        self.shaping = shaping

        # Internal states
        self.last_health = 100
        self.last_x, self.last_y = self._get_player_pos()
        self.ammo_state = self._get_ammo_state()
        self.weapon_state = self._get_weapon_state()
        self.total_rew = self.last_damage_dealt = self.deaths = self.last_frags = self.last_armor = 0

        # Store individual reward contributions for logging purposes
        self.rewards_stats = {
            'frag': 0,
            'damage': 0,
            'ammo': 0,
            'health': 0,
            'armor': 0,
            'distance': 0,
        }
        
    def step(self, action, array=False):
        # Perform the action as usual
        state, reward, done, info = super().step(action)
        
        self._log_reward_stat('frag', reward)

        # Adjust the reward based on the shaping table
        if self.shaping:
            shaped_reward = reward + self.shape_rewards()
        else:
            shaped_reward = reward

        self.total_rew += shaped_reward

        return state, shaped_reward, done, info

    def reset(self):
        self._print_state()
        
        state = super().reset()

        self.last_health = 100
        self.last_x, self.last_y = self._get_player_pos()
        self.last_armor = self.last_frags = self.total_rew = self.deaths = 0

        # Damage count  is not cleared when starting a new episode: https://github.com/mwydmuch/ViZDoom/issues/399
        # self.last_damage_dealt = 0

        # Reset reward stats
        for k in self.rewards_stats.keys():
            self.rewards_stats[k] = 0
            
        return state

    def shape_rewards(self):
        reward_contributions = [
            self._compute_damage_reward(),
            self._compute_ammo_reward(),
            self._compute_health_reward(),
            self._compute_armor_reward(),
            self._compute_distance_reward(*self._get_player_pos()),
        ]

        return sum(reward_contributions)
    
    def _respawn_if_dead(self):
        if not self.game.is_episode_finished():
            # Check if player is dead
            if self.game.is_player_dead():
                self.deaths += 1
                self._reset_player()

    def _compute_distance_reward(self, x, y):
        """Computes a reward/penalty based on the distance travelled since last update."""
        dx = self.last_x - x
        dy = self.last_y - y

        distance = np.sqrt(dx ** 2 + dy ** 2)

        if distance - reward_threshold_distance > 0:
            reward = reward_factor_distance
        else:
            reward = -reward_factor_distance

        self.last_x = x
        self.last_y = y
        self._log_reward_stat('distance', reward)

        return reward

    def _compute_damage_reward(self):
        """Computes a reward based on total damage inflicted to enemies since last update."""
        damage_dealt = self.game.get_game_variable(GameVariable.DAMAGECOUNT)
        reward = reward_factor_damage * (damage_dealt - self.last_damage_dealt)

        self.last_damage_dealt = damage_dealt
        self._log_reward_stat('damage', reward)

        return reward

    def _compute_health_reward(self):
        """Computes a reward/penalty based on total health change since last update."""
        # When the player is dead, the health game variable can be -999900
        health = max(self.game.get_game_variable(GameVariable.HEALTH), 0)

        health_reward = reward_factor_health_increment * max(0, health - self.last_health)
        health_penalty = reward_factor_health_decrement * min(0, health - self.last_health)
        reward = health_reward - health_penalty

        self.last_health = health
        self._log_reward_stat('health', reward)

        return reward

    def _compute_armor_reward(self):
        """Computes a reward/penalty based on total armor change since last update."""
        armor = self.game.get_game_variable(GameVariable.ARMOR)
        reward = reward_factor_armor_increment * max(0, armor - self.last_armor)
        
        self.last_armor = armor
        self._log_reward_stat('armor', reward)

        return reward

    def _compute_ammo_reward(self):
        """Computes a reward/penalty based on total ammunition change since last update."""
        self.weapon_state = self._get_weapon_state()

        new_ammo_state = self._get_ammo_state()
        ammo_diffs = (new_ammo_state - self.ammo_state) * self.weapon_state
        ammo_reward = reward_factor_ammo_increment * max(0, np.sum(ammo_diffs))
        ammo_penalty = reward_factor_ammo_decrement * min(0, np.sum(ammo_diffs))
        reward = ammo_reward - ammo_penalty
        
        self.ammo_state = new_ammo_state
        self._log_reward_stat('ammo', reward)

        return reward

    def _get_player_pos(self):
        """Returns the player X- and Y- coordinates."""
        return self.game.get_game_variable(GameVariable.POSITION_X), self.game.get_game_variable(
            GameVariable.POSITION_Y)

    def _get_ammo_state(self):
        """Returns the total available ammunition per weapon slot."""
        ammo_state = np.zeros(10)

        for i in range(10):
            ammo_state[i] = self.game.get_game_variable(AMMO_VARIABLES[i])

        return ammo_state

    def _get_weapon_state(self):
        """Returns which weapon slots can be used. Available weapons are encoded as ones."""
        weapon_state = np.zeros(10)

        for i in range(10):
            weapon_state[i] = self.game.get_game_variable(WEAPON_VARIABLES[i])

        return weapon_state

    def _log_reward_stat(self, kind: str, reward: float):
        self.rewards_stats[kind] += reward

    def _reset_player(self):
        self.last_health = 100
        self.last_armor = 0
        self.game.respawn_player()
        self.last_x, self.last_y = self._get_player_pos()
        self.ammo_state = self._get_ammo_state()

    def _print_state(self):
        super()._print_state()
        print('\nREWARD BREAKDOWN')
        print('Agent {} frags: {}, deaths: {}, total reward: {:.2f}'.format(
            self.name,
            self.last_frags,
            self.deaths,
            self.total_rew
        ))
        for k, v in self.rewards_stats.items():
            print(f'- {k}: {v:+.1f}')
        print('***************************************\n\n')
            

We define some helper functions whose task is simply to create a VizDoom game instance and store it with our newly defined environment wrapper.

In [4]:
from stable_baselines3.common.vec_env import VecTransposeImage, DummyVecEnv

def game_instance(scenario):
    """Creates a Doom game instance."""
    game = vizdoom.DoomGame()
    game.load_config(f'scenarios/{scenario}.cfg')
    game.add_game_args(envs.DOOM_ENV_WITH_BOTS_ARGS)
    game.init()
    
    return game

def env_with_bots_shaped(scenario, **kwargs) -> envs.DoomEnv:
    """Wraps a Doom game instance in an environment with shaped rewards."""
    game = game_instance(scenario)
    return DoomWithBotsShaped(game, **kwargs)

def vec_env_with_bots_shaped(n_envs=1, **kwargs) -> VecTransposeImage:
    """Wraps a Doom game instance in a vectorized environment with shaped rewards."""
    return VecTransposeImage(DummyVecEnv([lambda: env_with_bots_shaped(**kwargs)] * n_envs))

We can now train on the map introduced in the second notebook. The code loading the model, registering the callbacks and starting the learning process has been moved to the `common` module for readability. This way, we can start the training with a single call to `solve_env` which will handle the aspects we have already covered previously.

In the part below we define:

* Environment parameters such as how much frame skipping, how many bots etc (see notebook 1 & 2).
* Agent parameters such as the learning rate, steps per rollout and our custom CNN (see notebook 2).

In [5]:
from common.models import CustomCNN

scenario = 'deathmatch_simple'

# Agent parameters.
agent_args = {
    'n_epochs': 3,
    'n_steps': 4096,
    'learning_rate': 1e-4,
    'batch_size': 32,
    'policy_kwargs': {'features_extractor_class': CustomCNN}
}

# Environment parameters.
env_args = {
    'scenario': scenario,
    'frame_skip': 4,
    'frame_processor': envs.default_frame_processor,
    'n_bots': 8,
    'shaping': True
}

# In the evaluation environment we measure frags only.
eval_env_args = dict(env_args)
eval_env_args['shaping'] = False

In [None]:
# Create environments with bots and shaping.
env = vec_env_with_bots_shaped(2, **env_args)
eval_env = vec_env_with_bots_shaped(1, **eval_env_args)

envs.solve_env(env, eval_env, scenario, agent_args)

Follow the learning process via Tensorboard. If you run it from the same directory as this notebook, the command is: `tensorboard --logdir=logs/tensorboard`.

You should see some noticeable improvement over our previous setup with average rewards starting to rise much earlier in the learning process. In the figure below I have illustrated the average reward curve over 6 consecutive trials. The difference is stunning! Within the same amount of time we were able to obtain a 3x improvement! In addition, we see that our agent is already stronger than programmed bots!

But there is more, reward shaping is not the only way of improving the learning performance. In the next part of this notebook we will see how curriculum learning can further boost our setup.

![Comparison performance](./figures/comparison_shaping_2.png)

<div id=part_2></div>

# [^](#top) Part 2 - Curriculum Learning

The concept behind curriculum learning is to make the learning task easy at first and then gradually increase the difficulty as the agent progresses. To implement the idea in a deathmatch environment, we will alter the speed and health of bots based on the performance of the agent. The following table summarises the bots parameters based on the average reward obtained by the agent. The average reward is computed over the last 10 episodes.

| Average reward over last 10 episodes | Bot multiplier |
| :----------------------------------: |:--------------:| 
| <=5                                   |  0.1           |
| <=10                                  |  0.2           |
| <=15                                  |  0.4           |
| <=20                                  |  0.6           |
| <=25                                  |  0.8           |
|  >25                                  |  1.0           |

A multiplier of 0.1 means that bots will have 10% of their usual health and will move at only 10% the normal speed. Once the average reward rises above 5, bots will have 20% of their health and speed etc.

We can't directly influence the behaviour of bots with Python code. Instead, we need to use ACS scripts. Those scripts are stored in the `.wad` file alongside each map. To read and edit scripts, you can use [Slade](https://slade.mancubus.net/). Using ACS, it is quite easy to modify game variables. The following snippet is all we need for the task. For more information about ACS and what you can do with it, refer to [ZDoom's ACS documentation](https://zdoom.org/wiki/ACS).


<div id=acs_script></div>

### ACS Script:
---
```C
#include "zcommon.acs"

global int 0:reward;

int difficulty_level = 5;
int speed_levels[6] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.0};
int health_levels[6] = {10, 20, 40, 60, 80, 100};

script 1 OPEN
{
  Log(s:"Level loaded");
}

script 2 ENTER
{
  set_actor_skill(ActivatorTID());
}

script 3 RESPAWN
{
  set_actor_skill(ActivatorTID());
}

script "change_difficulty" (int new_difficulty_level)
{
  Log(s:"Changing difficulty level to: ", d: new_difficulty_level);
  
  difficulty_level = new_difficulty_level;
}

function void set_actor_skill(int actor_id)
{
  if (ClassifyActor(actor_id) & ACTOR_BOT ) {
    Log(s:"Changing difficulty level for bot!", d: actor_id, d: difficulty_level);
    SetActorProperty(actor_id, APROP_Speed , speed_levels[difficulty_level]);
    SetActorProperty(actor_id, APROP_Health , health_levels[difficulty_level]);
  }
}
```
---
<div id=curriculum_env></div>

## Curriculum environment wrapper
To interact with a function defined in an ACS script we can use the `puke` and `pukename` commands. (The latter allows calling functions by their name. This is a specificity of ACS which originally identified functions using integers.)

```Python
game.send_game_command(f'pukename <function name> <arguments>')
```

For more details, have a look at the ZDoom Wiki. Just like for reward shaping, we will subclass the environment wrapper to add the behaviour we need. Also, we need to make sure that the environment used for evaluating the agent's performance is using the normal difficulty (no curriculum applied). Otherwise we would have biased estimates of our agent's performance.

In [7]:
from collections import deque

REWARD_THRESHOLDS = [5, 10, 15, 20, 25, 25]

class DoomWithBotsCurriculum(DoomWithBotsShaped):

    def __init__(self, game, frame_processor, frame_skip, n_bots, shaping, initial_level=0, max_level=5, rolling_mean_length=10):
        super().__init__(game, frame_processor, frame_skip, n_bots, shaping)
        
        # Initialize ACS script difficulty level
        game.send_game_command('pukename change_difficulty 0')
        
        # Internal state
        self.level = initial_level
        self.max_level = max_level
        self.rolling_mean_length = rolling_mean_length
        self.last_rewards = deque(maxlen=rolling_mean_length)

    def step(self, action, array=False):
        # Perform action step as usual
        state, reward, done, infos = super().step(action, array)

        # After an episode, check whether difficulty should be increased.
        if done:
            self.last_rewards.append(self.total_rew)
            run_mean = np.mean(self.last_rewards)
            print('Avg. last 10 runs of {}: {:.2f}. Current difficulty level: {}'.format(self.name, run_mean, self.level))
            if run_mean > REWARD_THRESHOLDS[self.level] and len(self.last_rewards) >= self.rolling_mean_length:
                self._change_difficulty()

        return state, reward, done, infos

    def reset(self):
        state = super().reset()
        self.game.send_game_command(f'pukename change_difficulty {self.level}')

        return state

    def _change_difficulty(self):
        """Adjusts the difficulty by setting the difficulty level in the ACS script."""
        if self.level < self.max_level:
            self.level += 1
            print(f'Changing difficulty for {self.name} to {self.level}')
            self.game.send_game_command(f'pukename change_difficulty {self.level}')
            self.last_rewards = deque(maxlen=self.rolling_mean_length)
        else:
            print(f'{self.name} already at max level!')

This is much shorter than the previous wrapper as we only need to check the rolling average reward and optionally send a command to the Doom game instance. Finally, we launch a training session using the wrapper for curriculum learning and wait for 3M steps.

In [None]:
def env_with_bots_curriculum(scenario, **kwargs) -> envs.DoomEnv:
    """Wraps a Doom game instance in an environment with shaped rewards and curriculum."""
    game = game_instance(scenario)
    return DoomWithBotsCurriculum(game, **kwargs)

def vec_env_with_bots_curriculum(n_envs=1, **kwargs) -> VecTransposeImage:
    """Wraps a Doom game instance in a vectorized environment with shaped rewards and curriculum."""
    return VecTransposeImage(DummyVecEnv([lambda: env_with_bots_curriculum(**kwargs)] * n_envs))

# Create environments with bots.
env = vec_env_with_bots_curriculum(2, **env_args)
eval_env = vec_env_with_bots_shaped(1, **eval_env_args) # Don't use adaptive curriculum for the evaluation env!

envs.solve_env(env, eval_env, scenario, agent_args)

The learning speed should now be even better than before! By combining reward shaping and curriculum learning we managed to get a 4x increase in performance for the same number of training steps! We are now fully equipped to efficiently train an agent that will outmatch even the best programmed bot beyond a shadow of a doubt. Let's go!

![Comparison performance](./figures/comparison_shaping_3.png)

<div id=final_model></div>

## Final model

To celebrate the training of the final model, I have created a fancier map that requires more efforts for players to navigate and find enemies. The screenshot below shows an overhead view of the new map. You can find the corresponding `.wad` file on the GitHub repository.

![New map](./figures/map_2_scaled.png)

### Architecture

The final model is based on our custom CNN with a couple of changes:

* 512 neurons for the first flat fully connected layer.
* 256 neurons in a fully connected layer for the **value net**.
* 256 neurons in a fully connected layer for the **action net**.

The exact code can be found [here](https://github.com/lkiel/rl-doom/blob/develop/src/models/cnn.py). It is probably more complicated than it needs to be due to the possibility to add or not different forms of normalization. Also, the exact number of neurons used is not that important. From my experiments, I have had very good results with less than half of the neurons at each step.

Finally, I also used [frame stacking](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecframestack) as it come for free when you use stable-baselines vectorized wrappers but it does not improve significantly the performance of the agent. I have had similar results with frame stacking set to 1.

### Training
To obtain the final model, I trained with reward shaping and curriculum learning as follows:

1. 10M training steps from scratch using a frame skip parameter of 4.
1. 10M training steps using the previous result and setting the frame skip parameter to 2.
1. 10M training steps using the previous result and setting the frame skip parameter to 1.

At the beginning of the training process, the frame skip is set relatively high to speed up the learning. Then, it is progressively reduced to improve the aiming accuracy of the agent. The figure below shows the final "learning curve" for this setup. Notice the sharp jump in performance as soon as we allow the agent to skip less frames. Also note that it is more difficult for the agent to get frags on this map due to its more complex structure. Thus, we can't directly compare the performance after 3M steps to the plots shown above.

![Best model training](./figures/final_training_rewards.png)

Thus, the best model reaches on average 27 frags per game of 2:30 minutes. This is around one frag every 5.5 seconds! Here is an animation of resulting agent in action, destroying the competition:

![Final agent](./figures/deathmatch_stack=4.gif)

<div id="bonus"></div>

# Bonus: play against your agent!

Want to see how your skills compare to the AI? There are two helper scripts in the `bin` folder:

* `demo_deathmatch.sh`
* `demo_multiplayer.sh`

The first one starts a game of deathmatch with 8 bots and a pretrained agent. Use it if you want a demonstration of what the agent can do. The second one will spawn two instances of Doom, one for a human player and one for the pretrained agent. Each player joins the same deathmatch game with 7 programmed bots. Good luck! 

Note: Due to the limitation of Jupyter notebooks to run multiple processes in parallel, I could not include the demo directly in this notebook.

<div id="conclusion"></div>

# Conclusion

Over the course of this three-part series we managed to train a reinforcement learning agent to play Doom deathmatch games. We started with a basic setup able to solve very simple scenarios where only a limited number of actions were allowed and where the complexity of the learning task was significantly constrained. We worked our way towards more elaborated tasks by increasing the number of parameters in our model and monitoring closely the learning process to ensure a smooth progression. Finally, we saw how reward shaping and curriculum learning could boost the training to reach good result much quicker.


## Potential improvements
If you have played a game of deathmatch against a fully trained agent you will have noticed that it is actually quite hard to keep up in terms of score. However, when playing in a 1 versus 1 it remains quite easy to defeat it due to its overall lack of strategy. I've identified three aspects that can be improved:

### Memory
You might have noticed that the agent has no concept of memory. This means that enemies that are not visible on the screen are immediately forgotten by the agent. Also, the agent does not keep track of places that it has already visited. This is not a big issue when playing against 8 programmed bots as there is always an enemy close by. However, when playing against a single opponent this means that the agent will revisit several times the same location of simply ignore some area of the map it should have explored.

A potential improvement here would be to use a model that has a concept of memory like a LSTM neural network. This paper shows that such a model could be used to play Doom effectively.
> Lample, Guillaume, and Devendra Singh Chaplot. “Playing FPS Games with Deep Reinforcement Learning.” ArXiv:1609.05521 [Cs], Jan. 2018. arXiv.org, http://arxiv.org/abs/1609.05521.

### Difficulty
If you have played yourself, you might have noticed that the programmed bots are not the smartest of opponents. They will often get stuck against walls or randomly run across the map. This also means that the amount of strategy needed by our agent to get good rewards is not enormous. It might be interesting to see wether we can increase the performance of the agent by letting it play against versions of itself. Stronger opponents means the agent will potentially learn more interesting strategies.

### Aggressivity
The agent prefers attacking than picking strategic items or protecting himself. It is hard to point to a single cause but it might be possible to mitigate this behaviour by picking different weights for the reward shaping process or defining new actions to be reinforced altogether.