<div id=top></div>

# Reinforcement Learning with Doom - Increasing complexity and monitoring the model

Leandro Kieliger
contact@lkieliger.ch

---
## Description

In this notebook we are going to significantly improve the learning efficiency of the setup created in the previous part of this series. First, we will see how to modify rewards to incentivize behaviors helping reach the initial goal, an method called "reward shaping". In the second part, we will design an adaptive learning process that varies the difficulty of the training environment based on the performance of the agent. 


### [Part 1 - Reward Shaping](#part_1)
* [aaa](#aaa)
* [bbb](#bbb)

    
### [Part 2 - Curriculum Learning](#part_2)
* [ccc](#ccc)
* [ddd](#ddd)
    
    
### [Part 3 - Bonus visualisation](#part_3)
* [eee](#eee)

<div id=part_1></div>

# [^](#top) Part 1 - Reward Shaping


## Preparations

In [1]:
%load_ext autoreload
%autoreload 2

import cv2
import gym
import matplotlib.pyplot as plt
import numpy as np
import torch as th
import typing as t
import vizdoom
from stable_baselines3 import ppo
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common import evaluation, policies
from torch import nn

from common import envs, plotting

In the previous notebook we saw that the learning process was very slow. Indeed, even after training more than 2 million steps, our agent barely reached 2 frags per match on average. In comparison, the best bot manages to get around 13 frags. 

We also discussed one of the main reason why the model had so much difficulties getting started. The issue is related to rewards being sparse. That is, the agent has to execute many steps "just right" before it can observe some meaningful reward signal. It must manage to move and aim at ennemies while repeatedly shooting them in order to (possibly) get some rewards. Such sequence of action rarely happens by chance. If rewards are rare, this means that it will take a long time to reinforce good behaviors.

To solve the issue of sparse rewards, we can give our agent small positive rewards for every action we believe will be beneficial to the learning process. Here is the list of actions we would like to incentivize as well as the associated reward:

| Action                     | Reward       |
| -------------------------- |--------------| 
| Frag                       |  1 per frag   | 
| Damaging enemies           |  0.01 per damage point | 
| Picking up ammunition      |  0.02 per unit |
| Using ammunition           | -0.01 per unit | 
| Picking up health          |  0.02 per health point |
| Losing health              | -0.01 per health point |
| Picking up armor           |  0.01 per armor point|
| Moved distance > 3 units   |  5e-5 per step|
| Moved distance < 3 units   | -2.5e-3 per step|

Note that players typically have 100 health points and that damage points correspond to the number of ennemy health points that were removed. Also, players can typically move at around 16 units per tick. The distance reward is here to avoid "camping" behavior.