# Training an agent to play Super Mario using player recorderd data

In this exercise you will learn how to use player generated data to train a neural network to play Super Mario. The results will be evaluated against the results from the Q_Learner exercise. 

## 1. Generating data
First, you will have to generate some data for the neural network to train with.
You will have the most fun playing with a USB-Controller but if you have none, you can set the following variable to false to use the keyboard:

In [2]:
USE_GAMEPAD = False

Gamepad Controlls Xbox(Playstation):<br>
A(X): Jump<br>
B(O): Run<br>
Dpad Right: Move Right<br>
Dpad Left: Move Left<br>
Dpad Down: Duck<br>

Keyboard Controlls:<br>
S: Jump<br>
A: Run<br>
Arrowkey Right: Move Right<br>
Arrowkey Left: Move Left<br>
Arrowkey Down: Duck<br>
G: Deactivate Matrix Grid<br>

In [3]:
import os
import subprocess
from d3rlpy.dataset import MDPDataset
from gym_setup import Env
from gamepad_controller import GamepadController
from keyboard_controller import KeyboardController
import numpy as np

level = os.path.join("levels", "RoughTerrainLevel.lvl") # TODO: Use a very easy level for this exercise

try:
    with subprocess.Popen(['java', '-jar', 'server.jar'], shell=True) as server:
        env = Env(visible=True, port=8080, level=level, run_server=False).env
        if USE_GAMEPAD:
            controller = GamepadController(env)
        else:
            controller = KeyboardController(env)
        while True:
            observation = env.reset()
            done = False
            action = controller.read()

            observations = [observation]
            actions = [action]
            rewards = [0]  # No reward at first time step, because no action was taken yet
            terminals = [done]

            while not done:
                observation, reward, done, info = env.step(action)
                action = controller.read()

                observations.append(observation)
                actions.append(action)
                rewards.append(reward)
                terminals.append(done)

            dataset_path = os.path.join("data", "datasets", os.path.split(level)[1] + ".h5")
            if os.path.isfile(dataset_path):
                dataset = MDPDataset.load(dataset_path)
                dataset.append(np.asarray(observations), np.asarray(actions), np.asarray(rewards),
                                np.asarray(terminals))
            else:
                dataset = MDPDataset(np.asarray(observations), np.asarray(actions), np.asarray(rewards),
                                        np.asarray(terminals), discrete_action=True)
            dataset.dump(dataset_path)
            stats = dataset.compute_stats()
            mean = stats['return']['mean']
            std = stats['return']['std']
            print(f'mean: {mean}, std: {std}')
except ConnectionResetError:
    # Finish
    pass


## 2. Use the generated data to train a policy
Now that you have generated some data for the neural network to train with, let's begin with the training.
For the purpose of this exercise we will use the Offline RL Python library d3rlpy.

### 2.1 Choosing an algorithm
![DQN](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/dqn.PNG)

#### Why we chose DQN? <br>
Our main goal was to display how good data influences the effect of an RL algorithm.
Creating measuring and learning data for a reinforcement Learning algorithm often consumes a lot of ressources,
this leads to the conclusion that you can not afford to create a lot of learning data.
Therefore in this exercise we try to teach an offline RL algorithm to work with a limited amount of data (1 hour of player
created data per level) to solve complex mario levels. To compare our results we trained a qlearner for 10 000 episodes
on the chosen levels. <br>
Therefor we chose the DQN algorithm as our offline algorithm since it's Algorithm is based on a similiar groundwork as
the qlearner. Additionally it has a high efficency. After recording player based data and qlearner data, we train a player
and a qlearner model for the maps to compare their individual results with each other. <br>
We expect our data to perform better than the qlearner data even tho it is significantly less data. This is because we
as players only have to master the mechanics of the game and can plan ahead from the first episode we play, while the qlearner
needs to learn everything from scratch. Since we evaluate mainly complex levels this leads the result, that the qlearner
will need a lot of time to learn to finish the level or might not even be able to finish the level at all. Comparably
we expect our data to have a 100% win ratio on the levels, which leads to better learning data for the network.

### 2.2 Setup the training
Let's setup some parameters before the training:

In [None]:
from data.datasets.getDatasets import getDataset
from gym_setup import Env
import d3rlpy
import pathlib
from d3rlpy.metrics.scorer import evaluate_on_environment
from sklearn.model_selection import train_test_split
import copy
import socket
from contextlib import closing

MODEL_DIR = pathlib.Path("data", "models")
if not MODEL_DIR.exists():
    MODEL_DIR.mkdir(parents=True)

# Environment settings
port = 8081
run_server = True
visible = False

# Training parameters
gamma = 0.99
learning_rate = 0.0003
target_update_interval = 3000
n_epochs = 1000
test_size = 0.1
batch_size = 2
n_frames = 1
n_steps = 40
use_gpu = True

### 2.3 Training time!
To start the training run the next cell. If you want to see the progress of your training you can adittionaly open a new terminal and run ``pipenv run board`` to see the Tensorboard.

In [None]:
    env = Env(visible=visible, port=8083, level=level, run_server=run_server).env

    dataset = getDataset()

    train_episodes, test_episodes = train_test_split(dataset, test_size=test_size)

    dqn = d3rlpy.algos.DQN(learning_rate=learning_rate, gamma=gamma, use_gpu=use_gpu,
                           target_update_interval=target_update_interval, batch_size=batch_size)

    # train offline
    dqn.build_with_dataset(dataset)
    # set environment in scorer function
    evaluate_scorer = evaluate_on_environment(env)
    # evaluate algorithm on the environment
    rewards = evaluate_scorer(dqn)
    name = 'marioai_%s_%s_%s_%s_%s' % (level.split('/')[-1], gamma, learning_rate, target_update_interval, n_epochs)
    currentMax = -100000
    dqn_max = copy.deepcopy(dqn)

    for epoch, metrics in (dqn.fitter(train_episodes, eval_episodes=test_episodes, tensorboard_dir='runs', experiment_name=name, n_epochs=n_epochs, scorers={'environment': evaluate_scorer})):
        if metrics.get("environment") > currentMax:
            currentMax = metrics.get("environment")
            dqn_max.copy_q_function_from(dqn)
        else:
            dqn.copy_q_function_from(dqn_max)

    model_file = pathlib.Path(MODEL_DIR, name + ".pt")
    dqn.save_model(model_file)

### 2.4 See what worked
Now let's see if the training did something:

In [None]:
env = Env(visible=True, level=level, port=8082).env
dqn = DQN()
dqn.build_with_dataset(getDataset())
dqn.load_model('data/models/model.pt')

while True:
    observation = env.reset()
    done = False
    total_reward = 0
    while not done:
        observation, reward, done, info = env.step(dqn.predict([observation])[0])
        total_reward += reward

    print(f'finished episode, total_reward: {total_reward}')

## 3. Offline RL vs Online RL 
Now we want to compare the approach from exercise 1 where an online Q_Learner was used with the results one can get with the offline RL approach.

### 3.0 Reproducibility
To compare as fairly as possible we ran both the Online Q_Learner as well as the offline Deep Q Network until a plateau of performance has been reached.

### 3.1 Easy level

|  | Reward | Video |
| -------- | -------- | -------- |
| Q_Learner  | 232     | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/rough_terrain_q_learner_232.gif)    |
| Deep Q Network  | 280    | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/rough_terrain_dqn_280.gif)    |


### 3.2 Medium level

|  | Reward | Video |
| -------- | -------- | -------- |
| Q_Learner  | 176     | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/cliff_and_enemies_q_learner_176.gif)    |
| Deep Q Network  | 193    | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/cliff_and_enemies_dqn_193.gif)    |

### 3.3 Hard level
|  | Reward | Video |
| -------- | -------- | -------- |
| Q_Learner  | -559     | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/climb_q_learner_-559.gif)    |
| Deep Q Network  | -34    | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/climb_dqn_-34.gif)    |

### 3.4 Summary

![Summary](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/level-summary.png)


Both the data of our played episodes aswell as the data of the qlearner had the same reward settings. Our approach
for the rewards was simple is best. The goal was to complete complex levels and see how good it performs with a small amount
of good data compared to a (comparably) big amount of qlearning data (which itself had to learn the level first).
With a very good but small dataset we now had to trained the network with each datasets for 100 epochs. Both had once again
the same hyperparameters for the training to create a fair comparison. Our data performed much better tho. This is because
we reached the goal from the beginning on and tried out different ways to do so, giving the Network the possibility to learn
from different possible actions that lead to the goal. Meanwhile the qlearner itself had problems completing the levels in
10 000 episodes since they were "too" complex to learn in such a short time without any reference data. This is made visible
in all 3 grphics. At some point the qlearner completed the level but didn't optimize it's Qtable yet, which leads to our
data resulting in having better Rewards in every level. The graph of the hard level has negative rewards for both our data
model aswell as the qlearner data model, since both did not manage to complete the level. The level was too complex for
the data since it needed alot of climbing and even tho the data for that level was good aswell the network didn't manage to
learn to climb over a high wall.

### 3.5 Performance over 100 level

// TODO

## 4. Conclusion