# Training an agent to play Super Mario using player recorderd data

In this exercise you will learn how to use player generated data to train a neural network to play Super Mario. The results will be evaluated against the results from the Q_Learner exercise. 

## 1. Generating data
First, you will have to generate some data for the neural network to train with.


### 1.1 Player generated data.

![Mario](https://media1.giphy.com/media/aX0RqLt2ARSW4/giphy.gif?cid=ecf05e47fnkts3fqh25tj9v8noh9vnccwo4x0ey4zpdxc7ft&rid=giphy.gif&ct=g)


To achieve the best possible results, the training algorithm needs the best possible data. In this case that means player generated data.


You will have the most fun playing with a USB-Controller but if you have none, you can set the following variable to ``False`` to use the keyboard:

In [None]:
# Don't forget to run me
USE_GAMEPAD = False

### Controls
|            	| Keyboard    	| Xbox       	| Playstation 	|
|------------	|-------------	|------------	|-------------	|
| Jump       	| S           	| A          	| X           	|
| Sprint     	| A           	| B          	| O           	|
| Move Right 	| Arrow Right 	| Dpad Right 	| Dpad Right  	|
| Move Left  	| Arrow Left  	| Dpad Left  	| Dpad Left   	|
| Duck       	| Arrow Down  	| Dpad Down  	| Dpad Down   	|

### Level
The level we will be using for this exercise will be a very easy one to minimize the training time.

However, if you would like to try different levels, we encourage you to do so by changing the ``level`` variable below to a different one from the ``exercise_offline_rl\levels`` folder.

In [None]:
# Setup the imports and global variables. Run this cell again if you encounter any import errors.
import os
from d3rlpy.dataset import MDPDataset
from gym_setup import Env
from Controller.gamepad_controller import GamepadController
from Controller.keyboard_controller import KeyboardController
import numpy as np
from data.datasets.getDatasets import getDataset
from gym_setup import Env
import d3rlpy
import pathlib
from d3rlpy.metrics.scorer import evaluate_on_environment
from sklearn.model_selection import train_test_split
import copy

level = os.path.join("levels", "OneCliffLevel.lvl") # <--- change here if you want to try a different level.

dataset_path = os.path.join(
    "data", "datasets", os.path.split(level)[1] + ".h5")

dataset_path_rand = os.path.join(
    "data", "datasets", os.path.split(level)[1] + ".random.h5")

MODEL_DIR = pathlib.Path("data", "models")
if not MODEL_DIR.exists():
    MODEL_DIR.mkdir(parents=True)




In [None]:
# Let's play!
try:
    env_play = Env(visible=True, level=level).env
    if USE_GAMEPAD:
        controller = GamepadController(env_play)
    else:
        controller = KeyboardController(env_play)
    while True:
        observation = env_play.reset()
        done = False
        action = controller.read()

        observations = [observation]
        actions = [action]
        # No reward at first time step, because no action was taken yet
        rewards = [0]
        terminals = [done]

        while not done:
            observation, reward, done, info = env_play.step(action)
            action = controller.read()

            observations.append(observation)
            actions.append(action)
            rewards.append(reward)
            terminals.append(done)

        if os.path.isfile(dataset_path):
            dataset = MDPDataset.load(dataset_path)
            dataset.append(np.asarray(observations), np.asarray(actions), np.asarray(rewards),
                            np.asarray(terminals))
        else:
            dataset = MDPDataset(np.asarray(observations), np.asarray(actions), np.asarray(rewards),
                                    np.asarray(terminals), discrete_action=True)
        dataset.dump(dataset_path)
        stats = dataset.compute_stats()
        mean = stats['return']['mean']
        std = stats['return']['std']
        print(f'mean: {mean}, std: {std}')
except ConnectionResetError:
    print("Done")


### 1.2 Randomly generated data (optional)
To complement the player generated data, it is possible and encouraged to also generate some random data for the algorithm to train with.

In [None]:
# Generate random data

EPISODES = 20

env_rand = Env(visible=False, level=level).env

observations = []
actions = []
rewards = []
terminals = []

for episode in range(0, EPISODES):
    observation = env_rand.reset()
    done = False

    while not done:
        action = env_rand.action_space.sample()
        observation, reward, done, info = env_rand.step(action)
        observations.append(observation)
        actions.append(action)
        rewards.append(reward)
        terminals.append(done)

    if episode % 5 == 0 and episode != 0:
        if os.path.isfile(dataset_path_rand):
            dataset = MDPDataset.load(dataset_path_rand)
            dataset.append(np.asarray(observations), np.asarray(actions), np.asarray(rewards),
                           np.asarray(terminals))
        else:
            dataset = MDPDataset(np.asarray(observations), np.asarray(actions), np.asarray(rewards),
                                 np.asarray(terminals), discrete_action=True)
        dataset.dump(dataset_path_rand)
        stats = dataset.compute_stats()
        mean = stats['return']['mean']
        std = stats['return']['std']
        print(f'mean: {mean}, std: {std}')


## 2. Use the generated data to train a policy
Now that you have generated some data for the neural network to train with, let's begin with the training.
For the purpose of this exercise we will use the Offline RL Python library d3rlpy.

Our main goal was to display how good data influences the effect of an RL algorithm.
Creating measuring and learning data for an reinforcement Learning algorithm needs a lot of ressources,
this leads to the conclusion that you can not afford to create a lot of learning data.
We trained our Deep Q-Network with one hour of data played by ourselfs for each map.
Even with such a small amount of data our Mario AI learned to complete complex levels.
To display the weight of that "good" data we compared it to a Q-learner which learned for 10000 episodes to train a
a model for the network.

Note: We have pregenerated some training data for your convenice which will be used in addition to your data to train the model.
If you want to train with your own data only, go ahead and delete the data from ``exercise_offline_rl\data\datasets``.

### 2.1 Choosing an algorithm
![DQN](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/dqn.PNG)

#### Why we chose DQN? <br>
Our main goal was to display how good data influences the effect of an RL algorithm.
Creating measuring and learning data for a reinforcement Learning algorithm often consumes a lot of ressources,
this leads to the conclusion that you can not afford to create a lot of learning data.
Therefore in this exercise we try to teach an offline RL algorithm to work with a limited amount of data (1 hour of player
created data per level) to solve complex mario levels. To compare our results we trained a qlearner for 10 000 episodes
on the chosen levels. <br>
Therefor we chose the DQN algorithm as our offline algorithm since it's Algorithm is based on a similiar groundwork as
the qlearner. Additionally it has a high efficency. After recording player based data and qlearner data, we train a player
and a qlearner model for the maps to compare their individual results with each other. <br>
We expect our data to perform better than the qlearner data even tho it is significantly less data. This is because we
as players only have to master the mechanics of the game and can plan ahead from the first episode we play, while the qlearner
needs to learn everything from scratch. Since we evaluate mainly complex levels this leads the result, that the qlearner
will need a lot of time to learn to finish the level or might not even be able to finish the level at all. Comparably
we expect our data to have a 100% win ratio on the levels, which leads to better learning data for the network.

### 2.2 Setup the training
Let's setup some parameters before the training:

In [None]:
# Training parameters
gamma = 0.99
learning_rate = 0.0003
target_update_interval = 3000
n_epochs = 10 # <--- change here if you want to train more / less
test_size = 0.1
batch_size = 2
n_frames = 1
n_steps = 40
use_gpu = False

### 2.3 Training time!

To start the training run the next cell. If you want to see the progress of your training you can adittionaly open a new terminal and run ``pipenv run board`` to see the Tensorboard.

In [None]:
env_train = Env(visible=False, level=level, run_server=True).env

dataset = getDataset()

train_episodes, test_episodes = train_test_split(dataset, test_size=test_size)

dqn = d3rlpy.algos.DQN(learning_rate=learning_rate, gamma=gamma, use_gpu=use_gpu,
                        target_update_interval=target_update_interval, batch_size=batch_size)

# train offline
dqn.build_with_dataset(dataset)
# set environment in scorer function
evaluate_scorer = evaluate_on_environment(env_train)
# evaluate algorithm on the environment
rewards = evaluate_scorer(dqn)
name = 'marioai_%s_%s_%s_%s_%s' % (level.split('/')[-1], gamma, learning_rate, target_update_interval, n_epochs)
model_file = pathlib.Path(MODEL_DIR, name + ".pt")
currentMax = -100000
dqn_max = copy.deepcopy(dqn)

for epoch, metrics in (dqn.fitter(train_episodes, eval_episodes=test_episodes, tensorboard_dir='runs', experiment_name=name, n_epochs=n_epochs, scorers={'environment': evaluate_scorer})):
    if metrics.get("environment") > currentMax:
        currentMax = metrics.get("environment")
        dqn_max.copy_q_function_from(dqn)
    else:
        dqn.copy_q_function_from(dqn_max)

    dqn.save_model(model_file)


### 2.4 See what worked
Now let's see if the training did something:

In [None]:
env_show = Env(visible=True, level=level).env
dqn = d3rlpy.algos.DQN()
dqn.build_with_dataset(getDataset())
dqn.load_model(model_file)

while True:
    observation = env_show.reset()
    done = False
    total_reward = 0
    while not done:
        observation, reward, done, info = env_show.step(
            dqn.predict([observation])[0])
        total_reward += reward

    print(f'finished episode, total_reward: {total_reward}')


## 3. Offline RL vs Online RL 
Now we want to compare the approach from exercise 1 where an online Q_Learner was used with the results one can get with the offline RL approach.

### 3.0 Reproducibility
To compare as fairly as possible we ran both the Online Q_Learner as well as the offline Deep Q Network until a plateau of performance has been reached.

### 3.1 Easy level

|  | Reward | Video |
| -------- | -------- | -------- |
| Q_Learner  | 232     | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/rough_terrain_q_learner_232.gif)    |
| Deep Q Network  | 280    | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/rough_terrain_dqn_280.gif)    |

With sufficient training, neither model struggles with the easy level. However, the model fed with player generated data show better anticipation of jumps which leads to a better overall result.

### 3.2 Medium level

|  | Reward | Video |
| -------- | -------- | -------- |
| Q_Learner  | 176     | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/cliff_and_enemies_q_learner_176.gif)    |
| Deep Q Network  | 193    | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/cliff_and_enemies_dqn_193.gif)    |

In the medium level, both models behave similarly to the easy level. While the Online Learner seems to have a better strategy to avoid enemies, the Offline Learner has the better jumping performance leading to a better overall score as it is quicker to finish the level.

### 3.3 Hard level
|  | Reward | Video |
| -------- | -------- | -------- |
| Q_Learner  | -559     | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/climb_q_learner_-559.gif)    |
| Deep Q Network  | -34    | ![Gif](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/climb_dqn_-34.gif)    |

Neither model manages to complete the really hard level we tested them on. It is however interesting to observe the different strategies they applied. While the Q_Learner shows a very promising leap to the middle platform, failing to reach the final platform it seems to just give up and wait for the time to run out leading to a high time punishment and therefore an extremely low score. The offline trained model while still failing to complete the level, has developed a strategy to avoid the high time punishment by committing suicide as soon as possible.

### 3.4 Summary

![Summary](https://raw.githubusercontent.com/koerners/marioai/master/exercise_offline_rl/data/jupyter/level-summary.png)


Both the data of our played episodes aswell as the data of the qlearner had the same reward settings. Our approach
for the rewards was simple is best. The goal was to complete complex levels and see how good it performs with a small amount
of good data compared to a (comparably) big amount of qlearning data (which itself had to learn the level first).
With a very good but small dataset we now had to trained the network with each datasets for 100 epochs. Both had once again
the same hyperparameters for the training to create a fair comparison. Our data performed much better tho. This is because
we reached the goal from the beginning on and tried out different ways to do so, giving the Network the possibility to learn
from different possible actions that lead to the goal. Meanwhile the qlearner itself had problems completing the levels in
10 000 episodes since they were "too" complex to learn in such a short time without any reference data. This is made visible
in all 3 grphics. At some point the qlearner completed the level but didn't optimize it's Qtable yet, which leads to our
data resulting in having better Rewards in every level. The graph of the hard level has negative rewards for both our data
model aswell as the qlearner data model, since both did not manage to complete the level. The level was too complex for
the data since it needed alot of climbing and even tho the data for that level was good aswell the network didn't manage to
learn to climb over a high wall.

### 3.5 Performance over 100 level

// TODO

## 4. Conclusion

// TODO

![Thanks](https://media4.giphy.com/media/1mssFONYwmBlJy1DAv/giphy.gif?cid=ecf05e47fq7b3e8nbn49rxb2hj1f8qy627umny603h7tsi8f&rid=giphy.gif&ct=g)