# Training an Agent to play Super Mario
In this exercise you are going to train a Q-Learning agent on the ```gym-marioai``` domain.  
gym_marioai provides a python interface to interact with the MarioAI engine in a comfortable way. The engine itself is implemented in java, and the ```.jar``` of the engine needs to be started separately.  

### Installation
Requirements: Java 8 runtime environment, python 3.?  
You will be provided with both the .jar and the gym-marioai python package.

In [None]:
# install the gym-environment
# navigate to the source folder, then run:
# pip install ./path/to/gym-marioai

In [None]:
!python --version

### Running the MarioAI server:
navigate to the folder containing ```marioai-server.jar```, then run the following:  
```java -jar ./marioai-server.jar```

### Running the server
To run the server, run the following cell. It will launch the jar containing the Java engine.

In [1]:
from subprocess import Popen
server_process = Popen(
    ['java', '-jar', 'marioai-proto-interface/target/marioai-proto-interface-0.1-SNAPSHOT-jar-with-dependencies.jar'])


### Closing the MarioAI server: (this also happens automatically by closing the render window)
To kill the server process comment in the following cell and run it. Do not do this yet though ;).

In [None]:
# server_process.kill()

### python-client demo setup:
make sure the demo is running...

## Representation of the Q-table
You will experience some of the shortcomings of tabular reinforcement learning methods. With marioai, the observation space will be very large, resulting in
- longer training duration (no interpolation of the policy between similar observations, each state needs to be explored separately)
- large amount of memory required to store the Q-table, if implemented naively

However, we can assume that only a subset of the observation space will be visited.  

Task: Implement a representation of the Q-table that stores observations 'on-demand'.  
Optional: Think of a way to store and reuse the trained model.


## Training a Q-learner

In [2]:
import numpy as np
import gym
import gym_marioai

In [3]:
class QTable:
    """
    data structure to store the Q function for hashable state representations
    """

    def __init__(self, n_actions, initial_capacity=100):
        self.capacity = initial_capacity
        self.num_states = 0
        self.state_index_map = {}
        self.table = np.zeros([initial_capacity, n_actions])

    def __contains__(self, state):
        """ 'in' operator """
        return state in self.state_index_map

    #def __len__(self):
    #    return self.num_states

    def __getitem__(self, state):
        """ access state directly using [] notation """
        if state not in self.state_index_map:
            self.init_state(state)

        return self.table[self.state_index_map[state]]

    def init_state(self, state):
        if self.num_states == self.capacity:
            # need to increase capacity
            self.table = np.concatenate(
                (self.table, np.zeros_like(self.table)))
            self.capacity *= 2
        self.state_index_map[state] = self.num_states
        self.num_states += 1


In [22]:
#####################################
#   Training Parameters
#####################################
n_episodes = 5000
alpha = 0.1
gamma = 0.99
lmbda = 0.75
epsilon_start = 0.5
epsilon_end = 0.001
epsilon_decay_length = n_episodes / 2
decay_step = (epsilon_start - epsilon_end) / epsilon_decay_length

SAVE_FREQ = 100

#####################################
#   Environment/Reward Settings
#####################################
level = 'earlyCliffLevel'
path = None

if level == 'cliffLevel':
    path = gym_marioai.levels.cliff_level
if level == 'oneCliffLevel':
    path = gym_marioai.levels.one_cliff_level
if level == 'earlyCliffLevel':
    path = gym_marioai.levels.early_cliff_level

trace = 3
rf_width = 20
rf_height = 10
prog = 1
timestep = -1
cliff = 1000
win = -100
dead = -10

In [23]:
def train():
    """
    training
    """
    log_path = f'{level}_{rf_width}x{rf_height}_trace{trace}_prog{prog}_cliff{cliff}_win{win}_dead{dead}-0'
    # logger = Logger(log_path)
    # collect some training statistics
    all_rewards = np.zeros([SAVE_FREQ])
    all_wins = np.zeros([SAVE_FREQ])
    all_steps = np.zeros([SAVE_FREQ])
    all_gap_jumps = np.zeros([SAVE_FREQ])

    ###################################
    #       environment setup
    ###################################
    reward_settings = gym_marioai.RewardSettings(
        progress=prog, timestep=timestep, cliff=cliff, win=win, dead=dead)
    env = gym.make('Marioai-v0', render=False,
                   level_path=path,
                   reward_settings=reward_settings,
                   compact_observation=True,
                   trace_length=trace,
                   rf_width=rf_width, rf_height=rf_height)

    try:
        ####################################
        #       Q-learner setup
        #####################################
        Q = QTable(env.n_actions, 128)
        etrace = {}

        ####################################
        #      Training Loop
        ####################################
        for e in range(n_episodes+1):
            done = False
            info = {}
            total_reward = 0
            steps = 0

            # exponential decay
            epsilon = (epsilon_end / epsilon_start) ** (e /
                                                        n_episodes) * epsilon_start

            state = env.reset()
            #state = tuple([s.tobytes() for s in state])
            # choose a' from a Policy derived from Q
            if np.random.rand() < epsilon:
                action = env.action_space.sample()
            else:
                action = int(np.argmax(Q[state]))  # greedy

            while not done:
                next_state, reward, done, info = env.step(action)
                #next_state = tuple([s.tobytes() for s in next_state])
                total_reward += reward

                # choose a' from a Policy derived from Q
                best_next_action = int(np.argmax(Q[next_state]))  # greedy
                if np.random.rand() < epsilon:
                    next_action = env.action_space.sample()
                else:
                    next_action = best_next_action

                # calculate the TD error
                td_error = reward + gamma * \
                    Q[next_state][best_next_action] - Q[state][action]

                # reset eligibility trace for (s,a) using replacing strategy
                etrace[(state, action)] = 1

                # perform Q update
                if best_next_action == next_action:
                    for (s, a), eligibility in etrace.items():
                        Q[s][a] += alpha * eligibility * td_error
                        etrace[(s, a)] *= gamma * lmbda
                else:
                    for (s, a), eligibility in etrace.items():
                        Q[s][a] += alpha * eligibility * td_error
                    etrace = {}

                steps += 1
                action = next_action
                state = next_state

            # episode finished
            # logger.append(total_reward, info['steps'], info['win'])

            all_rewards[e % SAVE_FREQ] = total_reward
            all_wins[e % SAVE_FREQ] = 1 if info['win'] else 0
            all_steps[e % SAVE_FREQ] = info['steps']
            all_gap_jumps[e % SAVE_FREQ] = info['cliff_jumps']

            if e % SAVE_FREQ == 0 and e > 0:
                # logger.save()
                # logger.save_model(Q)
                print(f'Episode: {e}', 'Eps: {epsilon:.3f}', f'Avg Reward: {all_rewards.mean():>4.2f}',
                    f'Avg steps: {all_steps.mean():>4.2f}', f'Win% : {all_wins.mean():3.2f}',
                    f'Cliff jumps: {all_gap_jumps.mean():.1f}', f'States seen: {Q.num_states}', end='\r')
        env.teardown()
        return Q
    except KeyboardInterrupt:
        env.teardown()

In [24]:
Q = train()



# Replay

In [25]:
reward_settings = gym_marioai.RewardSettings(progress=prog, timestep=timestep,
                                             cliff=cliff, win=win, dead=dead)
env = gym.make('Marioai-v0', render=True,
               level_path=path,
               reward_settings=reward_settings,
               compact_observation=True,
               trace_length=trace,
               rf_width=rf_width, rf_height=rf_height)

try:
    while True:
        done = False
        info = {}
        total_reward = 0
        steps = 0
        state = env.reset()

        while not done:
            action = int(np.argmax(Q[state]))  # greedy
            state, reward, done, info = env.step(action)
            total_reward += reward
            steps += 1

        print(f'finished episode. reward: {total_reward:4.2f}\t steps: {steps:4.2f}\t'
            f'win: {info["win"]}\t gap jumps: {info["cliff_jumps"]}')
except KeyboardInterrupt:
    env.teardown()

finished episode. reward: 10158.00	 steps: 842.00	win: False	 gap jumps: 11
closing socket connection...
socket connection closed successfully.
socket disconnected.


## Plotting the training results

In [None]:
# TODO

In [None]:
server_process.kill()