# Training an Agent to play Super Mario
In this exercise you are going to train a Q-Learning agent on the ```gym-marioai``` domain.  
gym_marioai provides a python interface to interact with the MarioAI engine in a comfortable way. The engine itself is implemented in java, and the ```.jar``` of the engine needs to be started separately.  

### Installation
Requirements: Java 8 runtime environment, python 3.?  
You will be provided with both the .jar and the gym-marioai python package.

In [None]:
# install the gym-environment
# navigate to the source folder, then run:
# pip install ./path/to/gym-marioai

In [1]:
import gym
import gym_marioai
import numpy as np
from random import random

### Running the MarioAI server:
navigate to the folder containing ```marioai-server.jar```, then run the following:  
```java -jar ./marioai-server.jar```

### python-client demo setup:
make sure the demo is running...

In [3]:
# initialize the environment
env = gym.make('Marioai-v0', render=True, level_path=gym_marioai.levels.cliff_level)

# run random episodes
for episode in range(2):
    state = env.reset()
    done = False
    total_reward = 0

    while not done:
        action = env.action_space.sample()
        state, reward, done, info = env.step(action)
        total_reward += reward

    print(f'finished episode {episode}, total_reward: {total_reward}')

print('finished demo')


RuntimeError: socket connection broken

## Representation of the Q-table
You will experience some of the shortcomings of tabular reinforcement learning methods. With marioai, the observation space will be very large, resulting in
- longer training duration (no interpolation of the policy between similar observations, each state needs to be explored separately)
- large amount of memory required to store the Q-table, if implemented naively

However, we can assume that only a subset of the observation space will be visited.  

Task: Implement a representation of the Q-table that stores observations 'on-demand'.  
Optional: Think of a way to store and reuse the trained model.


In [4]:
class QTable:
    """
    data structure to store the Q function for hashable state representations 
    """
    def __init__(self, n_actions, initial_capacity=100):
        self.capacity = initial_capacity
        self.num_states = 0
        self.state_index_map = {}
        self.table:np.array = np.zeros([initial_capacity, n_actions])

    def __contains__(self, state):
        """ 'in' operator """
        return state in self.state_index_map

    def __len__(self):
        return self.num_states

    def __getitem__(self, state):
        """ access state directly using [] notation """
        return self.table[self.state_index_map[state]]

    def init_state(self, state):
        if self.num_states == self.capacity: 
            # need to increase capacity
            self.table = np.concatenate((self.table, np.zeros_like(self.table)))
            self.capacity *= 2

        self.state_index_map[state] = self.num_states
        self.num_states += 1


class QAgent:
    def __init__(self, env, alpha=0.1, gamma=0.99, 
            epsilon_start=0.5, epsilon_end=0.001,
            epsilon_decay_length=10000, # in episodes
            ):
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.decay_step = (epsilon_start - epsilon_end) / epsilon_decay_length

        self.env = env
        self.Q = QTable(env.action_space.n)

    def select_action(self, state):
        """ epsilon-greedy action selection """
        if not state in self.Q:
            self.Q.init_state(state)
            return self.env.action_space.sample()

        if random() < self.epsilon:
            return self.env.action_space.sample()

        return np.argmax(self.Q[state])

    def update_Q(self, state, action, reward, next_state):
        """ basic Q learning update 
            Q(s,a) <- Q(s,a) + alpha * [r + gamma * max(Q(s', .)) - Q(s,a)]
        """
        if not next_state in self.Q:
            self.Q.init_state(next_state)

        td_error = reward + self.gamma * np.max(self.Q[next_state]) \
                    - self.Q[state][action]
        self.Q[state][action] += self.alpha * td_error 

    def decay_epsilon(self):
        if self.epsilon > self.epsilon_end:
            self.epsilon -= self.decay_step



## Training a Q-learner

In [5]:
# training loop
def train(env, agent, n_episodes):
    for e in range(n_episodes):
        done = False
        total_reward = 0

        # convert to bytes so it can be used as dictionary key
        # (np array is not hashable)
        state = env.reset()
        state = state.tobytes()

        while not done:
            action = agent.select_action(state)
            
            next_state, reward, done, info = env.step(action)
            next_state = next_state.tobytes()
            total_reward += reward

            agent.update_Q(state, action, reward, next_state)
            state = next_state

        # episode has finished
        agent.decay_epsilon()


In [6]:
# training parameters
episodes = 10000
trials = 10
alpha = 0.2
gamma = 0.99

# initialize agent and environment
env = gym.make('Marioai-v0', render=True,
                level_path=gym_marioai.levels.cliff_level,
                rf_width=7, rf_height=5)
agent = QAgent(env, alpha=alpha, gamma=gamma, epsilon_decay_length=episodes / 2)

for trial in range(trials):
    print(f'starting training', trial)
    train(env, agent, episodes)

starting training 0


RuntimeError: socket connection broken

## Plotting the training results

In [None]:
# TODO