# Gekitai with Reiforcement Lerning

## Introduction

- This notebook will walk through the various stages of the implementation of a custom [OpenAI gym](https://www.gymlibrary.ml) for the gekitai game.

- The gekitai rules are available [here](https://boardgamegeek.com/boardgame/295449/gekitai)

## Contributors

- [João Sousa](mailto:up201904739@edu.fc.up.pt)
- [Miguel Rodrigues](mailto:up201906042@edu.fe.up.pt)
- [Ricardo Ferreira](mailto:up201907835@edu.fe.up.pt)

In [None]:
import numpy as np
import gym
import gekitai

In [None]:
# Checking if the developed gym follows the specification of the OpenAI gym
from gym.utils.env_checker import check_env

env = gym.make('gekitai-v0')
check_env(env)

## Testing the environment

Below there is a small test just to make sure is up and running.
In the snippet below nothing fancy happens, the step of the environment is by taking random actions.

In [None]:
env = gym.make('gekitai-v0', render_mode='human')
observation = env.reset()

episodes = 5

for episode in range(episodes):
    done = False
    while not done:
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
    
        env.render(mode='human')
    
    print(info)
    observation = env.reset()
    
env.close()

## Learning

In the next section we will discuss and view how can an agent learn to play the gekitai using our custom developed environment. 

### Considerations

Once we are using [OpenAI gym](https://www.gymlibrary.ml/), we had to face a challenge regarding single vs. multi agent environments. The fact is that gym's interface is targeted towards single-agent environments meant that we were required to adapt our 2-player board game, hence a multi-agent environment, to a single-agent environment. 

For that, our `step()` function executes a move for both the agent and its opponent. The way that we use for generate a move for the open is pretty simple - we choose a random action but with a small catch of insider spaces (more valuable) have a higher probability of being chosen. This simplicity comes from the fact that the usage of more complex algorithms could take way too long to compute each step, what would translate in even longer times when training our RL models.

Another relevant aspect is the choice of the training algorithms. Not all RL algorithms work with out environment due to the fact that `action_space` is of type `Discrete()` and `observation_space` is of type `Box()` which means it is continuous.

### Algorithms

Taking into account the considerations stated above some of the algorithms compatible with our environment are:

- DQN
- PPO
- A2C

The implementation for those algorithms will be provided by the Stable Baselines3 library, since it provides a very friendly and easy-to-use API, very handy for solving all sorts of tasks related to RL. The documentation for the library can be found [here](https://stable-baselines3.readthedocs.io/en/master/).

In [None]:
# Logs setup for visualization through TensorBoard
import os

logs_dir = f'logs'
if not os.path.exists(logs_dir):
    os.makedirs(logs_dir)

### DQN (Deep Q-Network)

The DQN algorithm is based in the Q-learning algorithm. Basically, the Q-table which store the Q-values for each pair `(state, action)` from the latter is substituted by a a neural network which is trained to estimate that same Q-value, in other words, Q-learning is for a discrete `observation_space` what DQN is for a continuous `observation_space`.

Both DQN and Q-learning have the charateristic of being off-policy, meaning that the behaviour of the agent is completely independent from the produced estimates for the value function.

In [None]:
from stable_baselines3 import DQN

dqn_models_dir = 'models/dqn'
if not os.path.exists(dqn_models_dir):
    os.makedirs(dqn_models_dir)

In [None]:
version = 0.1
env = gym.make('gekitai-v0', render_mode='rgb_array')
env.reset()

model = DQN('MlpPolicy', env, verbose=1, tensorboard_log=logs_dir)
model.learn(total_timesteps=2e5, reset_num_timesteps=False, tb_log_name=f'dqn_v{version}')
model.save(f'{dqn_models_dir}/gekitai_dnq_v{version}')

In [None]:
env = gym.make('gekitai-v0', render_mode='human')
observation = env.reset()

episodes = 5

for episode in range(episodes):
    done = False
    
    while not done:
        action, _states = model.predict(observation, deterministic=True)
        observation, reward, done, info = env.step(action)
        
        env.render(mode='human')
    
    print(info)
    observation = env.reset()

env.close()