# 01 Tests on Reinforcement Learning

## Intro
In this notebook I am going to test wich strategy is better to train a RL agent in a very-limited environment such as the tic-tac-toe game.


## RL algorithm
The implemented algorithm is the [Q-Learning algorithm](https://en.wikipedia.org/wiki/Q-learning#Algorithm), that is able to find the optimal policy $\pi^*$.

$Q$ is a fuction that given a state and an action, returns a number that means the *quality*, so the best move for a state is the one that maximizes the expected value of the total reward over all successive steps. In this case, winning the game.

The easiest way to implement the function $Q$ is as a matrix ($State \times Action$). And if the transition probability matrix is not known, then we have to sample from the environment by making the agent to play. This way is posible to calculate $Q$ iteratively.

\begin{equation*}
Q^{new} (s_{t},a_{t}) \leftarrow (1-\alpha) \cdot \underbrace{Q(s_{t},a_{t})}_{\text{old value}} + \underbrace{\alpha}_{\text{learning rate}} \cdot  \overbrace{\bigg( \underbrace{r_{t}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(s_{t+1}, a)}_{\text{estimate of optimal future value}} \bigg) }^{\text{learned value}}
\end{equation*}

Where:
* $r_{t}$ is the reward observed for the current state * $s_t$
* $\alpha \in [0,1]$ is the learning rate, which represents the importance between previous experiences and the current one.
* $\gamma \in [0,1]$ is the discount factor, which represents the difference in importance between future rewards and present rewards.

In [None]:
from players.minimax import Minimax
from players.qlearner import QLearner
from players.random import Random
from utils_train import train_player, test_players

---
## Experiments
I am going to train three RL agents with the Q-learning algorithm. To make the experiment fair, I will limit the resources the agents have by setting a fixed trainning time. This way, if an agent has a better but costly in (CPU operations) teacher, will play less games than other with a worse but faster teacher.

In [None]:
TRAIN_SECONDS = 60  # Time for training (sg)

### 1 Random teacher
The random teacher chooses a random move each turn. It does not aim to win or lose but it plays very fast.

In [None]:
p_rl_rand = QLearner(1)
p_rand = Random(2)

print("Training for {} seconds".format(TRAIN_SECONDS))
games_rand = train_player(
    p_train=p_rl_rand,
    p_opponent=p_rand,
    p_train_func=lambda board: p_rl_rand._train_1_game(0.1, lambda x: 1-(x/10), board, p_rand),
    seconds=TRAIN_SECONDS
)
print("Games played: {}".format(games_rand))

### 2 Minimax teacher
The minimax teacher aims to win. This agent is optimal in the sense that it can only win or draw a game. It is slow.

In [None]:
p_rl_minimax = QLearner(1)
p_minimax = Minimax(2)

print("Training for {} seconds".format(TRAIN_SECONDS))
games_minimax = train_player(
    p_train=p_rl_minimax,
    p_opponent=p_minimax,
    p_train_func=lambda board: p_rl_minimax._train_1_game(0.1, lambda x: 1-(x/10), board, p_minimax),
    seconds=TRAIN_SECONDS
)
print("Games played: {}".format(games_minimax))

### 3 RL teacher
What if I make a RL agent to play against itself? Will it learn?

In [None]:
p_rl_rl = QLearner(1)

print("Training for {} seconds".format(TRAIN_SECONDS))

games_minimax = train_player(
    p_train=p_rl_rl,
    p_opponent=p_rl_rl,
    p_train_func=lambda board: p_rl_rl._autotrain_1_game(0.1, lambda x: 1-(x/10), board),
    seconds=TRAIN_SECONDS
)
print("Games played: {}".format(games_minimax))

---
## Metrics
To measure the actual performance of each agent, I will make them play against several opponents.

In [None]:
TEST_N_GAMES = 100  # Games to test

### Optimal (minimax)
If the agent is good enough, when faced against this oponent, no one will win and all the games will be draws.

In [None]:
print("Testing RL-RAND VS MINIMAX ({} games)".format(TEST_N_GAMES))
print("As player 1")
print(test_players(p_rl_rand, p_minimax, TEST_N_GAMES))
print("As player 2")
print(test_players(p_minimax, p_rl_rand, TEST_N_GAMES))

In [None]:
print("Testing RL-MINIMAX VS MINIMAX ({} games)".format(TEST_N_GAMES))
print("As player 1")
print(test_players(p_rl_minimax, p_minimax, TEST_N_GAMES))
print("As player 2")
print(test_players(p_minimax, p_rl_minimax, TEST_N_GAMES))

In [None]:
print("Testing RL-RL VS MINIMAX ({} games)".format(TEST_N_GAMES))
print("As player 1")
print(test_players(p_rl_rl, p_minimax, TEST_N_GAMES))
print("As player 2")
print(test_players(p_minimax, p_rl_rl, TEST_N_GAMES))

### Against other RL agent
However, what the agent has learned could be a non-optimal policy and this way I could choose which one is the best. There can only be one.

In [None]:
print("Testing RL-RAND VS RL-MINIMAX ({} games)".format(TEST_N_GAMES))
print("As player 1")
print(test_players(p_rl_rand, p_rl_minimax, TEST_N_GAMES))
print("As player 2")
print(test_players(p_rl_minimax, p_rl_rand, TEST_N_GAMES))

In [None]:
print("Testing RL-RAND VS RL-RL ({} games)".format(TEST_N_GAMES))
print("As player 1")
print(test_players(p_rl_rand, p_rl_rl, TEST_N_GAMES))
print("As player 2")
print(test_players(p_rl_rl, p_rl_rand, TEST_N_GAMES))

In [None]:
print("Testing RL-MINIMAX VS RL-RL ({} games)".format(TEST_N_GAMES))
print("As player 1")
print(test_players(p_rl_minimax, p_rl_rl, TEST_N_GAMES))
print("As player 2")
print(test_players(p_rl_rl, p_rl_minimax, TEST_N_GAMES))

## Out of scope
This notebook does not include how to find the best hyperparameters, it's just a glimpse on how to explore the search space and proving that in order to learn how to win, the agent must win during the training.