# 01 Tests on Reinforcement Learning

## Intro
In this notebook I am going to test wich strategy is better to train a RL agent in a very-limited environment such as the tic-tac-toe game.


## RL algorithm
The implemented algorithm is the [Q-Learning algorithm](https://en.wikipedia.org/wiki/Q-learning#Algorithm), that is able to find the optimal policy $\pi^*$.

$Q$ is a fuction that given a state and an action, returns a number that means the *quality*, so the best move for a state is the one that maximizes the expected value of the total reward over all successive steps. In this case, winning the game.

The easiest way to implement the function $Q$ is as a matrix ($State \times Action$). And if the transition probability matrix is not known, then we have to sample from the environment by making the agent to play. This way is posible to calculate $Q$ iteratively.

\begin{equation*}
Q^{new} (s_{t},a_{t}) \leftarrow (1-\alpha) \cdot \underbrace{Q(s_{t},a_{t})}_{\text{old value}} + \underbrace{\alpha}_{\text{learning rate}} \cdot  \overbrace{\bigg( \underbrace{r_{t}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(s_{t+1}, a)}_{\text{estimate of optimal future value}} \bigg) }^{\text{learned value}}
\end{equation*}

Where:
* $r_{t}$ is the reward observed for the current state * $s_t$
* $\alpha \in [0,1]$ is the learning rate, which represents the importance between previous experiences and the current one.
* $\gamma \in [0,1]$ is the discount factor, which represents the difference in importance between future rewards and present rewards.

In [1]:
import itertools

from players.minimax import Minimax
from players.qlearner import QLearner
from players.random import Random
from utils import train_player_seconds, test_players, train_player_games

---
## Teachers & Learners

### Random teacher
The random teacher chooses a random move each turn. It does not aim to win or lose but it plays very fast.

### Minimax teacher
The minimax teacher aims to win. This agent is optimal in the sense that it can only win or draw a game. It is slow.

### RL teacher
What if I make a RL agent to play against itself? Will it learn? On each game it learns to play as player 1 and player 2.

In [2]:
teachers = {
    'random': Random(2),
    'minimax': Minimax(2)
}

In [3]:
learners_seconds = {
    'random': {
        'agent': QLearner(1),
        'teacher': teachers['random'],
        'results': {},
        'train_func': lambda board,learner,teacher: learner._train_1_game(0.1, lambda x: 1-(x/10), board, teacher)
    },
    'minimax': {
        'agent': QLearner(1),
        'teacher': teachers['minimax'],
        'results': {},
        'train_func': lambda board,learner,teacher: learner._train_1_game(0.1, lambda x: 1-(x/10), board, teacher)
    },
    'rl': {
        'agent': QLearner(1),
        'teacher': None,  # Itself, referenced in future
        'results': {},
        'train_func': lambda board, learner, teacher: learner._autotrain_1_game(0.1, lambda x: 1-(x/10), board)
    },
}

learners_seconds['rl']['teacher'] = learners_seconds['rl']['agent']

---
## Experiments (with time limitation)
I am going to train three RL agents with the Q-learning algorithm. To make the experiment fair, I will limit the resources the agents have by setting a fixed trainning time. This way, if an agent has a better but costly in (CPU operations) teacher, will play less games than other with a worse but faster teacher.

In [4]:
TRAIN_SECONDS = 60  # Time for training (sg)

In [5]:
for rl_name, rl in learners_seconds.items():
    rl['seconds'] = TRAIN_SECONDS
    print("Training rl ({}) for {} seconds".format(rl_name, TRAIN_SECONDS))
    rl['games'] = train_player_seconds(
        learner=rl['agent'],
        teacher=rl['teacher'],
        train_func=rl['train_func'],
        seconds=TRAIN_SECONDS
    )
    print("\t {} games played\n".format(rl['games']))

Training rl (random) for 60 seconds
	 183370 games played

Training rl (rl) for 60 seconds
	 149260 games played

Training rl (minimax) for 60 seconds
	 278 games played



---
## Metrics
To measure the actual performance of each agent, I will make them play against several opponents.

In [6]:
TEST_N_GAMES = 100  # Games to test

### Optimal (minimax)
If the agent is good enough, when faced against this oponent, no one will win and all the games will be draws.

In [7]:
for rl_name, rl in learners_seconds.items():
    print("Testing rl ({}) VS minimax ({} games)".format(rl_name, TEST_N_GAMES))
    print("\tAs player 1")
    results = test_players(rl['agent'], teachers['minimax'], TEST_N_GAMES)
    rl['results']['vs_minimax (p1)'] = {
        'wins': results[1],
        'loses':  results[2],
        'draws':  results[-1]
    }
    print(rl['results']['vs_minimax (p1)'])
    print("\tAs player 2")
    results = test_players(teachers['minimax'], rl['agent'], TEST_N_GAMES)
    rl['results']['vs_minimax (p2)'] = {
        'wins': results[2],
        'loses':  results[1],
        'draws':  results[-1]
    }
    print(rl['results']['vs_minimax (p2)'])
    print('\n')

Testing rl (random) VS minimax (100 games)
	As player 1
{'loses': 21, 'draws': 79, 'wins': 0}
	As player 2
{'loses': 49, 'draws': 51, 'wins': 0}


Testing rl (rl) VS minimax (100 games)
	As player 1
{'loses': 0, 'draws': 100, 'wins': 0}
	As player 2
{'loses': 93, 'draws': 7, 'wins': 0}


Testing rl (minimax) VS minimax (100 games)
	As player 1
{'loses': 65, 'draws': 35, 'wins': 0}
	As player 2
{'loses': 96, 'draws': 4, 'wins': 0}




### Against other RL agent
However, what the agent has learned could be a non-optimal policy and this way I could choose which one is the best. There can only be one.

In [8]:
for (rl1_name, rl1), (rl2_name,rl2) in itertools.combinations_with_replacement(learners_seconds.items(),2):
    print("Testing rl ({}) VS rl({}) ({} games)".format(rl1_name, rl2_name, TEST_N_GAMES))
    print("\tAs player 1")
    results = test_players(rl1['agent'], rl2['agent'], TEST_N_GAMES)
    rl['results']['vs_{} (p1)'.format(rl2_name)] = {
        'wins': results[1],
        'loses':  results[2],
        'draws':  results[-1]
    }
    print(rl['results']['vs_{} (p1)'.format(rl2_name)])
    print("\tAs player 2")
    results = test_players(rl2['agent'], rl1['agent'], TEST_N_GAMES)
    rl['results']['vs_{} (p2)'.format(rl2_name)] = {
        'wins': results[2],
        'loses':  results[1],
        'draws':  results[-1]
    }
    print(rl['results']['vs_{} (p2)'.format(rl2_name)])
    print('\n')

Testing rl (random) VS rl(random) (100 games)
	As player 1
{'loses': 100, 'draws': 0, 'wins': 0}
	As player 2
{'loses': 0, 'draws': 0, 'wins': 100}


Testing rl (random) VS rl(rl) (100 games)
	As player 1
{'loses': 0, 'draws': 9, 'wins': 91}
	As player 2
{'loses': 0, 'draws': 49, 'wins': 51}


Testing rl (random) VS rl(minimax) (100 games)
	As player 1
{'loses': 6, 'draws': 9, 'wins': 85}
	As player 2
{'loses': 10, 'draws': 16, 'wins': 74}


Testing rl (rl) VS rl(rl) (100 games)
	As player 1
{'loses': 100, 'draws': 0, 'wins': 0}
	As player 2
{'loses': 0, 'draws': 0, 'wins': 100}


Testing rl (rl) VS rl(minimax) (100 games)
	As player 1
{'loses': 9, 'draws': 9, 'wins': 82}
	As player 2
{'loses': 63, 'draws': 12, 'wins': 25}


Testing rl (minimax) VS rl(minimax) (100 games)
	As player 1
{'loses': 100, 'draws': 0, 'wins': 0}
	As player 2
{'loses': 0, 'draws': 0, 'wins': 100}




## Out of scope
This notebook does not include how to find the best hyperparameters, it's just a glimpse on how to explore the search space and proving that in order to learn how to win, the agent must win during the training.

---
---

## Experiments (games)

In [9]:
TRAIN_GAMES = 2000  # Number of games for training

In [10]:
learners_games = {
    'random': {
        'agent': QLearner(1),
        'teacher': teachers['random'],
        'results': {},
        'train_func': lambda board,learner,teacher: learner._train_1_game(0.1, lambda x: 1-(x/10), board, teacher)
    },
    'minimax': {
        'agent': QLearner(1),
        'teacher': teachers['minimax'],
        'results': {},
        'train_func': lambda board,learner,teacher: learner._train_1_game(0.1, lambda x: 1-(x/10), board, teacher)
    },
    'rl': {
        'agent': QLearner(1),
        'teacher': None,  # Itself, referenced in future
        'results': {},
        'train_func': lambda board, learner, teacher: learner._autotrain_1_game(0.1, lambda x: 1-(x/10), board)
    },
}

learners_games['rl']['teacher'] = learners_seconds['rl']['agent']

In [11]:
for rl_name, rl in learners_games.items():
    rl['games'] = TRAIN_GAMES
    print("Training rl ({}) in {} games".format(rl_name, TRAIN_GAMES))
    rl['games'] = train_player_games(
        learner=rl['agent'],
        teacher=rl['teacher'],
        train_func=rl['train_func'],
        games=TRAIN_GAMES
    )
    print("\t {} games played\n".format(rl['games']))

Training rl (random) in 2000 games
	 1.3201005458831787 games played

Training rl (rl) in 2000 games
	 1.52003812789917 games played

Training rl (minimax) in 2000 games
	 774.6538817882538 games played



---
## Metrics

In [12]:
TEST_N_GAMES = 100  # Games to test

### Against Minimax

In [17]:
for rl_name, rl in learners_games.items():
    print("Testing rl ({}) VS minimax ({} games)".format(rl_name, TEST_N_GAMES))
    print("\tAs player 1")
    results = test_players(rl['agent'], teachers['minimax'], TEST_N_GAMES)
    rl['results']['vs_minimax (p1)'] = {
        'wins': results[1],
        'loses':  results[2],
        'draws':  results[-1]
    }
    print(rl['results']['vs_minimax (p1)'])
    print("\tAs player 2")
    results = test_players(teachers['minimax'], rl['agent'], TEST_N_GAMES)
    rl['results']['vs_minimax (p2)'] = {
        'wins': results[2],
        'loses':  results[1],
        'draws':  results[-1]
    }
    print(rl['results']['vs_minimax (p2)'])
    print('\n')

Testing rl (random) VS minimax (100 games)
	As player 1
{'loses': 100, 'draws': 0, 'wins': 0}
	As player 2
{'loses': 81, 'draws': 19, 'wins': 0}


Testing rl (rl) VS minimax (100 games)
	As player 1
{'loses': 46, 'draws': 54, 'wins': 0}
	As player 2
{'loses': 100, 'draws': 0, 'wins': 0}


Testing rl (minimax) VS minimax (100 games)
	As player 1
{'loses': 0, 'draws': 100, 'wins': 0}
	As player 2
{'loses': 56, 'draws': 44, 'wins': 0}




### Against other RL agent

In [18]:
for (rl1_name, rl1), (rl2_name,rl2) in itertools.combinations_with_replacement(learners_games.items(),2):
    print("Testing rl ({}) VS rl({}) ({} games)".format(rl1_name, rl2_name, TEST_N_GAMES))
    print("\tAs player 1")
    results = test_players(rl1['agent'], rl2['agent'], TEST_N_GAMES)
    rl['results']['vs_{} (p1)'.format(rl2_name)] = {
        'wins': results[1],
        'loses':  results[2],
        'draws':  results[-1]
    }
    print(rl['results']['vs_{} (p1)'.format(rl2_name)])
    print("\tAs player 2")
    results = test_players(rl2['agent'], rl1['agent'], TEST_N_GAMES)
    rl['results']['vs_{} (p2)'.format(rl2_name)] = {
        'wins': results[2],
        'loses':  results[1],
        'draws':  results[-1]
    }
    print(rl['results']['vs_{} (p2)'.format(rl2_name)])
    print('\n')

Testing rl (random) VS rl(random) (100 games)
	As player 1
{'loses': 100, 'draws': 0, 'wins': 0}
	As player 2
{'loses': 0, 'draws': 0, 'wins': 100}


Testing rl (random) VS rl(rl) (100 games)
	As player 1
{'loses': 0, 'draws': 46, 'wins': 54}
	As player 2
{'loses': 100, 'draws': 0, 'wins': 0}


Testing rl (random) VS rl(minimax) (100 games)
	As player 1
{'loses': 0, 'draws': 0, 'wins': 100}
	As player 2
{'loses': 0, 'draws': 100, 'wins': 0}


Testing rl (rl) VS rl(rl) (100 games)
	As player 1
{'loses': 100, 'draws': 0, 'wins': 0}
	As player 2
{'loses': 0, 'draws': 0, 'wins': 100}


Testing rl (rl) VS rl(minimax) (100 games)
	As player 1
{'loses': 0, 'draws': 15, 'wins': 85}
	As player 2
{'loses': 52, 'draws': 12, 'wins': 36}


Testing rl (minimax) VS rl(minimax) (100 games)
	As player 1
{'loses': 100, 'draws': 0, 'wins': 0}
	As player 2
{'loses': 0, 'draws': 0, 'wins': 100}


