# LAB10
Use reinforcement learning to devise a tic-tac-toe player.

### My solution
To accomplish the task, I used a model-free Q-learning implemented in this way:
1) Start an agent in state $s_0$ that concides with no moves (the board is initally empty)
2) In each state chose an action $a$ in the following way:
    - random action with probability $\varepsilon$ (it can be usefull to also take advantage of exploration)
    - best action looking at the max Q-values with probability ($1-\varepsilon$)    
3) Make action $a$ to move in the next state and update the Q-table in the following way:
$$
    Q_{t+1}(s, a) = (1 - \alpha) * Q_t(s, a) + \alpha * (r + \gamma * Q_t(s', a') )
$$ 
->  Where $Q(s', a')$ is the Q-value computed given the state after the action done by the random opponent  
(and not simply the state after the trained player's action), considering the best possible action from that state.

Note that in this game I set the reward $r$ equal to: 
- 1 if the trained player wins
- -1 if the opponent wins
- 0 if there is a draw

The RL player training is done in a first moment against a random player and then against himself,  
whereas the evaluation is made against a random player.  
Further explanations about functions and parameters are inserted along the code.

In [1]:
import random
from itertools import combinations
from collections import namedtuple
from tqdm.auto import tqdm
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
State = namedtuple('State', ['x', 'o']) # this is filled with the position (range(9)) of the two players
MAGIC = [2, 7, 6, 9, 5, 1, 4, 3, 8] # I used the magic square to define easily the win condition

### Tic-Tac-Toe class 

In [3]:
class TicTacToe:
    def __init__(self):
        self.board = np.zeros(9) # 0 indicate the the absence of a move on a box of the board
        self.current_player = 1 # current player assumes value 1 or -1

    def print_board(self):
        for i in range(9):
            if i in [2,5,8]:
                end_ = "\n"
            else:
                end_ = "|"

            if self.board[i] == 0:
                print(" ", end = end_)
            if self.board[i] == 1:
                print("X", end = end_)
            if self.board[i] == -1:
                print("O", end = end_)
        print("")

    def win(self, elements):
        # Checks if elements is winning (elements is an array )
        magic_numbers = [MAGIC[i] for i in elements]
        return any(sum(c) == 15 for c in combinations(magic_numbers, 3))

    def state_value(self, pos: State):
        # State evaluation used as reward
        if self.win(pos.x): # trained player wins
            return 1
        elif self.win(pos.o): # opponent wins
            return -1
        else:
            return 0

    def available_actions(self):
        # return the avlable actions
        return [i for i, v in enumerate(self.board) if v == 0] # available actions are the empty positions 

    def make_move(self, action):
        # insert a new symbol (+1 or -1) in the board and gives the turn to the other player
        self.board[action] = self.current_player # put in position "action" a -1 or a +1 depending on the player
        self.current_player = -self.current_player # change current player

    def get_state(self):
        # return the state of the game (positions of the symbols on the game board)
        return State(tuple(sorted([i for i, v in enumerate(self.board) if v == 1])), # positions of the symbol +1
                     tuple(sorted([i for i, v in enumerate(self.board) if v == -1]))) # positions of the symbol -1

### Agent that exploits Q-Learning 

In [4]:
class Agent:
    def __init__(self, epsilon=0.1, alpha=0.7, gamma=0.9):
        self.q_table = {} # q-table defined as dictionary with key (state, action) and value q-value
        self.epsilon = epsilon # used to choose between random and best action
        self.alpha = alpha # it controls the weight given to new information when updating q-values
        self.gamma = gamma # discount factor

    def get_q_value(self, state, action):
        # return a q_value given a tuple (state, action)
        return self.q_table.get((state, action), 0)

    def choose_action(self, state, available_actions):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(available_actions) # random action with probability epsilon
        else:
            # lambda function to return the action linked to the best q-value (given the state)
            best_action = max(available_actions, key=lambda a: self.get_q_value(state, a))

            return best_action # best action with probability (1-epsilon)

    def update_q_value(self, states, actions, reward, available_actions):
        #print(len(states), len(actions)) # dimensions (3,2)
        if available_actions:
            # compute the new q_value on the next best action after the response of the opponent
            max_next_q_value = max(self.get_q_value(states[2], next_action) for next_action in available_actions)
        else:
            max_next_q_value = 0  # no more available actions

        old_q_value = self.get_q_value(states[0], actions[0])
        # model free Q-Learning update
        self.q_table[(states[0], actions[0])] = (1-self.alpha)*old_q_value + self.alpha*(reward + self.gamma*max_next_q_value)

### Training the agent
The training is divided in two parts:   
- the first half is against a random player
- the second one is against himself

In [5]:
q_agent = Agent()

training_rounds = 100_000

for round in tqdm(range(training_rounds)): 
    # Training against a random player
    tic_tac_toe = TicTacToe() # initiate each time to clear the board
    actions = [] # action of the player + opponent response action
    states = [] # state before player action + state after player action + state after opponent response action
    while tic_tac_toe.available_actions():
        state = tic_tac_toe.get_state()
        states.append(state)
        available_actions = tic_tac_toe.available_actions()

        if round <= training_rounds/2:  
            # the trained player plays against a random player
            if tic_tac_toe.current_player == 1:
                action = q_agent.choose_action(state, available_actions)
            else:
                action = random.choice(available_actions)
        else:
            # the trained player plays against himself
            action = q_agent.choose_action(state, available_actions)

        tic_tac_toe.make_move(action)
        actions.append(action)
        reward = tic_tac_toe.state_value(tic_tac_toe.get_state()) # compute the reward

        # q table update
        if len(actions) == 2:
            states.append(tic_tac_toe.get_state())
            q_agent.update_q_value(states, actions, reward, tic_tac_toe.available_actions())
            actions = []
            states = []

100%|██████████| 100000/100000 [00:17<00:00, 5867.34it/s]


### Evaluation of the agent
In this section I evaluate the agent against a random player

In [11]:
n_games = 100
wins = 0
draws = 0
losses = 0

for _ in range(n_games):
    tic_tac_toe = TicTacToe()
    while tic_tac_toe.available_actions():
        state = tic_tac_toe.get_state() 
        if tic_tac_toe.current_player == 1:
            # Q-learning agent turn
            available_actions = tic_tac_toe.available_actions()
            action = q_agent.choose_action(state, available_actions)
            tic_tac_toe.make_move(action) # insert a symbol (+1) in the board and change player
        else:
            # random player turn
            available_actions = tic_tac_toe.available_actions()
            action = random.choice(available_actions)
            tic_tac_toe.make_move(action) # insert a symbol (-1) in the board and change player
    
    # when there are no more available moves
    result = tic_tac_toe.state_value(tic_tac_toe.get_state())
    if result == 1:
        wins += 1
    elif result == -1:
        losses += 1
    else:
        draws += 1

print(f"Game stats againts a random player over {n_games} games:")
print(f"Wins = {wins}, Draws = {draws}, Losses = {losses}")

Game stats againts a random player over 100 games:
Wins = 100, Draws = 0, Losses = 0


### Representation of a game against a random player

In [18]:
tic_tac_toe = TicTacToe()

while tic_tac_toe.available_actions():
    state = tic_tac_toe.get_state() 
    if tic_tac_toe.current_player == 1:
        # Q-learning agent turn
        available_actions = tic_tac_toe.available_actions()
        action = q_agent.choose_action(state, available_actions)
    else:
        # random player turn
        available_actions = tic_tac_toe.available_actions()
        action = random.choice(available_actions)

    tic_tac_toe.make_move(action) # insert a symbol (+1 or -1) in the board and change player
    tic_tac_toe.print_board()
    
    if tic_tac_toe.state_value(tic_tac_toe.get_state()) in [-1, 1]:
        break

X| | 
 | | 
 | | 

X| |O
 | | 
 | | 

X| |O
 |X| 
 | | 

X| |O
 |X| 
 | |O

X| |O
X|X| 
 | |O

X| |O
X|X| 
 |O|O

X| |O
X|X|X
 |O|O

