### Importing Libraries

In [2]:
import pickle
import numpy as np
import random

### Tic Tac Toe Game implementation using python

In [3]:
class TicTacToe:
    def __init__(self):
        self.board = [" " for _ in range(9)]
        self.current_winner = None

    def print_board(self):
        for row in [self.board[i * 3 : (i + 1) * 3] for i in range(3)]:
            print("| " + " | ".join(row) + " |")

    @staticmethod
    def print_board_nums():
        number_board = [[str(i) for i in range(j * 3, (j + 1) * 3)] for j in range(3)]
        for row in number_board:
            print("| " + " | ".join(row) + " |")

    def available_moves(self):
        return [i for i, spot in enumerate(self.board) if spot == " "]

    def empty_squares(self):
        return " " in self.board

    def num_empty_squares(self):
        return self.board.count(" ")

    def make_move(self, square, player):
        if self.board[square] == " ":
            self.board[square] = player
            if self.winner(player):
                self.current_winner = player
            return True
        return False

    def winner(self, player):
        board = [self.board[i * 3 : (i + 1) * 3] for i in range(3)]
        win_conditions = [
            [board[0][0], board[0][1], board[0][2]],
            [board[1][0], board[1][1], board[1][2]],
            [board[2][0], board[2][1], board[2][2]],
            [board[0][0], board[1][0], board[2][0]],
            [board[0][1], board[1][1], board[2][1]],
            [board[0][2], board[1][2], board[2][2]],
            [board[0][0], board[1][1], board[2][2]],
            [board[0][2], board[1][1], board[2][0]],
        ]
        return [player, player, player] in win_conditions

### Q-Learning Agent implementation

Q-learning is a model-free reinforcement learning algorithm to learn quality of actions telling an agent what action to take under what circumstances. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations.

Formula for Q-Learning: $ Q(s,a) = Q(s,a) + \alpha * (R(s) + \gamma * max(Q(s',a')) - Q(s,a)) $

Where:
- $ Q(s,a) $ is the Q-value for state $ s $ and action $ a $.
- $ \alpha $ is the learning rate.
- $ R(s) $ is the reward for state $ s $.
- $ \gamma $ is the discount factor.
- $ s' $ is the next state.
- $ a' $ is the next action.

Algorithm:
1. Choose an action $ a $ in the current world state $ s $ based on current Q-value estimates $ Q(s,a) $ or pick a random action with probability $ \epsilon $.
2. Take the action $ a $ and observe the outcome state $ s' $ and reward $ r $.
3. Update the Q-value of current state and previous states using the formula $ Q(s,a) = Q(s,a) + \alpha * (R(s) + \gamma * max(Q(s',a')) - Q(s,a)) $.
4. Set the state to the new state and repeat the process.

In [4]:
class QLearningAgent:
    '''
    Q-Learning Agent
    
    Attributes:
        player (str): the player's symbol ('X' or 'O')
        alpha (float): the learning rate
        gamma (float): the discount factor
        epsilon (float): the exploration rate
        q_table (dict): the Q-Table
        path (str): the path to the Q-Table file

    Methods:
        load_q_table: Load the Q-Table from a file
        save_q_table: Save the Q-Table to a file
        get_state: Get the state of the board
        get_reward: Get the reward based on the game state
        choose_action: Choose the best action based on the Q-Table
        update_q_values: Update the Q-Values based on the reward
        make_action: Make an action based on the Q-Table
        train: Train the agent
    '''
    def __init__(self, player, alpha=0.1, gamma=0.9, epsilon=0.2, path="../../data/pickle.pkl"):
        self.player = player
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

        self.path = path
        self.q_table = self.load_q_table()

    def load_q_table(self) -> dict:
        '''Load the Q-Table from a file'''
        try:
            with open(self.path, "rb") as f:
                return pickle.load(f)
        except FileNotFoundError:
            return {}

    def save_q_table(self) -> None:
        '''Save the Q-Table to a file'''
        with open(self.path, "wb") as f:
            pickle.dump(self.q_table, f)

    def get_state(self, board: str) -> str:
        '''Get the state of the board'''
        return "".join(board)

    def get_reward(self, game: TicTacToe, num_moves: int) -> float:
        '''Get the reward based on the game state'''
        if game.current_winner == self.player:  # if the agent wins
            return 5
        elif game.current_winner == "O":        # if the opponent wins
            return -5
        elif not game.empty_squares():          # if it's a draw
            return 0.5
        elif num_moves > 2:                     # if the game is still going
            return -0.25
        else:
            return 0                            # if the game just started

    def choose_action(self, game: TicTacToe) -> int:
        '''Choose the best action based on the Q-Table or randomly with epsilon probability'''
        state = self.get_state(game.board)
        if np.random.rand() < self.epsilon or state not in self.q_table:
            move = random.choice(game.available_moves())
            return move
        else:
            q_values = [self.q_table[state][move] for move in game.available_moves()]
            max_q_value = max(q_values)
            best_moves = [move for move, q_value in zip(game.available_moves(), q_values) if q_value == max_q_value]
            move = random.choice(best_moves)
            return move

    def update_q_values(self, history: list, reward: float) -> None:
        '''Update the Q-Values based on the reward and the history'''
        for state, action in reversed(history):
            if state not in self.q_table:
                self.q_table[state] = np.zeros(9)
            self.q_table[state][action] = (1 - self.alpha) * self.q_table[state][
                action
            ] + self.alpha * (reward + self.gamma * np.max(self.q_table[state]))
            reward *= self.gamma

    def make_action(self, game, history: list, player: str) -> None:
        '''Make an action based on the Q-Table and update the Q-Values based on the reward and the history'''
        state = self.get_state(game.board)

        if state not in self.q_table:
            self.q_table[state] = np.zeros(9)
    
        action = self.choose_action(game)
        game.make_move(action, player)
        history.append((state, action))
        reward = self.get_reward(game, len(history))
        self.update_q_values(history, reward)
        

    def train(self, episodes: int=1000) -> None:
        '''Train the agent'''
        for _ in range(episodes):
            game = TicTacToe()
            history_player = []
            history_opponent = []
            turn = 'X'

            while game.empty_squares():
                if turn == self.player:
                    self.make_action(game, history_player, self.player)
                    turn = "O"
                else:
                    self.make_action(game, history_opponent, "O")
                    turn = self.player
            
            if _ % 10000 == 0:
                print(f"Episode {_}")
        self.save_q_table()

### Training the Q-Learning Agent

In [122]:
agent = QLearningAgent(player='X')
agent.train(episodes=1000000)

Episode 0
Episode 10000
Episode 20000
Episode 30000
Episode 40000
Episode 50000
Episode 60000
Episode 70000
Episode 80000
Episode 90000
Episode 100000
Episode 110000
Episode 120000
Episode 130000
Episode 140000
Episode 150000
Episode 160000
Episode 170000
Episode 180000
Episode 190000
Episode 200000
Episode 210000
Episode 220000
Episode 230000
Episode 240000
Episode 250000
Episode 260000
Episode 270000
Episode 280000
Episode 290000
Episode 300000
Episode 310000
Episode 320000
Episode 330000
Episode 340000
Episode 350000
Episode 360000
Episode 370000
Episode 380000
Episode 390000
Episode 400000
Episode 410000
Episode 420000
Episode 430000
Episode 440000
Episode 450000
Episode 460000
Episode 470000
Episode 480000
Episode 490000
Episode 500000
Episode 510000
Episode 520000
Episode 530000
Episode 540000
Episode 550000
Episode 560000
Episode 570000
Episode 580000
Episode 590000
Episode 600000
Episode 610000
Episode 620000
Episode 630000
Episode 640000
Episode 650000
Episode 660000
Episode 6

### Play Tic Tac Toe Game with Q-Learning Agent vs Random Agent

In [211]:
agent = QLearningAgent(player='X', path="../../data/pickle.pkl")

def play_game(agent: QLearningAgent):
    game = TicTacToe()
    turn = 'X'
    while game.empty_squares():
        if turn == 'X':
            action = agent.choose_action(game)
            game.make_move(action, 'X')
            turn = 'O'
        else:
            move = random.choice(game.available_moves())
            print(f"Player O chooses position {move}")
            game.make_move(move, 'O')
            turn = 'X'
        game.print_board()
        print('\n')
        if game.current_winner:
            print(f"Player {'AI' if game.current_winner == 'X' else 'O'} wins!")
            return 'AI' if game.current_winner == 'X' else 'O'
    if not game.current_winner:
        print("It's a tie!")
        return 'Tie'

In [237]:
play_game(agent)

Q-Values: [20.725331279416537, 20.959500600410813, 20.313741283863347, 20.873837921869473, 20.866353564549293, 20.2446856553271, 20.577835441167327, 20.834094394100575, 21.187976110677614]
Max Q-Value: 21.187976110677614
AI choose 8 with Q-Value 21.187976110677614
|   |   |   |
|   |   |   |
|   |   | X |


Player O chooses position 3
|   |   |   |
| O |   |   |
|   |   | X |


AI choose 7 randomly
|   |   |   |
| O |   |   |
|   | X | X |


Player O chooses position 6
|   |   |   |
| O |   |   |
| O | X | X |


AI choose 5 randomly
|   |   |   |
| O |   | X |
| O | X | X |


Player O chooses position 1
|   | O |   |
| O |   | X |
| O | X | X |


Q-Values: [15.466663247414818, 8.737352668998899, 9.274737496223356]
Max Q-Value: 15.466663247414818
AI choose 0 with Q-Value 15.466663247414818
| X | O |   |
| O |   | X |
| O | X | X |


Player O chooses position 2
| X | O | O |
| O |   | X |
| O | X | X |


Q-Values: [48.631106471794524]
Max Q-Value: 48.631106471794524
AI choose 4 with Q-Va

'AI'

In [209]:
count = {'AI': 0, 'O': 0, 'Tie': 0}

for _ in range(100):
    winner = play_game(agent)
    count[winner] += 1

print(count)

{'AI': 95, 'O': 4, 'Tie': 1}


### Play Tic Tac Toe Game with Q-Learning Agent vs Minimax Agent

In [8]:
import sys

sys.path.append('..')

from search.minimax import minimax

agent = QLearningAgent(player='X', path='../../data/pickle.pkl')

def play_game_vs_minmax(agent: QLearningAgent):
    game = TicTacToe()
    turn = 'X'
    while game.empty_squares():
        if turn == 'X':
            action = agent.choose_action(game)
            game.make_move(action, 'X')
            turn = 'O'
        else:
            move_info = minimax(game, 'O')
            move = move_info['position']
            print(f"MinMiax O chooses position {move}")
            game.make_move(move, 'O')
            turn = 'X'
        game.print_board()
        print('\n')
        if game.current_winner:
            print(f"Player {'Minmax' if game.current_winner == 'O' else 'AI'} wins!")
            return 'Minimax' if game.current_winner == 'O' else 'AI'
    if not game.current_winner:
        print("It's a tie!")
        return 'Tie'

In [182]:
play_game_vs_minmax(agent)

Q-Values: [20.725331279416537, 20.959500600410813, 20.313741283863347, 20.873837921869473, 20.866353564549293, 20.2446856553271, 20.577835441167327, 20.834094394100575, 21.187976110677614]
Max Q-Value: 21.187976110677614
AI choose 8 with Q-Value 21.187976110677614
|   |   |   |
|   |   |   |
|   |   | X |


All possible moves and their scores:
Move: 0, Score: -3
Move: 1, Score: -3
Move: 2, Score: -3
Move: 3, Score: -3
Move: 4, Score: 0
Move: 5, Score: -3
Move: 6, Score: -3
Move: 7, Score: -3

MinMiax O chooses position 4
|   |   |   |
|   | O |   |
|   |   | X |


AI choose 2 randomly
|   |   | X |
|   | O |   |
|   |   | X |


All possible moves and their scores:
Move: 0, Score: -5
Move: 1, Score: -5
Move: 3, Score: -5
Move: 5, Score: 0
Move: 6, Score: -5
Move: 7, Score: -5

MinMiax O chooses position 5
|   |   | X |
|   | O | O |
|   |   | X |


AI choose 6 randomly
|   |   | X |
|   | O | O |
| X |   | X |


All possible moves and their scores:
Move: 0, Score: -3
Move: 1, Score: -3


'O'

In [11]:
count = {'AI': 0, 'Minimax': 0, 'Tie': 0}

for _ in range(100):
    winner = play_game_vs_minmax(agent)
    count[winner] += 1

print(count)

{'AI': 0, 'Minimax': 94, 'Tie': 6}


Although Minimax performed much better than reinforcement learning (Q-Learning), this is only because there are not many possible moves in the game of tic-tac-toe, and brute-force algorithms perform much better. In games like Go, Q-Learning would be the favorite.