# TicTacToe

## First Day

For the frist day, we are creating a simple game for two persons. 

## Second Day

In part 2, we will add a few features before creating a bot: 

1) Clean up the code a bit. 
2) Add a status bar and a reset button to start new game. 
3) Include a check_win function to check who the winner and end the game. 

```
[Status]
 ___________
| O |   |   |
 --- --- ---
|   | X |   |
 --- --- ---
| X | O |   |
 --- --- ---
|   Reset   |
 -----------
```

In [2]:
# Install ipywidgets and IPython in they are not found
import ipywidgets as widgets
from IPython.display import display


# The game class to construct the tic tac toe game
class TicTacToe: 
    def __init__(self): 
        self.board = [ [' ' for col in range(3)] for row in range(3)]
        self.player = 'X'
        self.winner = ' '
        
        # User Interface element
        # Status Bar
        self.status = widgets.Label('Ready')
        # Cell Buttons
        self.buttons = [ [widgets.Button(description='') for button in range(3)] for row in range(3)]
        # register the make_move function to each button's onclick event
        for i in range(3): 
            for j in range(3): 
                self.buttons[i][j].on_click(self.make_move(i, j))
        self.button_list = [button for row in self.buttons for button in row]
        # Reset Button
        self.reset_button = widgets.Button(description='New Game', layout=widgets.Layout(width='450px'))
        self.reset_button.on_click(self.reset())
        # Output
        self.output = widgets.Output()
    
    # set a bot to play this game
    def set_bot(self, bot): 
        self.bot = bot
    
    # Start a NEW game
    def reset(self): 
        def on_reset_clicked(_): 
            self.player = 'X'
            self.winner = ' '
            self.status.value = 'Ready'
            # clear the memory of 3x3 matrix
            for i in range(3): 
                for j in range(3): 
                    self.board[i][j] = ' '
            # clear the buttons on the grid
            for button in self.button_list: 
                button.description = ' '
        return on_reset_clicked          
    
    # Put either "X" or "O" on the ith row and jth column
    def make_move(self, i, j): 
        def on_button_clicked(_): 
            # human move
            self.move(i,j)
            # bot move
            if self.winner==' ': 
                self.bot.move()
            
        return on_button_clicked
    
    # core function to make a move
    # for a human (click) or a bot
    def move(self, i, j):
        if self.winner==' ' and self.board[i][j] == ' ': 
            self.board[i][j] = self.player
            self.buttons[i][j].description = self.player

            # turn taking
            if self.player == 'X': 
                self.player = 'O'
            else: 
                self.player = 'X'
        self.status.value = 'In progress, ' + self.player + ' playing.'
        # check winner
        self.winner = self.check_win()
        if self.winner != ' ': 
            self.status.value = self.winner + ' won!'
    
    # Check if there is a winner
    # return the winner, 'X' or 'O'
    # OR, return ' ' if no winner
    def check_win(self): 
        # check diagnals
        if self.board[1][1]!=' ' and self.board[0][0]==self.board[1][1] and self.board[1][1]==self.board[2][2]:
            return self.board[0][0]
        if self.board[1][1]!=' ' and self.board[0][2]==self.board[1][1] and self.board[1][1]==self.board[2][0]:
            return self.board[1][1]
        
        # check rows
        for i in range(3): 
            if self.board[i][0]!=' ' and self.board[i][0]==self.board[i][1] and self.board[i][1]==self.board[i][2]:
                return self.board[i][0]
        
        # check columns
        for j in range(3): 
            if self.board[0][j]!=' ' and self.board[0][j]==self.board[1][j] and self.board[1][j]==self.board[2][j]:
                return self.board[0][j]
        
        # no winner found at this point
        return ' '
        
    
    def display(self): 
        self.grid = widgets.GridBox(self.button_list,layout=widgets.Layout( grid_template_columns="repeat(3, 150px)"))
        self.game_box = widgets.VBox([self.status, self.grid, self.reset_button])
        display(self.game_box, self.output)
        
game = TicTacToe()
game.display()

VBox(children=(Label(value='Ready'), GridBox(children=(Button(style=ButtonStyle()), Button(style=ButtonStyle()…

Output()

## Bot 1: Random Bot

The Tic-Tac-Toe game is now ready and can be played by two human players. 

Let's create our first bot so we can play against it--

1. In this first attempt, we will create a **RANDOM** bot just like an iRobot Roomba. 
2. It won't be very smart but will perform some (random) work, again, just like iRobot. 
3. This is our **baseline** model. We will create more advanced (smarter) models down the load and compare them back to this very first model. 

In [4]:
import random

class RandomBot:
    def __init__(self, game, player): 
        if player not in ['X', 'O']:
            raise ValueError("Player must be either X or O!")
        self.game = game
        self.player = player
    
    # a move function to pick a random cell
    def move(self): 
        avail_cells = [ (i,j) for i in range(3) for j in range(3) if self.game.board[i][j]==' ' ]
        cell = random.choice(avail_cells)
        self.game.move(cell[0], cell[1])

In [6]:
game = TicTacToe()
game.display()
bot = RandomBot(game, 'O')
game.set_bot(bot)

VBox(children=(Label(value='Ready'), GridBox(children=(Button(style=ButtonStyle()), Button(style=ButtonStyle()…

Output()

## Bot 2: Learn by Rewards (or Penalty)

### Bug Fix First

In the `check_win()` function: 
1. Check whether there are 3 SAME marks in a (horizontal/vertical/diagonal) row, e.g. 

```python
self.board[0][0]==self.board[1][1] and self.board[1][1]==self.board[2][2]
```

2. In addition, these should be NON-empty marks, i.e.: 

```python
self.board[0][0]!=' ' 
```

### Reinforcement Learning

The following video explains the idea of Reinforcement Learning (RL): 
https://youtu.be/QUPpKgXJd5M?si=D3yRIdaC9GWveXgH

Key ideas of RL applied to Tic-Tac-Toe bot: 
1. Monitors every state of the game, i.e. 'X', 'O', and ' ' marks on the 3x3 grid (`3^9 = 19683` permutations). 
2. Makes a move and, depending on the `exploration rate`, it will select: 
    * EITHER an random move to **explore** different situations
    * OR the **best move** based on past rewards
3. No reward or penalty if there is no immediate winner. 
4. In the end: 
    * IF the bot wins, the LAST move will receive a reward of `1`
    * IF the bot loses, the LAST move will receive a penalty of `-1`
5. This repeat with NEW games and the bot continues to learn. 

Reading the above, one may wonder whether ONLY the LAST move will be rewarded? The answer is NO. 
* All actions leading to the LAST move (for a win or loss) will be rewarded or penalized but the reward/penalty will be discounted. 

For example, given RLBot Actions: 
```
Move1 => Move2 => Move3 (WIN)
```

Its rewards will be like: 

```
Move3 (1 point) =>  Move2 (0.9 point) => Move1 (0.9*0.9 point)
```

* We do have to repeat the game to train the bot in order to update rewards/penalities to previous moves. 

In [10]:
import numpy as np

class RLBot:
    def __init__(self, symbol, learning_rate=0.5, discount_factor=0.9, exploration_rate=0.1):
        self.symbol = symbol
        self.q_table = {}
        set_params(learning_rate, discount_factor, exploration_rate)
    
    # parameters affecting the bot's learning behavior
    def set_params(self, learning_rate=0.5, discount_factor=0.9, exploration_rate=0.1): 
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.exploration_rate = exploration_rate
        
    # 
    def get_state(self, board):
        return str(board.reshape(-1))
        
    def get_q_values(self, state):
        if state not in self.q_table:
            self.q_table[state] = np.zeros(9)
        return self.q_table[state]
    
    def select_action(self, state, available_actions):
        if np.random.random() < self.exploration_rate:
            return np.random.choice(available_actions)
        q_values = self.get_q_values(state)
        return np.argmax(q_values)
    
    def update_q_values(self, old_state, action, reward, new_state):
        old_q_values = self.get_q_values(old_state)
        new_q_values = self.get_q_values(new_state)
        old_q_values[action] = old_q_values[action] + self.learning_rate * (reward + self.discount_factor * np.max(new_q_values) - old_q_values[action])
        
    def make_move(self, board):
        state = self.get_state(board)
        available_actions = list(zip(*np.where(board == ' ')))
        action = self.select_action(state, available_actions)
        return action

In [11]:
game.board

[[' ', 'O', 'O'], [' ', 'X', 'O'], ['X', 'X', 'X']]

In [None]:
game.board.reshape