## Bot 2: Learn by Rewards (or Penalty)

### 2.A. Code Reorganization

1. `game.py` for game related modules/classes. 
2. `bots.py` for bots and machine learning modules. 

### 2.B. Bug Fix

In the `check_win()` function: 
1. Check whether there are 3 SAME marks in a (horizontal/vertical/diagonal) row, e.g. 

```python
self.board[0][0]==self.board[1][1] and self.board[1][1]==self.board[2][2]
```

2. In addition, these should be NON-empty marks, i.e.: 

```python
self.board[0][0]!=' ' 
```

In [29]:
%load_ext autoreload
%autoreload 2

from game import TicTacToe
from bots import RandomBot, RLBot

game = TicTacToe()
bot = RandomBot(game, 'O')
game.set_bot(bot)
game.display()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


VBox(children=(Label(value='Ready'), GridBox(children=(Button(style=ButtonStyle()), Button(style=ButtonStyle()…

Output()

In [8]:
import numpy as np
np_board = np.array(game.board)
str(np_board.reshape(-1))

"['X' ' ' 'O' ' ' 'X' 'O' ' ' ' ' 'X']"

In [13]:
index_2d = list(zip(*np.where(np_board==' ')))
index_1d = [ np.ravel_multi_index(act, (3,3)) for act in index_2d ]
index_1d

[1, 3, 6, 7]

### 2.C. Reinforcement Learning

The following video explains the idea of Reinforcement Learning (RL): 
https://youtu.be/QUPpKgXJd5M?si=D3yRIdaC9GWveXgH

Key ideas of RL applied to Tic-Tac-Toe bot: 
1. Monitors every state of the game, i.e. 'X', 'O', and ' ' marks on the 3x3 grid (`3^9 = 19683` permutations). 
2. Makes a move and, depending on the `exploration rate`, it will select: 
    * EITHER an random move to **explore** different situations
    * OR the **best move** based on past rewards
3. No reward or penalty if there is no immediate winner. 
4. In the end: 
    * IF the bot wins, the LAST move will receive a reward of `1`
    * IF the bot loses, the LAST move will receive a penalty of `-1`
5. This repeat with NEW games and the bot continues to learn. 

Reading the above, one may wonder whether ONLY the LAST move will be rewarded? The answer is NO. 
* (1) **All actions leading to** the LAST move (for a win or loss) will be rewarded or penalized but the reward/penalty will be **discounted**. 
```
        For example, given RLBot Actions: 
        Move1 => Move2 => Move3 (WIN)

        Its rewards will be like: 

        Move3 (1 point) =>  Move2 (0.9 point) => Move1 (0.9*0.9 point)
```

* (2) We do have to **repeat the game** to train the bot in order to update rewards/penalities to previous moves. 
* (3) Another parameter `learning rate` determines **how fast** the bot will update the reward/penalty. 

In [30]:
game = TicTacToe()
bot = RLBot(game, 'O', 0.1, 0.9, 0.5)
game.set_bot(bot)
game.display()

VBox(children=(Label(value='Ready'), GridBox(children=(Button(style=ButtonStyle()), Button(style=ButtonStyle()…

Output()

In [31]:
import numpy as np

game = game = TicTacToe()
Simon = RLBot(game, 'X', 0.2, 0.9, 0.5)
Olive = RLBot(game, 'O', 0.2, 0.9, 0.5)

# keep track of the following
players = ['X', 'O']
bots = [Simon, Olive]
old_states = ['', '']
actions = [0, 0]
new_states = ['', '']

def train(): 
    in_session = True
    game.start_over()
    play_idx = 0
    while in_session: 
        old_state, action, new_state = bots[play_idx].move()
        old_states[play_idx] = old_state
        new_states[play_idx] = new_state
        actions[play_idx] = action
        
        # check win or end
        board = np.array(game.board)
        winner = game.check_win()
        if winner!=' ': 
            in_session = False
            for i in range(2): 
                if players[i]==winner: 
                    bots[i].update_q_values(old_states[i], actions[i], 10, new_states[i])
                else: 
                    bots[i].update_q_values(old_states[i], actions[i], -10, new_states[i])
        elif bots[play_idx].board_is_full(board): 
            in_session = False
        
        # alternate the turn between 0 and 1, or "X" and "O"
        play_idx = 1 if play_idx==0 else 0
        

In [None]:
import time 

game.display()
start_time = time.time()
for run in range(10000): 
    train()
end_time = time.time()
elapsed = (end_time - start_time)*1000
print(f"It takes {elapsed} milliseconds to train Simon and Olive.")

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter serve