<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 15. Final Project</h3>

The tic-tac-toe environment allows different types of players to play against each other. For example, here is a game between a __PrettyGoodPlayer__ (a procedural algorithm) and a __HumanPlayer__ (you):

In [None]:
from tictactoe import *

g = Game(PrettyGoodPlayer(), HumanPlayer())
g.play()

And here is a __RandomPlayer__ playing a __VeryGoodPlayer__ (run this a few times to see what happens):

In [None]:
from tictactoe import *

g = Game(RandomPlayer(), VeryGoodPlayer())
g.play()
g.replay()     # this will display the game, like an instant replay

And here is our nemesis __MinimaxPlayer__ playing a __VeryGoodPlayer__:

In [None]:
from tictactoe import *

g = Game(MinimaxPlayer(), VeryGoodPlayer())
g.play()
g.replay()     # this will display the game, like an instant replay

<hr>
***Final Project***

Implement a QPlayer that can play against the __MinimaxPlayer__, __RandomPlayer__, __VeryGoodPlayer__, or __HumanPlayer__. The QPlayer should learn from its mistakes, and improve as it goes along.

To interact with the tic-tac-toe environment, your player will need to implement certain functions (the __step()__ function is divided into two pieces: __move()__ and __update()__).

A structural example of a player is below (__MyPlayer__), along with a framework for __QPlayer__.

In [3]:
from tictactoe import *

# Define a Python class to represent a new type of player;
# Extend the base class BasePlayer, which tracks wins and losses
# and supports framework for identifying X-player v. O-player.
class MyPlayer(BasePlayer):
    
    # call the superclass __init__ function... that will reset the 
    # won/loss/tie counters
    def __init__(self):
        super().__init__()       
    
    # move() is called when it is this player's turn
    def move(self, game, state):
        print('move:', state)
        return game.sample(legal=True)

    # implement update() to update the player's data structures.
    # update() is called after a winning move, an illegal move,
    # or an opponent's move.
    def update(self, game, state, reward, done):
        print('update:',state, reward, done)

In [5]:
# run this a few times; note how and when update() is called.

from tictactoe import *
g = Game(MyPlayer(), MinimaxPlayer())
g.play()
print(g.x_player)   # this prints MyPlayer, which was playing 'X'
g.replay()

move: 9841
update: 9814 0 False
move: 9867
update: 9812 0 False
move: 9789
update: 9884 -1 True
MyPlayer w/l/t=0/1/0
[9868, 9867, 9870, 9789, 9798, 3237]

                  
 0     1     2    
                  

\ /               
 X     4     5    
/ \               

                  
 6     7     8    
                  



OOO               
O O    1     2    
OOO               

\ /               
 X     4     5    
/ \               

                  
 6     7     8    
                  



OOO   \ /         
O O    X     2    
OOO   / \         

\ /               
 X     4     5    
/ \               

                  
 6     7     8    
                  



OOO   \ /         
O O    X     2    
OOO   / \         

\ /   OOO         
 X    O O    5    
/ \   OOO         

                  
 6     7     8    
                  



OOO   \ /   \ /   
O O    X     X    
OOO   / \   / \   

\ /   OOO         
 X    O O    5    
/ \   OOO         

                  
 6    

In [1]:
import numpy as np
from tictactoe import *

###################################################################################
#                                                                                 #
# FINAL PROJECT -- Implement QPlayer                                              #
#                                                                                 #
###################################################################################

class QPlayer(BasePlayer):

    def __init__(self):
        super().__init__()
        self.q = np.zeros((Game.state_space(), Game.action_space())) 

    # choose a move, referring only to the state
    def move(self, game, state):
        self.state = state
        if np.random.random() < 0.01:                                                     
            self.action = self.env.sample()                                                                 
        else:                                                                                     
            self.action = np.argmax(self.q[state])
        return self.action
        
    # update q-table, referring only to state & reward
    def update(self, game, state, reward, done):
        self.q[self.state][self.action] += reward + 0.9 * np.amax(self.q[self.state])
        

g = Game(QPlayer(), MinimaxPlayer())
for n in range(100):
    g.play()
    print(g.x_player)


Illegal move detected.
QPlayer w/l/t=0/0/1
Illegal move detected.
QPlayer w/l/t=0/0/2
Illegal move detected.
QPlayer w/l/t=0/0/3
Illegal move detected.
QPlayer w/l/t=0/0/4
QPlayer w/l/t=0/1/4
Illegal move detected.
QPlayer w/l/t=0/1/5
QPlayer w/l/t=0/2/5
Illegal move detected.
QPlayer w/l/t=0/2/6
Illegal move detected.
QPlayer w/l/t=0/2/7
Illegal move detected.
QPlayer w/l/t=0/2/8
Illegal move detected.
QPlayer w/l/t=0/2/9
Illegal move detected.
QPlayer w/l/t=0/2/10
Illegal move detected.
QPlayer w/l/t=0/2/11
Illegal move detected.
QPlayer w/l/t=0/2/12
Illegal move detected.
QPlayer w/l/t=0/2/13
Illegal move detected.
QPlayer w/l/t=0/2/14
Illegal move detected.
QPlayer w/l/t=0/2/15
Illegal move detected.
QPlayer w/l/t=0/2/16
Illegal move detected.
QPlayer w/l/t=0/2/17
Illegal move detected.
QPlayer w/l/t=0/2/18
QPlayer w/l/t=0/2/19
QPlayer w/l/t=0/2/20
QPlayer w/l/t=0/2/21
QPlayer w/l/t=0/2/22
QPlayer w/l/t=0/2/23
QPlayer w/l/t=0/2/24
QPlayer w/l/t=0/2/25
QPlayer w/l/t=0/2/26
QPlayer w

In [2]:
g = Game(QPlayer(), HumanPlayer())
g.play()


---( 1 )--------------------------

\ /               
 X     1     2    
/ \               

                  
 3     4     5    
                  

                  
 6     7     8    
                  

x state = 9842, o state = 9840 available: [1, 2, 3, 4, 5, 6, 7, 8]
4
Illegal move detected.

---( 2 )--------------------------

\ /               
 X     1     2    
/ \               

      OOO         
 3    O O    5    
      OOO         

                  
 6     7     8    
                  

x state = 9761, o state = 9921 available: [1, 2, 3, 5, 6, 7, 8]
...TIE GAME...
