# Project 3: Reinforcement Learning

Peyton Lewis, Louie Wang, Anushka Iyer, Leyang Xu

This code was a second attempt after our DQN modeling attempts. We were able to obtain a model that won against a random opponent more often with the Policy Gradients models, and therefore show this PG code as well. At the bottom of the script is a section where our submitted h5 file can be tested against a random opponent. This is reliant on our valid_moves() function, which is also present in this script.

## Methods Implemented: 
- Policy Gradients (softmax final layer)
- Self-play network training (flip coin which player goes first each time)
- Linear Annealing strategy for epsilon
- Memory buffer 
- Auto-Win and Auto-Blocking moves (fractional percentage of the time)

In [1]:
# Necessary imports
import numpy as np
import pandas as pd
from IPython.display import clear_output
import tensorflow as tf
import time
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten, Dropout, Conv2D, Input

## Basic Functions

In [5]:
# Function that updates the game board once action is chosen

def update_board(board_temp,color,column):
    # this is a function that takes the current board status, a color, and a column and outputs the new board status
    # columns 0 - 6 are for putting a checker on the board: if column is full just return the current board...this should be forbidden by the player
    # columns 7 - 13 are for pulling a checker off the board: this does not check if removing the checker is allowed...
    # 
    # the color input should be either 'plus' or 'minus'
    
    board = board_temp.copy()
    ncol = board.shape[1]
    nrow = board.shape[0]
    if column < ncol: # drop a checker on the board
        row = -1
        # start by assuming you can't add to the column
        # loop through the rows checking if you can go to each row or not
        for check in range(nrow):
            if (board[check,column]!=0):
                break # if this row is occupied, you're done
            else: # otherwise, you can go on this row!
                row += 1

        if row >= 0: # if you can add to the column
            if color == 'plus': # check the color
                board[row,column] = 1
            else:
                board[row,column] = -1
        return board
    else:
        column -= ncol
        if column >= ncol:
            return board # can't play anything bigger than 13...
        board[1:,column] = board[:-1,column].copy()
        if board[0, column] != 0: # if column completely full, top spot is a 0
            board[0, column] = 0
        return board

In [7]:
# Function that checks for a winner on the board

def check_for_win(board):
    # this function checks to see if anyone has won on the given board
    nrow = board.shape[0]
    ncol = board.shape[1]
    winner = 'nobody'
    for row in range(nrow):
        for col in range(ncol):
            # check for vertical winners
            if row <= (nrow-4): # can't have a column go from rows 4-7...
                if np.sum(board[row:(row+4),col])==4:
                    winner = 'v-plus'
                    return winner
                elif np.sum(board[row:(row+4),col])==-4:
                    winner = 'v-minus'
                    return winner
            # check for horizontal winners
            if col <= (ncol-4):
                if np.sum(board[row,col:(col+4)])==4:
                    winner = 'h-plus'
                    return winner
                elif np.sum(board[row,col:(col+4)])==-4:
                    winner = 'h-minus'
                    return winner
            # check for top left to bottom right diagonal winners
            if (row <= (nrow-4)) and (col <= (ncol-4)):
                if np.sum(board[range(row,row+4),range(col,col+4)])==4:
                    winner = 'd-plus'
                    return winner
                elif np.sum(board[range(row,row+4),range(col,col+4)])==-4:
                    winner = 'd-minus'
                    return winner
            # check for top right to bottom left diagonal winners
            if (row <= (nrow-4)) and (col >= 3):
                if np.sum(board[range(row,row+4),range(col,col-4,-1)])==4:
                    winner = 'd-plus'
                    return winner
                elif np.sum(board[range(row,row+4),range(col,col-4,-1)])==-4:
                    winner = 'd-minus'
                    return winner
    return winner

In [4]:
# Function to display the Popout game board visually 

def display_board(board):
    # this function displays the board as ascii using X for +1 and O for -1
    clear_output()
    horizontal_line = '-'*(7*5+8)
    blank_line = '|'+' '*5
    blank_line *= 7
    blank_line += '|'
    print(horizontal_line)
    for row in range(6):
        print(blank_line)
        this_line = '|'
        for col in range(7):
            if board[row,col] == 0:
                this_line += ' '*5 + '|'
            elif board[row,col] == 1:
                this_line += '  X  |'
            else:
                this_line += '  O  |'
        print(this_line)
        print(blank_line)
        print(horizontal_line)

In [2]:
# Function that checks if a move for a player is valid, given the board

def valid_move(board, player):
    valid_moves = []

    # check where you can drop a checker
    col_empty = np.where(board[0,:] == 0)[0]
    for col in col_empty:
        valid_moves.append(col)

    # check where you can remove a checker
    last_row = board[-1,:]
    if player == 'plus':  # can only drop if their checker in last row
        drop = np.where(last_row == 1)[0]
        drop += 7
    else :  # same case for minus player 
        drop = np.where(last_row == -1)[0]
        drop += 7

    for col in drop:
        valid_moves.append(col)

    return valid_moves    

In [6]:
# Function to see if auto win is possible -- will also be used in reverse to see if a block is possible

def auto_win(board, valid_moves, player) : 
    # play all valid moves and return winning move if possible
    auto_wins = []
    for move in valid_moves :
        board_temp = update_board(board, player, move)
        winner = check_for_win(board_temp)
        if player in winner :  # if there is an auto win move, add it and return it
            auto_wins.append(move)
    return auto_wins

In [7]:
# Function to return discount rewards

def discount_rewards(rewards, delta=0.90):  # use of delta = 0.9
    discounted_rewards = np.zeros(len(rewards))
    running_add = 0
    for t in reversed(range(0, len(rewards))):
        running_add = running_add * delta + rewards[t]
        discounted_rewards[t] = running_add
    return discounted_rewards

In [8]:
# Function to create the model architecture

def create_model(height, width, channels):

    imp = Input(shape=(height,width, channels))
    mid = Conv2D(32,(2,2),strides=1,activation='relu')(imp)
    mid = Conv2D(64,(2,2),strides=1,activation='relu')(mid)
    mid = Conv2D(128,(2,2),strides=1,activation='relu')(mid)
    mid = Flatten()(mid)
    mid = Dense(256,activation='relu')(mid)
    out = Dense(14,activation='softmax')(mid)  # softmax final layer to output probabilities for the actions
    model = Model(imp,out) 
    
    # compile model wtih sparse categorical crossentropy loss function
    model.compile(loss='sparse_categorical_crossentropy',optimizer= tf.keras.optimizers.Adam(learning_rate=0.001))
    
    return model

In [None]:
# Function to play a single game

def play_game(epsilon):
    winner = 'nobody'
    board = np.zeros((6,7))
    player = 'plus'

    # toss coin to see who goes first
    if np.random.rand() < 0.5:
        player = 'plus'
    else:
        player = 'minus'

    # initialize game data to store 
    game_data = {'boards_plus':[],'actions_plus':[], 'rewards_plus':[], 'discount_rewards_plus':[], 'boards_minus':[], 'actions_minus':[], 'rewards_minus':[], 'discount_rewards_minus':[], 'winner': 'nobody', 'win_type': 'nobody'}

    # loop until there is a winner
    while winner == 'nobody':

        # get valid moves for player at that board
        valid_moves = valid_move(board, player)
        board_feed = board.reshape(1,6,7,1)

        # get plus player's action
        if player == 'plus':

            # epsilon % of the time play random (legal) move
            if np.random.rand() < epsilon:
                action = np.random.choice(valid_moves)
            
            # 1-epsilon % of the time play the best move
            else:

                # get probabilities for each action from the plus model
                mod_probs_plus = player_plus(board_feed, training=False).numpy()[0]
                non_valid_moves = np.setdiff1d(np.arange(14), valid_moves)
                mod_probs_plus[non_valid_moves] = 0  # set non valid moves to 0

                if np.sum(mod_probs_plus) == 0:
                    action = np.random.choice(valid_moves)
                else:  
                    # renormalize so probabilities sum to 1
                    mod_probs_plus = mod_probs_plus/np.sum(mod_probs_plus)
                    action = np.random.choice(np.arange(14), p=mod_probs_plus)
            
            # play auto win and block a fraction of the time so it can learn these moves
            if np.random.rand() < epsilon/50:

                # check for auto winning moves
                winning_move = auto_win(board, valid_moves, player)
                blocks = auto_win(board, valid_move(board, 'minus'), 'minus')

                if len(winning_move) > 0:  
                    action = np.random.choice(winning_move)   # play winning move if possible

                elif len(blocks) > 0:
                    action = np.random.choice(blocks)         # block if possible

            # add that board to the game data
            game_data['boards_plus'].append(board)

        # get minus move
        elif player == 'minus':
            
            # epsilon % of the time play random (legal) move
            if np.random.rand() < epsilon:
                action = np.random.choice(valid_moves)
            
            # 1-epsilon % of the time play the best move
            else:
                
                # get probabilities for each action from the minus model
                mod_probs_minus = player_minus(board_feed, training=False).numpy()[0]
                non_valid_moves = np.setdiff1d(np.arange(14), valid_moves)
                mod_probs_minus[non_valid_moves] = 0  # set non valid moves to 0
                
                if np.sum(mod_probs_minus) == 0:
                    action = np.random.choice(valid_moves)
                else:
                    # renormalize so probabilities sum to 1
                    mod_probs_minus = mod_probs_minus/np.sum(mod_probs_minus)
                    action = np.random.choice(np.arange(14), p=mod_probs_minus)
            
            # play auto win and block a fraction of the time so it can learn these moves
            if np.random.rand() < epsilon/50:
                
                # check for auto winning moves
                winning_move = auto_win(board, valid_moves, player)
                blocks = auto_win(board, valid_move(board, 'plus'), 'plus')

                if len(winning_move) > 0:  
                    action = np.random.choice(winning_move)   # play winning move if possible

                elif len(blocks) > 0:
                    action = np.random.choice(blocks)      # block if possible

            # add that board to the game data
            game_data['boards_minus'].append(board)
        
        # update board and check for win
        board = update_board(board, player, action)
        winner = check_for_win(board)
        
        # add winner and win type to game data
        if winner != 'nobody' and 'plus' in winner:
            game_data['winner'] = 'plus'
            game_data['win_type'] = winner
        elif winner != 'nobody' and 'minus' in winner:
            game_data['winner'] = 'minus'
            game_data['win_type'] = winner
        
        # switch player
        if player == 'plus':
            # store action taken and reward of 0 if no winner yet
            game_data['actions_plus'].append(action)
            game_data['rewards_plus'].append(0)

            # give reward to winner of 1, and loser a -5
            if 'plus' in winner :
                game_data['rewards_plus'][-1] = 1
                game_data['rewards_minus'][-1] = -5
            elif 'minus' in winner:
                game_data['rewards_plus'][-1] = -5
                game_data['rewards_minus'][-1] = 1

            player = 'minus'

        else:
            # store action taken and reward of 0 if no winner yet
            game_data['actions_minus'].append(action)
            game_data['rewards_minus'].append(0)

            # give reward to winner of 1, and loser a -5
            if 'plus' in winner :
                game_data['rewards_plus'][-1] = 1
                game_data['rewards_minus'][-1] = -5
            elif 'minus' in winner:
                game_data['rewards_plus'][-1] = -5
                game_data['rewards_minus'][-1] = 1


            player = 'plus'

    # add discounted rewards to game data
    game_data['discount_rewards_plus'] = discount_rewards(game_data['rewards_plus'])
    game_data['discount_rewards_minus'] = discount_rewards(game_data['rewards_minus'])

    return game_data

In [3]:
# Function to check how the plus model performs against a random (valid) player

def check_plus_model_wins (model, trials) : 

    # track win count
    plus = 0
    minus = 0

    # loop through trials
    for trial in range(trials):

        winner = 'nobody'
        board = np.zeros((6,7))
        player = 'plus'
        moves = 0

        # play game
        while winner == 'nobody':
            
            # get valid moves
            valid_moves = valid_move(board, player)
            board_feed = board.reshape(1,6,7,1)

            # get move from plus model
            if player == 'plus':
                
                # get probabilities for each action from the plus model
                mod_probs_plus = model(board_feed, training=False).numpy()[0]
                non_valid_moves = np.setdiff1d(np.arange(14), valid_moves)
                mod_probs_plus[non_valid_moves] = 0  # set non valid moves to 0

                if np.sum(mod_probs_plus) == 0:
                    action = np.random.choice(valid_moves)
                else:

                    # renormalize probabilities so they sum to 1
                    mod_probs_plus = mod_probs_plus/np.sum(mod_probs_plus)
                    action = np.random.choice(np.arange(14), p=mod_probs_plus)

            # get random move from minus player (random player here)
            elif player == 'minus':
                # check if there is a valid move
                valid_moves = valid_move(board, player)
                action = np.random.choice(valid_moves)

            # update board and check for win
            board = update_board(board,player,action)
            winner = check_for_win(board)
            moves += 1

            # switch player
            if player == 'plus':
                player = 'minus'
            else:
                player = 'plus'
        
        # update win count
        if 'plus' in winner:
            plus += 1
        elif 'minus' in winner:
            minus += 1

    # calculate win rates
    plus_wins = plus/trials
    rand_minus_wins = minus/trials

    return plus_wins, rand_minus_wins

In [None]:
# Function to check how the minus model performs against a random (valid) player

def check_minus_model_wins(model, trials) : 

    # track win count
    plus = 0
    minus = 0

    # loop through trials
    for trial in range(trials):

        winner = 'nobody'
        board = np.zeros((6,7))
        player = 'minus'
        moves = 0

        # play game
        while winner == 'nobody':
            
            # get valid moves
            valid_moves = valid_move(board, player)
            board_feed = board.reshape(1,6,7,1)

            # this is the random player
            if player == 'plus':

                # check if there is a valid move
                valid_moves = valid_move(board, player)
                action = np.random.choice(valid_moves)

            # get move from minus model
            elif player == 'minus':
                
                # get probabilities for each action from the minus model
                mod_probs_minus = player_minus(board_feed, training=False).numpy()[0]
                non_valid_moves = np.setdiff1d(np.arange(14), valid_moves)
                mod_probs_minus[non_valid_moves] = 0  # set non valid moves to 0

                if np.sum(mod_probs_minus) == 0:
                    action = np.random.choice(valid_moves)
                else:
                    # renormalize probabilities so they sum to 1
                    mod_probs_minus = mod_probs_minus/np.sum(mod_probs_minus)
                    action = np.random.choice(np.arange(14), p=mod_probs_minus)

            # update board and check for win
            board = update_board(board,player,action)
            winner = check_for_win(board)
            moves += 1

            # switch player
            if player == 'plus':
                player = 'minus'
            else:
                player = 'plus'

        # update win count
        if 'plus' in winner:
            plus += 1
        elif 'minus' in winner:
            minus += 1

    # calculate win rates
    rand_plus_wins = plus/trials
    minus_wins = minus/trials

    return rand_plus_wins, minus_wins

## Begin Training Models

We tried a different network architecture than was in our DQN model. This model had the same number of convolutional layers and dense layers, however, here, we used smaller filter dimensions for the first convolutional layer (2x2 vs. 4v4). We found this architecture to work better for the PG models.

In [9]:
# initialize plus and minus models 
player_plus = create_model(6,7,1)
player_minus = create_model(6,7,1)

player_plus.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 6, 7, 1)]         0         
                                                                 
 conv2d (Conv2D)             (None, 5, 6, 32)          160       
                                                                 
 conv2d_1 (Conv2D)           (None, 4, 5, 64)          8256      
                                                                 
 conv2d_2 (Conv2D)           (None, 3, 4, 128)         32896     
                                                                 
 flatten (Flatten)           (None, 1536)              0         
                                                                 
 dense (Dense)               (None, 256)               393472    
                                                                 
 dense_1 (Dense)             (None, 14)                3598  

In [10]:
# speeds up training 
player_plus.call = tf.function(player_plus.call, experimental_relax_shapes=True)
player_minus.call = tf.function(player_minus.call, experimental_relax_shapes=True)

## Set up Buffer and Train

To build our memory buffer, we built a buffer for the plus player and the minus player, and stored the current boards, actions, and discounted rewards for each player. We kept a small buffer of 1000 boards for each player so that the model was not learning on very old moves and boards. With each step of SGD, we sample 50 boards and compute the loss (sparse categorical cross entropy) betweem the actions that were actually taken and the highest probability that the model output. We train until we reach the limit of total boards of 1,000,000 boards. This resulted in roughly 150,000 games. We also added linear annealing on top of the PG technique so that the PG models wouldn't learn to quickly against a bad model early on. This added another layer of randomness so the models wouldn't learn too quickly.

In [16]:
# parameters for the buffer
batch_size = 50
buffn = 1000
warmupboards = 500
max_boards = 1000000 + warmupboards
tot_boards = 0
len_buffer_plus = 0
len_buffer_minus = 0
buffer_plus = {'boards':[],'actions':[],'rewards':[]}
buffer_minus = {'boards':[],'actions':[],'rewards':[]}

# sparse categorical crossentropy loss object
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)

# linear annealing parameters 
anneal1 = 600000
anneal2 = 300000

ep0 = 1.0
ep1 = 0.5
ep2 = 0.05
epsilon = ep0
dep1 = (ep0-ep1)/anneal1
dep2 = (ep1-ep2)/anneal2

In [17]:
# begin training loop
start = time.time()

# set up dataframe to store results 
save_output = pd.DataFrame(columns=['game','runtime','epsilon','plus_wins','rand_minus_wins','rand_plus_wins','minus_wins'])

# lists to store winner info 
winner = []
win_type = []
plus = [] 
minus = []
games = 0

# loop until upper limit of total boards is reached
while tot_boards < max_boards: 

    # play a game and increment game counter
    game_data = play_game(epsilon)
    games += 1

    # add winner and win type to lists 
    winner.append(game_data['winner'])
    win_type.append(game_data['win_type'])

    # add to lists storing win counts per player (1: win, 0: loss)
    if game_data['winner'] == 'plus':
        plus.append(1)
        minus.append(0)
    elif game_data['winner'] == 'minus':
        plus.append(0)
        minus.append(1)

    # extend the plus and minus buffers with the frames from the game played, actions, and discounted rewards
    buffer_plus['boards'].extend(game_data['boards_plus'])
    buffer_plus['actions'].extend(game_data['actions_plus'])
    buffer_plus['rewards'].extend(game_data['discount_rewards_plus'])

    buffer_minus['boards'].extend(game_data['boards_minus'])
    buffer_minus['actions'].extend(game_data['actions_minus'])
    buffer_minus['rewards'].extend(game_data['discount_rewards_minus'])

    # add to tot_boards and each buffer length 
    tot_boards += len(game_data['boards_plus'])
    len_buffer_plus += len(game_data['boards_plus'])
    len_buffer_minus += len(game_data['boards_minus'])

    # remove excess buffer for both when exceed 1000 boards
    if len_buffer_plus > buffn:
        excess = len_buffer_plus - buffn
        buffer_plus['boards'] = buffer_plus['boards'][excess:].copy()
        buffer_plus['actions'] = buffer_plus['actions'][excess:].copy()
        buffer_plus['rewards'] = buffer_plus['rewards'][excess:].copy()
        len_buffer_plus = len(buffer_plus['actions'])
    if len_buffer_minus > buffn:
        excess = len_buffer_minus - buffn
        buffer_minus['boards'] = buffer_minus['boards'][excess:].copy()
        buffer_minus['actions'] = buffer_minus['actions'][excess:].copy()
        buffer_minus['rewards'] = buffer_minus['rewards'][excess:].copy()
        len_buffer_minus = len(buffer_minus['actions'])
    
    # once both buffers longer than warm up boards, begin two stage annealing 
    if len_buffer_plus > warmupboards and len_buffer_minus > warmupboards:

        # first stage of annealing
        if epsilon >= ep1:
            epsilon -= dep1 * len(game_data['boards_plus'])
        # second stage of annealing
        elif epsilon < ep1 and epsilon > ep2:
            epsilon -= dep2 * len(game_data['boards_plus'])

    # start training plus model once buffer is longer than warm up boards
    if len_buffer_plus > warmupboards :

        # oversample boards with positive rewards
        prob_plus = np.ones(len_buffer_plus)
        prob_plus[np.array(buffer_plus['rewards']) > 0] = 3
        prob_plus = prob_plus / np.sum(prob_plus)
        # select random sample from buffer
        which_choose_plus = np.random.choice(len_buffer_plus, batch_size, replace=False, p=prob_plus)

        # use random indices to extract boards, actions, and rewards from buffer 
        boards = np.array(buffer_plus['boards'])[which_choose_plus]
        actions = np.array(buffer_plus['actions'])[which_choose_plus]
        rewards = np.array(buffer_plus['rewards'])[which_choose_plus]

        # optimizer
        optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
        
        # train model
        with tf.GradientTape() as tape:
            # forward pass
            probs = player_plus(boards)
            loss_value = tf.losses.sparse_categorical_crossentropy(actions, probs) * tf.constant(rewards, dtype = tf.float32)
            # backward pass
            gradients = tape.gradient(loss_value, player_plus.trainable_variables)
            # update weights
            optimizer.apply_gradients(zip(gradients, player_plus.trainable_variables))

    # start training minus model once buffer is longer than warm up boards    
    if len_buffer_minus > warmupboards :

        # oversample boards with positive rewards
        prob_minus = np.ones(len_buffer_minus)
        prob_minus[np.array(buffer_minus['rewards']) > 0] = 3
        prob_minus = prob_minus / np.sum(prob_minus)
        # select random sample from buffer
        which_choose_minus = np.random.choice(len_buffer_minus, batch_size, replace=False, p=prob_minus)

        # get random sample from buffer
        boards = np.array(buffer_minus['boards'])[which_choose_minus]
        actions = np.array(buffer_minus['actions'])[which_choose_minus]
        rewards = np.array(buffer_minus['rewards'])[which_choose_minus]

        # optimizer
        optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

        # train model
        with tf.GradientTape() as tape:
            # forward pass
            probs = player_minus(boards)
            loss_value = tf.losses.sparse_categorical_crossentropy(actions, probs) * tf.constant(rewards, dtype = tf.float32)
            # backward pass
            gradients = tape.gradient(loss_value, player_minus.trainable_variables)
            # update weights
            optimizer.apply_gradients(zip(gradients, player_minus.trainable_variables))

    # save models and output every 2000 games to track progress
    if games % 2000 == 0:
        player_plus.save('models/player_plus_{0}.h5'.format(games))
        player_minus.save('models/player_minus_{0}.h5'.format(games))
        end = time.time()
        # add to df
        plus_wins, rand_minus_wins = check_plus_model_wins(player_plus, 1000)
        rand_plus_wins, minus_wins = check_minus_model_wins(player_minus, 1000)
        save_output.loc[len(save_output)] = [games, end-start, epsilon, plus_wins, rand_minus_wins, rand_plus_wins, minus_wins]
        save_output.to_csv('save_output.csv', index=False)
        print('game: ', games, '\ttime: ', end-start, '\tlen_buffer_plus: ', len_buffer_plus, '\tlen_buffer_minus: ', len_buffer_minus, game_data['win_type'], '\tplus boards: ', len(game_data['boards_plus']), '\tminus boards: ', len(game_data['boards_minus']), '\tepsilon: ', epsilon)
        start = time.time()

 

game:  2000 	time:  462.4003789424896 	len_buffer_plus:  1000 	len_buffer_minus:  1000 v-minus 	plus boards:  17 	minus boards:  17 	epsilon:  0.9773691666666656
game:  4000 	time:  483.31683826446533 	len_buffer_plus:  1000 	len_buffer_minus:  1000 h-minus 	plus boards:  6 	minus boards:  6 	epsilon:  0.9547124999999972
game:  6000 	time:  482.01316571235657 	len_buffer_plus:  1000 	len_buffer_minus:  1000 h-minus 	plus boards:  3 	minus boards:  4 	epsilon:  0.9325899999999944
game:  8000 	time:  469.3741900920868 	len_buffer_plus:  1000 	len_buffer_minus:  1000 h-plus 	plus boards:  13 	minus boards:  14 	epsilon:  0.9111391666666544
game:  10000 	time:  457.8006341457367 	len_buffer_plus:  1000 	len_buffer_minus:  1000 d-plus 	plus boards:  18 	minus boards:  17 	epsilon:  0.88956666666665
game:  12000 	time:  453.2348470687866 	len_buffer_plus:  1000 	len_buffer_minus:  1000 h-plus 	plus boards:  14 	minus boards:  13 	epsilon:  0.8688008333333108
game:  14000 	time:  460.80549287

In [18]:
# see split between plus and minus wins
winner = pd.Series(winner)
winner.value_counts(normalize=True)

plus     0.518664
minus    0.481336
dtype: float64

In [19]:
# see distribution of win types 
win_type = pd.Series(win_type)
win_type.value_counts(normalize=True)

v-plus     0.411214
v-minus    0.403921
h-plus     0.089175
h-minus    0.061154
d-plus     0.018275
d-minus    0.016261
dtype: float64

## Function to Test Model Against Random Opponent

To test our model against a random opponent, use the cell below and import the necessary packages at the top of this notebook. You will also need to run the following functions from the Basic Functions section at the top of this notebook:  

- **update_board()**: Updates game board after moves
- **check_for_win()**: Checks board for a winner
- **valid_move()**: Restricts both players from making invalid moves
- **check_plus_model_wins()**: Checks our model's win rate against a random opponent

The valid move function restricts both our plus model and the random opponent from making invalid moves in its construction.

In [15]:
# Necessary imports
import numpy as np
import pandas as pd
from IPython.display import clear_output
import tensorflow as tf
import time
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten, Dropout, Conv2D, Input

In [31]:
# read in h5 model submitted to Canvas
player_plus = tf.keras.models.load_model('pg_model.h5') 
player_plus.call = tf.function(player_plus.call, experimental_relax_shapes=True)

# calculates win rate of plus model against random for 1000 games
plus_wins, rand_minus_wins = check_plus_model_wins(player_plus, 1000)
print('Plus Win Rate:', plus_wins)
print('Random Win Rate:', rand_minus_wins)

Plus Win Rate: 0.991
Random Win Rate: 0.009
