## Simulator

This is a simulator of 1 play of a game between a player and a dealer. 

The player portion requires human input every turn (stick or hit) but the dealer’s play is automated (stick on any sum 17 or greater).

In [1]:
import random
def draw_card_sim():
    num = random.randint(1,10)
    col_num = random.randint(1,2)
    color = 'black' if col_num == 1 else 'red'
    return ([num, color])

def calc_value(hand):
    value = 0
    for card in hand:
        if card[1] == 'black':
            value += card[0]
        else:
            value -= card[0]
    return value

def print_state(hand, who):
    print('\033[1m' + who + "'s hand:" + '\033[0m')
    for card in hand:
        print(str(card[0]) + ' ' + card[1]) 
        
def decide_outcome(player, dealer):
    player = calc_value(player)
    dealer = calc_value(dealer)
    
    if ((player > dealer) and (player < 22)) or ((player > 1) and (player <= 21) and (dealer > 21)):
        return 'PLAYER WON'
    elif (player == dealer) or ((player < 1 or player > 21) and (dealer < 1 or dealer > 21)):
        return 'DRAW'
    else:
        return 'DEALER WON'

In [35]:
def game():
    player = [[random.randint(1, 10), 'black']]
    dealer = [[random.randint(1, 10), 'black']]
    print('\033[1m' + '****** GAME INITIALIZED ******' + '\033[0m')
    
    print(str(player[0][0]), 'black', 'Player')
    print(str(dealer[0][0]), 'black', 'Dealer')
    
    ### Player
    print('\033[1m' + "****** PLAYER'S TURN ******" + '\033[0m')
    proceed = input('stick or hit? ')
    while proceed == 'hit':
        player.append(draw_card_sim())
        print_state(player, 'Player')
        print('Current value: {}'.format(str(calc_value(player))))

        # Bust
        if (calc_value(player) < 1) or (calc_value(player) > 21):
            print(calc_value(player))
            print('\033[1m' + 'Bust\n' + '\033[0m')
            break
            
        proceed = input('stick or hit? ')

    print('\n' + '\033[1m' + "****** DEALER'S TURN ******" + '\033[0m')
    
    ### Dealer
    while True:
        dealer.append(draw_card_sim())
        print_state(dealer, 'Dealer')
        print('Current value: {}'.format(str(calc_value(dealer))))
        
        # Win 
        if (calc_value(dealer) >= 17) and (calc_value(dealer) <= 21):
            print('\033[1m' + 'Dealer stops' + '\033[0m')
            break
        
        # Bust
        elif (calc_value(dealer) < 1) or (calc_value(dealer) > 21):
            print('\033[1m' + 'Dealer busts' + '\033[0m')
            break
            
        print('\n')   
    
    outcome = decide_outcome(player, dealer)
    return outcome
game()

[1m****** GAME INITIALIZED ******[0m
4 black Player
1 black Dealer
[1m****** PLAYER'S TURN ******[0m
stick or hit? hit
[1mPlayer's hand:[0m
4 black
1 black
Current value: 5
stick or hit? stick

[1m****** DEALER'S TURN ******[0m
[1mDealer's hand:[0m
1 black
10 black
Current value: 11


[1mDealer's hand:[0m
1 black
10 black
2 red
Current value: 9


[1mDealer's hand:[0m
1 black
10 black
2 red
8 black
Current value: 17
[1mDealer stops[0m


'DEALER WON'

In this example, the player played 2 turns, to which he decided to hit in the first turn and stick in the second turn, resulting in a final value of 5. After that, the dealer took a series of hits before finally sticking, resulting in a final value of 17, hence winning the game.

## Q-Learning

I first started by initializing a Q-table and pre-setting all possible states and actions combination values to 0. I then opted for the epsilon-greedy approach in selecting the action for the automated player. The functions getQ and updateQ are used in updating the Q-Table after every card draw and end of game when the outcome is decided (win/draw/lose).

The parameters I used are alpha = 0.1, gamma = 0.9, epsilon = 0.1 and I ran the Q-Learning for 500,000 iterations.

In [37]:
alpha = 0.1
gamma = 0.9
epsilon = 0.1
possible_moves = ['hit', 'stick']
num_episodes = 500000

# Initialize Q table
Q_table = {}
possible_moves = ['hit', 'stick']
for i in range(1, 22):
    for move in possible_moves:
        Q_table[(i, move)] = 0

def draw_card():
    value = random.randint(1, 10)
    color = random.choice(['black', 'red'])
    return value, color

def epsilon_greedy(state):
    if random.random() < epsilon: # pick action w lower value
        action = getQ(state)[1][1]
    else:
        action = getQ(state)[0][1]
    return action
    
def getQ(state):
    state_actions = []
    for k, v in Q_table.items():
        if k[0] == state:
            if len(state_actions) == 0:
                state_actions.append([k[0], k[1], v])
            elif v > state_actions[0][2]:
                state_actions.insert(0, [k[0], k[1], v])
            else:
                state_actions.append([k[0], k[1], v])
    return state_actions # state_actions = [(11, 'hit'): 6.1, (11, 'stick'): 3.4]
    
def updateQ(old_state, action, new_state, reward):
    prev_q = Q_table[(old_state, action)]
    try:
        new_q = getQ(new_state)[0][2]
    except IndexError: # case where new_state is a negative number
        new_q = 0
    Q_table[(old_state, action)] = prev_q + alpha*(reward + gamma*new_q - prev_q)

In [40]:
for ep in range(num_episodes):
    player = random.randint(1, 10)
    dealer = random.randint(1, 10)
    
    ## PLAYER
    action = 'hit'
    while action == 'hit': # repeat player's turn till stick
        current = player
        action = epsilon_greedy(current)
        if action == 'hit': # draw new card
            new_card = draw_card()
            if new_card[1] == 'black':
                player += new_card[0]
            else:
                player -= new_card[0]

        if player < 1 or player > 21: # bust
            break
        else:
            updateQ(current, action, player, 0)
    player_final = (current, action, player)
    
    # DEALER
    action = 'hit'
    while action == 'hit': # repeat dealer's turn till stick or >=17 & <=21
        current = dealer
        if current >= 17 and current <= 21: # stick
            action = 'stick'
        else:
            action = 'hit'
            new_card = draw_card()
            if new_card[1] == 'black':
                dealer += new_card[0]
            else:
                dealer -= new_card[0]

            if dealer < 1 or dealer > 21: # bust
                break
            else:
                updateQ(current, action, dealer, 0)
    dealer_final = (current, action, dealer)
    
    ## OUTCOME
    if (player_final[2] == dealer_final[2]) or ((player_final[2] < 1 or player_final[2] > 21) and (dealer_final[2] <1 or dealer_final[2] > 21)): # draw
        updateQ(player_final[0], player_final[1], player_final[2], 0)
        updateQ(dealer_final[0], dealer_final[1], dealer_final[2], 0)

    elif (player_final[2] > dealer_final[2] and player_final[2] < 22) or (dealer_final[2] > 21 and player_final[2] > 0 and player_final[2] < 21) : # player wins
        updateQ(player_final[0], player_final[1], player_final[2], 1)
        updateQ(dealer_final[0], dealer_final[1], dealer_final[2], -1)

    else:
        updateQ(player_final[0], player_final[1], player_final[2], -1) # dealer wins
        updateQ(dealer_final[0], dealer_final[1], dealer_final[2], 1)    

## Conclusion

This is the final Q-Table of expected rewards. Generally, the higher the value of your current hand, the higher the expected reward is if you stick. There is a jump from hand value 16 to 17 due to the dealer’s algorithm in stopping on hand values 17 and above, often resulting in wins when these situations occur- increasing expected rewards of numbers 17 and above.

For hitting, the expected reward is low for lower hand values but increases around hand value 7 and slowly decreases from hand value 10 onwards. This makes sense as drawing cards when the value of your hand is already high might cause you to bust, hence incurring a negative reward.

In [41]:
Q_table

{(1, 'hit'): 1.0388177169119155,
 (1, 'stick'): 2.4162429708091175,
 (2, 'hit'): 0.3340981838313377,
 (2, 'stick'): 1.4624035740111612,
 (3, 'hit'): 0.9523541218403131,
 (3, 'stick'): 2.2443270643149753,
 (4, 'hit'): 0.9477959115332029,
 (4, 'stick'): 2.0725351108198393,
 (5, 'hit'): 1.2841424455033121,
 (5, 'stick'): 1.9274282206672004,
 (6, 'hit'): 0.8646566055698401,
 (6, 'stick'): 2.1128572713562064,
 (7, 'hit'): 2.2177951501642355,
 (7, 'stick'): 2.2083238844541455,
 (8, 'hit'): 1.8350855781711937,
 (8, 'stick'): 2.3671604527419645,
 (9, 'hit'): 3.122091434823913,
 (9, 'stick'): 3.0662396291892637,
 (10, 'hit'): 2.310421005096463,
 (10, 'stick'): 3.0708054787336154,
 (11, 'hit'): 3.117730382910863,
 (11, 'stick'): 3.3366063934840695,
 (12, 'hit'): 3.455185485329924,
 (12, 'stick'): 3.1796680475818797,
 (13, 'hit'): 3.6323903191534295,
 (13, 'stick'): 3.0088081434193175,
 (14, 'hit'): 2.6612111941220196,
 (14, 'stick'): 2.9115192243982153,
 (15, 'hit'): 2.7528501645678025,
 (15, 's