# Reinforcement Learning Agent
Here we try to develop a RL Agent for learning the game of blackjack. We use Q-Learning for the same.

In [11]:
from game import *
import random

 **Enlist Valid States**  
 
The player state consists of three fields:
1. Integer representing the sum of player card values
2. Integer representing the value of dealer's first card
3. Boolean representing whether the player has a usable Ace.

First we write a function to get the current agent state. Then we enlist all the valid states that the player can be in. The player can continue playing (choosing between hit and state) in the valid states. We also assign an integer index to each state for easy reference. 

In [12]:
def get_state(game):
    return (game.get_sum(game.get_player_hand()), game.get_hand_value(game.get_dealer_hand())[0], game.hasAce)

valid_states = [(x,y,z) for z in [True,False] for x in range(12,22) for y in range(2,12)]
valid_states[160]

def state_to_index(st):
    if st in valid_states:
        return valid_states.index(st)
    else:
        return -1

state_to_index((20, 2, True))
state_to_index((22, 2, True))

-1

**Initialize Q-value Table** 

SARSA learning assigns a q-value to each state,action pair which represents the future rewards expected to be received if we take the given action in the given state. The actions that a player can take in any valid state are: hit and stand. We assign a q-value of 0 to each state, action pair.

In [13]:
def zero_q_values():
    qvals = {"hit": 0.0, "stand": 0.0}
    return qvals

q_value_table = [zero_q_values() for x in valid_states ]
q_value_table[state_to_index((20, 2, True))]

{'hit': 0.0, 'stand': 0.0}

**Epsilon-Greedy Policy**  

All control problems need a policy to decide what action to take in what state. We use epsilon-greedy policy, which tells us to take a random action epsilon number of times (exploration) and the best action according to current data 1-epsilon number of times (exploitation).

In [29]:
def epsilon_greedy(epsilon, q_values):
    if random.random() < epsilon:
        return random.choice(list(q_values.keys()))
    else:
        if q_values["hit"] > q_values["stand"]:
            return "hit"
        elif q_values["hit"] < q_values["stand"]:
            return "stand"
        else:
            return random.choice(list(q_values.keys()))
        
for i in range(10):
    print(epsilon_greedy(0.1, zero_q_values()))

stand
stand
stand
stand
hit
hit
stand
stand
stand
hit


**SARSA**  

Finally we go on to learn the Q-value table. We use SARSA learning algorithm for the same. The intuition behind SARSA is very simple.

We first randomly initialize q_value_table, with zero value for the terminal states. Then everytime we take action a in state s, we try to find the next state s' and next action (according to the same policy) a' and calculate the expected future reward. The expected future reward is given by:   
immediate_reward + GAMMA * Q(s',a')  
GAMMA is the discounting factor in the above equation.  

The Q value of current state and action pair (s,a) is the moving average of all these expected future rewards. Thus we try to move the value of Q(s,a) towards the expected future reward by a constant averaging factor ALPHA.  

In [47]:
EPSILON = 0.2
ALPHA = 0.3
GAMMA = 0.9
#count_of_busts = 0
#count_of_wins = 0

for episode in range(500000):
    game = BlackJack()
    game.start_game()
    state = state_to_index(get_state(game))
    action = epsilon_greedy(EPSILON, q_value_table[state])
    while(True):
        if action=="hit":
            game.hit()
            next_state = state_to_index(get_state(game))
        else:
            game.stand()
            next_state = -1
        if next_state == -1:
            q_value_table[state][action] += ALPHA*(game.result - q_value_table[state][action])
            break
        next_action = epsilon_greedy(EPSILON, q_value_table[next_state])
        q_value_table[state][action] += ALPHA*( GAMMA*q_value_table[next_state][next_action] - q_value_table[state][action])
        state = next_state
        action = next_action

**Policy Table**

Finally we try to develop the optimal policy from the Q value table develop. pi(s) = argmax Q(s,a).

In [48]:
policy_table = {}
value_function = {}
for state in range(len(valid_states)):
    if q_value_table[state]["hit"] > q_value_table[state]["stand"]:
        policy_table[valid_states[state]] = "hit"
        value_function[valid_states[state]] = q_value_table[state]["hit"]
    else:
        policy_table[valid_states[state]] = "stand"
        value_function[valid_states[state]] = q_value_table[state]["stand"]

In [50]:
states_without_ace = [(x,y,False) for x in range(12,22) for y in range(2,12)]
for state in states_without_ace:
    print(str(state) + " : " + str(policy_table[state]))

(12, 2, False) : hit
(12, 3, False) : hit
(12, 4, False) : hit
(12, 5, False) : hit
(12, 6, False) : hit
(12, 7, False) : hit
(12, 8, False) : hit
(12, 9, False) : hit
(12, 10, False) : hit
(12, 11, False) : hit
(13, 2, False) : hit
(13, 3, False) : hit
(13, 4, False) : hit
(13, 5, False) : hit
(13, 6, False) : hit
(13, 7, False) : hit
(13, 8, False) : hit
(13, 9, False) : hit
(13, 10, False) : hit
(13, 11, False) : hit
(14, 2, False) : hit
(14, 3, False) : stand
(14, 4, False) : hit
(14, 5, False) : hit
(14, 6, False) : hit
(14, 7, False) : hit
(14, 8, False) : stand
(14, 9, False) : hit
(14, 10, False) : hit
(14, 11, False) : hit
(15, 2, False) : stand
(15, 3, False) : stand
(15, 4, False) : hit
(15, 5, False) : hit
(15, 6, False) : hit
(15, 7, False) : hit
(15, 8, False) : hit
(15, 9, False) : hit
(15, 10, False) : hit
(15, 11, False) : stand
(16, 2, False) : hit
(16, 3, False) : hit
(16, 4, False) : stand
(16, 5, False) : stand
(16, 6, False) : stand
(16, 7, False) : stand
(16, 8, 