Let's how to implement the MC control method with epsilon-greedy policy for playing the blackjack game

In [1]:
# Importing the necessary libraries:
import gym
import pandas as pd
from collections import defaultdict
import random

In [2]:
# Create a blackjack environment:
env = gym.make('Blackjack-v1')

In [3]:
# Initialize the dictionary for storing the Q values:
Q = defaultdict(float)

In [4]:
# Initialize the dictionary for storing 
# the total return of the state-action pair:
total_return = defaultdict(float)

In [5]:
# Initialize the dictionary for storing 
# the count of the number of times a state-action pair is visited:
N = defaultdict(int)

Define the epsilon-greedy policy

In [6]:
# We define a function called epsilon_greedy_policy which takes 
# the state and Q value as an input and returns the action to be 
# performed in the given state:

def epsilon_greedy_policy(state,Q):
    
    #set the epsilon value to 0.5
    epsilon = 0.5
    
    #sample a random value from the uniform distribution, if the sampled value is less than
    #epsilon then we select a random action else we select the best action which has maximum Q
    #value as shown below
    
    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])

Generating an episode

In [7]:
# Set the number of time steps:
num_timesteps = 100

In [8]:
# Let's generate an episode using the epsilon-greedy policy. 
# We define a function called generate_episode which takes 
# the Q value as an input and returns the episode.

def generate_episode(Q):
    
    #initialize a list for storing the episode
    episode = []
    
    #initialize the state using the reset function
    state = env.reset()
    
    #then for each time step
    for t in range(num_timesteps):
        
        #select the action according to the epsilon-greedy policy
        action = epsilon_greedy_policy(state,Q)
        
        #perform the selected action and store the next state information
        next_state, reward, done, info = env.step(action)
        
        # store the state, action, reward in the episode list
        episode.append((state, action, reward))
        
        #if the next state is a final state  
       
        if done:
            
            # then break the loop
            break
        # else update the next state to the current state    
        state = next_state

    return episode

Computing the optimal policy

In [9]:
# Set the number of iterations:
num_iterations = 50000

We initialize a random policy in the first iteration and improve the policy iteratively by computing Q value. Since we extract the policy from the Q function, we don't have to explicitly define the policy. As the Q value improves the policy also improves implicitly.

In [10]:
#for each iteration
for i in range(num_iterations):
    
    #so, here we pass our initialized Q function to generate an episode
    episode = generate_episode(Q)
    
    #get all the state-action pairs in the episode
    all_state_action_pairs = [(s, a) for (s,a,r) in episode]
    
    #store all the rewards obtained in the episode in the rewards list
    rewards = [r for (s,a,r) in episode]

    #for each step in the episode 
    for t, (state, action, reward) in enumerate(episode):

        #if the state-action pair is occurring for the first time in the episode
        if not (state, action) in all_state_action_pairs[0:t]:
            
            #compute the return R of the state-action pair as the sum of rewards
            R = sum(rewards[t:])
            
            #update total return of the state-action pair
            total_return[(state,action)] = total_return[(state,action)] + R
            
            #update the number of times the state-action pair is visited
            N[(state, action)] += 1

            #compute the Q value by just taking the average
            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]

Thus on every iteration, the Q value improves and so does policy. After all the iterations, we can have a look at the Q value of each state-action in the pandas data frame for more clarity.

In [12]:
# Let's convert the Q value dictionary to a pandas data frame:
df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])

In [13]:
# Let's look at the first few rows of the data frame:
df.head(11)

Unnamed: 0,state_action pair,value
0,"((16, 8, False), 0)",-0.536122
1,"((20, 4, False), 0)",0.571848
2,"((18, 3, True), 0)",-0.058824
3,"((18, 3, True), 1)",-0.058824
4,"((16, 2, False), 0)",-0.270386
5,"((16, 2, False), 1)",-0.515873
6,"((10, 1, False), 0)",-0.777778
7,"((10, 1, False), 1)",-0.347826
8,"((18, 10, False), 0)",-0.230912
9,"((18, 10, False), 1)",-0.733918


As we can observe, we have the Q values for all the state-action pairs. 
Now we can extract the policy by selecting the action which has maximum Q value in each state.

To learn more how to select action based on this Q value, check the book under the section, implementing on-policy control.