# **Monte Carlo Control (Draft)**
### 2022/05/02, A. J. Zerouali

 

## **1 - Introduction**

### a) Contents

This notebook is a debug file for my Monte Carlo impelementations.


### b) Import cells

We'll continue working with GridWorld. I'll also include some helper functions written previously for DP.

In [1]:
#####################################################
##### IMPORTANT: ALWAYS EXECUTE THIS CELL FIRST #####
#####################################################
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Import the Windy GridWorld class and helper functions:

In [2]:
from Windy_GridWorld import GridWorld_Windy_small, windy_standard_grid, test_standard_grid

Import printing functions for value fn and policy, random episode generator, and value fn comparing function:

In [3]:
from RL_Fns_Windy_GridWorld import print_values, print_policy, gen_random_policy, compare_value_fns

Import episode generating function and Monte Carlo evaluation of policies:

In [4]:
from Monte_Carlo_Windy_GridWorld import generate_episode, MC_Policy_Eval

In [7]:
help(generate_episode)

Help on function generate_episode in module Monte_Carlo_Windy_GridWorld:

generate_episode(s_0, a_0, Pi, env, T_max)
    Generates a random episode of max length (T_max + 1), given an initial state-action.
    ARGUMENTS: Initial state and action; policy; environment; max episode length (in this order).
    OUTPUT: episode_rewards: Rewards list
            episode_states: State list
            episode_actions: Actions list
            T: Episode length minus one.
            
    NOTE: The function below is taylored to GridWorld. The correct way
         to implement it is to use methods of the environment class.
         I'm not using the methods of the Windy_GridWorld class because
         the instructor's implementation is a little too clunky and the
         entire design should be redone from scratch.



#### IMPORTANT:

Execute all cells before the one above.

## 2 - Monte Carlo control - Exploring starts

MC with exploring starts pseudocode

            Require: Discount factor gamma.
            Initialize: Random policy Pi, Q(s,a)=0 for all state-actions (s,a),
                        create returns dict. and lists returns[(s,a)]=[]
            Loop until convergence:
                Generate initial state-action (s_0, a_0) randomly.
                Generate episode sample of length T following Pi: (s_0, a_0, r_1, ... a_(T-1), r_T, s_T)
                G = 0                                              # Init. cumulative return of generated episode
                for t in {(T-1), (T-2), ..., 1, 0}:
                    G = gamma*G + r_(t+1)                          # Evaluate sample return
                    if (s_t, a_t) not in {(s_i,a_i)}_{i=0,...,t-1}:# If this is the first visit of (s_t,a_t)
                        Append G to returns[s_t,a_t]               # Store sample cum. return
                        Q[s_t, a_t] = average(returns[s_t, a_t])   # Update Q(s_t, a_t)
                        Pi[s_t] = argmax_a {Q[s_t, a]}             # Update Pi*(s_t)


### a) Version 2

In version 2, I computed the value function and the optimal policy with 2 approaches:

- One is a modification of Version 1 (which had bugs).

- The second is a modification of Lazy Programmer's version of MC control with exploring starts.

Both give the same result, and both have the issue of the values $V(2,2)$ and $V(2,3)$ being different from those obtained by MC policy evaluation. More details below. I plotted the same graph as the instructor with the learning rate (1/no. samples), but the results were very erratic. I then tried to execute the code with 30000 samples, a random initial policy, and max length of episode of 51. This crashed Jupyter, and I had to erase the file because I corrupted it.

That execution crashed because I was printing-out an infinite loop. The random policy was alternating between 3 states ad infinitum, thus generating samples of maximal length, and since I was printing out contents from the for loop over time, I ended up overflowing the Jupyter memory allowed.

### b) Version 3

In version 3, I just fixed version 1. I think the discrepancy in $V(2,2)$ and $V(2,3)$ between the control function and policy evaluation function comes from a sampling issue. 

In [8]:
##############################################################
###  MONTE CARLO CONTROL WITH EXPLORING STARTS - VERSION 3 ###
##############################################################
## 2022/05/02, AJ Zerouali
# Monte Carlo with exploring starts (Lect. 64-65)
# We are not using convergence thresholds here.
# This function below generates a specified number of sample
# episodes and averages returns to find the optimal policy.
# Can choose first visit MC or all visits MC with Boolean.
# This function also calls generate_episode() of previous sec.

def MC_Ctrl_ExpStarts(Pi, env, gamma, N_samples, T_max, all_visits_MC):
    '''
     Monte Carlo control with exploring starts
     ARGUMENTS: - Pi, env, gamma: Policy, environment, discount factor.
                - N_samples: No. of samples for Monte Carlo expectation.
                - T_max: Max. episode length minus 1.
                - all_visits_MC: Boolean for all visits or first visit MC.
     OUTPUT:    - Pi_star: Optimal policy
                - V_star: Optimal value function
                - Q: State-action values from samples
                - Returns: Dict. of return samples (by init. state-action)
                - N_iter: No. of randomly generated episodes (<= N_samples)
     NOTE: This function calls generate_episode().
    '''
    
    # Environment attributes
    non_term_states = env.non_term_states
    term_states = env.term_states
    adm_actions = env.adm_actions
    Rwds = env.rewards
    
    
    # Init. Q, returns, Pi_star, and V_star dictionaries
    Pi_star = {}
    Q = {}
    Returns = {}
    V_star = {}
    for s in non_term_states:
        Pi_star[s] = {}
        Q[s]={}
        Returns[s] = {}
        V_star[s] = 0.0
        for a in adm_actions[s]:
            Q[s][a] = 0.0
            Returns[s][a]=[]
            
    for s in term_states:
        V_star[s] = 0.0
            
    
    # Init. counter
    N_iter = 0
    
    # Main MC loop
    while (N_iter<N_samples):
                
        ##########################
        ##### Step t=0 setup #####
        ##########################

        # Generate (s_0, a_0) randomly from non_term_states and adm_actions
        # np.random.choice() works only for 1-dim'l arrays
        s_0 = list(non_term_states)[np.random.randint(len(non_term_states))]
        a_0 = np.random.choice(adm_actions[s_0])

        ########################
        ##### Steps 1 to T #####
        ########################
        
        #print(f"Generating sample episode no. {N_iter+1}...\n") # Debug
        # Generate episode
        # Signature: episode_rewards, episode_states, episode_actions, T = generate_episode(s_0, a_0, Pi, env, T_max)
        episode_rewards, episode_states, episode_actions, T = generate_episode(s_0, a_0, Pi, grid, T_max)
        # Step t of episode is (r_t, s_t, a_t) = (episode_rewards[t], episode_states[t], episode_actions[t])
        #rint(f"Sample episode N_iter={N_iter} has T={T} steps after t=0.\n") # Debug
 
        #####################################################
        ### COMPUTE CUMULATIVE RETURNS AND OPTIMAL ACTION ###
        #####################################################

        # First visit only MC
        if not all_visits_MC:
            # State-action iterable
            episode_state_actions = list(zip(episode_states, episode_actions))
            
            # Init. storing variable
            G = 0.0
            
            # Loop over episode
            for t in range(T-1, -1, -1): # Loop goes backwards from (T-1) to 0
                
                # Sum the return
                G = gamma*G + episode_rewards[t+1]
                s_t = episode_states[t]
                a_t = episode_actions[t]
                
                if (s_t, a_t) not in episode_state_actions[:t]:
                    
                    # Update sample returns and Q-function
                    Returns[s_t][a_t].append(G)
                    Q[s_t][a_t] = np.average(Returns[s_t][a_t])
                    
                    
                    # Get a_star and update Pi_star
                    a_star = max(Q[s_t], key = Q[s_t].get)
                    Pi_star[s_t] = {a_star:1.0} 
                    
                    
        # END IF first visit MC
        
        # All visits MC
        elif all_visits_MC:
            
            # Init. storing variable
            G = 0.0
            
            for t in range(T-1, -1, -1): # Loop goes backwards from (T-1) to 0
                
                # Sum the returns
                G = gamma*G + episode_rewards[t+1]
                s_t = episode_states[t]
                a_t = episode_actions[t]
                
                # Update sample returns and Q-function
                Returns[s_t][a_t].append(G)
                Q[s_t][a_t] = np.average(Returns[s_t][a_t])
                
                
                # Get a_star and update Pi_star
                a_star = max(Q[s_t], key = Q[s_t].get)
                Pi_star[s_t] = {a_star:1.0} 
                
                              
        # Update N_iter
        N_iter += 1
        
        '''
        # DEBUG:
        print(f"Completed episode N_iter={N_iter} with R={G}.")
        print("--------------------------------------------------------------------")
        '''
       
    # END WHILE of MC loop
    
    # COMPUTE V
    for s in non_term_states:
        a = max(Q[s], key = Q[s].get)
        V_star[s]=Q[s][a]
    
    # Output
    return Pi_star, V_star, Q, Returns, N_iter

# END DEF MC_Ctrl_ExpStarts

#### Test with non-windy GridWorld


In [9]:
# Create environment
grid = test_standard_grid()

# Policy to be evaluated
Pi = {
    (2, 0): {'U': 1.0},
    (1, 0): {'U': 1.0},
    (0, 0): {'R': 1.0},
    (0, 1): {'R': 1.0},
    (0, 2): {'R': 1.0},
    (1, 2): {'U': 1.0},
    (2, 1): {'R': 1.0},
    (2, 2): {'U': 1.0},
    (2, 3): {'L': 1.0},
  }

Pi_lect = {
    (2, 0): {'U': 1.0},
    (1, 0): {'U': 1.0},
    (0, 0): {'R': 1.0},
    (0, 1): {'R': 1.0},
    (0, 2): {'R': 1.0},
    (1, 2): {'R': 1.0},
    (2, 1): {'R': 1.0},
    (2, 2): {'R': 1.0},
    (2, 3): {'U': 1.0},
  }

# Discount factor and convergence threshold
gamma = 0.9
#epsilon = 1e-3

# Max. episode length:
T_max = 50
N_samples = 1000

# Evaluate V_Pi:
# SIGNATURE: Pi_star, V_star, Q, Returns = MC_Ctrl_ExpStarts(Pi, env, gamma, N_samples, T_max, all_visits_MC)
#V = MC_Policy_Eval(Pi, grid, gamma, N_samples, T_max, all_visits_MC = False)
Pi_star, V_star, Q, Returns, N_iter = MC_Ctrl_ExpStarts(Pi_lect, grid, gamma, N_samples, T_max, all_visits_MC = False)

The policy obtained looks optimal (although that depends on the number of samples collected):

In [10]:
print_policy(Pi_star, grid)

##  POLICY  ##
------------------------
  R  |  R  |  R  |
------------------------
  U  |     |  U  |
------------------------
  U  |  L  |  L  |  L  |
------------------------


The value function is off though:

In [11]:
print_values(V_star, grid)

## VALUE FUNCTION ##
------------------------
 0.81| 0.90| 1.00| 0.00|
------------------------
 0.73| 0.00| 0.90| 0.00|
------------------------
 0.66| 0.59|-0.73|-0.81|
------------------------


If we evaluate the optimal policy with MC, we get a different value function:

In [12]:
V_eval = MC_Policy_Eval(Pi_star, grid, gamma, 50, T_max, all_visits_MC = False)

In [13]:
print_values(V_eval, grid)

## VALUE FUNCTION ##
------------------------
 0.81| 0.90| 1.00| 0.00|
------------------------
 0.73| 0.00| 0.90| 0.00|
------------------------
 0.66| 0.59| 0.53| 0.48|
------------------------


## 3 - Monte Carlo control - Epsilon greedy policy

Turning to MC without exploring starts. Main points:

* When performing control with a given policy, it is possible to miss certain state-action pairs that aren't visited by the policy.

* In principle, would need to fill the dictionary $Q_pi(s,a)$ for all $(s,a)\in\mathcal{S}\times \mathcal{A}_s$.

* First new change is that the policy built is $\varepsilon$-soft. In particular, we will build an $\varepsilon$-greedy policy.

* **Definition:** (Sutton-Barto, p.100) A policy is said to be soft if $\pi(a|s)>0$ for all $(s,a)\in\mathcal{S}\times\mathcal{A}_s$, and gradually shifts closer to an optimal deterministic policy.

* **Question:** Why is he resetting to one initial state $s_0$?

* The new part in the implementation is how $a^\ast$ is chosen in the policy update at time $t$. If $a^\ast=\arg\max_a Q(s_t,a)$, choose \$pi(a|s_t)=(1-\varepsilon)+(\varepsilon/|\mathcal{A}|)$ if $a=a^\ast$ and $\pi(a|s_t)=\varepsilon/|\mathcal{A}|$ otherwise.

* Lazy Programmer proposes a function that chooses an epsilon-greedy action.

            def epsilon_greedy(Q,s,eps):
                if random() < eps:
                    return random action
                else:
                    return argmax_a Q(s,a)

* Whereas exploring starts was an on-policy algorithm, the version without exploring starts is off-policy.

* Will write an $\varepsilon$-soft policy generator. For the execution however, will have to modify the policy to avoid infinite loops when testing MC control.

* Will also have to write an episode generator following a stochastic policy. print_policy() will be useless then.



The "epsilon-greedy" Monte Carlo pseudocode is as follows:

            Require: Discount factor gamma.
            Initialize: Pi = arbitrary eps-soft policy with Pi(a|s)>0 for all s and all a in adm_actions[s],
                        Q(s,a)=0 for all state-actions (s,a),
                        create returns dict. and lists returns[(s,a)]=[].
            Loop until convergence:
                Reset to initial state s_0.
                Generate episode sample of length T following Pi: (s_0, a_0, r_1, ... a_(T-1), r_T, s_T)
                G = 0                                              # Init. cumulative return of generated episode
                for t in {(T-1), (T-2), ..., 1, 0}:
                    G = gamma*G + r_(t+1)                          # Evaluate sample return
                    if (s_t, a_t) not in {(s_i,a_i)}_{i=0,...,t-1}:# If this is the first visit of (s_t,a_t)
                        Append G to returns[s_t,a_t]               # Store sample cum. return
                        Q[s_t, a_t] = average(returns[s_t, a_t])   # Update Q(s_t, a_t)
                        a* = argmax_a Q[s_t,a]
                        for a in adm_actions[s_t]:                 # Update Pi*[a|s_t], eps-greedy policy
                            if a==a*:
                                Pi[a|s_t] = 1-eps+(eps/len(adm_actions[s_t]))
                            else:
                                Pi[a|s_t] = eps/len(adm_actions[s_t])


#### Random eps-soft policy generator

In [5]:
################################################
##### RANDOM EPSILON-SOFT POLICY GENERATOR #####
################################################
## 2022/05/03, AJ Zerouali
# Recall: rand_ind = np.random.choice(a = len(next_states), p = next_states_probs)


def gen_random_epslnsoft_policy(eps, env):
    '''
      Generates a random epsilon-soft policy for Windy GridWorld.
      ARGUMENTS: - eps, the epsilon float;
                 - env, Windy_GridWorld_simple object (environment).
      OUTPUT: Pi, an epsilon-soft policy dictionary.
      Note: eps should be between 5% and 10%.
    '''
    non_term_states = env.non_term_states
    adm_actions = env.adm_actions
    Pi = {}
    
    for s in non_term_states:
        Pi[s]={}
        actions_list_s = list(adm_actions[s])
        a_rand = np.random.choice(actions_list_s)
        for a in actions_list_s:
            if a==a_rand:
                Pi[s][a] = 1-eps+(eps/len(actions_list_s))
            else:
                Pi[s][a] = eps/len(actions_list_s)
    
    return Pi

#### Random episode generator with stochastic policy

Will use np.random.choice. Given a 1-D array-like object and a list of corresponding probabilities, this function picks one element at random (assuming size has no value). The main issue is this 1-D array object which can only be numbers, not characters or other objects. The 1-D array should therefore be a list of indices, and we need to specify the list of probabilities with p= list of probabilities.

In [6]:
###############################################
### GENERATE EPISODE FROM STOCHASTIC POLICY ###
###############################################
# 2022/05/03, AJ Zerouali
# Format: Epsd = [(0, s_0, a_0), (r_1, s_1, a_1), ..., (r_T, s_T, '')]
# 

def generate_episode_stochastic(s_0, a_0, Pi, env, T_max):
    '''
     Generates a random episode of max length (T_max + 1), given an initial state-action.
     ARGUMENTS: Initial state and action; stochastic policy; environment; max episode length (in this order).
     OUTPUT: episode_rewards: Rewards list
             episode_states: State list
             episode_actions: Actions list
             T: Episode length minus one.
             
     NOTE: The function below is taylored to GridWorld. The correct way
          to implement it is to use methods of the environment class.
    '''
    
    # Environment attributes
    non_term_states = env.non_term_states
    term_states = env.term_states
    adm_actions = env.adm_actions
    Rwds = env.rewards
    
    # Step t=0
    s_new = s_0
    a_new = a_0
    r_new = 0
    
    # Init. episode lists (format: step_t = (r_t, s_t, a_t)) and store Step 0
    episode_rewards = [r_new]
    episode_states = [s_new]
    episode_actions = [a_new]
    
        
    # Init. episode length
    T = 0
    
    ##### 0<t
    while (s_new not in term_states) and (T<T_max):
        
        # Init. old step
        r_old = r_new
        s_old = s_new
        a_old = a_new
        
        # WARNING: Modify for other environments
        # Compute new state
        if a_old == 'U':
            s_new = (s_old[0]-1, s_old[1])
        elif a_old == 'D':
            s_new = (s_old[0]+1, s_old[1])
        elif a_old == 'L':
            s_new = (s_old[0], s_old[1]-1)
        elif a_old == 'R':
            s_new = (s_old[0], s_old[1]+1)
        
        # Compute new action
        ###################################
        # CHANGE THIS FOR STOCHASTIC POLICY
        ###################################
        if s_new in non_term_states:
            # Here we are assuming that Pi[s].keys() is the same as adm_actions[s]
            actions_list_s_new = list(Pi[s_new].keys())
            probs_list_s_new = list(Pi[s_new].values())
            rand_ind = np.random.choice(len(actions_list_s_new), p = probs_list_s_new)
            # Pick action from random index
            a_new = actions_list_s_new[rand_ind]
        elif s_new in term_states:
            a_new = ''
        
        # Compute new reward
        r_new = Rwds.get(s_new, 0)
        
        # Add step to episode
        episode_rewards.append(r_new)
        episode_states.append(s_new)
        episode_actions.append(a_new)
        
        # Update
        T += 1
        
    # END WHILE
    
    # Output line
    return episode_rewards, episode_states, episode_actions, T
    
# END DEF generate_episode()

In [7]:
#######################################################
###  MONTE CARLO CONTROL WITH EPSILON-GREEDY POLICY ###
#######################################################
## 2022/05/02, AJ Zerouali
# Monte Carlo without exploring starts (Lect. 64-65), following
# an epsilon greedy scheme.
# We are not using convergence thresholds here.
# This function below generates a specified number of sample
# episodes and averages returns to find the optimal policy.
# Can choose first visit MC or all visits MC with Boolean.
# This function also calls generate_episode() of previous sec.

def MC_Ctrl_EpsGreedy(Pi, eps, env, gamma, N_samples, T_max, all_visits_MC):
    '''
     Epsilon-greedy Monte Carlo control algorithm.
     
     ARGUMENTS: - Pi, env, gamma: Policy, environment, discount factor.
                - eps: Epsilon float for output policy.
                - N_samples: No. of samples for Monte Carlo expectation.
                - T_max: Max. episode length minus 1.
                - all_visits_MC: Boolean for all visits or first visit MC.
     OUTPUT:    - Pi_star: Optimal policy
                - V_star: Optimal value function
                - Q: State-action values from samples
                - Returns: Dict. of return samples (by init. state-action)
                - N_iter: No. of randomly generated episodes (<= N_samples)
     NOTE: This function calls generate_episode().
           Should take an eps-soft policy as input.
    '''
    
    # Environment attributes
    non_term_states = env.non_term_states
    term_states = env.term_states
    adm_actions = env.adm_actions
    Rwds = env.rewards
    
    
    # Init. Q, returns, Pi_star, and V_star dictionaries
    Pi_star = {}
    Q = {}
    Returns = {}
    V_star = {}
    for s in non_term_states:
        Pi_star[s] = {}
        Q[s]={}
        Returns[s] = {}
        V_star[s] = 0.0
        for a in adm_actions[s]:
            Q[s][a] = 0.0
            Returns[s][a]=[]
            
    for s in term_states:
        V_star[s] = 0.0
            
    
    # Init. counter
    N_iter = 0
    
    # Main MC loop
    while (N_iter<N_samples):
                
        ##########################
        ##### Step t=0 setup #####
        ##########################

        # Generate (s_0, a_0) randomly from non_term_states and adm_actions
        # np.random.choice() works only for 1-dim'l arrays
        s_0 = list(non_term_states)[np.random.randint(len(non_term_states))]
        a_0 = np.random.choice(adm_actions[s_0])

        ########################
        ##### Steps 1 to T #####
        ########################
        
        #print(f"Generating sample episode no. {N_iter+1}...\n") # Debug
        # Generate episode
        # Signature: episode_rewards, episode_states, episode_actions, T = generate_episode(s_0, a_0, Pi, env, T_max)
        # Step t of episode is (r_t, s_t, a_t) = (episode_rewards[t], episode_states[t], episode_actions[t])
        episode_rewards, episode_states, episode_actions, T = generate_episode_stochastic(s_0, a_0, Pi, grid, T_max)
        #print(f"Sample episode N_iter={N_iter} has T={T} steps after t=0.\n") # Debug
 
        #####################################################
        ### COMPUTE CUMULATIVE RETURNS AND OPTIMAL ACTION ###
        #####################################################

        # First visit only MC
        if not all_visits_MC:
            # State-action iterable
            episode_state_actions = list(zip(episode_states, episode_actions))
            
            # Init. storing variable
            G = 0.0
            
            # Loop over episode
            for t in range(T-1, -1, -1): # Loop goes backwards from (T-1) to 0
                
                # Sum the return
                G = gamma*G + episode_rewards[t+1]
                s_t = episode_states[t]
                a_t = episode_actions[t]
                
                if (s_t, a_t) not in episode_state_actions[:t]:
                    
                    # Update sample returns and Q-function
                    Returns[s_t][a_t].append(G)
                    Q[s_t][a_t] = np.average(Returns[s_t][a_t])
                    
                    # Get a_star and update Pi_star
                    a_star = max(Q[s_t], key = Q[s_t].get)
                    Pi_star[s_t][a_star] = 1-eps+eps/len(adm_actions[s_t])
                    for a in adm_actions[s_t]:
                        if a != a_star:
                            Pi_star[s_t][a] = eps/len(adm_actions[s_t])
                
                    
        # END IF first visit MC
        
        # All visits MC
        elif all_visits_MC:
            
            # Init. storing variable
            G = 0.0
            
            for t in range(T-1, -1, -1): # Loop goes backwards from (T-1) to 0
                
                # Sum the returns
                G = gamma*G + episode_rewards[t+1]
                s_t = episode_states[t]
                a_t = episode_actions[t]
                
                # Update sample returns and Q-function
                Returns[s_t][a_t].append(G)
                Q[s_t][a_t] = np.average(Returns[s_t][a_t])
                
                # Get a_star and update Pi_star
                a_star = max(Q[s_t], key = Q[s_t].get)
                Pi_star[s_t][a_star] = 1-eps+eps/len(adm_actions[s_t])
                for a in adm_actions[s_t]:
                    if a != a_star:
                        Pi_star[s_t][a] = eps/len(adm_actions[s_t])
                
                              
        # Update N_iter
        N_iter += 1
        
        '''
        # DEBUG:
        print(f"Completed episode N_iter={N_iter} with R={G}.")
        print("--------------------------------------------------------------------")
        '''
       
    # END WHILE of MC loop
    
    # COMPUTE V
    ## CHANGE IN EPSILON GREEDY??
    for s in non_term_states:
        a = max(Q[s], key = Q[s].get)
        V_star[s]=Q[s][a]
    
    # Output
    return Pi_star, V_star, Q, Returns, N_iter

# END DEF MC_Ctrl_EpsGreedy

#### Test with non-windy GridWorld


In [21]:
# Create environment
grid = test_standard_grid()

# Discount factor, epsilon-greedy probability
eps = 0.1
gamma = 0.9
#epsilon = 1e-3

# Policies
Pi_eps_ini = {
    (2, 0): {'U': (1-eps)+eps/len(grid.adm_actions[(2, 0)]), 'R':eps/len(grid.adm_actions[(2, 0)])},
    (1, 0): {'U': (1-eps)+eps/len(grid.adm_actions[(1, 0)]), 'D':eps/len(grid.adm_actions[(1, 0)])},
    (0, 0): {'R': (1-eps)+eps/len(grid.adm_actions[(0, 0)]), 'D':eps/len(grid.adm_actions[(0, 0)])},
    (0, 1): {'R': (1-eps)+eps/len(grid.adm_actions[(0, 1)]), 'L':eps/len(grid.adm_actions[(1, 0)])},
    (0, 2): {'R': (1-eps)+eps/len(grid.adm_actions[(0, 2)]), 'D':eps/len(grid.adm_actions[(0, 2)]),\
                                                             'L':eps/len(grid.adm_actions[(0, 2)])},
    (1, 2): {'U': (1-eps)+eps/len(grid.adm_actions[(1, 2)]), 'D':eps/len(grid.adm_actions[(1, 2)]),\
                                                             'R':eps/len(grid.adm_actions[(1, 2)])},
    (2, 1): {'R': (1-eps)+eps/len(grid.adm_actions[(2, 1)]), 'L':eps/len(grid.adm_actions[(2, 1)])},
    (2, 2): {'U': (1-eps)+eps/len(grid.adm_actions[(2, 2)]), 'L':eps/len(grid.adm_actions[(2, 2)]),\
                                                             'R':eps/len(grid.adm_actions[(2, 2)])},
    (2, 3): {'L': (1-eps)+eps/len(grid.adm_actions[(2, 3)]), 'U':eps/len(grid.adm_actions[(2, 3)])},
  }

Pi_unif = {}
for s in grid.non_term_states:
    Pi_unif[s]={}
    for a in grid.adm_actions[s]:
        Pi_unif[s][a] = 1/len(grid.adm_actions[s])

Pi_opt = {
    (2, 0): {'U': 1.0},
    (1, 0): {'U': 1.0},
    (0, 0): {'R': 1.0},
    (0, 1): {'R': 1.0},
    (0, 2): {'R': 1.0},
    (1, 2): {'U': 1.0},
    (2, 1): {'R': 1.0},
    (2, 2): {'U': 1.0},
    (2, 3): {'L': 1.0},
  }

Pi_lect = {
    (2, 0): {'U': 1.0},
    (1, 0): {'U': 1.0},
    (0, 0): {'R': 1.0},
    (0, 1): {'R': 1.0},
    (0, 2): {'R': 1.0},
    (1, 2): {'R': 1.0},
    (2, 1): {'R': 1.0},
    (2, 2): {'R': 1.0},
    (2, 3): {'U': 1.0},
  }



# Max. episode length:
T_max = 200
N_samples = 5000
eps_2 = 0.05

# Evaluate V_Pi:
Pi_star, V_star, Q, Returns, N_iter = MC_Ctrl_EpsGreedy(Pi_unif, eps_2, grid, gamma, N_samples, T_max,\
                                                        all_visits_MC = False)

print(f"Finished running MC_Ctrl_EpsGreedy with N_iter={N_iter} samples.")

# MC policy evaluation needs a stochastic version too

Finished running MC_Ctrl_EpsGreedy with N_iter=5000 samples.


In [14]:
V_star

{(0, 1): 0.2653558492526721,
 (1, 2): 0.26344629585996776,
 (2, 1): -0.09710571099198803,
 (0, 0): 0.1538743819135216,
 (2, 0): -0.02086175870654055,
 (2, 3): -0.31735544576054125,
 (0, 2): 1.0,
 (2, 2): -0.1879232051953183,
 (1, 0): 0.05649572956105423,
 (0, 3): 0.0,
 (1, 3): 0.0}

In [24]:
print_values(V_star,grid)

## VALUE FUNCTION ##
------------------------
 0.12| 0.22| 1.00| 0.00|
------------------------
 0.04| 0.00| 0.23| 0.00|
------------------------
-0.02|-0.10|-0.19|-0.34|
------------------------


In [23]:
Pi_star

{(0, 1): {'R': 0.975, 'L': 0.025},
 (1, 2): {'U': 0.9666666666666667,
  'D': 0.016666666666666666,
  'R': 0.016666666666666666},
 (2, 1): {'R': 0.025, 'L': 0.975},
 (0, 0): {'R': 0.975, 'D': 0.025},
 (2, 0): {'R': 0.025, 'U': 0.975},
 (2, 3): {'U': 0.025, 'L': 0.975},
 (0, 2): {'R': 0.9666666666666667,
  'L': 0.016666666666666666,
  'D': 0.016666666666666666},
 (2, 2): {'R': 0.016666666666666666,
  'U': 0.016666666666666666,
  'L': 0.9666666666666667},
 (1, 0): {'U': 0.975, 'D': 0.025}}

In [22]:
Pi_unif

{(0, 1): {'L': 0.5, 'R': 0.5},
 (1, 2): {'U': 0.3333333333333333,
  'D': 0.3333333333333333,
  'R': 0.3333333333333333},
 (2, 1): {'L': 0.5, 'R': 0.5},
 (0, 0): {'D': 0.5, 'R': 0.5},
 (2, 0): {'U': 0.5, 'R': 0.5},
 (2, 3): {'U': 0.5, 'L': 0.5},
 (0, 2): {'L': 0.3333333333333333,
  'R': 0.3333333333333333,
  'D': 0.3333333333333333},
 (2, 2): {'U': 0.3333333333333333,
  'R': 0.3333333333333333,
  'L': 0.3333333333333333},
 (1, 0): {'D': 0.5, 'U': 0.5}}

#### MC stochastic policy evaluation

In [25]:
######################################
###  MONTE CARLO POLICY EVALUATION ###
######################################
## 2022/05/03, AJ Zerouali

def MC_Stochastic_Policy_Eval(Pi, env, gamma, N_samples, T_max, all_visits_MC):
    '''
    Function implementing Monte Carlo policy evaluation. Generates 
    a specified number of sample episodes and averages returns to 
    evaluate a given stochastic policy. 
    Can choose first visit MC or all visits MC with Boolean.
    
     ARGUMENTS: - Pi, env, gamma: Policy, environment, discount factor.
                - N_samples: No. of samples for Monte Carlo expectation.
                - T_max: Max. episode length minus 1.
                - all_visits_MC: Boolean for all visits or first visit MC.
     OUTPUT:    - V:=V_Pi, value function obtained.
     NOTE: This function calls generate_episode_stochastic().
    '''
    
    # Environment attributes
    non_term_states = env.non_term_states
    term_states = env.term_states
    adm_actions = env.adm_actions
    Rwds = env.rewards
    
    # Init output and returns
    V = {}
    Returns = {}
    for s in term_states:
        V[s] = 0.0
    for s in non_term_states:
        V[s] = 0.0
        Returns[s]=[]
        # Init. counter
        N_iter = 0
    
    # Main MC loop
    while N_iter < N_samples:

        ##########################
        ##### Step t=0 setup #####
        ##########################

        # Count no. of non_term_states
        N_non_term_states = len(non_term_states)

        # Generate s_0 randomly from non_term_states, get a_0 from policy
        s_0 = list(non_term_states)[np.random.randint(N_non_term_states)]
        a_0 = list(Pi[s_0].keys())[0]

        ########################
        ##### Steps 1 to T #####
        ########################

        # Generate episode
        # Signature: episode_rewards, episode_states, episode_actions, T = generate_episode(s_0, a_0, Pi, env, T_max)
        episode_rewards, episode_states, episode_actions, T = generate_episode_stochastic(s_0, a_0, Pi, env, T_max)
        # Step t of episode is (r_t, s_t, a_t) = (episode_rewards[t], episode_states[t], episode_actions[t])

        ##################################
        ### COMPUTE CUMULATIVE RETURNS ###
        ##################################

        # Init. storing variable
        G = 0.0
        
        # First visit only MC
        if not all_visits_MC:
            # Loop over episode
            for t in range(T-1, -1, -1): # Loop goes backwards from (T-1) to 0
                G = gamma*G + episode_rewards[t+1]
                s_t = episode_states[t]
                
                if s_t not in episode_states[:t]:
                    Returns[s_t].append(G)
                    V[s_t] = np.average(Returns[s_t])
                    
        # All visits MC
        elif all_visits_MC:
            for t in range(T-1, -1, -1): # Loop goes backwards from (T-1) to 0
                G = gamma*G + episode_rewards[t+1]
                s_t = episode_states[t]
                Returns[s_t].append(G)
                V[s_t] = np.average(Returns[s_t])
        
        # Update N_iter
        N_iter += 1

    # END WHILE of MC loop
    
    # Output
    return V

# END DEF MC_Stochastic_Policy_Eval


In [27]:
V = MC_Stochastic_Policy_Eval(Pi_star, grid, gamma, N_samples, 200, all_visits_MC=False)

In [28]:
print_values(V_star, grid)
print_values(V, grid)

## VALUE FUNCTION ##
------------------------
 0.12| 0.22| 1.00| 0.00|
------------------------
 0.04| 0.00| 0.23| 0.00|
------------------------
-0.02|-0.10|-0.19|-0.34|
------------------------
## VALUE FUNCTION ##
------------------------
 0.80| 0.89| 0.99| 0.00|
------------------------
 0.72| 0.00| 0.89| 0.00|
------------------------
 0.60| 0.54| 0.44|-0.28|
------------------------


#### Comments on epsilon-greedy strategies:

* A greedy strategy can be described as a short-sighted one, in the sense that we disregard the number of samples collected and the confidence in the accuracy of some estimated expected return. More precisely, we use the available samples and simply take the maximizing action. In principle, this is a heuristic only. The pseudocode for a greedy policy would typically look like the following, where we evaluate have a predicted expectation $J$:

            While True:
                Compute the expectations J[a] from collected samples
                a = argmax_b(J)
                execute action a

* An epsilon-greedy strategy aims to explore more data instead of only exploiting the collected samples. We fix a small probability $\varepsilon\sim 5\%, 10\%$ of taking a completely different action from the argmax. In contrast with the previous pseudocode, an epsilon-greedy algorithm is as follows:

            While True:
                Compute the expectations J[a] from collected samples
                Generate p = random number in [0,1]
                if p < epsilon:
                    a = random action
                else:
                    a = argmax_b(J)
                execute action a

* In order to stop the "exploration" at some point, one typically decreases epsilon as samples are collected. Some researchers call this "cooling" in the literature, and implement epsilon as a function that decays fast as t goes to infinity (or as the number of sample grows if I understood properly).

## Appendix A: Other functions

### a) MC Control with exploring starts - Initial version


In [8]:
##################################################
###  MONTE CARLO CONTROL WITH EXPLORING STARTS ###
##################################################
## 2022/04/27, AJ Zerouali
# Monte Carlo with exploring starts (Lect. 64-65)
# We are not using convergence thresholds here.
# This function below generates a specified number of sample
# episodes and averages returns to find the optimal policy.
# Can choose first visit MC or all visits MC with Boolean.
# This function also calls generate_episode() of previous sec.

def MC_Ctrl_ExpStarts(Pi, env, gamma, N_samples, T_max, all_visits_MC):
    # ARGUMENTS: - Pi, env, gamma: Policy, environment, discount factor.
    #            - N_samples: No. of samples for Monte Carlo expectation.
    #            - T_max: Max. episode length minus 1.
    #            - all_visits_MC: Boolean for all visits or first visit MC.
    # OUTPUT:    - Pi_star: Optimal policy
    #            - V_star: Optimal value function
    # NOTE: This function calls generate_episode().
    
    # Environment attributes
    non_term_states = env.non_term_states
    term_states = env.term_states
    adm_actions = env.adm_actions
    Rwds = env.rewards
    
    
    # Init. Q, returns, Pi_star, and V_star dictionaries
    Pi_star = {}
    Q = {}
    Returns = {}
    V_star = {}
    for s in non_term_states:
        Pi_star[s] = {}
        Q[s]={}
        Returns[s] = {}
        V_star[s] = 0.0
        for a in adm_actions[s]:
            Q[s][a] = 0.0
            Returns[s][a]=[]
            
    for s in term_states:
        V_star[s] = 0.0
            
    
    # Init. counter
    N_iter = 0
    
    # Main MC loop
    while N_iter < N_samples:

        ##########################
        ##### Step t=0 setup #####
        ##########################

        # Generate (s_0, a_0) randomly from non_term_states and adm_actions
        # np.random.choice() works only for 1-dim'l arrays
        s_0 = list(non_term_states)[np.random.randint(len(non_term_states))]
        a_0 = np.random.choice(adm_actions[s_0])

        ########################
        ##### Steps 1 to T #####
        ########################

        # Generate episode
        # Signature: episode_rewards, episode_states, episode_actions, T = generate_episode(s_0, a_0, Pi, env, T_max)
        episode_rewards, episode_states, episode_actions, T = generate_episode(s_0, a_0, Pi, grid, T_max)
        # Step t of episode is (r_t, s_t, a_t) = (episode_rewards[t], episode_states[t], episode_actions[t])

        #####################################################
        ### COMPUTE CUMULATIVE RETURNS AND OPTIMAL ACTION ###
        #####################################################

        # Init. storing variable
        G = 0.0
        
        # First visit only MC
        if not all_visits_MC:
            # State-action iterable
            episode_state_actions = list(zip(episode_states, episode_actions))
            
            # Loop over episode
            for t in range(T-1, -1, -1): # Loop goes backwards from (T-1) to 0
                G = gamma*G + episode_rewards[t+1]
                s_t = episode_states[t]
                a_t = episode_actions[t]
                
                if (s_t, a_t) not in episode_state_actions[:t]:
                    
                    # Update sample returns and Q-function
                    # WARNING: THERE SEEMS TO BE A PROBLEM HERE
                    Returns[s_t][a_t].append(G)
                    Q[s_t][a_t] = np.average(Returns[s_t][a_t])
                    
                    # Get a_star and update Pi_star
                    a_star = max(Q[s_t], key = Q[s_t].get)
                    Pi_star[s_t] = {a_star:1.0} 
                    
                    # Update V_star
                    V_star[s_t] = Q[s_t][a_star]
                    
                    
        # All visits MC
        elif all_visits_MC:
            for t in range(T-1, -1, -1): # Loop goes backwards from (T-1) to 0
                G = gamma*G + episode_rewards[t+1]
                s_t = episode_states[t]
                a_t = episode_actions[t]
                
                # Update sample returns and Q-function
                Returns[s_t][a_t].append(G)
                Q[s_t][a_t] = np.average(Returns[s_t][a_t])
                
                # Get a_star and update Pi_star
                a_star = max(Q[s_t], key = Q[s_t].get)
                Pi_star[s_t] = {a_star:1.0} 
                
                # Update V_star
                V_star[s_t] = Q[s_t][a_star]
        
        # Update N_iter
        N_iter += 1

    # END WHILE of MC loop
    
    # Output
    return Pi_star, V_star, Q, Returns

# END DEF MC_Ctrl_ExpStarts


In [10]:
# Create environment
grid = test_standard_grid()

# Policy to be evaluated
Pi_opt = {
    (2, 0): {'U': 1.0},
    (1, 0): {'U': 1.0},
    (0, 0): {'R': 1.0},
    (0, 1): {'R': 1.0},
    (0, 2): {'R': 1.0},
    (1, 2): {'U': 1.0},
    (2, 1): {'R': 1.0},
    (2, 2): {'U': 1.0},
    (2, 3): {'L': 1.0},
  }

Pi_lect = {
    (2, 0): {'U': 1.0},
    (1, 0): {'U': 1.0},
    (0, 0): {'R': 1.0},
    (0, 1): {'R': 1.0},
    (0, 2): {'R': 1.0},
    (1, 2): {'R': 1.0},
    (2, 1): {'R': 1.0},
    (2, 2): {'R': 1.0},
    (2, 3): {'U': 1.0},
  }

Pi_rand = gen_random_policy(grid)

# Discount factor and convergence threshold
gamma = 0.9
#epsilon = 1e-3

# Max. episode length:
T_max = 50
N_samples = 10000

# Evaluate V_Pi:
# SIGNATURE: Pi_star, V_star, Q, Returns = MC_Ctrl_ExpStarts(Pi, env, gamma, N_samples, T_max, all_visits_MC)
#V = MC_Policy_Eval(Pi, grid, gamma, N_samples, T_max, all_visits_MC = False)
Pi_star, V_star, Q, Returns = MC_Ctrl_ExpStarts(Pi_rand, grid, gamma, N_samples, T_max, all_visits_MC = False)

print(f"Printing the optimal policy obtained from MC_Ctrl_ExpStarts:")
print_policy(Pi_star, grid)

print(f"Printing the value fn obtained from MC_Ctrl_ExpStarts:")
print_values(V_star, grid)

V_eval = MC_Policy_Eval(Pi_star, grid, gamma, 50, T_max, all_visits_MC = False)
print(f"Printing the value fn obtained from MC_Policy_Eval(Pi_star,...):")
print_values(V_eval, grid)


Printing the optimal policy obtained from MC_Ctrl_ExpStarts:
##  POLICY  ##
------------------------
  R  |  R  |  R  |
------------------------
  U  |     |  U  |
------------------------
  U  |  L  |  L  |  L  |
------------------------
Printing the value fn obtained from MC_Ctrl_ExpStarts:
## VALUE FUNCTION ##
------------------------
 0.81| 0.90| 1.00| 0.00|
------------------------
 0.73| 0.00| 0.90| 0.00|
------------------------
 0.66|-0.66|-0.73|-0.81|
------------------------
Printing the value fn obtained from MC_Policy_Eval(Pi_star,...):
## VALUE FUNCTION ##
------------------------
 0.81| 0.90| 1.00| 0.00|
------------------------
 0.73| 0.00| 0.90| 0.00|
------------------------
 0.66| 0.59| 0.53| 0.48|
------------------------


In [14]:
Pi_eps = gen_random_epslnsoft_policy(0.1, grid)
Pi_eps

{(0, 1): {'L': 0.9500000000000001, 'R': 0.05},
 (1, 2): {'U': 0.03333333333333333,
  'D': 0.9333333333333333,
  'R': 0.03333333333333333},
 (2, 1): {'L': 0.9500000000000001, 'R': 0.05},
 (0, 0): {'D': 0.9500000000000001, 'R': 0.05},
 (2, 0): {'U': 0.9500000000000001, 'R': 0.05},
 (2, 3): {'U': 0.9500000000000001, 'L': 0.05},
 (0, 2): {'L': 0.03333333333333333,
  'R': 0.9333333333333333,
  'D': 0.03333333333333333},
 (2, 2): {'U': 0.03333333333333333,
  'R': 0.03333333333333333,
  'L': 0.9333333333333333},
 (1, 0): {'D': 0.9500000000000001, 'U': 0.05}}

In [21]:
myListIndices = [0,1,2,3,4,5]
myProbs = [0.25,0.25,0.25,0.05,0.15,0.05]
myList = ['A','B','C','D','E','F']

In [23]:
ind_rand = np.random.choice(myListIndices, p=myProbs)
print(f"Random index: ind_rand=np.random.choice(myList, myProbs)={ind_rand}")
print(f"Random character: myList[{ind_rand}]={myList[ind_rand]}")

Random index: ind_rand=np.random.choice(myList, myProbs)=1
Random character: myList[1]=B


In [25]:
import collections

In [28]:
samples = []
for i in range(10000):
    ind_rand = np.random.choice(myListIndices, p=myProbs)
    samples.append(myList[ind_rand])

collections.Counter(samples)

Counter({'B': 2452, 'C': 2453, 'A': 2615, 'E': 1467, 'F': 530, 'D': 483})

In [33]:
myDict = {'F':0.1, 'A':0.25, 'D':0.05, 'C':0.15, 'E':3.14 , 'B':0.35}

In [34]:
for i in range(len(list(myDict.keys()))):
    print(f"Here: i={i}")
    print(f"list(myDict.keys())[{i}]={list(myDict.keys())[i]}")
    print(f"myDict[{list(myDict.keys())[i]}]={myDict[list(myDict.keys())[i]]}")
    print(f"list(myDict.values())[{i}]={list(myDict.values())[i]}")
    print("_________________________")

Here: i=0
list(myDict.keys())[0]=F
myDict[F]=0.1
list(myDict.values())[0]=0.1
_________________________
Here: i=1
list(myDict.keys())[1]=A
myDict[A]=0.25
list(myDict.values())[1]=0.25
_________________________
Here: i=2
list(myDict.keys())[2]=D
myDict[D]=0.05
list(myDict.values())[2]=0.05
_________________________
Here: i=3
list(myDict.keys())[3]=C
myDict[C]=0.15
list(myDict.values())[3]=0.15
_________________________
Here: i=4
list(myDict.keys())[4]=E
myDict[E]=3.14
list(myDict.values())[4]=3.14
_________________________
Here: i=5
list(myDict.keys())[5]=B
myDict[B]=0.35
list(myDict.values())[5]=0.35
_________________________


In [36]:
list(myDict.keys())

['F', 'A', 'D', 'C', 'E', 'B']

In [37]:
list(myDict.values())

[0.1, 0.25, 0.05, 0.15, 3.14, 0.35]

In [5]:
grid = test_standard_grid()

In [7]:
Pi_rand = gen_random_policy(grid)
print_policy(Pi_rand, grid)

##  POLICY  ##
------------------------
  R  |  L  |  L  |
------------------------
  D  |     |  R  |
------------------------
  R  |  R  |  L  |  U  |
------------------------


In [56]:
env = grid
adm_actions = grid.adm_actions
s_0 = (2,0)
a_0 = np.random.choice(adm_actions[s_0])
T_max = 100

In [57]:
episode_rewards, episode_states, episode_actions, T = generate_episode_stochastic(s_0, a_0, Pi, env, T_max)

In [58]:
episode = list(zip(episode_rewards, episode_states, episode_actions))
for step in episode:
    print(step)

(0, (2, 0), 'R')
(0, (2, 1), 'L')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'R')
(0, (2, 1), 'R')
(0, (2, 2), 'L')
(0, (2, 1), 'L')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'U')
(0, (0, 0), 'D')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'U')
(0, (0, 0), 'D')
(0, (1, 0), 'D')
(0, (2, 0), 'U')
(0, (1, 0), 'D')
(0, (2, 0), 'U

In [38]:
Pi_eps

{(0, 1): {'L': 0.9500000000000001, 'R': 0.05},
 (1, 2): {'U': 0.03333333333333333,
  'D': 0.9333333333333333,
  'R': 0.03333333333333333},
 (2, 1): {'L': 0.9500000000000001, 'R': 0.05},
 (0, 0): {'D': 0.9500000000000001, 'R': 0.05},
 (2, 0): {'U': 0.9500000000000001, 'R': 0.05},
 (2, 3): {'U': 0.9500000000000001, 'L': 0.05},
 (0, 2): {'L': 0.03333333333333333,
  'R': 0.9333333333333333,
  'D': 0.03333333333333333},
 (2, 2): {'U': 0.03333333333333333,
  'R': 0.03333333333333333,
  'L': 0.9333333333333333},
 (1, 0): {'D': 0.9500000000000001, 'U': 0.05}}

In [39]:
Pi = Pi_eps
s_new = (1,2)

In [40]:
actions_list_s_new = list(Pi[s_new].keys())
probs_list_s_new = list(Pi[s_new].values())
rand_ind = np.random.choice(len(actions_list_s_new), p = probs_list_s_new)
# Pick action from random index
a_new = actions_list_s_new[rand_ind]

In [41]:
print(f"actions_list_s_new = {actions_list_s_new}")
print(f"probs_list_s_new = {probs_list_s_new}")
print(f"rand_ind = {rand_ind}")
print(f"a_new = {a_new}")

actions_list_s_new = ['U', 'D', 'R']
probs_list_s_new = [0.03333333333333333, 0.9333333333333333, 0.03333333333333333]
rand_ind = 1
a_new = D


In [44]:
s_new = (2,2)

samples = []

for i in range(10000):
    actions_list_s_new = list(Pi[s_new].keys())
    probs_list_s_new = list(Pi[s_new].values())
    rand_ind = np.random.choice(len(actions_list_s_new), p = probs_list_s_new)
    # Pick action from random index
    a_new = actions_list_s_new[rand_ind]
    
    samples.append(a_new)

collections.Counter(samples)

Counter({'L': 9351, 'R': 312, 'U': 337})