In [6]:
import numpy as np
import random

**1. Creating a function that returns a randomly ordered list of probabilities**

In [20]:
def gen_bandits():
    """
    Function that returns a list of randomly shuffled probabilities.
    This list contains the reward probability of each "slot machine" (bandit).
    """
    bandits = [0.1, 0.1, 0.1, 0.2, 0.6]
    random.shuffle(bandits)
    return bandits


**2. A function that creates the game itself**

In [24]:
def multi_armed_bandit(num_games=1000, epsilon=0.1, verbose=False):
    
    bandits = gen_bandits()
    total_reward = 0
    acum_reward_bandit = np.zeros(len(bandits))  # total reward accumulated per image
    num_selected_bandit = np.zeros(len(bandits)) # number of times each image was selected
    q_bandits = np.zeros(len(bandits))           # estimated value of average reward per image
    
    if verbose:
        print("Initial Bandits Distribution\n  {}".format(bandits))
    
    for game in range(0, num_games):
        
        # Save the previous Q(a) values before the update
        # Initially, the values are zero
        # But after each iteration, the new values are saved
        old_q_bandits = q_bandits.copy()
        
        # Select the "slot machine" to play
        # Generate a random number between 0 and 1 with uniform distribution
        # This value will usually be greater than 0.1
        # That will cause the algorithm to prefer exploiting
        if np.random.random() < epsilon:
            bandit = np.random.randint(len(bandits))  # Explore
        else:
            # Select a bandit with the highest estimated Q value
            # This creates a boolean array where True is in the position(s) of the max Q
            # np.flatnonzero returns the indices with True
            bandit = np.random.choice(np.flatnonzero(q_bandits == q_bandits.max()))  # Exploit
            
        # Get the reward
        # If a randomly drawn number is less than the reward probability, you get 1 point
        reward = 1 if (np.random.random() < bandits[bandit]) else 0
        
        # Add the reward to the total
        total_reward += reward
        
        # Update the values for the selected "slot machine"
        acum_reward_bandit[bandit] += reward  # Update the total reward for this bandit
        num_selected_bandit[bandit] += 1      # Update the number of times this bandit was selected
        q_bandits[bandit] = acum_reward_bandit[bandit] / num_selected_bandit[bandit]  # Update Q(a) as the running average
        
        if verbose:
            print("\nGAME {game}\n  Old Q_Bandits = {old_q_bandits}\n  Selected Bandit = {bandit} \
                  \n  Reward = {reward}\n  Q_Bandits = {q_bandits}"
                  .format(game=game+1, old_q_bandits=old_q_bandits, bandit=bandit, 
                          reward=reward, q_bandits=q_bandits))
    
    return bandits, total_reward, q_bandits, num_selected_bandit, acum_reward_bandit



**3 Trying differente values of exploration**

In [26]:
# List of epsilon values to test
epsilons = [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.15, 0.2, 0.5, 1.0]

# To store the results
resultados = []

# Loop: for each epsilon in the list...
for eps in epsilons:
    
    # Set the random seed so results are comparable
    np.random.seed(42)
    
    # Run the algorithm with the current epsilon
    bandits, total_reward, q_bandits, num_selected_bandit, acum_reward_bandit = multi_armed_bandit(
        num_games=1000, 
        epsilon=eps,
        verbose=False
    )
    
    # Save the results as a tuple or dictionary
    resultados.append({
        'epsilon': eps,
        'total_reward': total_reward,
        'imagem_mais_exibida': int(np.argmax(num_selected_bandit)),
        'imagem_mais_cliques': int(np.argmax(acum_reward_bandit)),
        'q_max': round(np.max(q_bandits), 4)
    })

# Display the results
for r in resultados:
    print(r)


{'epsilon': 0.0, 'total_reward': 629, 'imagem_mais_exibida': 4, 'imagem_mais_cliques': 4, 'q_max': np.float64(0.629)}
{'epsilon': 0.01, 'total_reward': 610, 'imagem_mais_exibida': 4, 'imagem_mais_cliques': 4, 'q_max': np.float64(0.6108)}
{'epsilon': 0.02, 'total_reward': 369, 'imagem_mais_exibida': 2, 'imagem_mais_cliques': 2, 'q_max': np.float64(0.5549)}
{'epsilon': 0.03, 'total_reward': 243, 'imagem_mais_exibida': 1, 'imagem_mais_cliques': 0, 'q_max': np.float64(0.5962)}
{'epsilon': 0.04, 'total_reward': 525, 'imagem_mais_exibida': 1, 'imagem_mais_cliques': 1, 'q_max': np.float64(0.6017)}
{'epsilon': 0.05, 'total_reward': 489, 'imagem_mais_exibida': 1, 'imagem_mais_cliques': 1, 'q_max': np.float64(0.572)}
{'epsilon': 0.1, 'total_reward': 587, 'imagem_mais_exibida': 3, 'imagem_mais_cliques': 3, 'q_max': np.float64(0.6251)}
{'epsilon': 0.15, 'total_reward': 584, 'imagem_mais_exibida': 4, 'imagem_mais_cliques': 4, 'q_max': np.float64(0.6342)}
{'epsilon': 0.2, 'total_reward': 540, 'image

Interpretation:

1. If the algorithm does not explore (epsilon = 0), it may randomly fall on the best image and get stuck with it, but this does not make sense in a real situation. Some degree of exploration would be necessary.

2. When exploring 100% (epsilon = 1), the algorithm loses some of its functionality by choosing images randomly and missing the opportunity to choose the best images more frequently.

3. Low exploration values ​​(epsion = 0.01) may not be enough for the algorithm to evaluate which image is the best. Since 1% of 1,000 is equal to 10, the algorithm had few opportunities to learn the best images.

4. Values ​​between 2% and 5% already seem to provide better rewards, since the algorithm had more opportunities to learn which images are the best.

5. From about 20% exploration, the algorithm loses the opportunity to focus on the best images.