# Multi Armed Bandits Epsilon Greedy

### References
* [Introduction](https://markelsanz14.medium.com/introduction-to-reinforcement-learning-part-1-multi-armed-bandit-problem-618e8cbf9d4b)
* [Good Article](https://changyaochen.github.io/multi-armed-bandit-mar-2020/)
* [Multi Armed Bandits 101](https://medium.com/opex-analytics/multi-armed-bandits-101-6f4ac62b6bd6)
* [Other methods](https://towardsdatascience.com/multi-armed-bandits-and-reinforcement-learning-dc9001dcb8da)
* [Thompson Sampling](https://towardsdatascience.com/multi-armed-bandits-thompson-sampling-algorithm-fea205cf31df)
* [Upper Confidence Bound](https://towardsdatascience.com/multi-armed-bandits-upper-confidence-bound-algorithms-with-python-code-a977728f0e2d)

In [1]:
import numpy as np

def pull_bandit_arm(bandits, bandit_number):
    """
    Even pulling the arm might be random
    """
    result = np.random.uniform()
    result <= bandits[bandit_number]
    return int(result)

def take_epsilon_greedy_action(epsilon, average_rewards_actions):
    result = np.random.uniform()
    if result < epsilon:
        # Explore
        return np.random.randint(0, len(average_rewards_actions)) 
    else:
        # Greedy action.
        return np.argmax(average_rewards_actions) 

### Bit of Hope
Let's pretend we keep track of the CTR of each of our products in a online system

In [2]:
# CTR of each strategy
market_strategies_ctr = [0.1, 0.3, 0.05, 0.55, 0.4]

### Agent Learning

In [3]:
num_iterations = 500
epsilon = 0.1

# Store info to know which one is the best action in each moment.
total_rewards = [0 for _ in range(len(market_strategies_ctr))]
total_attempts = [0 for _ in range(len(market_strategies_ctr))]
avg_rewards = [0.0 for _ in range(len(market_strategies_ctr))]

for iteration in range(num_iterations+1):
    action = take_epsilon_greedy_action(epsilon, avg_rewards)
    reward = pull_bandit_arm(market_strategies_ctr, action)
    # Store result.
    total_rewards[action] += reward
    total_attempts[action] += 1
    avg_rewards[action] = total_rewards[action] / float(total_attempts[action])

    if iteration % 100 == 0:
        print('Expected reward at step: {} is {}'.format(
            iteration,['{:.2f}'.format(elem) for elem in avg_rewards]))

Expected reward at step: 0 is ['0.00', '0.00', '0.00', '0.00', '0.00']
Expected reward at step: 100 is ['0.00', '0.00', '0.00', '0.00', '0.00']
Expected reward at step: 200 is ['0.00', '0.00', '0.00', '0.00', '0.00']
Expected reward at step: 300 is ['0.00', '0.00', '0.00', '0.00', '0.00']
Expected reward at step: 400 is ['0.00', '0.00', '0.00', '0.00', '0.00']
Expected reward at step: 500 is ['0.00', '0.00', '0.00', '0.00', '0.00']


### Plot Best Strategy

In [4]:
best_bandit = np.argmax(avg_rewards)
print('\nBest bandit is {} with an mean reward of {:.3f}'.format(
    best_bandit, avg_rewards[best_bandit]))
print('Total mean reward in the {} episodes has been {}'
      .format(num_iterations, sum(total_rewards)))


Best bandit is 0 with an mean reward of 0.000
Total mean reward in the 500 episodes has been 0
