# Multi-Armed Bandits

Multi-armed bandits (MABs) are a simpler class of problems that capture the fundamental challenge of exploration vs. exploitation in reinforcement learning. The scenario is like a gambler in front of several slot machines (“one-armed bandits”): each pull of a lever gives a stochastic reward. The gambler (agent) wants to maximize reward over time by choosing which machines to play, balancing trying new machines (exploration) and sticking with the best one found so far (exploitation).

# Problem Setup
- You have K slot machines (arms). Each arm, when pulled, gives a reward drawn from a fixed distribution (unknown to you). For simplicity, let’s say each arm gives a reward of 1 with some probability (and 0 otherwise) – a Bernoulli bandit.
- Your goal is to maximize the total reward over a series of pulls (trials).

Key challenge: Exploration vs. Exploitation
- Exploration: Try different arms to gain information about their payoff.
- Exploitation: Use the information to pick the arm with the highest known reward rate so far.


We will implement and compare three strategies:
1. $\epsilon$-Greedy – with probability ε, explore (random arm), otherwise exploit (best arm)
2. Upper Confidence Bound (UCB) – select the arm with the highest upper confidence bound on reward (favoring arms with less trials so far to explore uncertainty
3. Thompson Sampling – a Bayesian approach: maintain a distribution for each arm’s success probability and sample from these to decide (naturally balancing exploration and exploitation)

Let’s set up a bandit problem to test these. We will create a bandit with, say, 3 arms:
- Arm 0: win probability 0.3
- Arm 1: win probability 0.5
- Arm 2: win probability 0.7

The best arm is #2 (70% chance of reward), but the agents don’t know that initially.

In [1]:
import random

# True probabilities for each arm (unknown to agent)
true_probs = [0.3, 0.5, 0.7]
K = len(true_probs)

We’ll simulate a sequence of 1000 pulls. At each step, the agent chooses an arm according to the strategy, and gets a reward 1 (with the arm’s true probability) or 0. We will track:
- The cumulative reward over time.
- The number of pulls of each arm (to see exploration).