## **Multi Armed Bandits**

The Multi-Armed Bandit problem is an analogy to a gambler facing a row of slot machines, each with a different, unknown probability of winning. The challenge is to maximize their winnings by deciding which machine to play, how many times to play it, and when to switch to another machine. This scenario perfectly encapsulates the exploration-exploitation trade-off: exploring to find the machine with the highest reward but exploiting known information to maximize winnings.

* challange&rarr;maximize wining

To creat a simulated multi armed bandit environment we start by assuming we have multiple slot machines each with own probablity of wining. Each probablity unknown to the agent and will need to learn during training. 

### **Slot Machines**

* Reward from an arm is 0 or 1.
* Agent's goal &rarr; Accumlate maximum reward.

### **Solving the problem**
* Decayed epsilon greedy
* Epsilon &rarr; Select random machine
* 1-Epsilon &rarr; select best machine so far.
* Epsilon decreses over time.

```
n_bandits=4
true_bandits_probs=np.random.rand(n_bandits)

n_iterations=100000
epsilon=1.0
min_epsilon=0.01
epsilon_decay=0.999
counts=np.zeros(n_bandits)#How many times each bandit was played
values=np.zeros(n_bandits)#Estimated winning probablity of each bandit
rewads=np.zeros(n_iterations)#reward history
selected_arms=np.zeros(n_iterations,dtype=int)#Arm selection history
```

**Interaction Loop**
```
for i in range(n_iterations):
    arm=epsilon_greedy()
    reward=np.random.rand()<true_bandit_probs[arm]
    rewards[i]=reward
    selected_arm[i]=arm
    counts[arm]+=1
    values[arm]+=(reward-values[arm])/counts[arm]
    epsilon=max(min_epsilon,epsilon*epsilon_decay)
```

**Analyzing selections**
```
selection_percentage=np.zeros((n_iterations,n_bandits))
for i in range(n_iterations):
    selection_percentage[i,selected_arms[i]]=1
    selection_percentage=np.cumsum(selection_percentage,axis=0)/np.arange(1,n_iterations+1).reshape(-1,1)

for arm in range(n_bandits):
    plt.plot(selection_percentage[:,arm],label=f'Bandit #{arm+1}')
    plt.xscale('log')
    plt.title('Bandit Action Choices over time')
    plt.xlabel('Epsilon Number')
    plt.ylabel('Percentage of Bandit Selection(%)')
    plt.legend()
    plt.show()

for i,prob in enumerate(true_bandtis_prob,1):
    print(f"Bandit #{i} ->{prob:.2f}")

```
* Agent learns to select the bandit with the highest probablity.