### K-Arm Bandit based Solution for a Recommendation System using Epsilon Greedy Method

In [8]:
import numpy as np
import random

In [9]:
#stimulating user feedback for each item
def simulate_user_feedback(item):
    true_reward_probs = [0.2, 0.5, 0.3, 0.6, 0.8, 0.4, 0.1, 0.7, 0.9, 0.2]  
    return 1 if random.random() < true_reward_probs[item] else 0

In [10]:
#epsilon-greedy bandit algorithm for recommendation system
class EpsilonGreedyBandit:
    def __init__(self, k, epsilon, total_rounds):
        self.k = k  # number of arms (items)
        self.epsilon = epsilon  # exploration rate
        self.total_rounds = total_rounds  # total number of recommendations (users)
        self.rewards = np.zeros(k)  # estimated rewards for each item
        self.counts = np.zeros(k)  # number of times each item has been selected
        self.total_reward = 0  # total reward across all recommendations
        self.explorations = 0  # count of exploration actions
        self.exploitations = 0  # count of exploitation actions

   #selecting an item using epsilon-greedy method
    def select_item(self):
        if random.random() < self.epsilon:
            # Exploration: randomly select any item
            self.explorations += 1
            return np.random.randint(0, self.k)
        else:
            # Exploitation: select the item with the highest estimated reward
            self.exploitations += 1
            return np.argmax(self.rewards)
        
    #updating estimated rewards for the selected item based on the reward received
    def update_reward(self, item, reward):
        self.counts[item] += 1
        #updating the average reward
        self.rewards[item] += (reward - self.rewards[item]) / self.counts[item]
        self.total_reward += reward

    #running the recommendation systemfor thedefined number of rounds (uses)
    def run(self):
        for round in range(self.total_rounds):
            #selecting an item to recommend
            item = self.select_item()
            
            #simulating feedback from the user for the selected item
            reward = simulate_user_feedback(item)
            
            #updating the reward estimates
            self.update_reward(item, reward)
        
        print(f"Results with epsilon = {self.epsilon}:")
        print(f"Total Reward: {self.total_reward}")
        print(f"Explorations: {self.explorations}, Exploitations: {self.exploitations}")
        print(f"Final estimated rewards: {self.rewards}\n")

In [11]:
K = 10  #number of items (arms)
total_rounds = 1000  #number of rounds (recommendations)

#usinf different values of epsilon for to observe trade-off between exploration & exploitation
epsilons = [0.3, 0.1, 0.01]

for epsilon in epsilons:
    bandit = EpsilonGreedyBandit(K, epsilon, total_rounds)
    bandit.run()

Results with epsilon = 0.3:
Total Reward: 759
Explorations: 299, Exploitations: 701
Final estimated rewards: [0.21875    0.46875    0.38297872 0.71428571 0.79245283 0.45454545
 0.11538462 0.67741935 0.89177489 0.15625   ]

Results with epsilon = 0.1:
Total Reward: 825
Explorations: 86, Exploitations: 914
Final estimated rewards: [0.18181818 0.61538462 0.125      0.54545455 0.84848485 0.58333333
 0.33333333 0.66037736 0.8933162  0.375     ]

Results with epsilon = 0.01:
Total Reward: 828
Explorations: 9, Exploitations: 991
Final estimated rewards: [0.27819549 0.5        0.         0.         0.         0.5
 0.         0.         0.91734575 0.        ]



Observations:

1. High Exploration with ε = 0.3 - The system explore more often (30% of the time), recommending different items frequently. This leads to more diverse recommendations but lower rewards in total since few items can perform poorly.
    
2. Moderate Exploration with ε = 0.1 - Exploration happens less frequently here (10% of the time), leading to more exploitation of the known items. The rewards gained here are higher compared to ε=0.3.
    
3. Low Exploration with ε = 0.01 - The system rarely explores and always recommends the items that yields maximum rewards. The total reward is the highest because the system mostly exploits highly rewarding options. But, it might miss out in discovering better items due to limited exploration.
    
   
Therefore, higher values of epsilon may result in more exploration and a diverse set of recommendations but may produce lower total rewards and lower values of epsilon may result in more exploration of the best rewarding items maximising short-term rewards but risking missing out on long-term opputunities for discovering highgly rewarding items. Setting a moderate epsilon value provides a good balance between exploration and exploitation leading to reasonably high rewards and diversity in recommendations. 