<a href="https://colab.research.google.com/github/pstanisl/mlprague-2021/blob/main/01_introduction_to_bandits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLPrague 2021
## How to Make Data-Driven Decisions: The Case for Contextual Multi-armed Bandits
### Petr Stanislav & Michal Pleva

# Introduction

The Multi-Armed Bandit problem (MAB) is a special case of Reinforcement Learning: an agent collects rewards in an environment by taking some actions after observing some state of the environment. The main difference between general RL and MAB is that in MAB, we assume that the action taken by the agent does not influence the next state of the environment. Therefore, agents do not model state transitions, credit rewards to past actions, or "plan ahead" to get to reward-rich states.

As in other RL domains, the goal of a MAB agent is to find a policy that collects as much reward as possible. It would be a mistake, however, to always try to exploit the action that promises the highest reward, because then there is a chance that we miss out on better actions if we do not explore enough. This is the main problem to be solved in (MAB), often called the exploration-exploitation dilemma.

Bandit environments, policies, and agents for MAB can be found in subdirectories of [tf_agents/bandits](https://github.com/tensorflow/agents/tree/master/tf_agents/bandits).

## Example

In [2]:
from typing import Dict, List, Tuple

import matplotlib.pyplot as plt
import numpy as np

Example of the Greedy Multi-Armed Bandit

In [5]:
class BannerEnvironment(object):
  """Example of environment for banners (with Bernoulli distribution of CTR)"""
  def __init__(self, params: List[float]):
    self._params = params
    self._observe()

  def reset(self):
    return self._observe()

  def _observe(self) -> List[float]:
    self._observation = np.random.rand(1)
    return self._observation

  def step(self, action: int) -> Tuple[int, float]:
    ret = 0 if self._observe()[0] > self._params[action] else 1
    return (ret, self._observation[0])
  
  def best_action(self):
    return np.argmax(self._params)


class GreedyPolicy(object):
  """Simple greedy policy"""

  def __init__(self, values):
    self._values = values

  def action(self) -> int:
    return np.argmax(self._values)


class GreedyAgent(object):
  """Greedy Agent with optimistic initialization."""

  def __init__(self, n: int):
    self._n = n

    self.reset()

    self.policy = GreedyPolicy(self._values)

  def reset(self):
    self._counts = [0] * self._n
    self._values = [1.0] * self._n

  def train(self, experience: Dict[str, int]):
    action = experience['action']
    reward = experience['reward']

    self._counts[action] += 1

    value = self._values[action]
    n = self._counts[action]

    self._values[action] = ((n - 1) / n) * value + (1 / n) * reward

In [16]:
environment = BannerEnvironment([0.25, 0.4, 0.67])
environment.reset()

agent = GreedyAgent(3)

for _ in range(100):
  action = agent.policy.action()  
  reward, _= environment.step(action) 
  # Create trajectory nested 
  experience = {'action': action, 'reward': reward}
  # Train policy in the agent
  agent.train(experience)

print(f'Agent\'s reward estimations={agent._values} and counts={agent._counts}')

Agent's reward estimations=[0.0, 0.5, 0.6082474226804119] and counts=[1, 2, 97]
