<a href="https://colab.research.google.com/github/pstanisl/mlprague-2021/blob/main/01_introduction_to_bandits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLPrague 2021
## How to Make Data-Driven Decisions: The Case for Contextual Multi-armed Bandits
### Petr Stanislav & Michal Pleva

# Introduction

Reinforcement learning (RL) is a general framework where agents learn to perform **actions** in an **environment** so as to maximize a **reward**. The two main components are the environment, which represents the problem to be solved, and the agent, which represents the learning algorithm.

The agent and environment continuously interact with each other. At each time step, the agent takes an action on the environment based on its policy $\pi(a_t|s_t)$, where $s_t$ is the current observation from the environment, and receives a reward $r_{t+1}$ and the next observation $s_{t+1}$ from the environment. The goal is to improve the policy so as to maximize the sum of rewards (return).

# Examples

- *A master chess player* makes a move. The choice is informed both by
planning—anticipating possible replies and counterreplies—and by immediate, intuitive judgments of the desirability of particular positions
and moves.
- *An adaptive controller* adjusts parameters of a petroleum refinery’s operation in real time. The controller optimizes the yield/cost/quality
trade-off on the basis of specified marginal costs without sticking strictly
to the set points originally suggested by engineers.
- *A mobile robot* decides whether it should enter a new room in search of
more trash to collect or start trying to find its way back to its battery
recharging station. It makes its decision based on the current charge
level of its battery and how quickly and easily it has been able to find
the recharger in the past.
- *A stock trader* decides what to do with his shares. State represents, how many shares of each stock I own, what is the current price of each stock and how much cash we have (uninvested). He has three possibilities of actions - sell, buy or hold. Reward is change in value of portfolio from one step to the next.

# The Cartpole Environment
The Cartpole environment is one of the most well known classic reinforcement learning problems ( the "Hello, World!" of RL). A pole is attached to a cart, which can move along a frictionless track. The pole starts upright and the goal is to prevent it from falling over by controlling the cart.

- The observation from the environment $s_t$ is a 4D vector representing the position and velocity of the cart, and the angle and angular velocity of the pole.
- The agent can control the system by taking one of 2 actions $a_t$ : push the cart right (+1) or left (-1).
- A reward $r_{t+1}=1$ is provided for every timestep that the pole remains upright. The episode ends when one of the following is true:
  - the pole tips over some angle limit
  - the cart moves outside of the world edges
  - 200 time steps pass.

The goal of the agent is to learn a policy $\pi(a_t|s_t)$ so as to maximize the sum of rewards in an episode $\sum_{t=0}^{T} \gamma^t r_t$. Here $\gamma$ is a discount factor in $[0,1]$ that discounts future rewards relative to immediate rewards. This parameter helps us focus the policy, making it care more about obtaining rewards quickly.

# Multi-Armed Bandits

The Multi-Armed Bandit problem (MAB) is a special case of Reinforcement Learning: an agent collects rewards in an environment by taking some actions after observing some state of the environment. The main difference between general RL and MAB is that in MAB, we assume that the action taken by the agent does not influence the next state of the environment. Therefore, agents do not model state transitions, credit rewards to past actions, or "plan ahead" to get to reward-rich states.

As in other RL domains, the goal of a MAB agent is to find a policy that collects as much reward as possible. It would be a mistake, however, to always try to exploit the action that promises the highest reward, because then there is a chance that we miss out on better actions if we do not explore enough. This is the main problem to be solved in (MAB), often called the exploration-exploitation dilemma.

Bandit environments, policies, and agents for MAB can be found in subdirectories of [tf_agents/bandits](https://github.com/tensorflow/agents/tree/master/tf_agents/bandits).