# Contextual Bandits with TF-agents







## Learning Objectives

* 
*

*Contextual Bandit (CB)* is a machine learning framework in which a *agent* selects actions (also called *arms*) in order to maximize rewards in the long term. At each round, the agent receives some information about the current state (also called the *context*) and uses this information to select an action. As a consequence of this choice, it receives a *reward*. 

On the one hand, contextual bandit is one of the simplest instance of a *reinforcement learning problem* where a single state (or context) is provided to the agent and the play or *episode* stops after the first action has been chosen and the reward gotten. This setting appears in a number of useful problems in the industry, one of the best known being that of ad placements on a website: The different ads to publish on a webpage are the different actions, the context is given by a user features, and the reward is 1 is the user clicks on the published ad and 0 otherwise.

On the other hand, contextual bandit is a natural generalization of a classification problem in supervized learning. Namely, consider a data set of points $(x, y)$ where the $x$'s are the features and the $y$'s are the labels in $k$ possible classes. We can setup an associated contextual bandit problem as follows: The CB agent at each time step is given the context $x$. From that information, it needs to select from $k$ possible actions which are the $k$ possible classes. If the agent chooses the correct class for feature $x$, then the reward is $1$, and zero otherwise. The general goal of maximising the long-term cumulative reward for the CB agent is equivalent to that of minimizing the training loss in supervized learning. Contextual bandit is more general than classification though, since in many useful CB settings we actually know the reward only for the  actions we have taken. 

In this lab, we will learn how to solve a contextual bandit problem derived from a classification dataset with *Q-learning* and the associated *neural epsilon-greedy strategy* using a powerful reinforcement learning library written in tensorflow: [TensorFlow Agents](https://www.tensorflow.org/agents). 

**Acknowledgement:** This lab is based on a tutorial originally written by Anant Nawalgaria and Alex Erfurt. We thank them for making their original material available to us.

## Setup

Let us intall [TensorFlow Agents](https://www.tensorflow.org/agents) if it is not already installed and import the necessary libraries:

In [None]:
pip freeze | grep tf_agents || pip install -q tf_agents

In [None]:
import functools
import os
import time

import pandas as pd
import tensorflow as tf  # pylint: disable=g-explicit-tensorflow-version-import

# from tensorflow.python.framework.dtypes import int64
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_probability import distributions as tfd
from tf_agents.bandits.agents.neural_epsilon_greedy_agent import (
    NeuralEpsilonGreedyAgent,
)
from tf_agents.bandits.environments import environment_utilities as env_util
from tf_agents.bandits.environments.classification_environment import (
    ClassificationBanditEnvironment,
)
from tf_agents.bandits.metrics import tf_metrics as tf_bandit_metrics
from tf_agents.drivers.dynamic_step_driver import DynamicStepDriver
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks.q_network import QNetwork
from tf_agents.replay_buffers.tf_uniform_replay_buffer import (
    TFUniformReplayBuffer,
)

In [None]:
REGION = "us-central1"
PROJECT_ID = !(gcloud config get-value project)
PROJECT_ID = PROJECT_ID[0]

os.environ["PROJECT_ID"] = PROJECT_ID

## Loading the dataset into BigQuery

In this lab, we are going to use a classification dataset and turn it into a contextual bandit problem. 

Our dataset will be the [UCI Machine Learning Repository]( https://archive.ics.uci.edu/ml/datasets/covertype), which associates various cartographic features of a given area with different labels representing different types of forests covering the areas. 

The original features are as follows (the last column being the label):

In [None]:
pd.read_csv("../../tfx_pipelines/data/dataset.csv").head(2)

At each time step, our CB agent will be given a context $x$ representing an area cartographic features (`Elevation`, `Aspect`, `Slope`, etc.). Then it will have to choose among one of 7 possible forest cover types as defined by the last column (`Cover_type`), and represented by the integer from 0 to 6.

For convenience, we have pre-precessed the categorical features `Wilderness_Area` and `Soil_Type` into their one-hot-encoded versions. So the dataset we will use will have more columns (55 exactly) than the original covertype dataset. We will name the columns from `C0` to `C54`. The columns from `C0` to `C53` represent the features, while the last column `C54` represents the label.

The next cell defines our dataset column names and types and displays a few examples:

In [None]:
N_COLUMNS = 55
COLUMN_NAMES = [f"C{i}" for i in range(N_COLUMNS)]
COLUMN_TYPES = [tf.int64] * N_COLUMNS

covertype_df = pd.read_csv("../data/covertype.csv", names=COLUMN_NAMES)
covertype_df.head(2)

Let us look at how many examples we have at our disposal:

In [None]:
NUM_SAMPLES = len(covertype_df)
NUM_SAMPLES

Let us now load this dataset into `BigQuery` into the table named

```bash
PROJECT_ID.DATASET_ID.TABLE_ID
```

where `DATASET_ID` and `TABLE_ID` are defined in the next cell among other variable like the `DATASET_SCHEMA`:

In [None]:
DATASET_LOCATION = "US"
DATASET_SOURCE = "../data/covertype.csv"
DATASET_SCHEMA = ",".join([f"C{i}:INTEGER" for i in range(N_COLUMNS)])
DATASET_ID = "covertype_dataset_rl"
TABLE_ID = "covertypek_preproc"

os.environ["DATASET_LOCATION"] = DATASET_LOCATION
os.environ["DATASET_SOURCE"] = DATASET_SOURCE
os.environ["DATASET_SCHEMA"] = DATASET_SCHEMA
os.environ["DATASET_ID"] = DATASET_ID
os.environ["TABLE_ID"] = TABLE_ID

#### Exercise

In the cell below run the [bq command line](https://cloud.google.com/bigquery/docs/bq-command-line-tool) to create the dataset and populate the table from `DATASET_SOURCE` using the variable defined above:

In [None]:
%%bash

bq --location=$DATASET_LOCATION --project_id=$PROJECT_ID mk --dataset $DATASET_ID

bq --project_id=$PROJECT_ID --dataset_id=$DATASET_ID load \
--source_format=CSV \
--replace \
$TABLE_ID \
$DATASET_SOURCE \
$DATASET_SCHEMA

## Connecting to BigQuery

We will now create a `tf.data.Dataset` connected to the data table we created in our `BigQuery` instance, and our which TensorFlow Agents code will interact with.

For that purpose, we will use [Tensorflow_io](https://github.com/tensorflow/io/tree/v0.15.0/tensorflow_io/bigquery), which offers a connector `BigQueryClient` to  stream data directly out of `BigQuery`:

```python
from tensorflow_io.bigquery import BigQueryClient
```

The first step is to create a`BigQuery` client and then a read session from it:

In [None]:
bq_client = BigQueryClient()

bq_session = bq_client.read_session(
    f"projects/{PROJECT_ID}",
    PROJECT_ID,
    TABLE_ID,
    DATASET_ID,
    COLUMN_NAMES,
    COLUMN_TYPES,
)

From our `bq_session` we can create a `tf.data.Dataset` using the `parallel_read_rows` method, which will read our BigQuery rows in parallel: 

In [None]:
tf_dataset = bq_session.parallel_read_rows(
    block_length=NUM_SAMPLES,
    num_parallel_calls=tf.data.experimental.AUTOTUNE,
)

At this point the examples are stored in our `tf_dataset` as `OrderedDict` with keys the column names and values the corresponding row values:

In [None]:
for example in tf_dataset.take(1):
    print(example)

#### Exercise

Configure the `tf_dataset` we instanciated so that
1. the examples are stored as couples $(x, y)$ where $x$ is the feature vector with 54 components and $y$ is the label (**Hint:** Use `.map`)
1. it loops over the dataset infefinitively (**Hint:** Use `.repeat`)
1. it shuffles the dataset (Use `buffer_size=400000`)

In [None]:
LABEL_NAME = "C54"


def features_and_labels(features):
    label = features.pop(LABEL_NAME)
    return (
        tf.cast(tf.stack(tf.nest.flatten(features), axis=0), tf.float32),
        tf.cast(label - 1, tf.int32),
    )


tf_dataset = (
    tf_dataset.map(features_and_labels).repeat().shuffle(buffer_size=400000)
)

Verify that now the dataset has the correct form:

In [None]:
for example in tf_dataset.take(1):
    print(example)

## 3. Initializing and configuring the Environment

An environment in the TF-Agents Bandits library is a class that provides observations and reports rewards based on obseravtions and actions.
In this section we instantiate the "covertype bandit environment"

In the TF-Agents bandits library, there is an environment wrapper (named ClassificationBanditEnvironment) that can turn any multiclass labeled dataset to a bandit environment. The context (or observation) will be the features in the dataset, the actions are the label classes, and the rewards are calculated based on some stochastic function of the actual and the guessed labels. This latter function is defined by a table of distributions. For our covertype example, this table is simply the deterministic identity matrix:

In [None]:
TIMESTAMP = time.strftime("%Y%m%d_%H%M%S")
ROOT_DIR = f"./contextual_bandit_checkpoints/{TIMESTAMP}"

BATCH_SIZE = 128
TRAINING_LOOPS = 10
STEPS_PER_LOOP = 2
AGENT_ALPHA = 10.0

EPSILON = 0.01
LAYERS = (300, 200, 100, 100, 50, 50)
LR = 0.002

AGENT_CHECKPOINT_NAME = "agent"
STEP_CHECKPOINT_NAME = "step"
CHECKPOINT_FILE_PREFIX = "ckpt"

In [None]:
# initialize the distribution
covertype_reward_distribution = tfd.Independent(
    tfd.Deterministic(tf.eye(7)), reinterpreted_batch_ndims=2
)

In [None]:
covertype_reward_distribution.sample()

# provides an interface to return a reward given an action for features, tf_dataset provides labels

In [None]:
# Initializing the Classification Bandit Environment with the dataset, and reward distri ution
environment = ClassificationBanditEnvironment(
    tf_dataset, covertype_reward_distribution, BATCH_SIZE
)

In [None]:
environment.reward_spec()

## 4. Initializing the Agent
Now that we have the environment and metrics intialized from the Tf.dataset loaded from big query we reach the part where we define and initialize our policy and the Agent which will be utilizing that policy to make decisions given an observation. We have several policies: as shown here:

   1. [NeuralEpsilonGreedyAgent](https://medium.com/analytics-vidhya/the-epsilon-greedy-algorithm-for-reinforcement-learning-5fe6f96dc870): The neural epsilon greedy algorithm makes a value estimate for all the arms, and then chooses the best arm with the probability (1-epsilon) and any of the random arms with a probability of epsilon. this balances the exploration-exploitation tradeoff and epsilon is set to a small value like 10%. Example: In this example we have seven arms: one of each of the classes, and if we set epsilon to say 10%, then 90% of the times the agent will choose the arm with the highest value estimate (exploiting the one most likely to be the predicted class) and 10% of the time it will choose a random arm from all of the 7 arms( thus exploring the other possibilities). Refer [here](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/agents/neural_epsilon_greedy_agent/NeuralEpsilonGreedyAgent) for more information of the tensorflow agents version of the same.
   
   Each Agent is initialized with a policy: which is essentially the function approximator (be it linear or non linear) for estimating the Q values. The agent uses this policy, adds the exploration-exploitation component on top of this and then train the policy. In this example we will use a Deep Q Network as our policy


In [None]:
network = QNetwork(
    input_tensor_spec=environment.time_step_spec().observation,
    action_spec=environment.action_spec(),
    fc_layer_params=LAYERS,
)

agent = NeuralEpsilonGreedyAgent(
    time_step_spec=environment.time_step_spec(),
    action_spec=environment.action_spec(),
    reward_network=network,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=LR),
    epsilon=EPSILON,
)

## 5. Define and link the evaluation metrics


Just like you have metrics like accuracy/recall in supervised learning, in bandits we use the [regret](https://www.tensorflow.org/agents/tutorials/bandits_tutorial#regret_metric) metric per episode. To calculate the regret, we need to know what the highest possible expected reward is in every time step. For that, we define the `optimal_reward_fn`.

Another similar metric is the number of times a suboptimal action was chosen. That requires the definition if the `optimal_action_fn`.

In [None]:
optimal_reward_fn = functools.partial(
    env_util.compute_optimal_reward_with_classification_environment,
    environment=environment,
)

optimal_action_fn = functools.partial(
    env_util.compute_optimal_action_with_classification_environment,
    environment=environment,
)

In [None]:
regret_metric = tf_bandit_metrics.RegretMetric(optimal_reward_fn)

suboptimal_arms_metric = tf_bandit_metrics.SuboptimalArmsMetric(
    optimal_action_fn
)

In [None]:
step_metric = tf_metrics.EnvironmentSteps()

metrics = [
    # equivalent to number of steps in bandits problem
    tf_metrics.NumberOfEpisodes(),
    # measures regret
    regret_metric,
    # number of times the suboptimal arms are pulled
    suboptimal_arms_metric,
    # the average return
    tf_metrics.AverageReturnMetric(batch_size=environment.batch_size),
]

## 6. Initialize & configure the Replay Buffer
Reinforcement learning algorithms use replay buffers to store trajectories of experience when executing a policy in an environment. During training, replay buffers are queried for a subset of the trajectories (either a sequential subset or a sample) to "replay" the agent's experience. Sampling from the replay buffer facilitate data re-use and breaks harmful co-relation between sequential data in RL, although in contextual bandits this isn't absolutely required but still helpful.

The replay buffer exposes several functions which allow you to manipulate the replay buffer in several ways. Read more on them [here](https://www.tensorflow.org/agents/tutorials/5_replay_buffers_tutorial).

In this demo we would be using the `TFUniformReplayBuffer` for which we need to initialize the buffer spec with the spec of the trajectory of the agent's policy, a chosen batch size (number of trajectories to store), and the maximum length of the trajectory. (This is the amount of sequential time steps which will be considered as one data point). So a batch of 3 with 2 time steps each would result in a tensor of shape `(3,2)`. Since unlike regular RL problems, contextual bandits have only one time step we can keep `max_length = 1`. However since this tutorial is to enable you for RL problems as well, let set it to 2. Do not worry, any contextual bandit agent will internally
split the time steps inside each data point such that the effective batch size ends up being `(6,1)`. 

Create a Tensorflow based `TFUniformReplayBuffer` and initialize it with an appropriate values:

*  `batch_size = 128`
* `max_length = 2` ( 2 time steps per item)

In [None]:
# solution
buf = TFUniformReplayBuffer(
    data_spec=agent.policy.trajectory_spec,
    batch_size=BATCH_SIZE,
    max_length=STEPS_PER_LOOP,
)

Now we have a Replay buffer but we also need something to fill it with. Often a common practice is to have 
the agent interact with and collect experience from the environment, without actually learning from it (i.e. only forward pass). This loop can be either done by you manually as shown [here](https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial#training_the_agent) or you can do it using the `DynamicStepDriver`.
The data encountered by the driver at each step is saved in a `NamedTuple` called `Trajectory` and broadcast to a set of observers such as replay buffers and metrics. 
This trajectory includes the observation from the environment, the action recommended by the policy, the reward obtained, the type of the current and the next step, etc. 

In order for the driver to fill the replay buffer with data, as well as to compute ongoing metrics, it needs acess to the `add_batch` method of the replay buffer and the metrics. Have a look [here](https://www.tensorflow.org/agents/tutorials/5_replay_buffers_tutorial#data_collection) for more information and example code on how to initialize a step driver with observers. 


In [None]:
# solution
#  UNCOMMENT THIS BACK WHEN DEBUGGING IS DONE (GABOR)
# replay_observer = [buf.add_batch]  # Gabor's debug
replay_observer = [buf.add_batch, step_metric] + metrics

driver = DynamicStepDriver(
    env=environment,
    policy=agent.collect_policy,
    num_steps=STEPS_PER_LOOP * environment.batch_size,
    observers=replay_observer,
)

Here we provide you with a helper function in order to save your agent, the metrics and its lighter policy seperately, while training the model. We make all the aspects into trackable objects and then use checkpoint to save as well warm restart a previous training. For more information on checkpoints and policy savers ( which will be used in the training loop below) refer [here](https://www.tensorflow.org/agents/tutorials/10_checkpointer_policysaver_tutorial)

In [None]:
def restore_and_get_checkpoint_manager(root_dir, agent, metrics, step_metric):
    """Restores from `root_dir` and returns a function that writes checkpoints."""
    trackable_objects = {metric.name: metric for metric in metrics}
    trackable_objects[AGENT_CHECKPOINT_NAME] = agent
    trackable_objects[STEP_CHECKPOINT_NAME] = step_metric
    checkpoint = tf.train.Checkpoint(**trackable_objects)
    checkpoint_manager = tf.train.CheckpointManager(
        checkpoint=checkpoint, directory=root_dir, max_to_keep=5
    )
    latest = checkpoint_manager.latest_checkpoint

    if latest is not None:
        print("Restoring checkpoint from %s.", latest)
        checkpoint.restore(latest)
        print("Successfully restored to step %s.", step_metric.result())
    else:
        print(
            "Did not find a pre-existing checkpoint. " "Starting from scratch."
        )
    return checkpoint_manager

In [None]:
checkpoint_manager = restore_and_get_checkpoint_manager(
    ROOT_DIR, agent, metrics, step_metric
)
summary_writer = tf.summary.create_file_writer(ROOT_DIR)
summary_writer.set_as_default()

Now we have all the components ready to start training the model. Here is the process for Training the model
1. We first use the DynamicStepdriver instance to collect experience (trajectories) from the environment and fill up the replay buffer.
2. We then extract all the stored experience from the replay buffer by specfiying the `batch size` and `num_steps` we initialized the driver with. We extract it as `tf.dataset.Dataset` instance.
3. We then iterate on the `tf.dataset.Dataset` and the first sample we draw actually has all the data `batch_size * num_time_steps`
4. The agent then trains on the acquired experience
5. The replay buffer is cleared to make space for new data
6. Log the metrics and store them on disk
7. Save the Agent ( via checkpoints) as well as the policy

In [None]:
# solution
import warnings

warnings.filterwarnings("ignore")
TRAINING_LOOPS = 150

for _ in range(TRAINING_LOOPS):
    driver.run()
    batch_size = driver.env.batch_size

    dataset = buf.as_dataset(
        sample_batch_size=BATCH_SIZE,
        num_steps=STEPS_PER_LOOP,
        single_deterministic_pass=True,
    )

    experience, unused_info = next(iter(dataset))

    train_loss = agent.train(experience).loss

    buf.clear()

    metric_utils.log_metrics(metrics)
    # for m in metrics:
    # print(m.name, ": ", m.result())
    for metric in metrics:
        metric.tf_summaries(train_step=step_metric.result())
    checkpoint_manager.save()

    # saver.save(os.path.join(ROOT_DIR, "./", 'policy_%d' % step_metric.result()))

Now that our model is trained, what if we want to determine which action to take given a new "context": for that we will iterate on our dataset to get the next item,
    make a timestep out of it by wrapping the results using `ts.TimeStep`. It expects `step_type`, `reward`, `discount`, and `observation` as input: since we are performing prediction you can fill 
        in dummy values for the first 3: only the observation/context is relevant. Read about how it works [here](https://www.tensorflow.org/agents/api_docs/python/tf_agents/trajectories/time_step/TimeStep), and perform the task below
        
       

In [None]:
feature, label = iter(tf_dataset).next()

In [None]:
step = ts.TimeStep(
    tf.constant(ts.StepType.FIRST, dtype=tf.int32, shape=[1], name="step_type"),
    tf.constant(0.0, dtype=tf.float32, shape=[1], name="reward"),
    tf.constant(1.0, dtype=tf.float32, shape=[1], name="discount"),
    tf.constant(feature, dtype=tf.float32, shape=[1, 54], name="observation"),
)

agent.policy.action(step).action.numpy()

One final task : let us upload the tensoboard logs, to get an overview of the performance of our model. We will upload our logs to `tensorboard.dev` and for that you need to 
copy the following command in terminal and execute it from there, it will give you a link from which you need to copy/paste the authentication code, and once that is done, you will receive the 
url of your model evaluation, hosted on a public [tensorboard.dev](https://tensorboard.dev/) instance

In [None]:
!tensorboard dev upload --logdir /home/jupyter/tmp/quick_test/v7/ --name "(optional) My latest experiment" --description "(optional) Agent trained"