# Dataset Loader

The first part of the Open Bandit Pipeline (OBP) is the dataset loader. For the Open Bandit Dataset (OBD), the loader is `opb.dataset.OpenBanditDataset` ([docs](https://zr-obp.readthedocs.io/en/latest/_autosummary/obp.dataset.real.html#obp.dataset.real.OpenBanditDataset)). 

As with many classes in the OBP, the dataset modules are implemented with [dataclasses](https://docs.python.org/3.7/library/dataclasses.html).

The dataset module inherits from `obp.dataset.base.BaseRealBanditDatset` ([docs](https://zr-obp.readthedocs.io/en/latest/_autosummary/obp.dataset.base.html#module-obp.dataset.base)) and should implement three methods:
- `load_raw_data()`: Load an on-disk representation of the dataset into the module. Used during initialization.
- `pre_process()`: Perform any preprocessing needed to transform the raw data representation into a final representation.
- `obtain_batch_bandit_feedback()`: Return a dictionary containing (at least) keys: `["action","position","reward","pscore","context","n_rounds"]`

It is also helpful if the dataset module exposes a property `len_list`, which is how many items the bandit shows the user at a time. Often the answer is 1, though in the case of OBD it's 3.

In [3]:
import obp
from obp_dataset import MovieLensDataset

In [4]:
dataset = MovieLensDataset(
    data_path=os.path.join(os.getcwd(), "data/"), 
    embedding_network_weights_path="model/pmf/emb_50_ratio_0.800000_bs_1000_e_258_wd_0.100000_lr_0.000100_trained_pmf.pt", 
    embedding_dim=50,
    users_num=943,
    items_num=1682
)

/data/CEIA/Rurax-Moblix/obp/data/ml-100k/ml-100k.zip


  self.load_raw_data()


----- Finished data load
----- Preprocessing dataset
Finished preprocessing


In [5]:
bandit_feedback = dataset.obtain_batch_bandit_feedback()
print("feedback dict:")
for key, value in bandit_feedback.items():
    print(f"  {key}: {type(value)}")

feedback dict:
  n_rounds: <class 'int'>
  n_actions: <class 'int'>
  action: <class 'numpy.ndarray'>
  position: <class 'numpy.ndarray'>
  reward: <class 'numpy.ndarray'>
  pscore: <class 'numpy.ndarray'>
  context: <class 'numpy.ndarray'>
  action_context: <class 'numpy.ndarray'>


In [6]:
exp_rand_reward = round(bandit_feedback["reward"].mean(),4)
print(f"Expected reward for uniform random actions: {exp_rand_reward}")

Expected reward for uniform random actions: 0.5538


# Off-Policy Evaluation (OPE)

The next step is OPE which attempts to estimate the performance of online bandit algorithms using the logged bandit feedback and ReplayMethod(RM).

In [7]:
import numpy as np
from sklearn.linear_model import LogisticRegression

import obp
from obp.policy import EpsilonGreedy, LinTS, LinUCB, Random
from obp.ope import (
    RegressionModel,
    OffPolicyEvaluation, 
    ReplayMethod,
    InverseProbabilityWeighting, 
    DirectMethod, 
    DoublyRobust
)

from simulator import run_bandit_simulation

SyntaxError: invalid syntax (simulator.py, line 105)

In [None]:
import pickle

with open("data/ml-100k/movies_groups.pkl", "rb") as pkl_file:
    movies_groups = pickle.load(pkl_file)


In [7]:
epsilon_greedy = EpsilonGreedy(
    n_actions=dataset.n_actions,
    epsilon=0.1,
)
epsilon_greedy_ = run_bandit_simulation(
    bandit_feedback=bandit_feedback,
    policy=epsilon_greedy,
    epochs=5,
    item_group=movies_groups,
    fairness_constraints=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
)

100%|██████████| 100000/100000 [00:05<00:00, 16888.59it/s]


In [8]:
lin_ucb = LinUCB(
    dim=dataset.dim_context,
    n_actions=dataset.n_actions,
    epsilon=0.25
)
lin_ucb_ = run_bandit_simulation(
    bandit_feedback=bandit_feedback,
    policy=lin_ucb,
    epochs=5,
    item_group=movies_groups,
    fairness_constraints=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
)

In [10]:
# estimate the policy value of the online bandit algorithms using RM
ope = OffPolicyEvaluation(
    bandit_feedback=bandit_feedback,
    ope_estimators=[
        ReplayMethod(), 
        DR(estimator_name="DR"),
        IPS(estimator_name="IPS"), 
        SNIPS(estimator_name="SNIPS"),
        DM(estimator_name="DM"), 
    ]
)

In [None]:
# obp.ope.RegressionModel
regression_model = RegressionModel(
    n_actions=dataset.n_actions, # number of actions; |A|
    len_list=dataset.len_list, # number of items in a recommendation list; K
    base_model=LogisticRegression(C=100, max_iter=100000), 
)

In [None]:
estimated_rewards = regression_model.fit_predict(
    context=bandit_data["context"],
    action=bandit_data["action"],
    reward=bandit_data["reward"],
    position=bandit_data["position"],
)

In [11]:
estimated_policy_value = ope.estimate_policy_values(
    action_dist=epsilon_greedy_, # \pi_e(a|x)
    estimated_rewards_by_reg_model=estimated_rewards, # \hat{q}
)
estimated_policy_value

    95.0% CI (lower)  95.0% CI (upper)      mean
rm          0.152148          0.425977  0.275859 



In [None]:
estimated_policy_value = ope.estimate_policy_values(
    action_dist=lin_ucb_, # \pi_e(a|x)
    estimated_rewards_by_reg_model=estimated_rewards, # \hat{q}
)
estimated_policy_value