# Ranking Formulation 

### Setting 

Each user has a context $x \in \mathbb{R}^d$ and each item to be recommended also has a context $a \in \mathbb{R}^d$. For each user $x$, there is a set of items $a_{1}, a_{2}, ..., a_{B}$ that can be recommended, and our goal is to choose a ranking policy such that we recommend a set of $b < B$ items optimally. 

To accomplish this, we want to choose a ranker $w \in \{w_{1}, w_{2}, ..., w_{K}\}$ for each user $x$, where $w \in \mathbb{R}^d$. Note that we assume that there is a discrete set of rankers that we will choose, instead of continuously optimizing the ranking policy. Given a feature map $\phi(x,a): \mathbb{R}^d \to \mathbb{R}^d$, a given $w$ selects a set of $n$ items by taking $A_{b, x} := \text{top} \; b \; \{w^\top \phi(x,a)\}$. 

### Model 

To choose $w$, we first assume that the true model is a linear model with parameter $\theta \in \mathbb{R}^d$ such that $r(x,a; \theta) = \theta^\top \phi(x,a)$. In my initial model, I let $\phi(x,a) := x \odot a$ where $\odot$ is the element-wise product. This leads to the interpretation that each parameter $\theta$ or ranker $w_{i}$ is choosing the coefficients for a weighted inner product $\langle x, a \rangle _{w_{i}}$.

Assume that we are at the last stage $T$ where we have our estimate $\theta_{T}$. Then we would choose a ranker as follows: 

- Take a context $x$. For each ranker $w_{i}$, calculate the set of actions $A_{b,x,i} = \text{top} \; b \; \{\langle x, a \rangle _{w_{i}}\}$ that would be taken if $w_{i}$ was chosen. 

- Given this set of actions $A_{b,x,i}$, calculate the fantasized rewards under $\theta_{T}$ of choosing $A_{b,x,i}$: $r(x, w_{i}; \theta_{T}) = \sum_{a \in A_{b,x,i}} \langle x, a \rangle _{\theta_{T}}$  

- Finally, for each $x$, we will choose $w_{x}$ by taking $w_{x} = \argmax_{w_{i}} \{r(x, w_{i})\}$

### Thompson Sampling 

We first calculate the sets $A_{b,x,i}$ that each ranker $w_{i}$ will choose. TS then samples a posterior sample $\hat{\theta}_{t} \sim N(\theta_{t}, \Sigma_{t})$. Under $\hat{\theta}_{t}$, we calculate $r(x, w_{i}; \hat{\theta}_{t})$ and choose the arg max. 

### Initial Observations with Synthetic Data 

RHO works relatively well under this ranking model. 

TS does not outperform Uniform in basically all the settings I've tried. 

The gap between RHO and Uniform decreases as context dimension increases (which may be problematic as MIND embeddings are 100 dimensional, and BERT is 768 dimensional) 

I don't know if a logistic map is necessary. 

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import os
import torch 
os.chdir("../..")

from aexgym.env import PersSyntheticEnv, RankingSyntheticEnv
from aexgym.model import PersonalizedLinearModel, PersonalizedRankingModel
from aexgym.agent import LinearTS, LinearUniform, LinearUCB, LinearRho, RankingUniform, RankingTS, RankingRho
from aexgym.objectives import contextual_best_arm, contextual_simple_regret
from scripts.setup_script import make_uniform_prior

In [2]:
n_days = 5
n_arms = 10
context_len = 10
n_steps = n_days 
batch_size = 100
s2 = 0.2 * torch.ones((n_days, 1))

n_items = 4
total_items = 10

if torch.cuda.is_available():
    device = 'cuda:1'
else:
    device = 'cpu'
print(device)


cuda:1


In [3]:
#personalization 

#initialize parameterss
n_objs = 1
scaling = 1 / (batch_size*50)
pers_beta, pers_sigma = make_uniform_prior(context_len, scaling, n_objs=n_objs)
user_context_mu, user_context_var = torch.ones(context_len), 0.5*torch.eye(context_len)
item_context_mu, item_context_var = torch.ones(context_len), 0.5*torch.eye(context_len)


#initialize synthetic and agent model 
model = PersonalizedRankingModel(
    beta_0 = pers_beta, 
    sigma_0 = pers_sigma, 
    n_arms = n_arms, 
    s2 = s2,  
    n_objs=n_objs
)

#initialize synthetic environment
env = RankingSyntheticEnv(
    true_env = model,
    n_steps = n_steps,
    user_context_mu = user_context_mu, 
    user_context_var = user_context_var,
    item_context_mu = item_context_mu,
    item_context_var = item_context_var, 
    context_len = context_len, 
    batch_size = batch_size,
    n_arms = n_arms,
    n_items = n_items,
    total_items = total_items,
    device = device
)





In [4]:
env.reset()
print('hi')
contexts, cur_step = env.reset()
state_contexts, action_contexts, eval_contexts = contexts 
user_contexts, item_contexts = state_contexts
n_items, ranking_contexts = action_contexts
print(user_contexts.shape, item_contexts.shape)
print(n_items, ranking_contexts)

hi
torch.Size([100, 10]) torch.Size([100, 10, 10])
4 tensor([[ 1.2242,  0.4046, -0.5610,  0.8038,  0.1894, -0.3527,  1.1927,  1.2778,
          0.6996,  1.4482],
        [ 1.4256,  0.6969,  0.2864,  1.2346, -1.3946, -0.5059,  0.7187,  0.6895,
          1.4225,  0.5833],
        [ 1.0783,  0.0752,  0.9702,  2.1014,  1.8436,  1.1151,  0.0830,  0.3832,
          2.2211,  0.0084],
        [ 1.9807,  0.8198,  1.6593,  1.7063,  0.2288,  0.7508,  1.0968, -0.0670,
         -1.3840, -1.2977],
        [ 1.0812,  0.8957,  1.7140,  2.1685,  2.2265, -0.5535,  0.7209,  0.0237,
          2.0079, -0.2158],
        [ 0.0059,  0.8364,  1.2669,  0.8025,  0.4772, -0.3791, -0.8101, -0.6798,
          1.4521, -0.5376],
        [ 1.8977,  1.4114,  0.4934,  0.5317,  0.0398,  0.9299,  1.2554, -1.0875,
          2.5957,  1.4481],
        [-0.0471,  1.0671, -0.3254,  2.4931,  1.2227,  3.4607,  2.7104,  0.7407,
          3.1351,  1.6541],
        [ 1.0538,  1.0085, -2.5757, -0.0690,  1.8072,  0.7711,  0.8188, -0.

In [12]:
#initialize agent  
agent = RankingUniform(model, "Linear Uniform")
#agent = RankingTS(model, "Linear TS", toptwo=False, n_samples = 100)
#agent = RankingTS(model, "Linear TS", toptwo=True, n_samples = 100)
agent = RankingRho(model, "Linear Rho", lr=0.6, epochs = 10, weights = (1,0))

In [13]:
print_probs = False
torch.manual_seed(0)
objective = contextual_simple_regret()
objective.weights = (0, 1)
torch.set_printoptions(sci_mode=False)
regret_list = []
percent_arms_correct_list = []



for i in range(10000):
    torch.cuda.empty_cache()
    cumul_regret = 0
    env.reset()
    #print(env.mean_matrix)
    all_contexts, cur_step = env.reset()
    beta, sigma = agent.model.reset()
    #print(beta, sigma)
    beta, sigma = beta.to(device), sigma.to(device)
    while env.n_steps - cur_step > 0:

        #move to device 
        state_contexts, action_contexts, eval_contexts = all_contexts 
        state_contexts = tuple(contexts.to(device) for contexts in state_contexts)
        eval_contexts = tuple(contexts.to(device) for contexts in eval_contexts)
        action_contexts = (action_contexts[0], action_contexts[1].to(device))
        #train agent 

        if cur_step == 0:
            probs = torch.ones((batch_size, n_arms)).to(device) / n_arms
        else:
            agent.train_agent( 
                beta = beta, 
                sigma = sigma, 
                cur_step = cur_step, 
                n_steps = n_steps, 
                train_context_sampler = env.sample_train_contexts, 
                eval_contexts = eval_contexts,
                eval_action_contexts = action_contexts, 
                real_batch = batch_size, 
                print_losses=False, 
                objective=objective,
                repeats=10000
            )   

            #get probabilities
            probs = agent(
                beta = beta, 
                sigma = sigma, 
                contexts = state_contexts, 
                action_contexts = action_contexts, 
                objective = objective
            ) 
     
        #print probabilities 
        if print_probs == True:
            print(agent.name, env.n_steps - cur_step, probs)
        
        #get actions and move to new state
        actions = torch.distributions.Categorical(probs).sample()
        
        #move to next environment state 
        all_contexts, sampled_rewards, sampled_features, cur_step  = env.step(
            state_contexts = state_contexts, 
            action_contexts = action_contexts, 
            actions = actions
        )


        rewards = objective(
            agent_actions = actions,
            true_rewards = env.get_true_rewards(state_contexts, action_contexts)
        )

        cumul_regret += rewards['regret']
        
        #update model state 
        beta, sigma = agent.model.update_posterior(
            beta = beta, 
            sigma = sigma, 
            rewards = sampled_rewards, 
            features = sampled_features, 
            idx = cur_step-1
        )
    #get evaluation contexts and true rewards 
    eval_contexts = env.sample_eval_contexts(access=True)
    eval_contexts = tuple(contexts.to(device) for contexts in eval_contexts)
    true_eval_rewards = env.get_true_rewards(eval_contexts, action_contexts)
    fantasy_rewards = agent.fantasize(beta, eval_contexts, action_contexts).to(device)
    agent_actions = torch.argmax(fantasy_rewards.squeeze(), dim=1)
    #calculate results from objective
    #fantasy_rewards = torch.randn(fantasy_rewards.shape) 
    results_dict = objective(
        agent_actions = agent_actions, 
        true_rewards = true_eval_rewards.to(device)
    )

    results_dict['regret'] = 1 * cumul_regret / n_days + 0 * results_dict['regret']
    
    #append results 
    percent_arms_correct_list.append(results_dict['percent_arms_correct'])
    regret_list.append(results_dict['regret'])

    #print results 
    if i % 1 == 0:
        
        print(i, "Regret: ", np.mean(regret_list))
        print("Percent Arms Correct: ", np.mean(percent_arms_correct_list))

0 Regret:  0.03211196400225162
Percent Arms Correct:  0.47
1 Regret:  0.031562709342688317
Percent Arms Correct:  0.56
2 Regret:  0.03206083172311385
Percent Arms Correct:  0.57
3 Regret:  0.03032291629351675
Percent Arms Correct:  0.6025
4 Regret:  0.031569915302097795
Percent Arms Correct:  0.5820000000000001
5 Regret:  0.02999806455336511
Percent Arms Correct:  0.5816666666666667
6 Regret:  0.03098558742286903
Percent Arms Correct:  0.5828571428571429
7 Regret:  0.032211670966353266
Percent Arms Correct:  0.6075
8 Regret:  0.03265379703500205
Percent Arms Correct:  0.6033333333333334


KeyboardInterrupt: 

: 

: 