# Replication Materials for the Torch-Choice Paper

> Author: Tianyu Du
> 
> Email: `tianyudu@stanford.edu`

This repository contains the replication materials for the paper "Torch-Choice: A Library for Choice Models in PyTorch". Due to the limited space in the main paper, we have omitted some codes and outputs in the paper. This repository contains the full version of codes mentioned in the paper.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import random
from time import time
import numpy as np
import pandas as pd
import torch
import torch_choice
from torch_choice import run
from tqdm import tqdm
from torch_choice.data import ChoiceDataset, JointDataset, utils, load_mode_canada_dataset, load_house_cooling_dataset_v1
from torch_choice.model import ConditionalLogitModel, NestedLogitModel

In [2]:
# set the random seed to enforce reproducibility.
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.use_deterministic_algorithms(True)

In [3]:
torch_choice.__version__

'1.0.5'

# Data Structure

In [4]:
# You could download the data from our Github repository and load it using the following code.
# https://raw.githubusercontent.com/gsbDBI/torch-choice/main/tutorials/public_datasets/car_choice.csv
car_choice = pd.read_csv("./torch_choice_paper_data/car_choice.csv")
car_choice.head()

Unnamed: 0,record_id,session_id,consumer_id,car,purchase,gender,income,speed,discount,price
0,1,1,1,American,1,1,46.699997,10,0.94,90
1,1,1,1,Japanese,0,1,46.699997,8,0.94,110
2,1,1,1,European,0,1,46.699997,7,0.94,50
3,1,1,1,Korean,0,1,46.699997,8,0.94,10
4,2,2,2,American,1,1,26.1,10,0.95,100


### Adding Observables, Method 1: Observables Derived from Columns of the Main Dataset

In [5]:
user_observable_columns=["gender", "income"]
from torch_choice.utils.easy_data_wrapper import EasyDatasetWrapper
data_wrapper_from_columns = EasyDatasetWrapper(
    main_data=car_choice,
    purchase_record_column='record_id',
    choice_column='purchase',
    item_name_column='car',
    user_index_column='consumer_id',
    session_index_column='session_id',
    user_observable_columns=['gender', 'income'],
    item_observable_columns=['speed'],
    session_observable_columns=['discount'],
    itemsession_observable_columns=['price'])

data_wrapper_from_columns.summary()
dataset = data_wrapper_from_columns.choice_dataset
# ChoiceDataset(label=[], item_index=[885], provided_num_items=[], user_index=[885], session_index=[885], item_availability=[885, 4], item_speed=[4, 1], user_gender=[885, 1], user_income=[885, 1], session_discount=[885, 1], itemsession_price=[885, 4, 1], device=cpu)

Creating choice dataset from stata format data-frames...
Note: choice sets of different sizes found in different purchase records: {'size 4': 'occurrence 505', 'size 3': 'occurrence 380'}
Finished Creating Choice Dataset.
* purchase record index range: [1 2 3] ... [883 884 885]
* Space of 4 items:
                   0         1         2       3
item name  American  European  Japanese  Korean
* Number of purchase records/cases: 885.
* Preview of main data frame:
      record_id  session_id  consumer_id       car  purchase  gender  \
0             1           1            1  American         1       1   
1             1           1            1  Japanese         0       1   
2             1           1            1  European         0       1   
3             1           1            1    Korean         0       1   
4             2           2            2  American         1       1   
...         ...         ...          ...       ...       ...     ...   
3155        884         884  

### Adding Observables, Method 2: Added as Separated DataFrames

In [6]:
# create dataframes for gender and income. The dataframe for user-specific observable needs to have the `consumer_id` column.
gender = car_choice.groupby('consumer_id')['gender'].first().reset_index()
income = car_choice.groupby('consumer_id')['income'].first().reset_index()
# alternatively, put gender and income in the same dataframe.
gender_and_income = car_choice.groupby('consumer_id')[['gender', 'income']].first().reset_index()
# speed as item observable, the dataframe requires a `car` column.
speed = car_choice.groupby('car')['speed'].first().reset_index()
# discount as session observable. the dataframe requires a `session_id` column.
discount = car_choice.groupby('session_id')['discount'].first().reset_index()
# create the price as itemsession observable, the dataframe requires both `car` and `session_id` columns.
price = car_choice[['car', 'session_id', 'price']]
# fill in NANs for (session, item) pairs that the item was not available in that session.
price = price.pivot('car', 'session_id', 'price').melt(ignore_index=False).reset_index()

In [7]:
data_wrapper_from_dataframes = EasyDatasetWrapper(
    main_data=car_choice,
    purchase_record_column='record_id',
    choice_column='purchase',
    item_name_column='car',
    user_index_column='consumer_id',
    session_index_column='session_id',
    user_observable_data={'gender': gender, 'income': income},
    # alternatively, supply gender and income as a single dataframe.
    # user_observable_data={'gender_and_income': gender_and_income},
    item_observable_data={'speed': speed},
    session_observable_data={'discount': discount},
    itemsession_observable_data={'price': price})

# the second method creates exactly the same ChoiceDataset as the previous method.
assert data_wrapper_from_dataframes.choice_dataset == data_wrapper_from_columns.choice_dataset

Creating choice dataset from stata format data-frames...
Note: choice sets of different sizes found in different purchase records: {'size 4': 'occurrence 505', 'size 3': 'occurrence 380'}
Finished Creating Choice Dataset.


In [8]:
data_wrapper_mixed = EasyDatasetWrapper(
    main_data=car_choice,
    purchase_record_column='record_id',
    choice_column='purchase',
    item_name_column='car',
    user_index_column='consumer_id',
    session_index_column='session_id',
    user_observable_data={'gender': gender, 'income': income},
    item_observable_data={'speed': speed},
    session_observable_data={'discount': discount},
    itemsession_observable_columns=['price'])

# these methods create exactly the same choice dataset.
assert data_wrapper_mixed.choice_dataset == data_wrapper_from_columns.choice_dataset == data_wrapper_from_dataframes.choice_dataset

Creating choice dataset from stata format data-frames...
Note: choice sets of different sizes found in different purchase records: {'size 4': 'occurrence 505', 'size 3': 'occurrence 380'}
Finished Creating Choice Dataset.


## Constructing a Choice Dataset, Method 2: Building from Tensors

In [9]:
N = 10_000
num_users = 10
num_items = 4
num_sessions = 500


user_obs = torch.randn(num_users, 128)
item_obs = torch.randn(num_items, 64)
useritem_obs = torch.randn(num_users, num_items, 32)
session_obs = torch.randn(num_sessions, 10)
itemsession_obs = torch.randn(num_sessions, num_items, 12)
usersession_obs = torch.randn(num_users, num_sessions, 10)
usersessionitem_obs = torch.randn(num_users, num_sessions, num_items, 8)

item_index = torch.LongTensor(np.random.choice(num_items, size=N))
user_index = torch.LongTensor(np.random.choice(num_users, size=N))
session_index = torch.LongTensor(np.random.choice(num_sessions, size=N))
item_availability = torch.ones(num_sessions, num_items).bool()

dataset = ChoiceDataset(
    # pre-specified keywords of __init__
    item_index=item_index,  # required.
    num_items=num_items,
    # optional:
    user_index=user_index,
    num_users=num_users,
    session_index=session_index,
    item_availability=item_availability,
    # additional keywords of __init__
    user_obs=user_obs,
    item_obs=item_obs,
    session_obs=session_obs,
    itemsession_obs=itemsession_obs,
    useritem_obs=useritem_obs,
    usersession_obs=usersession_obs,
    usersessionitem_obs=usersessionitem_obs)

In [10]:
print(dataset)

ChoiceDataset(num_items=4, num_users=10, num_sessions=500, label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], itemsession_obs=[500, 4, 12], useritem_obs=[10, 4, 32], usersession_obs=[10, 500, 10], usersessionitem_obs=[10, 500, 4, 8], device=cpu)


## Functionalities of the Choice Dataset

In [11]:
print(f'{dataset.num_users=:}')
# dataset.num_users=10
print(f'{dataset.num_items=:}')
# dataset.num_items=4
print(f'{dataset.num_sessions=:}')
# dataset.num_sessions=500
print(f'{len(dataset)=:}')
# len(dataset)=10000

dataset.num_users=10
dataset.num_items=4
dataset.num_sessions=500
len(dataset)=10000


In [12]:
# clone
print(dataset.item_index[:10])
# tensor([2, 2, 3, 1, 3, 2, 2, 1, 0, 1])
dataset_cloned = dataset.clone()
# modify the cloned dataset.
dataset_cloned.item_index = 99 * torch.ones(num_sessions)
print(dataset_cloned.item_index[:10])
# the cloned dataset is changed.
# tensor([99., 99., 99., 99., 99., 99., 99., 99., 99., 99.])
print(dataset.item_index[:10])
# the original dataset does not change.
# tensor([2, 2, 3, 1, 3, 2, 2, 1, 0, 1])

tensor([2, 3, 0, 2, 2, 3, 0, 0, 2, 1])
tensor([99., 99., 99., 99., 99., 99., 99., 99., 99., 99.])
tensor([2, 3, 0, 2, 2, 3, 0, 0, 2, 1])


In [13]:
# move to device
print(f'{dataset.device=:}')
# dataset.device=cpu
print(f'{dataset.device=:}')
# dataset.device=cpu
print(f'{dataset.user_index.device=:}')
# dataset.user_index.device=cpu
print(f'{dataset.session_index.device=:}')
# dataset.session_index.device=cpu


if torch.cuda.is_available():
    # please note that this can only be demonstrated
    dataset = dataset.to('cuda')

    print(f'{dataset.device=:}')
    # dataset.device=cuda:0
    print(f'{dataset.item_index.device=:}')
    # dataset.item_index.device=cuda:0
    print(f'{dataset.user_index.device=:}')
    # dataset.user_index.device=cuda:0
    print(f'{dataset.session_index.device=:}')
    # dataset.session_index.device=cuda:0

    dataset._check_device_consistency()

dataset.device=cpu
dataset.device=cpu
dataset.user_index.device=cpu
dataset.session_index.device=cpu
dataset.device=cuda:0
dataset.item_index.device=cuda:0
dataset.user_index.device=cuda:0
dataset.session_index.device=cuda:0


In [14]:
def print_dict_shape(d):
    for key, val in d.items():
        if torch.is_tensor(val):
            print(f'dict.{key}.shape={val.shape}')
print_dict_shape(dataset.x_dict)

dict.user_obs.shape=torch.Size([10000, 4, 128])
dict.item_obs.shape=torch.Size([10000, 4, 64])
dict.session_obs.shape=torch.Size([10000, 4, 10])
dict.itemsession_obs.shape=torch.Size([10000, 4, 12])
dict.useritem_obs.shape=torch.Size([10000, 4, 32])
dict.usersession_obs.shape=torch.Size([10000, 4, 10])
dict.usersessionitem_obs.shape=torch.Size([10000, 4, 8])


In [15]:
# __getitem__ to get batch.
# pick 5 random sessions as the mini-batch.
dataset = dataset.to('cpu')
indices = torch.Tensor(np.random.choice(len(dataset), size=5, replace=False)).long()
print(indices)
# tensor([1118,  976, 1956,  290, 8283])
subset = dataset[indices]
print(dataset)
# ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
print(subset)
# ChoiceDataset(label=[], item_index=[5], user_index=[5], session_index=[5], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)

tensor([7119, 9650, 5466, 1073, 8419])
ChoiceDataset(num_items=4, num_users=10, num_sessions=500, label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], itemsession_obs=[500, 4, 12], useritem_obs=[10, 4, 32], usersession_obs=[10, 500, 10], usersessionitem_obs=[10, 500, 4, 8], device=cpu)
ChoiceDataset(num_items=4, num_users=10, num_sessions=500, label=[], item_index=[5], user_index=[5], session_index=[5], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], itemsession_obs=[500, 4, 12], useritem_obs=[10, 4, 32], usersession_obs=[10, 500, 10], usersessionitem_obs=[10, 500, 4, 8], device=cpu)


In [16]:
print(subset.item_index)
# tensor([0, 1, 0, 0, 0])
print(dataset.item_index[indices])
# tensor([0, 1, 0, 0, 0])

subset.item_index += 1  # modifying the batch does not change the original dataset.

print(subset.item_index)
# tensor([1, 2, 1, 1, 1])
print(dataset.item_index[indices])
# tensor([0, 1, 0, 0, 0])

tensor([2, 2, 1, 2, 3])
tensor([2, 2, 1, 2, 3])
tensor([3, 3, 2, 3, 4])
tensor([2, 2, 1, 2, 3])


In [17]:
print(subset.item_obs[0, 0])
# tensor(-1.5811)
print(dataset.item_obs[0, 0])
# tensor(-1.5811)

subset.item_obs += 1
print(subset.item_obs[0, 0])
# tensor(-0.5811)
print(dataset.item_obs[0, 0])
# tensor(-1.5811)

tensor(-0.2949)
tensor(-0.2949)
tensor(0.7051)
tensor(-0.2949)


In [18]:
print(id(subset.item_index))
# 140339656298640
print(id(dataset.item_index[indices]))
# 140339656150528
# these two are different objects in memory.

139976280433184
139975728990704


## Chaining Multiple Datasets with JointDataset

In [19]:
item_level_dataset = dataset.clone()
nest_level_dataset = dataset.clone()
joint_dataset = JointDataset(
    item=item_level_dataset,
    nest=nest_level_dataset)

print(joint_dataset)

JointDataset with 2 sub-datasets: (
	item: ChoiceDataset(num_items=4, num_users=10, num_sessions=500, label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], itemsession_obs=[500, 4, 12], useritem_obs=[10, 4, 32], usersession_obs=[10, 500, 10], usersessionitem_obs=[10, 500, 4, 8], device=cpu)
	nest: ChoiceDataset(num_items=4, num_users=10, num_sessions=500, label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], itemsession_obs=[500, 4, 12], useritem_obs=[10, 4, 32], usersession_obs=[10, 500, 10], usersessionitem_obs=[10, 500, 4, 8], device=cpu)
)


In [20]:
from torch.utils.data.sampler import BatchSampler, SequentialSampler, RandomSampler
shuffle = False  # for demonstration purpose.
batch_size = 32

# Create sampler.
sampler = BatchSampler(
    RandomSampler(dataset) if shuffle else SequentialSampler(dataset),
    batch_size=batch_size,
    drop_last=False)

dataloader = torch.utils.data.DataLoader(dataset,
                                         sampler=sampler,
                                         collate_fn=lambda x: x[0],
                                         pin_memory=(dataset.device == 'cpu'))

In [21]:
print(f'{item_obs.shape=:}')
# item_obs.shape=torch.Size([4, 64])
item_obs_all = item_obs.view(1, num_items, -1).expand(len(dataset), -1, -1)
item_obs_all = item_obs_all.to(dataset.device)
item_index_all = item_index.to(dataset.device)
print(f'{item_obs_all.shape=:}')
# item_obs_all.shape=torch.Size([10000, 4, 64])

item_obs.shape=torch.Size([4, 64])
item_obs_all.shape=torch.Size([10000, 4, 64])


In [22]:
for i, batch in enumerate(dataloader):
    first, last = i * batch_size, min(len(dataset), (i + 1) * batch_size)
    idx = torch.arange(first, last)
    assert torch.all(item_obs_all[idx, :, :] == batch.x_dict['item_obs'])
    assert torch.all(item_index_all[idx] == batch.item_index)

In [23]:
batch.x_dict['item_obs'].shape
# torch.Size([16, 4, 64])

torch.Size([16, 4, 64])

In [24]:
print_dict_shape(dataset.x_dict)
# dict.user_obs.shape=torch.Size([10000, 4, 128])
# dict.item_obs.shape=torch.Size([10000, 4, 64])
# dict.session_obs.shape=torch.Size([10000, 4, 10])
# dict.price_obs.shape=torch.Size([10000, 4, 12])

dict.user_obs.shape=torch.Size([10000, 4, 128])
dict.item_obs.shape=torch.Size([10000, 4, 64])
dict.session_obs.shape=torch.Size([10000, 4, 10])
dict.itemsession_obs.shape=torch.Size([10000, 4, 12])
dict.useritem_obs.shape=torch.Size([10000, 4, 32])
dict.usersession_obs.shape=torch.Size([10000, 4, 10])
dict.usersessionitem_obs.shape=torch.Size([10000, 4, 8])


In [25]:
dataset.__len__()
# 10000

10000

# Conditional Logit Model

In [26]:
dataset = load_mode_canada_dataset()

No `session_index` is provided, assume each choice instance is in its own session.


In [27]:
dataset

ChoiceDataset(num_items=4, num_users=1, num_sessions=2779, label=[], item_index=[2779], user_index=[], session_index=[2779], item_availability=[], itemsession_cost_freq_ovt=[2779, 4, 3], session_income=[2779, 1], itemsession_ivt=[2779, 4, 1], device=cpu)

In [28]:
model = ConditionalLogitModel(
    formula='(itemsession_cost_freq_ovt|constant) + (session_income|item) + (itemsession_ivt|item-full) + (intercept|item)',
    dataset=dataset,
    num_items=4)

In [29]:
model = ConditionalLogitModel(
    coef_variation_dict={'itemsession_cost_freq_ovt': 'constant',
                         'session_income': 'item',
                         'itemsession_ivt': 'item-full',
                         'intercept': 'item'},
    num_param_dict={'itemsession_cost_freq_ovt': 3,
                    'session_income': 1,
                    'itemsession_ivt': 1,
                    'intercept': 1},
    num_items=4)

In [30]:
model = ConditionalLogitModel(
    coef_variation_dict={'itemsession_cost_freq_ovt': 'constant',
                         'session_income': 'item',
                         'itemsession_ivt': 'item-full',
                         'intercept': 'item'},
    num_param_dict={'itemsession_cost_freq_ovt': 3,
                    'session_income': 1,
                    'itemsession_ivt': 1,
                    'intercept': 1},
    num_items=4,
    regularization="L1", regularization_weight=0.5)

In [31]:
from torch_choice import run
run(model, dataset, batch_size=-1, learning_rate=0.01, num_epochs=1000, model_optimizer="LBFGS")

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name  | Type                  | Params
------------------------------------------------
0 | model | ConditionalLogitModel | 13    
------------------------------------------------
13        Trainable params
0         Non-trainable params
13        Total params
0.000     Total estimated model params size (MB)


ConditionalLogitModel(
  (coef_dict): ModuleDict(
    (itemsession_cost_freq_ovt[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=3, 3 trainable parameters in total, initialization=normal, device=cpu).
    (session_income[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
    (itemsession_ivt[item-full]): Coefficient(variation=item-full, num_items=4, num_users=None, num_params=1, 4 trainable parameters in total, initialization=normal, device=cpu).
    (intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
  )
)
Conditional logistic discrete choice model, expects input features:

X[itemsession_cost_freq_ovt[constant]] with 3 parameters, with constant level variation.
X[session_income[item]] with 1 parameters, with item level variation.
X[itemsession_ivt[i

Training: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=1000` reached.


Time taken for training: 19.963228940963745
Skip testing, no test dataset is provided.
Log-likelihood: [Training] -1874.63818359375, [Validation] N/A, [Test] N/A

| Coefficient                           |   Estimation |   Std. Err. |       z-value |    Pr(>|z|) | Significance   |
|:--------------------------------------|-------------:|------------:|--------------:|------------:|:---------------|
| itemsession_cost_freq_ovt[constant]_0 | -0.0372948   |  0.0070951  |  -5.25641     | 1.46897e-07 | ***            |
| itemsession_cost_freq_ovt[constant]_1 |  0.0934485   |  0.00509611 |  18.3372      | 0           | ***            |
| itemsession_cost_freq_ovt[constant]_2 | -0.042776    |  0.00322201 | -13.2762      | 0           | ***            |
| session_income[item]_0                | -0.0862387   |  0.018302   |  -4.71199     | 2.45312e-06 | ***            |
| session_income[item]_1                | -0.0269129   |  0.00384875 |  -6.99265     | 2.69762e-12 | ***            |
| session_i

ConditionalLogitModel(
  (coef_dict): ModuleDict(
    (itemsession_cost_freq_ovt[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=3, 3 trainable parameters in total, initialization=normal, device=cpu).
    (session_income[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
    (itemsession_ivt[item-full]): Coefficient(variation=item-full, num_items=4, num_users=None, num_params=1, 4 trainable parameters in total, initialization=normal, device=cpu).
    (intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, initialization=normal, device=cpu).
  )
)
Conditional logistic discrete choice model, expects input features:

X[itemsession_cost_freq_ovt[constant]] with 3 parameters, with constant level variation.
X[session_income[item]] with 1 parameters, with item level variation.
X[itemsession_ivt[i

In [32]:
# You can visualize the training process using TensorBoard by (1) running the following command in terminal and (2) opening the browser and navigating to http://localhost:6006.
! tensorboard --logdir ./lightning_logs --port 6006

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.11.0 at http://localhost:6006/ (Press CTRL+C to quit)


# Nested Logit Model
Our paper provided an overview of the nested logit model API without concrete examples or executable code. For a complete example of the nested logit model, please refer to the nested-logit model tutorial for executable code.