# Tutorial: Data Management
**Author: Tianyu Du (tianyudu@stanford.edu)**

**Note**: please go through the introduction tutorial [here](https://gsbdbi.github.io/torch-choice/intro/) before proceeding.

This notebook aims to help users understand the functionality of `ChoiceDataset` object.
The `ChoiceDataset` is an instance of the more general PyTorch dataset object holding information of consumer choices. The `ChoiceDataset` offers easy, clean and efficient data management. The Jupyter-notebook version of this tutorial can be found [here](https://github.com/gsbDBI/torch-choice/blob/main/tutorials/data_management.ipynb).

This tutorial provides in-depth explanations on how the `torch-choice` library manages data. We are also providing an easy-to-use data wrapper converting long-format dataset to `ChoiceDataset` [here](https://gsbdbi.github.io/torch-choice/easy_data_management/), you can harness the `torch-choice` library without going through this tutorial. 

**Note**: since this package was initially proposed for modelling consumer choices, attribute names of `ChoiceDataset` are borrowed from the consumer choice literature.

**Note**: PyTorch uses the term **tensor** to denote high dimensional matrices, we will be using **tensor** and **matrix** interchangeably.

After walking through this tutorial, you should be abel to initiate a `ChoiceDataset` object as the following and use it to manage data.
```python
dataset = ChoiceDataset(
    # pre-specified keywords of __init__
    item_index=item_index,  # required.
    # optional:
    user_index=user_index,
    session_index=session_index,
    item_availability=item_availability,
    # additional keywords of __init__
    user_obs=user_obs,
    item_obs=item_obs,
    session_obs=session_obs,
    price_obs=price_obs)
```

### Observables
Observables are tensors with specific shapes, we classify observables into four categories based on their variations.

#### Basic Usage
Optionally, the researcher can incorporate observables of, for example, users and items. Currently, the package support the following types of observables, where $K_{...}$ denote the number of observables.

1. `user_obs` $\in \mathbb{R}^{U\times K_{user}}$: user observables such as user age.
2. `item_obs` $\in \mathbb{R}^{I\times K_{item}}$: item observables such as item quality.
3. `session_obs` $\in \mathbb{R}^{S \times K_{session}}$: session observable such as whether the purchase was made on weekdays.
4. `price_obs` $\in \mathbb{R}^{S \times I \times K_{price}}$, price observables are values depending on **both** session and item such as the price of item.

The researcher should supply them with as appropriate keyword arguments while constructing the `ChoiceDataset` object.

#### (Optional) Advanced Usage: Additional Observables
In some cases, the researcher have multiple sets of user (or item, or session, or price) observables, say *user income* (a scalar variable) and *user market membership*. The *user income* a matrix in $\mathbb{R}^{U\times 1}$. Further, suppose there are four types of market membership: no-membership, silver-membership, gold-membership, and diamond-membership. The *user market membership* is a binary matrix in $\{0, 1\}^{U\times 4}$ if we one-hot encode users' membership status.

In this case, the researcher can either
1. concatenate `user_income` and `user_market_membership` to a $\mathbb{R}^{U\times (1+4)}$ matrix and supply it as a single `user_obs` as the following:
```python
dataset = ChoiceDataset(..., user_obs=torch.cat([user_income, user_market_membership], dim=1), ...)
```
2. Or, supply these two sets of observables separately, namely a `user_income` $\in \mathbb{R}^{U \times 1}$ matrix and a `user_market_membership` $\in \mathbb{R}^{U \times 4}$ matrix as the following:
```python
dataset = ChoiceDataset(..., user_income=user_income, user_market_membership=user_market_membership, ...)
```

Supplying two separate sets of observables is particularly useful when the researcher wants different kinds of coefficients for different kinds of observables.

For example, the researcher wishes to model the utility for user $u$ to purchase item $i$ in session $s$ as the following:

$$
U_{usi} = \beta_{i} X^{(u)}_{user\ income} + \gamma X^{(u)}_{user\ market\ membership} + \varepsilon
$$

Please note that the $\beta_i$ coefficient has an $i$ subscript, which means it's item specific. The $\gamma$ coefficient has no subscript, which means it's the same for all items.

The coefficient for user income is item-specific so that it captures the nature of the product (i.e., a luxury or an essential good). Additionally, the utility representation admits an user market membership becomes shoppers with active memberships tend to purchase more, and the coefficient of this term is constant across all items.

As we will cover later in the modelling section, we need to supply two user observable tensors in this case for the model to build coefficient with different levels of variations (i.e., item-specific coefficients versus constant coefficients). In this case, the researcher needs to supply two tensors `user_income` and `user_market_membership` as keyword arguments to the `ChoiceDataset` constructor.

Generally, the `ChoiceDataset` handles multiple user/item/session/price observables internally, the `ChoiceDataset` class identifies the variation of observables by their prefixes. For example, every keyword arguments passed into `ChoiceDataset` with name starting with `item_` (except for the reserved `item_availability`) will be treated as item observable tensors.
Similarly, all keywords with names starting `user_`, `session_` and `price_` (except for reserved names like `user_index` and `session_index` mentioned above) will be interpreted as user/session/price observable tensors.

In [2]:
# import required dependencies.
import numpy as np
import pandas as pd
import torch
from torch_choice.data import ChoiceDataset, JointDataset

In [3]:
# let's get a helper
def print_dict_shape(d):
    for key, val in d.items():
        if torch.is_tensor(val):
            print(f'dict.{key}.shape={val.shape}')

## Creating  `ChoiceDataset` Object

In [4]:
# Feel free to modify it as you want.
num_users = 10
num_items = 4
num_sessions = 500

length_of_dataset = 10000

### Step 1: Generate some random purchase records and observables
We will be creating a randomly generated dataset with 10000 purchase records from 10 users, 4 items and 500 sessions.

We use the term **purchase record** to denote the observation in the dataset due to the convention in Stata documentation (because *observation* meant something else in the Stata documentation and we don't want to confuse existing Stata users).

As mentioned in the introduction tutorial, one purchase record consists of *who* (i.e., user) bought *what* (i.e., item) *when* and *where* (i.e., session). 

The length of the dataset equals the number of purchase records in it.

The first step is to randomly generate the purchase records using the following code. For simplicity, we assume all items are available in all sessions.

In [5]:
# create observables/features, the number of parameters are arbitrarily chosen.
# generate 128 features for each user, e.g., race, gender.
user_obs = torch.randn(num_users, 128)
# generate 64 features for each user, e.g., quality.
item_obs = torch.randn(num_items, 64)
# generate 10 features for each session, e.g., weekday indicator. 
session_obs = torch.randn(num_sessions, 10)
# generate 12 features for each session user pair, e.g., the budget of that user at the shopping day.
price_obs = torch.randn(num_sessions, num_items, 12)

We then generate random observable tensors for users, items, sessions and price observables, the size of observables of each type (i.e., the last dimension in the shape) is arbitrarily chosen.

**Notes on Encodings** Since we will be using PyTorch to train our model, we represent their identities with *consecutive* integer values instead of the raw human-readable names of items (e.g., Dell 24-inch LCD monitor). Similarly, you would need to encode user indices and session indices as well.
Raw item names can be encoded easily with [sklearn.preprocessing.LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) (The [sklearn.preprocessing.OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) works as well).

In [7]:
item_index = torch.LongTensor(np.random.choice(num_items, size=length_of_dataset))
user_index = torch.LongTensor(np.random.choice(num_users, size=length_of_dataset))
session_index = torch.LongTensor(np.random.choice(num_sessions, size=length_of_dataset))

# assume all items are available in all sessions.
item_availability = torch.ones(num_sessions, num_items).bool()

### Step 2: Initialize the `ChoiceDataset`.
You can construct a choice set using the following code, which manage all information for you.

In [8]:
dataset = ChoiceDataset(
    # pre-specified keywords of __init__
    item_index=item_index,  # required.
    # optional:
    user_index=user_index,
    session_index=session_index,
    item_availability=item_availability,
    # additional keywords of __init__
    user_obs=user_obs,
    item_obs=item_obs,
    session_obs=session_obs,
    price_obs=price_obs)

## What you can do with the `ChoiceDataset`?

### `print(dataset)` and `dataset.__str__`
The command `print(dataset)` will provide a quick overview of shapes of tensors included in the object as well as where the dataset is located (i.e., host memory or GPU memory).

In [9]:
print(dataset)

ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)


### `dataset.summary()`
The `summary` method provides preliminary summarization of the dataset.

In [10]:
print(pd.DataFrame(dataset.user_index).value_counts())

9    1027
7    1024
2    1007
3    1007
5    1005
6    1001
4     997
8     995
1     980
0     957
dtype: int64


In [11]:
print(pd.DataFrame(dataset.item_index).value_counts())

0    2520
1    2519
3    2498
2    2463
dtype: int64


In [12]:
dataset.summary()

ChoiceDataset with 500 sessions, 4 items, 10 users, 10000 purchase records (observations) .
The most frequent user is 9 with 1027 observations; the least frequent user is 0 with 957 observations; on average, there are 1000.00 observations per user.
5 most frequent users are: 9(1027 times), 7(1024 times), 2(1007 times), 3(1007 times), 5(1005 times).
5 least frequent users are: 0(957 times), 1(980 times), 8(995 times), 4(997 times), 6(1001 times).
The most frequent item is 0, it was chosen 2520 times; the least frequent item is 2 it was 2463 times; on average, each item was purchased 2500.00 times.
4 most frequent items are: 0(2520 times), 1(2519 times), 3(2498 times), 2(2463 times).
4 least frequent items are: 2(2463 times), 3(2498 times), 1(2519 times), 0(2520 times).
Attribute Summaries:
Observable Tensor 'user_obs' with shape torch.Size([10, 128])
             0          1          2          3          4          5    \
count  10.000000  10.000000  10.000000  10.000000  10.000000  1

### `dataset.num_{users, items, sessions}`
You can use the `num_{users, items, sessions}` attribute to obtain the number of users, items, and sessions, they are determined automatically from the `{user, item, session}_obs` tensors provided while initializing the dataset object.

**Note**: the print `=:` operator requires Python3.8 or higher, you can remove `=:` if you are using an earlier copy of Python.

In [13]:
print(f'{dataset.num_users=:}')
print(f'{dataset.num_items=:}')
print(f'{dataset.num_sessions=:}')
print(f'{len(dataset)=:}')

dataset.num_users=10
dataset.num_items=4
dataset.num_sessions=500
len(dataset)=10000


### `dataset.clone()`
The `ChoiceDataset` offers a `clone` method allow you to make copy of the dataset, you can modify the cloned dataset arbitrarily without changing the original dataset.

In [14]:
# clone
print(dataset.item_index[:10])
dataset_cloned = dataset.clone()
dataset_cloned.item_index = 99 * torch.ones(num_sessions)
print(dataset_cloned.item_index[:10])
print(dataset.item_index[:10])  # does not change the original dataset.

tensor([1, 2, 2, 3, 1, 2, 2, 1, 0, 2])
tensor([99., 99., 99., 99., 99., 99., 99., 99., 99., 99.])
tensor([1, 2, 2, 3, 1, 2, 2, 1, 0, 2])


### `dataset.to('cuda')` and `dataset._check_device_consistency()`.
One key advantage of the `torch_choice` and `bemb` is their compatibility with GPUs, you can easily move tensors in a `ChoiceDataset` object between host memory (i.e., cpu memory) and device memory (i.e., GPU memory) using `dataset.to()` method.
Please note that the following code runs only if your machine has a compatible GPU and GPU-compatible version of PyTorch installed.

Similarly, one can move data to host-memory using `dataset.to('cpu')`.
The dataset also provides a `dataset._check_device_consistency()` method to check if all tensors are on the same device.
If we only move the `label` to cpu without moving other tensors, this will result in an error message.

In [16]:
# move to device
print(f'{dataset.device=:}')
print(f'{dataset.device=:}')
print(f'{dataset.user_index.device=:}')
print(f'{dataset.session_index.device=:}')

# dataset = dataset.to('cuda')
dataset = dataset.to('mps')

print(f'{dataset.device=:}')
print(f'{dataset.item_index.device=:}')
print(f'{dataset.user_index.device=:}')
print(f'{dataset.session_index.device=:}')

dataset.device=mps:0
dataset.device=mps:0
dataset.user_index.device=mps:0
dataset.session_index.device=mps:0
dataset.device=mps:0
dataset.item_index.device=mps:0
dataset.user_index.device=mps:0
dataset.session_index.device=mps:0


In [17]:
dataset._check_device_consistency()

In [18]:
# # NOTE: this cell will result errors, this is intentional.
dataset.item_index = dataset.item_index.to('cpu')
dataset._check_device_consistency()

Exception: ("Found tensors on different devices: {device(type='mps', index=0), device(type='cpu')}.", 'Use dataset.to() method to align devices.')

In [19]:
# create dictionary inputs for model.forward()
# collapse to a dictionary object.
print_dict_shape(dataset.x_dict)

dict.user_obs.shape=torch.Size([10000, 4, 128])
dict.item_obs.shape=torch.Size([10000, 4, 64])
dict.session_obs.shape=torch.Size([10000, 4, 10])
dict.price_obs.shape=torch.Size([10000, 4, 12])


### Subset method
One can use `dataset[indices]` with `indices` as an integer-valued tensor or array to get the corresponding rows of the dataset.
The example code block below queries the 6256-th, 4119-th, 453-th, 5520-th, and 1877-th row of the dataset object.
The `item_index`, `user_index`, `session_index` of the resulted subset will be different from the original dataset, but other tensors will be the same.

In [17]:
# __getitem__ to get batch.
# pick 5 random sessions as the mini-batch.
dataset = dataset.to('cpu')
indices = torch.Tensor(np.random.choice(len(dataset), size=5, replace=False)).long()
print(indices)
subset = dataset[indices]
print(dataset)
print(subset)
# print_dict_shape(subset.x_dict)

# assert torch.all(dataset.x_dict['price_obs'][indices, :, :] == subset.x_dict['price_obs'])
# assert torch.all(dataset.item_index[indices] == subset.item_index)

tensor([6790, 3567, 8804, 7207, 9253])
ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
ChoiceDataset(label=[], item_index=[5], user_index=[5], session_index=[5], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)


The subset method internally creates a copy of the datasets so that any modification applied on the subset will **not** be reflected on the original dataset.
The researcher can feel free to do in-place modification to the subset.

In [16]:
print(subset.item_index)
print(dataset.item_index[indices])

subset.item_index += 1  # modifying the batch does not change the original dataset.

print(subset.item_index)
print(dataset.item_index[indices])

tensor([0, 1, 0, 0, 0])
tensor([0, 1, 0, 0, 0])
tensor([1, 2, 1, 1, 1])
tensor([0, 1, 0, 0, 0])


In [17]:
print(subset.item_obs[0, 0])
print(dataset.item_obs[0, 0])
subset.item_obs += 1
print(subset.item_obs[0, 0])
print(dataset.item_obs[0, 0])

tensor(-1.5811)
tensor(-1.5811)
tensor(-0.5811)
tensor(-1.5811)


In [18]:
print(id(subset.item_index))
print(id(dataset.item_index[indices]))

140339656298640
140339656150528


## Using Pytorch dataloader for the training loop.
The `ChoiceDataset` object natively support batch samplers from PyTorch. For demonstration purpose, we turned off the shuffling option.

In [38]:
from torch.utils.data.sampler import BatchSampler, SequentialSampler, RandomSampler
shuffle = False  # for demonstration purpose.
batch_size = 32

# Create sampler.
sampler = BatchSampler(
    RandomSampler(dataset) if shuffle else SequentialSampler(dataset),
    batch_size=batch_size,
    drop_last=False)

dataloader = torch.utils.data.DataLoader(dataset,
                                         sampler=sampler,
                                         num_workers=1,
                                         collate_fn=lambda x: x[0],
                                         pin_memory=(dataset.device == 'cpu'))

In [40]:
print(f'{item_obs.shape=:}')
item_obs_all = item_obs.view(1, num_items, -1).expand(len(dataset), -1, -1)
item_obs_all = item_obs_all.to(dataset.device)
item_index_all = item_index.to(dataset.device)
print(f'{item_obs_all.shape=:}')

item_obs.shape=torch.Size([4, 64])
item_obs_all.shape=torch.Size([10000, 4, 64])


In [43]:
for i, batch in enumerate(dataloader):
    first, last = i * batch_size, min(len(dataset), (i + 1) * batch_size)
    idx = torch.arange(first, last)
    assert torch.all(item_obs_all[idx, :, :] == batch.x_dict['item_obs'])
    assert torch.all(item_index_all[idx] == batch.item_index)

In [44]:
batch.x_dict['item_obs'].shape

torch.Size([16, 4, 64])

In [45]:
print_dict_shape(dataset.x_dict)

dict.user_obs.shape=torch.Size([10000, 4, 128])
dict.item_obs.shape=torch.Size([10000, 4, 64])
dict.session_obs.shape=torch.Size([10000, 4, 10])
dict.price_obs.shape=torch.Size([10000, 4, 12])


In [46]:
dataset.__len__()

10000

## Chaining Multiple Datasets: `JointDataset` Examples

In [47]:
dataset1 = dataset.clone()
dataset2 = dataset.clone()
joint_dataset = JointDataset(the_dataset=dataset1, another_dataset=dataset2)

In [48]:
joint_dataset

JointDataset with 2 sub-datasets: (
	the_dataset: ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
	another_dataset: ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
)