# Easy Data Wrapper Tutorial
The data construction covered in the Data Management tutorial might be too complicated for users without prior experience in PyTorch.
This tutorial offers a helper class to wrap the dataset, all the user needs to know is

(1) loading data-frames to Python, Pandas provides one-line solution to loading various types of data files including CSV, TSV, Stata, and Excel.

(2) basic usage of pandas. 

**Note**: this tutorial assumes the reader has already read the first part of *Data Management tutorial* and is familiar with the *terminology* of `Torch-Choice`. For example, the reader should know what a session is in our framework and what a price observable is.
We strongly recommend the reader to go through the Data Management tutorial first to have a sense of what kind of data this package is handling.

Author: Tianyu Du

Date: May. 20, 2022

Update: May. 25, 2022

In [1]:
__author__ = 'Tianyu Du'

In [2]:
import pandas as pd
from torch_choice.utils.easy_data_wrapper import EasyDatasetWrapper

## References and Background for Stata Users
This tutorial aim to show how to manage choice datasets using the `torch-choice` package, we will follow the Stata documentation [here](https://www.stata.com/manuals/cm.pdf) to offer a seamless experience for the user to transfer prior knowledge in other packages to our package.

*From Stata Documentation*: Choice models (CM) are models for data with outcomes that are choices. The choices are selected by a decision maker, such as a person or a business (i.e., the **user**), from a set of possible alternatives (i.e., the **items**). For instance, we could model choices made by consumers who select a breakfast cereal from several different brands. Or we could model choices made by businesses who chose whether to buy TV, radio, Internet, or newspaper advertising.

Models for choice data come in two varieties—models for discrete choices and models for rank-ordered alternatives. When each individual selects a single alternative, say, he or she purchases one box of cereal, the data are discrete choice data. When each individual ranks the choices, say, he or she orders cereals from most favorite to least favorite, the data are rank-ordered data. Stata has commands for fitting both discrete choice models and rank-ordered models.

Our `torch-choice` package handles the **discrete choice** models in the Stata document above.

## Data Layout
In the following parts, we demonstrate how to convert a data by Stata to data format expected by our package.

*Why do we want another `ChoiceDataset` object instead of just one data-frame?*: In earlier versions of Stata, we can only have one single data-frame loaded in memory, this would introduce memory error especially when teh dataset is large. For example, you have a dataset of a million decisions recorded, each consists of four items, and each item has a persistent *built quality* that stay the same in all observations. The Stata format would make a million copy of these variables, which is very inefficient.

We would need to collect a couple of data-frames as the essential pieces to build our `ChoiceDataset`. Don't worry, as soon as you have the data-frames ready, the `EasyDataWrapper` helper class would take care of the rest.

We call a single statistical observation a **"purchase record"** and use this terminology throughout the tutorial. 

In [3]:
df = pd.read_stata('https://www.stata-press.com/data/r17/carchoice.dta')

We load the artificial dataset from the Stata website. Here we borrow the description of dataset reported from the `describe` command in Stata. 

```
Contains data from https://www.stata-press.com/data/r17/carchoice.dta
 Observations:         3,160                  Car choice data
    Variables:             6                  30 Jul 2020 14:58
---------------------------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
---------------------------------------------------------------------------------------------------------------------------------------------------
consumerid      int     %8.0g                 ID of individual consumer
car             byte    %9.0g      nation     Nationality of car
purchase        byte    %10.0g                Indicator of car purchased
gender          byte    %9.0g      gender     Gender: 0 = Female, 1 = Male
income          float   %9.0g                 Income (in $1,000)
dealers         byte    %9.0g                 No. of dealerships in community
---------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: consumerid  car

```

In this dataset, the first four rows with `consumerid == 1` corresponds to the first **purchasing record**, it means the consumer with ID 1 was making the decision among four types of cars (i.e., **items**) and chose `American` car (since the `purchase == 1` in that row of `American` car).

Even though there were four types of cars, not all of them were available all the time. For example, for the **purchase record** by consumer with ID 4, only American, Japanese, and European cars were available (note that there is no row in the dataset with `consumerid == 4` and `car == 'Korean'`, this indicates unavailability of a certain item.)

In [4]:
df.head(30)

Unnamed: 0,consumerid,car,purchase,gender,income,dealers
0,1,American,1,Male,46.699997,9
1,1,Japanese,0,Male,46.699997,11
2,1,European,0,Male,46.699997,5
3,1,Korean,0,Male,46.699997,1
4,2,American,1,Male,26.1,10
5,2,Japanese,0,Male,26.1,7
6,2,European,0,Male,26.1,2
7,2,Korean,0,Male,26.1,1
8,3,American,0,Male,32.700001,8
9,3,Japanese,1,Male,32.700001,6


## Main Dataset
The wrapper we built requires several data frames, providing the correct information is all we need to do in this tutorial, the data wrapper will handle the construction of `ChoiceDataset` for you.

**Note**: The dataset in this tutorial is a bit over-simplified, we only have one purchase record for each user in each session, so the `consumerid` column identifies all of the user, the session, and the purchase record (because we have different dealers for the same type of car, we define each purchase record of it's session instead of assigning all purchase records to the same session).
That is, we have a single user makes a single choice in each single session.

**TODO**: elaborate this a bit more. 

The **main dataset** should contain the following columns:

1. `purchase_record_column`: a column identifies **purchase record** (also called **case** in Stata syntax). this tutorial, the `consumerid` column is the identifier. For example, the first 4 rows of the dataset (see above) has `consumerid == 1`, this means we should look at the first 4 rows together and they constitute the first purchase record.
2. `item_name_column`: a column identifies **names of items**, which is `car` in the dataset above. This column provides information above the availability as well. As mentioned above, there is no column with `car == Korean` in the fourth purchasing record (`consumerid == 4`), so we know that Korean car was not available that time.
3. `choice_column`: a column identifies the **choice** made by the consumer in each purchase record, which is the `purchase` column in our example. Exactly one row per purchase record (i.e., rows with the same values in `purchase_record_column`) should have 1, while the values are zeros for all other rows.
4. `user_index_column`: a *optional* column identifies the **user** making the choice, which is also `consumerid` in our case.
5. `session_index_column`: a *optional* column identifies the **session** of the choice, which is also `consumerid` in our case.

In [5]:
# choice_column, binary.
df_main = df[['purchase']]
# purchase_record_column
df_main['purchase_record'] = df['consumerid'].copy()
# item_name_column
df_main['item_name'] = df['car'].copy()
# user_index_column
df_main['user_index'] = df['consumerid'].copy()
# session_index_column
df_main['session_index'] = df['consumerid'].copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


We can preview the `df_main` dataset constructed, it provides full information on who (**user**) bought what (**item**) when and where (**session**).

In [6]:
df_main

Unnamed: 0,purchase,purchase_record,item_name,user_index,session_index
0,1,1,American,1,1
1,0,1,Japanese,1,1
2,0,1,European,1,1
3,0,1,Korean,1,1
4,1,2,American,2,2
...,...,...,...,...,...
3155,1,884,Japanese,884,884
3156,0,884,European,884,884
3157,1,885,American,885,885
3158,0,885,Japanese,885,885


## Datasets of Observables
We now construct data frames for different observables.

**Note**: the **index** (also the name of index) of these data frames matter a lot! You can use pandas' [`set_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) method to set the index of a data frame.

**Note**: the **name** of indices should be the same as the column name indicating that information in the main dataset. For example, the name of user observable's index should be `consumerid`. 

### Suggested Procedure of Storing and Loading Data
1. Suppose `SESSION_INDEX` column in `df_main` is the index of the session, `ALTERNATIVES` column is the index of the car.
2. For user-specific observables, you should have a CSV on disk with columns {`consumerid`, `var_1`, `var_2`, ...}.
3. You load the user-specific dataset as `user_obs = pd.read_csv(..., index='consumerid')`.


### Types of Observables
We strongly encourage the reader to review the *Data Management Tutorial* for more details on types of observables.

1. **User Observables**: user-specific observables (e.g., gender and income) should be indexed by user names: from 1 to 885 in this tutorial.
2. **Item Observables** item-specific observables (not shown in this tutorial) should be indexed by item names: American, Japanese, European, and Korean in this tutorial.
3. **Session Observable** session-specific observables (not shown in this tutorial) should be indexed by session names: from 1 to 885 in this tutorial.
4. **Price Observables** session-and-item-specific observables (e.g., dealers) should be indexed by both session names and item names (i.e., multi-indexing): from (1, American) to (885, Korean) in this example.

**NOTE**: you don't need to understand this line of code, it just extract the gender of each user and one-hot encode it, please see the resulted data-frame for what kind of observable data-frame is required.

In [7]:
# gender = pd.get_dummies(df.groupby('consumerid')['gender'].first().to_frame().reset_index().rename(columns={'consumerid': 'user_index'}))
gender = df.groupby('consumerid')['gender'].first().to_frame().reset_index().rename(columns={'consumerid': 'user_index'})
# convert to integers.
gender['gender'] = (gender['gender'] == 'Male').astype(int)

The user-observable data-frame contains a column of user IDs, this column should have exactly the same name as the column containing user IDs in the main dataset, which is `user_index`. Otherwise, the wrapper won't know which column corresponds to user IDs and which column corresponds to variables.

In [8]:
gender

Unnamed: 0,user_index,gender
0,1,1
1,2,1
2,3,1
3,4,0
4,5,1
...,...,...
880,881,1
881,882,1
882,883,0
883,884,1


In [9]:
income = df.groupby('consumerid')['income'].first().to_frame().reset_index().rename(columns={'consumerid': 'user_index'})

In [10]:
income

Unnamed: 0,user_index,income
0,1,46.699997
1,2,26.100000
2,3,32.700001
3,4,49.199997
4,5,24.299999
...,...,...
880,881,45.700001
881,882,69.800003
882,883,45.599998
883,884,20.900000


The price observable data-frame is indexed by two columns: session and item. The session index column should have exactly the same name as the session index column in `df_main` (i.e., called `session_index` in this example) and the column indexing columns should have exactly the same name as the item-name-column in `df_main` (i.e., called `item_name` in this example).

In [11]:
dealers = df[['consumerid', 'car', 'dealers']].rename(columns={'consumerid': 'session_index', 'car': 'item_name'})

In [12]:
dealers

Unnamed: 0,session_index,item_name,dealers
0,1,American,9
1,1,Japanese,11
2,1,European,5
3,1,Korean,1
4,2,American,10
...,...,...,...
3155,884,Japanese,10
3156,884,European,4
3157,885,American,10
3158,885,Japanese,5


# Build Datasets using `EasyDatasetWrapper`
We first need to provide the main dataset to the wrapper, then we need to tell the wrapper a bit information about the data. Recall that our `df_main` looks like the following, we now tell the `EasyDatasetWrapper` helper what each column in the `df_main` means.

In [13]:
df_main.head()

Unnamed: 0,purchase,purchase_record,item_name,user_index,session_index
0,1,1,American,1,1
1,0,1,Japanese,1,1
2,0,1,European,1,1
3,0,1,Korean,1,1
4,1,2,American,2,2


In [14]:
data = EasyDatasetWrapper(main_data=df_main,
                          purchase_record_column='purchase_record',
                          choice_column='purchase',
                          item_name_column='item_name',
                          user_index_column='user_index',
                          session_index_column='session_index',
                          user_observable_data={'gender': gender, 'income': income},
                          price_observable_data={'dealer': dealers})

Creating choice dataset from stata format data-frames...
Note: choice sets of different sizes found in different purchase records: {'size 4': 'occurrence 505', 'size 3': 'occurrence 380'}


In [15]:
# Use summary to see what's inside the data wrapper.
data.summary()

* Space of 4 items:
                   0         1         2       3
item name  American  European  Japanese  Korean
* Number of purchase records/cases: 885.
* Preview of main data frame:
      purchase  purchase_record item_name  user_index  session_index
0            1                1  American           1              1
1            0                1  Japanese           1              1
2            0                1  European           1              1
3            0                1    Korean           1              1
4            1                2  American           2              2
...        ...              ...       ...         ...            ...
3155         1              884  Japanese         884            884
3156         0              884  European         884            884
3157         1              885  American         885            885
3158         0              885  Japanese         885            885
3159         0              885  European         885

Now let's compare what's inside the data structure and our raw data.

In [16]:
bought_raw = df[df['purchase'] == 1]['car'].values
bought_data = list()
encoder = {0: 'American', 1: 'European', 2: 'Japanese', 3: 'Korean'}
for b in data.choice_dataset.item_index:
    bought_data.append(encoder[float(b)])

In [17]:
all(bought_raw == bought_data)

True

Then, let's compare the income and gender variable contained in the dataset. 

In [18]:
X = df.groupby('consumerid')['income'].first().values
Y = data.choice_dataset.user_income.numpy().squeeze()
all(X == Y)

True

In [19]:
X = df.groupby('consumerid')['gender'].first().values
X = (X == 'Male')  # binary encoding.
Y = data.choice_dataset.user_gender.numpy().squeeze()
all(X == Y)

True

Lastly, let's compare the `price_dealer` variable. Since there are NAN-values in it for unavailable cars, we can't not use `all(X == Y)` to compare them.

In [20]:
# rearrange columns to align it with the internal encoding scheme of the data wrapper.
X = df.pivot('consumerid', 'car', 'dealers')[['American', 'European', 'Japanese', 'Korean']]

In [21]:
X

car,American,European,Japanese,Korean
consumerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,9.0,5.0,11.0,1.0
2,10.0,2.0,7.0,1.0
3,8.0,2.0,6.0,
4,5.0,3.0,4.0,
5,8.0,3.0,3.0,
...,...,...,...,...
881,8.0,2.0,10.0,
882,8.0,6.0,8.0,1.0
883,9.0,5.0,8.0,1.0
884,12.0,4.0,10.0,


In [22]:
Y = data.choice_dataset.price_dealer.squeeze(dim=-1)

In [23]:
Y

tensor([[ 9.,  5., 11.,  1.],
        [10.,  2.,  7.,  1.],
        [ 8.,  2.,  6., nan],
        ...,
        [ 9.,  5.,  8.,  1.],
        [12.,  4., 10., nan],
        [10.,  4.,  5., nan]])

This concludes our tutorial on building the dataset, if you wish more in-depth understanding of the data structure, please refer to the *Data Management Tutorial*.