# Tutorial: Optimization Algorithms
**Author: Tianyu Du (tianyudu@stanford.edu)**

**Update: May. 14, 2023**

Let's first import essential Python packages.

In [1]:
from time import time
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F

from torch_choice.data import ChoiceDataset, utils
from torch_choice.model import ConditionalLogitModel

from torch_choice.utils.run_helper import run

In [2]:
print(torch.__version__)
print(f"{torch.cuda.is_available()=:}")

2.0.1+cu118
torch.cuda.is_available()=True


This tutorial will run both with and without graphic processing unit (GPU). However, our package is *much* faster with GPU.

In [3]:
if torch.cuda.is_available():
    print(f'CUDA device used: {torch.cuda.get_device_name()}')
    device = 'cuda'
else:
    print('Running tutorial on CPU.')
    device = 'cpu'

CUDA device used: NVIDIA GeForce RTX 3090


## Load Dataset
We have included the `ModeCanada` dataset in our package, which is located at `./public_datasets/`.

The `ModeCanada` dataset contains individuals' choice on traveling methods.

The raw dataset is in a long-format, in which the `case` variable identifies each choice.
Using the terminology mentioned in the data management tutorial, each choice is called a *purchasing record* (i.e., consumer bought the ticket of a particular travelling mode), and the total number of choices made is denoted as $B$.

For example, the first four row below (with `case == 109`) corresponds to the first choice, the `alt` column lists all alternatives/items available.

The `choice` column identifies which alternative/item is chosen. The second row in the data snapshot below, we have `choice == 1` and `alt == 'air'` for `case == 109`. This indicates the travelling mode chosen in `case = 109` was `air`.

Now we convert the raw dataset into the format compatible with our model, for a detailed tutorial on the compatible formats, please refer to the data management tutorial.

We focus on cases when four alternatives were available by filtering `noalt == 4`.

In [4]:
df = pd.read_csv('./public_datasets/ModeCanada.csv')
df = df.query('noalt == 4').reset_index(drop=True)
df.sort_values(by='case', inplace=True)
df.head()

Unnamed: 0.1,Unnamed: 0,case,alt,choice,dist,cost,ivt,ovt,freq,income,urban,noalt
0,304,109,train,0,377,58.25,215,74,4,45,0,4
1,305,109,air,1,377,142.8,56,85,9,45,0,4
2,306,109,bus,0,377,27.52,301,63,8,45,0,4
3,307,109,car,0,377,71.63,262,0,0,45,0,4
4,308,110,train,0,377,58.25,215,74,4,70,0,4


Since there are 4 rows corresponding to each *purchasing record*, the length of the long-format data is $4 \times B$.
Please refer to the data management tutorial for notations.

In [5]:
df.shape

(11116, 12)

In [6]:
item_index = df[df['choice'] == 1].sort_values(by='case')['alt'].reset_index(drop=True)
print(item_index)

0       air
1       air
2       air
3       air
4       air
       ... 
2774    car
2775    car
2776    car
2777    car
2778    car
Name: alt, Length: 2779, dtype: object


In [7]:
item_names = ['air', 'bus', 'car', 'train']
num_items = 4
encoder = dict(zip(item_names, range(num_items)))
print(f"{encoder=:}")
item_index = item_index.map(lambda x: encoder[x])
item_index = torch.LongTensor(item_index)
print(f"{item_index=:}")

encoder={'air': 0, 'bus': 1, 'car': 2, 'train': 3}
item_index=tensor([0, 0, 0,  ..., 2, 2, 2])


In [8]:
price_cost_freq_ovt = utils.pivot3d(df, dim0='case', dim1='alt',
                                    values=['cost', 'freq', 'ovt'])
print(f'{price_cost_freq_ovt.shape=:}')

price_ivt = utils.pivot3d(df, dim0='case', dim1='alt', values='ivt')
print(f'{price_ivt.shape=:}')

price_cost_freq_ovt.shape=torch.Size([2779, 4, 3])
price_ivt.shape=torch.Size([2779, 4, 1])


In [9]:
session_income = df.groupby('case')['income'].first()
session_income = torch.Tensor(session_income.values).view(-1, 1)
print(f'{session_income.shape=:}')

session_income.shape=torch.Size([2779, 1])


Lastly, we put all tensors we created to a single `ChoiceDataset` object, and move the dataset to the appropriate device.

In [10]:
dataset = ChoiceDataset(item_index=item_index,
                        price_cost_freq_ovt=price_cost_freq_ovt,
                        session_income=session_income,
                        price_ivt=price_ivt
                        ).to(device)

No `session_index` is provided, assume each choice instance is in its own session.


In [11]:
print(dataset)

ChoiceDataset(label=[], item_index=[2779], user_index=[], session_index=[2779], item_availability=[], price_cost_freq_ovt=[2779, 4, 3], session_income=[2779, 1], price_ivt=[2779, 4, 1], device=cuda:0)


In [22]:
"""
This is a template script for researchers to train the PyTorch-based model with minimal effort.
The researcher only needs to initialize the dataset and the model, this training template comes with default
hyper-parameters including batch size and learning rate. The researcher should experiment with different levels
of hyper-parameter if the default setting doesn't converge well.

This is a modified version of the original run_helper.py script, which is modified to work with PyTorch Lightning.
"""
import time
from copy import deepcopy
from typing import Optional, Union

from scipy.stats import norm

import pandas as pd
import torch
import torch.nn.functional as F
import pytorch_lightning as pl
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

from torch_choice.data import ChoiceDataset
from torch_choice.data.utils import create_data_loader
from torch_choice.model.conditional_logit_model import ConditionalLogitModel
from torch_choice.model.nested_logit_model import NestedLogitModel
from torch_choice.utils.std import parameter_std


class LightningModelWrapper(pl.LightningModule):
    def __init__(self,
                 model: Union [ConditionalLogitModel, NestedLogitModel],
                 learning_rate: float,
                 optimizer: str):
        """
        The pytorch-lightning model wrapper for conditional and nested logit model.
        Ideally, end users don't need to interact with this class. This wrapper will be called by the run() function.
        """
        super().__init__()
        self.model = model
        self.learning_rate = learning_rate
        self.optimizer_class_string = optimizer

    def __str__(self) -> str:
        return str(self.model)

    def forward(self, *args, **kwargs):
        return self.model(*args, **kwargs)

    @torch.no_grad()
    def _get_performance_dict(self, batch):
        item_index = batch['item'].item_index if isinstance(self.model, NestedLogitModel) else batch.item_index
        ll = - self.model.negative_log_likelihood(batch, item_index).detach().item()
        return {'log_likelihood': ll}

    def training_step(self, batch, batch_idx):
        item_index = batch['item'].item_index if isinstance(self.model, NestedLogitModel) else batch.item_index
        loss = self.model.loss(batch, item_index)
        self.log('train_loss', loss, prog_bar=False, batch_size=len(batch))
        # skip computing log-likelihood for training steps to speed up training.
        # for key, val in self._get_performance_dict(batch).items():
            # self.log('test_' + key, val, prog_bar=True, batch_size=len(batch))
        return loss

    def validation_step(self, batch, batch_idx):
        for key, val in self._get_performance_dict(batch).items():
            self.log('val_' + key, val, prog_bar=False, batch_size=len(batch))

    def test_step(self, batch, batch_idx):
        for key, val in self._get_performance_dict(batch).items():
            self.log('test_' + key, val, prog_bar=False, batch_size=len(batch))

    def configure_optimizers(self):
        return getattr(torch.optim, self.optimizer_class_string)(self.parameters(), lr=self.learning_rate)

# def run_original(model, dataset, dataset_test=None, batch_size=-1, learning_rate=0.01, num_epochs=5000, report_frequency=None):
#     """All in one script for the model training and result presentation."""
#     if report_frequency is None:
#         report_frequency = (num_epochs // 10)

#     assert isinstance(model, ConditionalLogitModel) or isinstance(model, NestedLogitModel), \
#         f'A model of type {type(model)} is not supported by this runner.'
#     model = deepcopy(model)  # do not modify the model outside.
#     trained_model = deepcopy(model)  # create another copy for returning.
#     print('=' * 20, 'received model', '=' * 20)
#     print(model)
#     print('=' * 20, 'received dataset', '=' * 20)
#     print(dataset)
#     print('=' * 20, 'training the model', '=' * 20)


def section_print(input_text):
    """Helper function for printing"""
    print('=' * 20, input_text, '=' * 20)


def run(model: Union [ConditionalLogitModel, NestedLogitModel],
        dataset_train: ChoiceDataset,
        dataset_val: Optional[ChoiceDataset]=None,
        dataset_test: Optional[ChoiceDataset]=None,
        optimizer: str='Adam',
        batch_size: int=-1,
        learning_rate: float=0.01,
        num_epochs: int=10,
        num_workers: int=0,
        device: Optional[str]=None,
        **kwargs) -> Union[ConditionalLogitModel, NestedLogitModel]:
    """_summary_

    Args:
        model (Union[ConditionalLogitModel, NestedLogitModel]): the constructed model.
        dataset_train (ChoiceDataset): the dataset for training.
        dataset_val (ChoiceDataset): an optional dataset for validation.
        dataset_test (ChoiceDataset): an optional dataset for testing.
        batch_size (int, optional): batch size for model training. Defaults to -1.
        learning_rate (float, optional): learning rate for model training. Defaults to 0.01.
        num_epochs (int, optional): number of epochs for the training. Defaults to 10.
        num_workers (int, optional): number of parallel workers for data loading. Defaults to 0.
        device (Optional[str], optional): the device that trains the model, if None is specified, the function will
            use the current device of the provided model. Defaults to None.
        **kwargs: other keyword arguments for the pytorch lightning trainer, this is for users with experience in
            pytorch lightning and wish to customize the training process.

    Returns:
        Union[ConditionalLogitModel, NestedLogitModel]: the trained model.
    """
    # ==================================================================================================================
    # Setup the lightning wrapper.
    # ==================================================================================================================
    lightning_model = LightningModelWrapper(model, learning_rate=learning_rate, optimizer=optimizer)
    if device is None:
        # infer from the model device.
        device = model.device
    # the cloned model will be used for standard error calculation later.
    model_clone = deepcopy(model)
    section_print('model received')
    print(model)

    # ==================================================================================================================
    # Prepare the data.
    # ==================================================================================================================
    # present a summary of datasets received.
    section_print('data set received')
    print('[Train dataset]', dataset_train)
    print('[Validation dataset]', dataset_val)
    print('[Test dataset]', dataset_test)

    # create pytorch dataloader objects.
    train_dataloader = create_data_loader(dataset_train.to(device), batch_size=batch_size, shuffle=True, num_workers=num_workers)

    if dataset_val is not None:
        val_dataloader = create_data_loader(dataset_val.to(device), batch_size=batch_size, shuffle=False, num_workers=num_workers)
    else:
        val_dataloader = None

    if dataset_test is not None:
        test_dataloader = create_data_loader(dataset_test.to(device), batch_size=batch_size, shuffle=False, num_workers=num_workers)
    else:
        test_dataloader = None

    # ==================================================================================================================
    # Training the model.
    # ==================================================================================================================
    # if the validation dataset is provided, do early stopping.
    # callbacks = [EarlyStopping(monitor="val_ll", mode="max", patience=10, min_delta=0.001)] if val_dataloader is not None else []
    callbacks = [EarlyStopping(monitor="val_ll", mode="max", patience=10, min_delta=0.001)] if val_dataloader is not None else []

    trainer = pl.Trainer(accelerator='gpu' if ('cuda' in str(model.device)) else None,  # use GPU if the model is currently on the GPU.
                         devices="auto",
                         max_epochs=num_epochs,
                         check_val_every_n_epoch=num_epochs // 100,
                         log_every_n_steps=num_epochs // 100,
                         callbacks=callbacks,
                         **kwargs)
    start_time = time.time()
    trainer.fit(lightning_model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader)
    print(f'Time taken for training: {time.time() - start_time}')
    if test_dataloader is not None:
        trainer.test(lightning_model, test_dataloaders=test_dataloader)
    else:
        print('Skip testing, no test dataset is provided.')

    # ====== get the standard error of the model ====== #
    # current methods of computing standard deviation will corrupt the model, load weights into another model for returning.
    state_dict = deepcopy(lightning_model.model.state_dict())
    model_clone.load_state_dict(state_dict)

    # get mean of estimation.
    mean_dict = dict()
    for k, v in lightning_model.model.named_parameters():
        mean_dict[k] = v.clone()

    # estimate the standard error of the model.
    dataset_for_std = dataset_train.clone()

    if isinstance(model, ConditionalLogitModel):
        def nll_loss(model):
            y_pred = model(dataset_for_std)
            return F.cross_entropy(y_pred, dataset_for_std.item_index, reduction='sum')
    elif isinstance(model, NestedLogitModel):
        def nll_loss(model):
            d = dataset_for_std[torch.arange(len(dataset_for_std))]
            return model.negative_log_likelihood(d, d['item'].item_index)
    std_dict = parameter_std(model_clone, nll_loss)

    print('=' * 20, 'model results', '=' * 20)
    report = list()
    for coef_name, std in std_dict.items():
        std = std.cpu().detach().numpy()
        mean = mean_dict[coef_name].cpu().detach().numpy()
        coef_name = coef_name.replace('coef_dict.', '').replace('.coef', '')
        for i in range(mean.size):
            report.append({'Coefficient': coef_name + f'_{i}',
                           'Estimation': float(mean[i]),
                           'Std. Err.': float(std[i])})
    report = pd.DataFrame(report).set_index('Coefficient')

    # Compute z-value
    report['z-value'] = report['Estimation'] / report['Std. Err.']

    # Compute p-value (two tails).
    report['Pr(>|z|)'] = (1 - norm.cdf(abs(report['z-value']))) * 2

    # Compute significance stars
    report['Significance'] = ''
    report.loc[report['Pr(>|z|)'] < 0.001, 'Significance'] = '***'
    report.loc[(report['Pr(>|z|)'] >= 0.001) & (report['Pr(>|z|)'] < 0.01), 'Significance'] = '**'
    report.loc[(report['Pr(>|z|)'] >= 0.01) & (report['Pr(>|z|)'] < 0.05), 'Significance'] = '*'
    
    # print(f'Training Epochs: {num_epochs}\n')
    # print(f'Learning Rate: {learning_rate}\n')
    # print(f'Batch Size: {batch_size if batch_size != -1 else len(dataset_list[0])} out of {len(dataset_list[0])} observations in total in test set\n')

    lightning_model.model.to(device)
    train_ll = - lightning_model.model.negative_log_likelihood(dataset_train, dataset_train.item_index).detach().item()

    if dataset_val is not None:
        val_ll = - lightning_model.model.negative_log_likelihood(dataset_val, dataset_val.item_index).detach().item()
    else:
        val_ll = 'N/A'

    if dataset_test is not None:
        test_ll = - lightning_model.model.negative_log_likelihood(dataset_test, dataset_test.item_index).detach().item()
    else:
        test_ll = 'N/A'
    print(f'Final Log-likelihood: [Training] {train_ll}, [Validation] {val_ll}, [Test] {test_ll}\n')
    print(report.to_markdown())
    print("Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1")
    return model

In [37]:
# """
# This is a template script for researchers to train the PyTorch-based model with minimal effort.
# The researcher only needs to initialize the dataset and the model, this training template comes with default
# hyper-parameters including batch size and learning rate. The researcher should experiment with different levels
# of hyper-parameter if the default setting doesn't converge well.
# """
# import numpy as np
# import pandas as pd
# from copy import deepcopy
# import torch
# import torch.nn.functional as F
# import torch.optim
# from torch_choice.data import utils as data_utils
# from torch_choice.utils.std import parameter_std
# from torch_choice.model.conditional_logit_model import ConditionalLogitModel
# from torch_choice.model.nested_logit_model import NestedLogitModel


# def run(model, dataset, dataset_test=None, batch_size=-1, learning_rate=0.01, num_epochs=5000, report_frequency=None, compute_std=True, return_final_training_log_likelihood=False, model_optimizer='Adam'):
#     """All in one script for the model training and result presentation."""
#     if report_frequency is None:
#         report_frequency = (num_epochs // 10)

#     assert isinstance(model, ConditionalLogitModel) or isinstance(model, NestedLogitModel), \
#         f'A model of type {type(model)} is not supported by this runner.'
#     model = deepcopy(model)  # do not modify the model outside.
#     trained_model = deepcopy(model)  # create another copy for returning.
#     data_loader = data_utils.create_data_loader(dataset, batch_size=batch_size, shuffle=True)

#     optimizer = {'SGD': torch.optim.SGD,
#                  'Adagrad': torch.optim.Adagrad,
#                  'Adadelta': torch.optim.Adadelta,
#                  'Adam': torch.optim.Adam,
#                  'LBFGS': torch.optim.LBFGS}[model_optimizer](model.parameters(), lr=learning_rate)

#     # optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
#     # optimizer = torch.optim.Adagrad(model.parameters(), lr=learning_rate)
#     # optimizer = torch.optim.Adadelta(model.parameters(), lr=learning_rate)
#     # scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10000, gamma=0.1)
#     scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10000, gamma=0.7)
#     # scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5000, gamma=0.5)
#     print('=' * 20, 'received model', '=' * 20)
#     print(model)
#     print('=' * 20, 'received dataset', '=' * 20)
#     print(dataset)
#     print('=' * 20, 'training the model', '=' * 20)

#     total_loss_history = list()
#     tol = 0.001  # stop if the loss failed to improve tol proportion of average performance in the last k iterations.
#     k = 5
#     # fit the model.
#     for e in range(1, num_epochs + 1):
#         # track the log-likelihood to minimize.
#         ll, count, total_loss = 0.0, 0.0, 0.0
#         for batch in data_loader:
#             optimizer.zero_grad()
            
#             item_index = batch['item'].item_index if isinstance(model, NestedLogitModel) else batch.item_index
#             # the model.loss returns negative log-likelihood + regularization term.
#             loss = model.loss(batch, item_index)
#             total_loss -= loss

#             with torch.no_grad():
#                 if (e % report_frequency) == 0:
#                     # record log-likelihood.
#                     ll -= model.negative_log_likelihood(batch, item_index).detach().item() # * len(batch)
#                     count += len(batch)

#                     pred = model.forward(batch).argmax(dim=1)
#                     acc = (pred == item_index).float().mean().item()
#                     print('Accuracy: ', acc)

#             loss.backward()
            
#             if model_optimizer == "LBFGS":
#                 def closure():
#                     optimizer.zero_grad()
#                     loss = model.loss(batch, item_index)
#                     loss.backward()
#                     return loss
#                 optimizer.step(closure)
#             else:
#                 optimizer.step()
#         scheduler.step()

#         current_loss = float(total_loss.detach().item())
#         # if e > k:
#         if False:
#             past_avg = np.mean(total_loss_history[-k:])
#             improvement = (past_avg - current_loss) / past_avg
#             if improvement < tol:
#                 print(f'Early stopped at {e} epochs.')
#                 break
#         total_loss_history.append(current_loss)
#         # ll /= count
#         if (e % report_frequency) == 0:
#             print(f'Epoch {e}: Log-likelihood={ll}')

#     if dataset_test is not None:
#         test_ll = - model.negative_log_likelihood(dataset_test, dataset_test.item_index).detach().item()
#         print('Test set log-likelihood: ', test_ll)

#     # final training log-likelihood.
#     ll = - model.negative_log_likelihood(dataset, dataset.item_index).detach().item() # * len(batch)

#     if not compute_std:
#         if return_final_training_log_likelihood:
#             return model, ll
#         else:
#             return model
#     else:
#         # current methods of computing standard deviation will corrupt the model, load weights into another model for returning.
#         state_dict = deepcopy(model.state_dict())
#         trained_model.load_state_dict(state_dict)

#         # get mean of estimation.
#         mean_dict = dict()
#         for k, v in model.named_parameters():
#             mean_dict[k] = v.clone()

#         # estimate the standard error of the model.
#         if isinstance(model, ConditionalLogitModel):
#             def nll_loss(model):
#                 y_pred = model(dataset)
#                 return F.cross_entropy(y_pred, dataset.item_index, reduction='sum')
#         elif isinstance(model, NestedLogitModel):
#             def nll_loss(model):
#                 d = dataset[torch.arange(len(dataset))]
#                 return model.negative_log_likelihood(d, d['item'].item_index)

#         std_dict = parameter_std(model, nll_loss)

#         print('=' * 20, 'model results', '=' * 20)
#         report = list()
#         for coef_name, std in std_dict.items():
#             std = std.cpu().detach().numpy()
#             mean = mean_dict[coef_name].cpu().detach().numpy()
#             coef_name = coef_name.replace('coef_dict.', '').replace('.coef', '')
#             for i in range(mean.size):
#                 report.append({'Coefficient': coef_name + f'_{i}',
#                             'Estimation': float(mean[i]),
#                             'Std. Err.': float(std[i])})
#         report = pd.DataFrame(report).set_index('Coefficient')
#         print(f'Training Epochs: stopped at {e}, maximum allowed: {num_epochs}\n')
#         print(f'Learning Rate: {learning_rate}\n')
#         print(f'Batch Size: {batch_size if batch_size != -1 else len(dataset)} out of {len(dataset)} observations in total\n')
#         print(f'Final Log-likelihood: {ll}\n')
#         print('Coefficients:\n')
#         print(report.to_markdown())
#         if return_final_training_log_likelihood:
#             return trained_model, ll
#         else:
#             return trained_model

In [27]:
model = ConditionalLogitModel(
    formula='(price_cost_freq_ovt|constant) + (session_income|item) + (price_ivt|item-full) + (intercept|item)',
    dataset=dataset,
    num_items=4).to(device)
# run(model, dataset, num_epochs=500, learning_rate=0.01, batch_size=-1, model_optimizer="LBFGS")
run(model, dataset, num_epochs=500, learning_rate=0.01, batch_size=-1, optimizer="LBFGS", device=device)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                  | Params
------------------------------------------------
0 | model | ConditionalLogitModel | 13    
------------------------------------------------
13        Trainable params
0         Non-trainable params
13        Total params
0.000     Total estimated model params size (MB)


ConditionalLogitModel(
  (coef_dict): ModuleDict(
    (price_cost_freq_ovt[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=3, 3 trainable parameters in total, device=cuda:0).
    (session_income[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, device=cuda:0).
    (price_ivt[item-full]): Coefficient(variation=item-full, num_items=4, num_users=None, num_params=1, 4 trainable parameters in total, device=cuda:0).
    (intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, device=cuda:0).
  )
)
Conditional logistic discrete choice model, expects input features:

X[price_cost_freq_ovt[constant]] with 3 parameters, with constant level variation.
X[session_income[item]] with 1 parameters, with item level variation.
X[price_ivt[item-full]] with 1 parameters, with item-full level variation.
X[intercept[item]] with 1 parameters, with

  rank_zero_warn(


Training: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=500` reached.


Time taken for training: 20.58278727531433
Skip testing, no test dataset is provided.
Final Log-likelihood: [Training] -1874.3427734375, [Validation] N/A, [Test] N/A

Coefficients:

| Coefficient                     |   Estimation |   Std. Err. |    z-value |    Pr(>|z|) | Significance   |
|:--------------------------------|-------------:|------------:|-----------:|------------:|:---------------|
| price_cost_freq_ovt[constant]_0 |  -0.0333405  |  0.00709546 |  -4.69885  | 2.61628e-06 | ***            |
| price_cost_freq_ovt[constant]_1 |   0.0925296  |  0.00509755 |  18.1518   | 0           | ***            |
| price_cost_freq_ovt[constant]_2 |  -0.0430024  |  0.00322471 | -13.3353   | 0           | ***            |
| session_income[item]_0          |  -0.0890684  |  0.0183464  |  -4.85482  | 1.20496e-06 | ***            |
| session_income[item]_1          |  -0.0279925  |  0.00387252 |  -7.22848  | 4.88498e-13 | ***            |
| session_income[item]_2          |  -0.0381456  |  0.0

ConditionalLogitModel(
  (coef_dict): ModuleDict(
    (price_cost_freq_ovt[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=3, 3 trainable parameters in total, device=cuda:0).
    (session_income[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, device=cuda:0).
    (price_ivt[item-full]): Coefficient(variation=item-full, num_items=4, num_users=None, num_params=1, 4 trainable parameters in total, device=cuda:0).
    (intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, device=cuda:0).
  )
)
Conditional logistic discrete choice model, expects input features:

X[price_cost_freq_ovt[constant]] with 3 parameters, with constant level variation.
X[session_income[item]] with 1 parameters, with item level variation.
X[price_ivt[item-full]] with 1 parameters, with item-full level variation.
X[intercept[item]] with 1 parameters, with

### Parameter Estimation from `R`
The following is the R-output from the `mlogit` implementation, the estimation, standard error, and log-likelihood from our `torch_choice` implementation is the same as the result from `mlogit` implementation.

We see that the final log-likelihood of models estimated using two packages are all around `-1874`.

The `run()` method calculates the standard deviation using $\sqrt{\text{diag}(H^{-1})}$, where $H$ is the hessian of negative log-likelihood with repsect to model parameters.

Names of coefficients are slightly different, one can use the following conversion table to compare estimations and standard deviations reported by both packages.

<!-- | Coefficient Name in Python |  Estimation |   Std. Err. |  Coeffcient Name in R | R Estimation | R Std. Err. | 
|:---------------------:|-------------:|------------:| :--------------: | ----------: | ------: |
| price_cost_freq_ovt_0 |  -0.0342194  |  0.00731707 | cost             | -0.0333389  |0.0070955|
| price_cost_freq_ovt_1 |   0.092262   |  0.00520946 | freq             |  0.0925297  |0.0050976|
| price_cost_freq_ovt_2 |  -0.0439827  |  0.00342765 | ovt              | -0.0430036  |0.0032247|
| session_income_0      |  -0.0901207  |  0.0205214  | income:bus       | -0.0890867  |0.0183471|
| session_income_1      |  -0.0272581  |  0.00385396 | income:car       | -0.0279930  |0.0038726|
| session_income_2      |  -0.0390468  |  0.00428838 | ivt:train        | -0.0014504  |0.0011875|
| price_ivt_0           |   0.0592097  |  0.0102933  | ivt:air          |  0.0595097  |0.0100727|
| price_ivt_1           |  -0.00753696 |  0.00496264 | ivt:bus          | -0.0067835  |0.0044334|
| price_ivt_2           |  -0.00604297 |  0.00193414 | ivt:car          | -0.0064603  |0.0018985|
| price_ivt_3           |  -0.00207518 |  0.00123286 | ivt:train        | -0.0014504  |0.0011875|
| intercept_0           |   0.700786   |  1.39368    | (Intercept):bus  |  0.6983381  |1.2802466|
| intercept_1           |   1.85016    |  0.728283   | (Intercept):car  |  1.8441129  |0.7085089|
| intercept_2           |   3.2782     |  0.648064   | (Intercept):train|  3.2741952  |0.6244152| -->

### R Output
```r
install.packages("mlogit")
library("mlogit")
data("ModeCanada", package = "mlogit")
MC <- dfidx(ModeCanada, subset = noalt == 4)
ml.MC1 <- mlogit(choice ~ cost + freq + ovt | income | ivt, MC, reflevel='air')

summary(ml.MC1)
```
```
Call:
mlogit(formula = choice ~ cost + freq + ovt | income | ivt, data = MC, 
    reflevel = "air", method = "nr")

Frequencies of alternatives:choice
      air     train       bus       car 
0.3738755 0.1666067 0.0035984 0.4559194 

nr method
9 iterations, 0h:0m:0s 
g'(-H)^-1g = 0.00014 
successive function values within tolerance limits 

Coefficients :
                    Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept):train  3.2741952  0.6244152   5.2436 1.575e-07 ***
(Intercept):bus    0.6983381  1.2802466   0.5455 0.5854292    
(Intercept):car    1.8441129  0.7085089   2.6028 0.0092464 ** 
cost              -0.0333389  0.0070955  -4.6986 2.620e-06 ***
freq               0.0925297  0.0050976  18.1517 < 2.2e-16 ***
ovt               -0.0430036  0.0032247 -13.3356 < 2.2e-16 ***
income:train      -0.0381466  0.0040831  -9.3426 < 2.2e-16 ***
income:bus        -0.0890867  0.0183471  -4.8556 1.200e-06 ***
income:car        -0.0279930  0.0038726  -7.2286 4.881e-13 ***
ivt:air            0.0595097  0.0100727   5.9080 3.463e-09 ***
ivt:train         -0.0014504  0.0011875  -1.2214 0.2219430    
ivt:bus           -0.0067835  0.0044334  -1.5301 0.1259938    
ivt:car           -0.0064603  0.0018985  -3.4029 0.0006668 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -1874.3
McFadden R^2:  0.35443 
Likelihood ratio test : chisq = 2058.1 (p.value = < 2.22e-16)
```