# Lab 4: Mode Choice

## Estimation Dataset

We are using the same 2017 survey trip records for estimation. However, the data has been pre-processed for you to allow for model estimation. The file **'trip17_estimation.csv'** contains all of the observed rows, but also has additional rows for the unchosen alternative by mode. To measure and estimate the relative value of one mode alternative to the others, we need data on those other alternatives, which generally aren't provided in a standard survey. 

This dataset was generated by duplicating all observed (chosen) trips (total length x) for each mode (m) alternative (once for walk alternatives, once for bike, etc.), to end up with a new dataset that was of length (x)(m). Existing PSRC scripts were used to add travel time, cost, and distance for these trip alternatives, using model outputs. Model outputs report time/cost/distance between all origins and destinations (often called 'skims'), so it's possible to provided measures for all trips, even those that are highly unlikely (like biking from Tacoma to Everett). Some modellers have used other means of attaching these "skim" values to alternatives, including Google APIs. There is usually a cost associated with this since API calls are capped, and model skims are readily available, so it was easier to use the old method. However, with more observed data in the world, there are plenty of alternatives becoming available!

----

## Data Dictionary
### Modes:
- 1: Walk
- 2: Bike
- 3: SOV (single occupant vehicle)
- 4: HOV2 (2 people in a vehicle)
- 5: HOV3 (3 or more people in a vehicle)
- 6: Transit (including bus and train)
- 9: TNC (hired rideshare vehicles like Uber, Lyft)

### Sociodemographics and Land Use:
Same as past codebook unless otherwise noted
- choice: 0/1 for if record represents observed choice data
- hh_density_o: households/ft^2 in trip's origin TAZ
- emp_density_o: total employees/ft^2 in trip's origin TAZ
- college_density_o: college students/ft^2 in trip's origin TAZ
- gradeschool_density_o: gradeschool students/ft^2 in trip's origin TAZ
- dist_lbus: distance to transit in miles

## Pylogit

There are multiple libraries that can be used to estimate choice models in Python, including:
- [statsmodels](https://www.statsmodels.org/stable/index.html)
- [pylogit](https://github.com/timothyb0912/pylogit)
- [choicemodels (built partially off pylogit)](https://github.com/UDST/choicemodels)
- [biogeme](http://biogeme.epfl.ch/examples_swissmetro.html)

We will focus on pylogit because it is easier and more flexible to use than statsmodels, and more developed than choicemodels, which is still in development. Biogeme is an old tool that is now available in pandas, which is very nice, but the documentation is somewhat lacking. You are welcome to try multiple libraries, but I will only be able to provide direct supply for pylogit.

For more information, see: https://github.com/timothyb0912/pylogit/tree/master/examples/notebooks. 
- You can open HTML notebooks from these links, e.g. https://github.com/timothyb0912/pylogit/blob/master/examples/notebooks/Main%20PyLogit%20Example.ipynb
- If interested in nested logit, see example: https://github.com/timothyb0912/pylogit/blob/master/examples/notebooks/Nested%20Logit%20Example--Python%20Biogeme%20benchmark--09NestedLogit.ipynb

In [8]:
import pandas as pd
import numpy as np

# Import the pylogit library
import pylogit as pl    # Importing as shortcut "pl", similar to pandas imported as "pd"
from collections import OrderedDict    # a requirement for pylogit specifications

In [9]:
# Load a list of HBW trips in estimation format

df_hbw = pd.read_csv(r'trip17_estimation_hbw.csv')

In [10]:
# what does the trips dataset look like now?
df_hbw.hhid.nunique()

608

## MNL Estimation Specification

In [11]:
# We are using what is called an Ordered Dictionary
# Remember that a dictionary looks like this {'key': 'value'}
# An ordered dictionary is a special version of this type that keeps the keys in order

specification = OrderedDict()
names = OrderedDict()

# Define the alternative specific constants (ASCs), i.e., the intercepts
# Remember that one choice is the baseline (ASC=0)
# Leave the chosen baseline mode out of the list of intercept values below
# In this case, we are using SOV as the base model - it's common to use the most likely alternative as the base
specification["intercept"] = [1, 2, 4, 5, 6, 9]    # these are the mode IDs, excluding 3 for SOV
names['intercept'] = ['Walk ASC','Bike ASC', 'HOV2 ASC','HOV3+ ASC','Transit ASC', 'TNC']    # Provide labels

# Create a coefficient for travel time
# Note that we are using only a single travel time coefficient across all modes

specification['travtime'] = [[1,2,3,4,5,6,9]]    # Note that this is a list inside a list [[]]
names['travtime'] = ['time all']

# Specify which columns are used in the estimation
custom_alt_id = "mode_id"    # Mode columns, must be integer based
obs_id_column = "custom_id"    # an ID that is unique to each choice (a set of chosen and unchosen alternatives have their own ID)
choice_column = "choice"    # 0/1 column indicating if that row was the chosen or unchosen alternative

# Call the module to create the choice model specification
model_1 = pl.create_choice_model(data=df_hbw,    # Note that here's we are specifying the df_hbw dataset
                                alt_id_col=custom_alt_id,
                                obs_id_col=obs_id_column,
                                choice_col=choice_column,
                                specification=specification,    # using the basic_specification from above
                                model_type="MNL",
                                names=names)

In [12]:
specification

OrderedDict([('intercept', [1, 2, 4, 5, 6, 9]),
             ('travtime', [[1, 2, 3, 4, 5, 6, 9]])])

In [13]:
# The code above only generated the template to estimate the model. We still need to execute the actual
# maximum likelihood estimation process. 

# Run the estimation given the specification "basic_mnl"

# Specify the initial values and method for the optimization.
# Note that the value in np.zeros() is the number of coefficients we expect to return, including ASCs
# For the basic setup above there are 6 ASC alternatives and 1 travel time variable, for a total of 7
# Note that the error result if you get this number wrong will usually give the required value. 

# Note that here's were using a method called "fit_mile" on the object "basic_mnl," which we created above. 
model_1.fit_mle(np.zeros(7))
print(np.round(model_1.summary, 3))    # Make things easier to read

# Look at the estimation results
print(model_1.get_statsmodels_summary())

Log-likelihood at zero: -9,595.2829
Initial Log-likelihood: -9,595.2829


  warn('Method %s does not use Hessian information (hess).' % method,


Estimation Time for Point Estimation: 0.50 seconds.
Final log-likelihood: -6,436.4076
             parameters  std_err  t_stats  p_values  robust_std_err  \
Walk ASC          0.705    0.074    9.493       0.0           0.127   
Bike ASC         -1.554    0.073  -21.240       0.0           0.084   
HOV2 ASC         -1.735    0.054  -31.944       0.0           0.054   
HOV3+ ASC        -3.433    0.119  -28.871       0.0           0.119   
Transit ASC      -0.355    0.036   -9.839       0.0           0.039   
TNC              -3.548    0.126  -28.204       0.0           0.126   
time all         -0.038    0.002  -23.033       0.0           0.004   

             robust_t_stats  robust_p_values  
Walk ASC              5.565              0.0  
Bike ASC            -18.402              0.0  
HOV2 ASC            -31.942              0.0  
HOV3+ ASC           -28.870              0.0  
Transit ASC          -9.093              0.0  
TNC                 -28.204              0.0  
time all        