# pyLogit Example
### Rigged to use some JFK survey data
The purpose of this notebook is to demonstrate they key functionalities of pyLogit:
<ol>
    <li> Converting data between 'wide' and 'long' formats. </li>
    <li> Estimating conditional logit models. </li>
</ol>

Note: the original model used to demonstrate this code had each individual responding to multiple choice situations. Thus the choice observations were not truly independent of all other choice observations (they are correlated accross choices made by the same individual). However, for the purposes of that example, the effect of repeat-observations on the typical i.i.d. assumptions were ignored.

<b>Note: two bionomial random attributes were added to eat survey response</b>

In [32]:
import os
from collections import OrderedDict    # For recording the model specification 

import pandas as pd                    # For file input/output
import numpy as np                     # For vectorized math operations

import pylogit as pl                   # For MNL model estimation and
                                       # conversion from wide to long format

## Load and filter the raw Swiss Metro data

In [33]:
wide_jfk = pd.read_csv('joined_jfk_responses.csv')
wide_jfk['transit_cost'] = 7.75
wide_jfk['random_attribute1'] = np.random.binomial(1,.5,len(wide_jfk))
wide_jfk['random_attribute2'] = np.random.binomial(2,.5,len(wide_jfk))
wide_jfk['taxi_av'] = 1
wide_jfk['transit_av'] = 1

In [34]:
wide_jfk['ModeCode'] = -1
mode_dict = {'other':0,'taxi':1,'transit':2}
for index, row in wide_jfk.iterrows():
    wide_jfk.loc[index,'ModeCode'] = mode_dict[row['ModeGroup']]

In [35]:
include_criteria = (wide_jfk.ModeCode.isin([1, 2]))

# Note that the .copy() ensures that any later changes are made 
# to a copy of the data and not to the original data

wide_jfk = wide_jfk.loc[include_criteria].copy()

In [36]:
# simple_jfk = wide_jfk[['random_attribute1','random_attribute2','taxi_mean_duration','taxi_mean_total_w_inferred_tip','transit_duration','transit_cost','taxi_av','transit_av','ModeCode']].copy()
simple_jfk = wide_jfk[['taxi_mean_duration','taxi_mean_total_w_inferred_tip','transit_duration','taxi_av','transit_av','ModeCode']].copy()

In [37]:
# Look at the first 5 rows of the data
simple_jfk.head().T

Unnamed: 0,57,58,59,60,61
taxi_mean_duration,47.879023,47.879023,47.879023,47.879023,28.719048
taxi_mean_total_w_inferred_tip,66.695924,66.695924,66.695924,66.695924,46.052326
transit_duration,58.083333,58.083333,58.083333,58.083333,64.05
taxi_av,1.0,1.0,1.0,1.0,1.0
transit_av,1.0,1.0,1.0,1.0,1.0
ModeCode,1.0,1.0,1.0,1.0,1.0


## Convert the Swissmetro data to "Long Format"

pyLogit only estimates models using data that is in "long" format. 

Long format has 1 row per individual per available alternative, and wide format has 1 row per individual or observation. Long format is useful because it permits one to directly use matrix dot products to calculate the index, $V_{ij} = x_{ij} \beta$, for each individual $\left(i \right)$ for each alternative $\left(j \right)$. In applications where one creates one's own dataset, the dataset can usually be created in long format from the very beginning. However, in situations where a dataset is provided to you in wide format (as in the case of the Swiss Metro dataset), it will be necesssary to convert the data from wide format to long format.

To convert the raw swiss metro data to long format, we need to specify:
<ol>
    <li>the variables or columns that are specific to a given individual, regardless of what alternative is being considered (note: every row is being treated as a separate observation, even though each individual gave multiple responses in this stated preference survey)</li>
    <li>the variables that vary across some or all alternatives, for a given individual (e.g. travel time)</li>
    <li>the availability variables</li>
    <li>the <u>unique</u> observation id column. (Note this dataset has an observation id column, but for the purposes of this example we don't want to consider the repeated observations of each person as being related. We therefore want a identifying column that gives an id to every response of every individual instead of to every individual).</li>
    <li>the choice column</li>
</ol>
<br>The cells below will identify these various columns, give them names in the long-format data, and perform the necessary conversion. 



In [38]:
# Look at the columns of the swiss metro dataset
simple_jfk.columns

Index([u'taxi_mean_duration', u'taxi_mean_total_w_inferred_tip',
       u'transit_duration', u'taxi_av', u'transit_av', u'ModeCode'],
      dtype='object')

In [39]:
# Create the list of individual specific variables
ind_variables = simple_jfk.columns.tolist()[:2]

# Specify the variables that vary across individuals and some or all alternatives
# The keys are the column names that will be used in the long format dataframe.
# The values are dictionaries whose key-value pairs are the alternative id and
# the column name of the corresponding column that encodes that variable for
# the given alternative. Examples below.
alt_varying_variables = {u'travel_time': dict([(1, 'taxi_mean_duration'),
                                               (2, 'transit_duration')]),
                          u'travel_cost': dict([(1, 'taxi_mean_total_w_inferred_tip')])
                          }

# Specify the availability variables
# Note that the keys of the dictionary are the alternative id's.
# The values are the columns denoting the availability for the
# given mode in the dataset.
availability_variables = {1: 'taxi_av',
                          2: 'transit_av'}

##########
# Determine the columns for: alternative ids, the observation ids and the choice
##########
# The 'custom_alt_id' is the name of a column to be created in the long-format data
# It will identify the alternative associated with each row.
custom_alt_id = "mode_id"

# Create a custom id column that ignores the fact that this is a 
# panel/repeated-observations dataset. Note the +1 ensures the id's start at one.
obs_id_column = "custom_id"
simple_jfk[obs_id_column] = np.arange(simple_jfk.shape[0],
                                            dtype=int) + 1


# Create a variable recording the choice column
choice_column = "ModeCode"

In [40]:
# Perform the conversion to long-format
long_jfk = pl.convert_wide_to_long(simple_jfk, 
                                           ind_variables, 
                                           alt_varying_variables, 
                                           availability_variables, 
                                           obs_id_column, 
                                           choice_column,
                                           new_alt_id_name=custom_alt_id)
# Look at the resulting long-format dataframe
long_jfk.head().T

Unnamed: 0,0,1,2,3,4
custom_id,1.0,1.0,2.0,2.0,3.0
mode_id,1.0,2.0,1.0,2.0,1.0
ModeCode,1.0,0.0,1.0,0.0,1.0
taxi_mean_duration,47.879023,47.879023,47.879023,47.879023,47.879023
taxi_mean_total_w_inferred_tip,66.695924,66.695924,66.695924,66.695924,66.695924
travel_time,47.879023,58.083333,47.879023,58.083333,47.879023
travel_cost,66.695924,0.0,66.695924,0.0,66.695924


## Perform desired variable creations and transformations
Before estimating a model, one needs to pre-compute all of the variables that one wants to use. This is different from the functionality of other packages such as mlogit or statsmodels that use formula strings to create new variables "on-the-fly." This is also somewhat different from Python Biogeme where new variables can be defined in the script but not actually created by the user before model estimation. pyLogit does not perform variable creation. It only estimates models using variables that already exist.

Below, we pre-compute the variables needed for this example's model:
<ol>
    <li> Travel time in hours instead of minutes. </li>
    <li> Travel cost in units of 0.01 CHF (swiss franks) instead of CHF, for ease of numeric optimization. </li>
    <li> Travel cost interacted with a variable that identifies individuals who own a season pass (and therefore have no marginal cost of traveling on the trip) or whose employer will pay for their commute/business trip. </li>
    <li> A dummy variable for traveling with a single piece of luggage. </li>
    <li> A dummy variable for traveling with multiple pieces of luggage. </li>
    <li> A dummy variable denoting whether an individual is traveling first class. </li>
    <li> A dummy variable indicating whether an individual took their survey on-board a train (since it is a-priori expected that these individuals are already willing to take a train or train-like service such as Swissmetro).</li>
</ol>

In [41]:
##########
# Create scaled variables so the estimated coefficients are of similar magnitudes
##########
# Scale the travel time column by 60 to convert raw units (minutes) to hours
long_jfk["travel_time_hrs"] = long_jfk["travel_time"] / 60.0

# Scale the travel cost by 100 so estimated coefficients are of similar magnitude
# and acccount for ownership of a season pass
long_jfk["travel_cost_hundreth"] = (long_jfk["travel_cost"] / 100.0)

## Create the model specification
The model specification being used in this example is the following:
$$
\begin{aligned}
V_{i, \textrm{Train}} &= \textrm{ASC Train} + \\
&\quad \beta _{ \textrm{tt_transit} } \textrm{Travel Time} _{ \textrm{Train}} * \frac{1}{60} + \\
&\quad \beta _{ \textrm{tc_train} } \textrm{Travel Cost}_{\textrm{Train}} * \left( GA == 0 \right) * 0.01 + \\
&\quad \beta _{ \textrm{headway_train} } \textrm{Headway} _{\textrm{Train}} * \frac{1}{60} + \\
&\quad \beta _{ \textrm{survey} } \left( \textrm{Train Survey} == 1 \right) \\
\\
V_{i, \textrm{Swissmetro}} &= \textrm{ASC Swissmetro} + \\
&\quad \beta _{ \textrm{tt_transit} } \textrm{Travel Time} _{ \textrm{Swissmetro}} * \frac{1}{60} + \\
&\quad \beta _{ \textrm{tc_sm} } \textrm{Travel Cost}_{\textrm{Swissmetro}} * \left( GA == 0 \right) * 0.01 + \\
&\quad \beta _{ \textrm{headway_sm} } \textrm{Heaway} _{\textrm{Swissmetro}} * \frac{1}{60} + \\
&\quad \beta _{ \textrm{seat} } \left( \textrm{Seat Configuration} == 1 \right) \\
&\quad \beta _{ \textrm{survey} } \left( \textrm{Train Survey} == 1 \right) \\
&\quad \beta _{ \textrm{first_class} } \left( \textrm{First Class} == 0 \right) \\
\\
V_{i, \textrm{Car}} &= \beta _{ \textrm{tt_car} } \textrm{Travel Time} _{ \textrm{Car}} * \frac{1}{60} + \\
&\quad \beta _{ \textrm{tc_car}} \textrm{Travel Cost}_{\textrm{Car}} * 0.01 + \\
&\quad \beta _{\textrm{luggage}=1} \left( \textrm{Luggage} == 1 \right) + \\
&\quad \beta _{\textrm{luggage}>1} \left( \textrm{Luggage} > 1 \right)
\end{aligned}
$$

Note that packages such as mlogit and statsmodels do not, by default, handle coefficients that vary over some alternatives but not all, such as the travel time coefficient that is specified as being the same for "Train" and "Swissmetro" but different for "Car."

In [42]:
# NOTE: - Specification and variable names must be ordered dictionaries.
#       - Keys should be variables within the long format dataframe.
#         The sole exception to this is the "intercept" key.
#       - For the specification dictionary, the values should be lists
#         of integers or or lists of lists of integers. Within a list, 
#         or within the inner-most list, the integers should be the 
#         alternative ID's of the alternative whose utility specification 
#         the explanatory variable is entering. Lists of lists denote 
#         alternatives that will share a common coefficient for the variable
#         in question.

basic_specification = OrderedDict()
basic_names = OrderedDict()

basic_specification["intercept"] = [1]
basic_names["intercept"] = ['ASC taxi']

basic_specification["travel_time_hrs"] = [1]
basic_names["travel_time_hrs"] = ['Travel Time, units:hrs (taxi)']

basic_specification["travel_cost_hundreth"] = [1]
basic_names["travel_cost_hundreth"] = ['Travel Cost taxi']

# Estimate the conditional logit model

In [43]:
# Estimate the multinomial logit model (MNL)
jfk_mnl = pl.create_choice_model(data=long_jfk,
                                        alt_id_col=custom_alt_id,
                                        obs_id_col=obs_id_column,
                                        choice_col=choice_column,
                                        specification=basic_specification,
                                        model_type="MNL",
                                        names=basic_names)

# Specify the initial values and method for the optimization.
jfk_mnl.fit_mle(np.zeros(3))

# Look at the estimation results
jfk_mnl.get_statsmodels_summary()

Log-likelihood at zero: -126.8459
Initial Log-likelihood: -126.8459
Estimation Time: 0.01 seconds.
Final log-likelihood: -106.3632


0,1,2,3
Dep. Variable:,ModeCode,No. Observations:,183.0
Model:,Multinomial Logit Model,Df Residuals:,180.0
Method:,MLE,Df Model:,3.0
Date:,"Fri, 13 May 2016",Pseudo R-squ.:,0.161
Time:,08:57:41,Pseudo R-bar-squ.:,0.138
converged:,True,Log-Likelihood:,-106.363
,,LL-Null:,-126.846

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
ASC taxi,2.5019,0.989,2.529,0.011,0.563 4.441
"Travel Time, units:hrs (taxi)",3.1544,1.787,1.765,0.078,-0.348 6.656
Travel Cost taxi,-5.6693,2.419,-2.344,0.019,-10.411 -0.928


## View results without using statsmodels summary table

You can view all of the results simply by using print_summaries(). This will simply print the various summary dataframes.

In [44]:
# Look at other all results at the same time
jfk_mnl.print_summaries()



Number of Parameters                                          3
Number of Observations                                      183
Null Log-Likelihood                                    -126.846
Fitted Log-Likelihood                                  -106.363
Rho-Squared                                            0.161477
Rho-Bar-Squared                                        0.137826
Estimation Message        Optimization terminated successfully.
dtype: object
                               parameters   std_err   t_stats  p_values  \
ASC taxi                         2.501880  0.989322  2.528883  0.011443   
Travel Time, units:hrs (taxi)    3.154375  1.786828  1.765350  0.077505   
Travel Cost taxi                -5.669328  2.419085 -2.343584  0.019099   

                               robust_std_err  robust_t_stats  robust_p_values  
ASC taxi                             0.944201        2.649731         0.008056  
Travel Time, units:hrs (taxi)        1.816686        1.736335         0.0

In [45]:
# Look at the general and goodness of fit statistics
jfk_mnl.fit_summary

Number of Parameters                                          3
Number of Observations                                      183
Null Log-Likelihood                                    -126.846
Fitted Log-Likelihood                                  -106.363
Rho-Squared                                            0.161477
Rho-Bar-Squared                                        0.137826
Estimation Message        Optimization terminated successfully.
dtype: object

In [46]:
# Look at the parameter estimation results, and round the results for easy viewing
np.round(jfk_mnl.summary, 3)

Unnamed: 0,parameters,std_err,t_stats,p_values,robust_std_err,robust_t_stats,robust_p_values
ASC taxi,2.502,0.989,2.529,0.011,0.944,2.65,0.008
"Travel Time, units:hrs (taxi)",3.154,1.787,1.765,0.078,1.817,1.736,0.083
Travel Cost taxi,-5.669,2.419,-2.344,0.019,2.448,-2.316,0.021
