# Trip Mode Choice Exercise

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as ss
import larch
larch.__version__

  import pandas.util.testing as tm


'5.3.5'

Accompanying this notebook is a file with data on intercity business travel in Canada. The data set describes 4,324 trips. The data set includes one record for each mode alternative that was available to each individual. Some individuals do not have all alternatives available so the total number of records equals 15,520.

In [2]:
raw_data = pd.read_csv('canada_intercity_business.csv.gz')
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15520 entries, 0 to 15519
Data columns (total 21 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CASENUM    15520 non-null  int64  
 1   ALTNUM     15520 non-null  int64  
 2   CHOICE     15520 non-null  int64  
 3   DIST       15520 non-null  int64  
 4   COST       15520 non-null  float64
 5   IVTT       15520 non-null  int64  
 6   OVTT       15520 non-null  int64  
 7   FREQ       15520 non-null  int64  
 8   INVFREQ    15520 non-null  float64
 9   FREQ2      15520 non-null  int64  
 10  INCOME     15520 non-null  int64  
 11  LNINCOME   15520 non-null  float64
 12  LCITY      15520 non-null  int64  
 13  GRPSIZE    15520 non-null  int64  
 14  GENDER     15520 non-null  int64  
 15  NIJ        15520 non-null  int64  
 16  OVTLNDIS   15520 non-null  float64
 17  TOTTIME    15520 non-null  int64  
 18  COSTINC    15520 non-null  float64
 19  COSTLNINC  15520 non-null  float64
 20  CNST  

**Data Definitions**

variable | description
:------ | :----------
CASENUM | Case number
ALTNUM | Alternative 1=train; 2=air; 3=bus; 4=car 
CHOICE | Indicates chosen alternative; 1 if chosen and 0 otherwise. 
DIST | (km) Travel distance 
COST | (dollars) Travel cost in dollars (measured in 1990 Canadian dollars). 
IVTT | (minutes) In-vehicle travel time
OVTT | (minutes) Out-of-vehicle travel time. (Note 0 for automobile.)
FREQ | Frequency per day. (Note 0 for automobile.) 
INVFREQ | Inverse of frequency per day. (Note 0 for automobile.) 
FREQ2 | Frequency per day. (Note 999 for automobile.) 
INCOME | (dollars) Individual’s annual income in dollars (measured in thousands of 1990 Canadian dollars)
LNINCOME | The natural log of annual income
LCITY | Indicated whether one or more trip ends is in the CBD. 0=no trip ends; 1=one trip end; 2=both trip ends 
GRPSIZE | Group size, the number of people traveling together
GENDER | Gender 0=male; 1=female. 
NIJ | Number of alternatives in choice set. 
OVTLNDIS | (min/log(km)) Out of vehicle travel time divided by the natural log of distance.
TOTTIME | (min) Total travel time.
COSTINC | Cost divided by income
COSTLNINC | Cost divided by the log of income
CNST | (1) A constant

## 1. Explore the data, using relevant tools. 

Report up to 5 of your most interesting findings, with frequency tables, crosstabs, and/or graphs as appropriate.  Each table or figure should be neatly formatted, fit on a single page, and be accompanied by a short blurb (one to three sentences will typically be sufficient) describing *why* if is interesting or relevant for this analysis. 

## 2. Estimate a basic reference model

Use Larch, or any similarly appropriate tool, to estimate multinomial logit (MNL) models of mode choice for this data. Estimate a constants only model as well as a basic reference model that includes only the alternative specific constants, travel time and travel cost. Interpret your results. 



## 3. Explore the impact of frequency on mode choice

Starting from the baseline model, create two new models, adding the frequency of service using the variables `FREQ` and `FREQ2`, respectively. The only difference between these two variables is that the frequency of service for the Car alternative is expressed as 0 in `FREQ`, and as 999 in `FREQ2`. (Since your car has no scheduled service, frequency can be considered as 0; or, since it leaves when you are ready, it can be considered to have near infinite frequency.) Compare the results of these two models. What is different? What is the same? 

Explain the implication of this, particularly regarding the interpretation of the alternative specific constants, and their t-statistics.

## 4. Explore other models with alternative specific variables

Estimate at least two additional models including one or more alternative specific variables (i.e., those that vary with the decision maker, not the alternatives) in each; state your reasons for selecting these variables (your work in part 1 may be valuable here). 

Using all appropriate tools and techniques at your disposal, interpret and evaluate these models (and those from before) individually, relative to the other models, relative to the null parameters model and relative to the constants only model. 

Explicitly explain the meaning of the estimation results for the attributes of alternatives and the alternative specific variables included in each model. Decide which, if any, of the alternative specific variables should be included, select a preferred model and explain the reasons for your selection.



## 5. Explore other models with transformations of variables

Estimate at least two additional models using the best of the results above and additional ideas for enhancing the model specification using transformations of the variables. In each case, explain how and why you transform the variables that you use, and interpret and evaluate the new models relative to those already estimated. Some possibilities:

1. Replace COST by COSTINC.

2. Add frequency transformed in such a way as to indicate a diminishing effect of increased frequency.

3. Examine the impact of group size using a piecewise linear formulation, or a dummy variable for large vs. small group.



## 6. Identify a preferred model.

Of all the models that you have estimates, identify an overall preferred model to use in travel forecasting, and explain why you have selected the model that you choose.

