# 300: Itinerary Choice Data

In [1]:
import larch, pandas, os, gzip
larch.__version__

'5.4.0'

The example itinerary choice described here is based on data derived from a ticketing database
provided by the Airlines Reporting Corporation. The data represent ten origin destination
pairs for travel in U.S. continental markets in May of 2013.   Itinerary characteristics
have been masked, e.g., carriers are labeled generically as "carrier X" and departure times
have been aggregated into categories. A fare is provided but is not completely accurate (a
random error has been added to each fare). These modifications were made to satisfy
nondisclosure agreements, so that the data can be published freely for teaching and
demostration purposes.  It is generally representative of real itinerary choice data used
in practice, and the results obtained from this data are intuitive from a behavioral
perspective, but it is not quite accurate and should not be used for behavioral studies.

In [2]:
from larch.data_warehouse import example_file

In [3]:
with gzip.open(example_file("arc"), 'rt') as previewfile:
    print(*(next(previewfile) for x in range(70)))

id_case,id_alt,choice,traveler,origin,destination,nb_cnxs,elapsed_time,fare_hy,fare_ly,equipment,carrier,timeperiod
 1,1,0,1,444,222,1,300,470.55658,0,1,3,7
 1,2,0,1,444,222,1,345,475.92258,0,2,3,5
 1,3,0,1,444,222,1,335,443.48166,0,1,3,2
 1,4,0,1,444,222,1,435,433.56735,0,2,3,2
 1,5,0,1,444,222,1,710,449.83664,0,2,3,2
 1,6,0,1,444,222,1,380,470.45175,0,1,3,5
 1,7,0,1,444,222,1,345,440.70526,0,2,3,6
 1,8,0,1,444,222,1,320,474.57831,0,2,3,4
 1,9,0,1,444,222,1,335,474.97363,0,2,3,3
 1,10,0,1,444,222,1,335,481.98392,0,1,3,3
 1,11,0,1,444,222,1,320,440.41138,0,1,3,7
 1,12,0,1,444,222,1,360,455.11444,0,2,3,1
 1,13,0,1,444,222,1,380,470.67239,0,1,3,4
 1,14,14,1,444,222,0,215,317.4277,0,2,3,1
 1,15,19,1,444,222,0,215,283.96292,0,2,3,1
 1,16,19,1,444,222,0,215,285.04138,0,2,3,2
 1,17,19,1,444,222,0,215,283.59644,0,2,3,2
 1,18,1,1,444,222,0,220,276.40555,0,2,3,3
 1,19,8,1,444,222,0,220,285.51282,0,2,3,3
 1,20,10,1,444,222,0,215,313.89828,0,2,3,3
 1,21,7,1,444,222,0,220,280.06757,0,2,3,4
 1,22,1

The first line of the file contains column headers. After that, each line represents
an alternative available to one or more decision makers. In our sample data, we see the first 67
lines of data share a ``id_case`` of 1, indicating that they are 67 different itineraries
available to the first decision maker type.  An identidier of the alternatives is given by the
number in the column ``id_alt``, although this number is simply a sequential counter within each case
in the raw data, and conveys no other information about the itinerary or its attributes.
The observed choices of the decision maker[s] are indicated in the column ``choice``.
That column counts the number of travelers who face this choice set and chose the itinerary
described by this row in the file.

We can load this data easily using pandas.  We'll also set the index of the resulting DataFrame to
be the case and alt identifiers.


In [4]:
itin = pandas.read_csv(example_file("arc"), index_col=['id_case','id_alt'])
itin.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6023 entries, (1, 1) to (105, 51)
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   choice        6023 non-null   int64  
 1   traveler      6023 non-null   int64  
 2   origin        6023 non-null   int64  
 3   destination   6023 non-null   int64  
 4   nb_cnxs       6023 non-null   int64  
 5   elapsed_time  6023 non-null   int64  
 6   fare_hy       6023 non-null   float64
 7   fare_ly       6023 non-null   float64
 8   equipment     6023 non-null   int64  
 9   carrier       6023 non-null   int64  
 10  timeperiod    6023 non-null   int64  
dtypes: float64(2), int64(9)
memory usage: 537.2 KB


In [5]:
itin.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,choice,traveler,origin,destination,nb_cnxs,elapsed_time,fare_hy,fare_ly,equipment,carrier,timeperiod
id_case,id_alt,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,1,0,1,444,222,1,300,470.55658,0.0,1,3,7
1,2,0,1,444,222,1,345,475.92258,0.0,2,3,5
1,3,0,1,444,222,1,335,443.48166,0.0,1,3,2
1,4,0,1,444,222,1,435,433.56735,0.0,2,3,2
1,5,0,1,444,222,1,710,449.83664,0.0,2,3,2


In [6]:
d = larch.DataFrames(itin, ch='choice', crack=True, autoscale_weights=True)
d.info(1)

rescaled array of weights by a factor of 2239.980952380952


larch.DataFrames:  (not computation-ready)
  n_cases: 105
  n_alts: 127
  data_ce: 6023 rows
    - choice       (6023 non-null int64)
    - nb_cnxs      (6023 non-null int64)
    - elapsed_time (6023 non-null int64)
    - fare_hy      (6023 non-null float64)
    - fare_ly      (6023 non-null float64)
    - equipment    (6023 non-null int64)
    - carrier      (6023 non-null int64)
    - timeperiod   (6023 non-null int64)
  data_co:
    - traveler    (105 non-null int64)
    - origin      (105 non-null int64)
    - destination (105 non-null int64)
  data_av: <populated>
  data_ch: choice
  data_wt: computed_weight (/ 2239.980952380952)
