In [1]:
import numpy as np
import pandas as pd
import larch as larch
import transportation_tutorials as tt

# Data for Discrete Choice

## Fundamental Data Formats

When working with discrete choice models, we will need our prepared data to be 
in one of two basic formats: the case-only ("idco")
format or the case-alternative ("idca") format. This are sometimes referred to as
IDCase (each record contains all the information for mode choice over
alternatives for a single trip) or IDCase-IDAlt (each record contains all the
information for a single alternative available to each decision maker so there is one
record for each alternative for each choice).

### idco Format

In the **idco** case-only format, each record provides all the relevant information
about an individual choice, including the variables related to the decision maker
or the choice itself, as well as alternative related variables for all available
alternatives and a variable indicating which alternative was chosen.


In [2]:
data_co = pd.read_csv("example-data/tiny_idco.csv", index_col='caseid')
data_co

Unnamed: 0_level_0,Income,CarTime,CarCost,BusTime,BusCost,WalkTime,Chosen
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,30000,30,150,40,100,20,Car
2,30000,25,125,35,100,0,Bus
3,40000,40,125,50,75,30,Walk
4,50000,15,225,20,150,10,Walk


### idca Format

In the **idca** case-alternative format, each record can include information on the variables
related to the decision maker or the choice itself, the attributes of that
particular alternative, and a choice variable that indicates whether the
alternative was or was not chosen.

In [3]:
data_ca = pd.read_csv("example-data/tiny_idca.csv")
data_ca

Unnamed: 0,caseid,altid,Income,Time,Cost,Chosen
0,1,Car,30000,30,150,1
1,1,Bus,30000,40,100,0
2,1,Walk,30000,20,0,0
3,2,Car,30000,25,125,0
4,2,Bus,30000,35,100,1
5,3,Car,40000,40,125,0
6,3,Bus,40000,50,75,0
7,3,Walk,40000,30,0,1
8,4,Car,50000,15,225,0
9,4,Bus,50000,20,150,0


The `idca` format actually has two technical variations, a sparse version and a 
dense version. The table shown above is a sparse version, where any alterative that 
is not available is simply missing from the data table.  Thus, in caseid 2 above, 
there are only 2 rows, not 3.  By dropping these rows, this data storage is potentially
more efficient than the dense version.  But, in cases where the number of missing alternatives
is managably small (less than half of all the data, certainly) it can be much more computationally
efficient to simply store and work with the dense array. 

> In *Larch*, these two distinct sub-types of idca data are labeled so 
> that the dense version labeled as `idca` and the sparse version 
> labeled as `idce`. 

### Data Conversion

Converting between `idca` format data and `idco` format in Python can be super easy if the alternative
id's are stored appropriately in a two-level MultiIndex. In that case, we can simply `stack` or `unstack` the DataFrame, and change formats.  This is typically more readily available when switching from `idca` to `idco`
formats, as the alterative id's typically appear in a column of the DataFrame that can be used for indexing.

In [4]:
data_ca.set_index(['caseid', 'altid']).unstack()

Unnamed: 0_level_0,Income,Income,Income,Time,Time,Time,Cost,Cost,Cost,Chosen,Chosen,Chosen
altid,Bus,Car,Walk,Bus,Car,Walk,Bus,Car,Walk,Bus,Car,Walk
caseid,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
1,30000.0,30000.0,30000.0,40.0,30.0,20.0,100.0,150.0,0.0,0.0,1.0,0.0
2,30000.0,30000.0,,35.0,25.0,,100.0,125.0,,1.0,0.0,
3,40000.0,40000.0,40000.0,50.0,40.0,30.0,75.0,125.0,0.0,0.0,0.0,1.0
4,50000.0,50000.0,50000.0,20.0,15.0,10.0,150.0,225.0,0.0,0.0,0.0,1.0


Getting our original `idco` data into `idca` format is not so clean, as there's no analagous
`set_columns` method in pandas, and even if there were, the alternative codes are not typically
neatly arranged in a row of data. We can force it to work, but it's not pretty.

In [5]:
forced_ca = data_co.T.set_index(
pd.MultiIndex.from_tuples([
    ['Car', 'Income'], 
    ['Car','Time'],
    ['Car','Cost'],
    ['Bus','Time'],
    ['Bus','Cost'],
    ['Walk','Time'],
    ['Car', 'Chosen'], 
], names=('alt','var'))
).T.stack(0)
forced_ca[['Chosen', 'Income']] = forced_ca[['Chosen', 'Income']].groupby("caseid").transform(
    lambda x: x.fillna(x.value_counts().index[0])
)
forced_ca['Chosen'] = (
    forced_ca['Chosen'] == forced_ca.index.get_level_values('alt')
).astype(float)
forced_ca

Unnamed: 0_level_0,var,Chosen,Cost,Income,Time
caseid,alt,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Bus,0.0,100.0,30000,40
1,Car,1.0,150.0,30000,30
1,Walk,0.0,,30000,20
2,Bus,1.0,100.0,30000,35
2,Car,0.0,125.0,30000,25
2,Walk,0.0,,30000,0
3,Bus,0.0,75.0,40000,50
3,Car,0.0,125.0,40000,40
3,Walk,1.0,,40000,30
4,Bus,0.0,150.0,50000,20


The good news is that it's not generally necessary to swap the data formats when working with Larch for discrete choice models, as Larch can natively handle both formats of data.

## Data Encoding

For the most part, data used in the utility functions of discrete choice models enters into the utility function as part of a linear-in-parameters function.  That is, we have some "data" that expresses an attribute of some part of the transportation system as a number, we multiply that by some numerical parameter that will be estimated, and we sum up the total over all the data-times-parameter operations.  This kind of structure is known as "linear algebra" and it's something computers can do super fast, as long as all the data and all the parameters are queued up in memory in the right formats.  So, typically it is optimal to pre-compute the "data" part of the process into one large contiguous array of floating point values, regardless if the values otherwise seem to be binary or integers. Most tools, such as Larch, will do much of this work for you, so you don't need to worry about it too much.   

There are two notable exceptions to this guideline: 

- *choices*: the data that represents the observed choices, which are inherently categorical
- *availablity*: data that represents the availability of each choice, which is inherently boolean

### Categorical Encoding

When we are looking at discrete choices, it is natural to employ a categorical data type for at least the "choice" data itself, if not for other columns as well.  Pandas can convert columns to categorical data simply by assigning the type "category".

In [6]:
choices = data_co['Chosen'].astype("category")
choices

caseid
1     Car
2     Bus
3    Walk
4    Walk
Name: Chosen, dtype: category
Categories (3, object): ['Bus', 'Car', 'Walk']

Once we have categorical data, if we like we can work with the underlying code values instead of the original raw data.

In [7]:
choices.cat.codes

caseid
1    1
2    0
3    2
4    2
dtype: int8

The `cat.categories` attribute contains the array of values matching each of the code.

In [8]:
choices.cat.categories

Index(['Bus', 'Car', 'Walk'], dtype='object')

When using `astype("category")` there's no control over the ordering of the categories.  If we want
to control the apparent order (e.g. we already have codes defined elsewhere such that Car is 1, Bus is 2, and walk is 3) then we can explicitly set the category value positions using `pd.CategoricalDtype` instead of `"category"`.
Note that the `cat.codes` numbers used internally by categoricals start with zero as standard in Python,
so if you want codes to start with 1 you need to include a dummy placeholder for zero.

In [9]:
choices1 = data_co['Chosen'].astype(pd.CategoricalDtype(['_','Car','Bus','Walk']))
choices1

caseid
1     Car
2     Bus
3    Walk
4    Walk
Name: Chosen, dtype: category
Categories (4, object): ['_', 'Car', 'Bus', 'Walk']

In [10]:
choices1.cat.codes

caseid
1    1
2    2
3    3
4    3
dtype: int8

To be clear, by asserting the *placement* ordering of alternative like this, we are not simultaneously asserting that the alternatives are ordinal.  Put another way, we are forcing Car to be coded as 1 and Bus to be coded as 2, but we are not saying that Car is less than Bus.  Pandas categoricals can allow this, by adding `ordered=True` to the CategoricalDtype. 

In [11]:
pd.CategoricalDtype(['NoCars','1Car','2Cars','3+Cars'], ordered=True)

CategoricalDtype(categories=['NoCars', '1Car', '2Cars', '3+Cars'], ordered=True)

### One Hot Encoding

One-hot encoding, also known as dummy variables, is the creation of a seperate binary-valued column for every categorical value.  We can convert a categorical data column into a set of one-hot encoded columns using the `get_dummies` function.

In [12]:
pd.get_dummies(choices)

Unnamed: 0_level_0,Bus,Car,Walk
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,1,0
2,1,0,0
3,0,0,1
4,0,0,1


It's not required to have first converted the data to a categorical data type.

In [13]:
pd.get_dummies(data_co['Chosen'])

Unnamed: 0_level_0,Bus,Car,Walk
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,1,0
2,1,0,0
3,0,0,1
4,0,0,1


## Adding on to Data

If you've seen a number of household travel surveys, you know that they typically include several tables of data, for households, persons, trips, and sometimes other things like vehicles.  The tables reveal the actual (or at least, reported as actual) travel behaviors of people: where do they go, when, by what mode, et cetera.  But these table don't always contain all the data we need to build or use a discrete choice model.  It's not enough to know the factual data (I drove to work, it took 35 minutes and cost \\$3.50 in gas and tolls), we also need counter-factual data (I didn't take the train, which would have taken 40 minutes and cost \\$2).  We can pull this counter-factual data out of model representations of all the various transportation options, but to do so we'll need to interface our survey data with our model data.

### Geocoding

One of the first things we might need to do is to geocode trip ends. Often this gets done relatively early in data processing, and TAZ ID's for origin and destination may already be attached.  But if that's not already done, we can do it in Python.

In [14]:
trips = pd.read_csv("example-data/tiny_points.csv", index_col='caseid')
trips

Unnamed: 0_level_0,orig_x,orig_y,dest_x,dest_y
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,946907,925689,884665,494832
2,952193,945628,892053,478115
3,889200,491106,959298,919813
4,877274,476391,947965,925746


If we have a shapefile for the TAZ's, we can map those points to the locations and attach the TAZ codes directly to the DataFrame.  For this geographic task, we'll use the `geopandas` package, and that shapefile.

In [15]:
import geopandas as gpd
taz_shp = gpd.read_file(tt.data("SERPM8-TAZSHAPE"))
taz_shp.head()

Unnamed: 0,OBJECTID,TAZ_REG,TAZ_OLD05,TAZ_MPO,COUNTY,CENSUSTAZ,TAZ_BF,FIX,AREA,F_NETAREA,CBD,HM_ROOMS,Shape_Leng,Shape_Area,geometry
0,1,1122.0,1122,1122,1.0,,0,0,4442490.0,0.8153,0,0,10592.846522,4442490.0,"POLYGON ((936374.674 959539.568, 936373.444 95..."
1,2,17.0,17,17,1.0,,0,0,15689400.0,0.8571,0,0,17396.297932,15689380.0,"POLYGON ((942254.500 952920.937, 942255.812 95..."
2,3,1123.0,1123,1123,1.0,,0,0,17396100.0,0.8663,0,0,23585.421941,17396130.0,"POLYGON ((940953.561 952985.069, 940953.437 95..."
3,4,1120.0,1120,1120,1.0,,0,0,1303420.0,0.8536,0,0,7202.864864,1303422.0,"POLYGON ((953119.000 951985.375, 953045.807 95..."
4,5,1121.0,1121,1121,1.0,,0,0,31477500.0,0.8787,0,0,24940.959492,31477480.0,"POLYGON ((934328.283 951600.585, 934327.451 94..."


To compelete the geocoding, first we need to assemble the latitude and longitude (or x and y) columns into an
array of points.

In [16]:
trips_orig_points_array = gpd.points_from_xy(
    trips.orig_x, trips.orig_y, crs=taz_shp.crs
)
trips_orig_points_array

<GeometryArray>
[<shapely.geometry.point.Point object at 0x7fa9202f03d0>,
 <shapely.geometry.point.Point object at 0x7fa9202f0400>,
 <shapely.geometry.point.Point object at 0x7fa9202f00a0>,
 <shapely.geometry.point.Point object at 0x7fa9202f04c0>]
Length: 4, dtype: geometry

Then, we wrap that array into a GeoDataFrame, attaching the same `index` as on our original `trips` table.

In [17]:
trips_orig_points = gpd.GeoDataFrame(
    geometry=trips_orig_points_array,
    index=trips.index,
)
trips_orig_points

Unnamed: 0_level_0,geometry
caseid,Unnamed: 1_level_1
1,POINT (946907.000 925689.000)
2,POINT (952193.000 945628.000)
3,POINT (889200.000 491106.000)
4,POINT (877274.000 476391.000)


Finally, we run a spatial join (`sjoin`) between that GeoDataFrame and the TAZ shapes, joining using the 'within' operator.  We'll take the `TAZ_MPO` field from the result, and write that to the trips DataFrame as the origin TAZ.

In [18]:
trips['otaz'] = gpd.sjoin(
    trips_orig_points, taz_shp, how='left', op='within',
).TAZ_MPO
trips

Unnamed: 0_level_0,orig_x,orig_y,dest_x,dest_y,otaz
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,946907,925689,884665,494832,1129
2,952193,945628,892053,478115,1141
3,889200,491106,959298,919813,1132
4,877274,476391,947965,925746,1149


We can do the same for the destination TAZ, all in one fell swoop.

In [19]:
trips['dtaz'] = gpd.sjoin(
    gpd.GeoDataFrame(
        geometry=gpd.points_from_xy(
            trips.dest_x, trips.dest_y, crs=taz_shp.crs
        ),
        index=trips.index,
    ), 
    taz_shp, how='left', op='within',
).TAZ_MPO

trips

Unnamed: 0_level_0,orig_x,orig_y,dest_x,dest_y,otaz,dtaz
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,946907,925689,884665,494832,1129,1122
2,952193,945628,892053,478115,1141,1136
3,889200,491106,959298,919813,1132,1144
4,877274,476391,947965,925746,1149,1129


### Attaching Skims

After attaching our trip end zone id's, we're ready to pull in the level-of-service (LOS) attributes for the various alternatives.  Many times, these LOS attributes come in the form of 
"skims".  Skims are typically a bunch of square arrays, showing the cost or time or other 
level of service attributes between origins and destinations.  They come in a variety of formats,
although it is increasingly common to see "openmatrix" (OMX) or `.omx` files used to exchange skim data
between different tools.  To read OMX files in Python, we can use the `openmatrix` package, or an optimized
OMX reader that's build in to Larch.

In [20]:
skims = larch.OMX(tt.data('SERPM8-JUPITER-AMHSKIMS'))
skims

<larch.OMX> ⋯/SERPM8-JUPITER-AMHSKIMS.omx
 |  shape:(220, 220)
 |  data:
 |    AM_DAT_DIST      (float64)
 |    AM_DAT_FFTIME    (float64)
 |    AM_DAT_TIME      (float64)
 |    AM_DAT_TOLLCOST  (float64)
 |    AM_DAT_TOLLDIST  (float64)
 |    AM_GP_DIST       (float64)
 |    AM_GP_FFTIME     (float64)
 |    AM_GP_TIME       (float64)
 |    AM_S2NH_DIST     (float64)
 |    AM_S2NH_FFTIME   (float64)
 |    AM_S2NH_HOVDIST  (float64)
 |    AM_S2NH_TIME     (float64)
 |    AM_S2TH_DIST     (float64)
 |    AM_S2TH_FFTIME   (float64)
 |    AM_S2TH_HOVDIST  (float64)
 |    AM_S2TH_TIME     (float64)
 |    AM_S2TH_TOLLCOST (float64)
 |    AM_S2TH_TOLLDIST (float64)
 |  lookup:
 |    TAZ_ID (220 int64)

The OMX file can contain two different sets of data. The "data" includes all the various LOS attributes, in a bunch of square arrays. The "lookup" includes one-dimensional arrays (vectors) that can include both meaningful data (e.g. hourly parking costs by zone) as well as metadata (e.g. the TAZ ID's).  In the file we're opened, there's only those ID's.  We can get a little more technical detail about each data item by inspecting them individually in our Jupyter notebook.

In [21]:
skims.lookup['TAZ_ID']

/lookup/TAZ_ID (CArray(220,), shuffle, zlib(1)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (8192,)

If we add a slice all operation (`[:]`) we can load the actual data into memory as a numpy array.

In [22]:
skims.lookup['TAZ_ID'][:]

array([   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,   11,
         12,   13,   14,   15,   16,   17,   18,   19,   20,   21,   22,
         23,   24,   25,   26,   27,   28,   29,   30,   31,   32,   33,
         34,   35,   36,   37,   38,   39,   40,   41,   42,   43,   44,
         45,   46,   47,   48,   49,   50,   51,   52,   53,   54,   55,
         56,   57,   58,   59,   60,   61,   62,   63,   64,   65,   66,
         67,   68,   69,   70,   71,   72,   73,   74,   75,   80,   81,
         82,   83,   84,   87,   88,   89,   90,   91,   92,   93,   95,
         98,   99,  100,  101,  102,  103,  792,  793,  794,  795,  796,
        797,  798,  799,  800,  825,  831,  832,  833,  834,  835,  836,
        837,  838,  839,  840,  841,  842,  843,  844,  845,  846,  847,
        848,  849,  850,  851,  852,  853,  885,  886,  887,  888,  889,
        890,  891,  892,  893, 1067, 1068, 1103, 1104, 1105, 1106, 1120,
       1121, 1122, 1123, 1124, 1125, 1126, 1127, 11

This OMX file only includes skims between 220 zones around Jupiter, and not the whole SERPM region, so the TAZ ID's are not a contiguous set of integers. Therefore, if we want to work with this data we have to map our known TAZ ID's to this discontinuous set.  Even if it was continuous, we'd still need to create a mapping, as TAZ ID's typically start at 1 but array indexing in Python starts at zero.

In [23]:
taz_idx_map = {i: n for n, i in enumerate(skims.lookup['TAZ_ID'])}
trips['otaz_ix'] = trips['otaz'].map(taz_idx_map)
trips['dtaz_ix'] = trips['dtaz'].map(taz_idx_map)
trips

Unnamed: 0_level_0,orig_x,orig_y,dest_x,dest_y,otaz,dtaz,otaz_ix,dtaz_ix
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,946907,925689,884665,494832,1129,1122,151,144
2,952193,945628,892053,478115,1141,1136,163,158
3,889200,491106,959298,919813,1132,1144,154,166
4,877274,476391,947965,925746,1149,1129,171,151


The "data" within the OMX file can also be accessed as if it is a dictionary of numpy arrays.  Each dictionary value isn't actually an array, but it exposes a very similar interface, and can be used to read arrays directly into memory.

In [24]:
skims["AM_GP_TIME"][:]

array([[ 1.992, 12.103,  8.116, ..., 11.371, 10.829, 11.093],
       [12.214,  2.516,  7.268, ..., 13.318, 14.36 ,  9.679],
       [ 8.258,  7.298,  2.491, ..., 11.785, 11.244, 11.507],
       ...,
       [11.287, 13.159, 11.56 , ...,  1.566,  3.523,  5.554],
       [10.714, 14.16 , 10.986, ...,  3.531,  1.216,  6.555],
       [11.168,  9.58 , 11.441, ...,  5.686,  6.727,  0.985]])

When we are building a discrete choice model for travel behavior, it is pretty common to need to attach individual values from the skims onto a table of data.  For example, we have our table of trip observations, which contains a number of trips with the origin and destination zones coded.  We can pass the columns of origin and destination zone indexes as the two dimensions of indexing, and get back a same-size set of values extracted from that array.

In [25]:
skims["AM_GP_TIME"][trips.otaz_ix, trips.dtaz_ix]

array([15.411,  6.582,  6.038,  7.752])

We'll probably want to attach many values from a single set of skims, and the `OMX` reader in Larch
offers a convenience method to get all of them in a single command.

In [26]:
skims_for_trips = skims.get_rc_dataframe(trips.otaz_ix, trips.dtaz_ix)
skims_for_trips

Unnamed: 0_level_0,AM_DAT_DIST,AM_DAT_FFTIME,AM_DAT_TIME,AM_DAT_TOLLCOST,AM_DAT_TOLLDIST,AM_GP_DIST,AM_GP_FFTIME,AM_GP_TIME,AM_S2NH_DIST,AM_S2NH_FFTIME,AM_S2NH_HOVDIST,AM_S2NH_TIME,AM_S2TH_DIST,AM_S2TH_FFTIME,AM_S2TH_HOVDIST,AM_S2TH_TIME,AM_S2TH_TOLLCOST,AM_S2TH_TOLLDIST
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,7.384,15.406,15.411,0.0,0.0,7.384,15.406,15.411,7.384,15.406,0.0,15.411,7.384,15.406,0.0,15.411,0.0,0.0
2,3.54,6.478,6.582,0.0,0.0,3.54,6.478,6.582,3.54,6.478,0.0,6.582,3.54,6.478,0.0,6.582,0.0,0.0
3,3.245,6.013,6.038,0.0,0.0,3.245,6.013,6.038,3.245,6.013,0.0,6.038,3.245,6.013,0.0,6.038,0.0,0.0
4,4.343,7.732,7.752,0.0,0.0,4.343,7.732,7.752,4.343,7.732,0.0,7.752,4.343,7.732,0.0,7.752,0.0,0.0


If we want, we can concatenate the result with our original trips DataFrame, and have one large table to work.

In [27]:
pd.concat([trips, skims_for_trips], axis=1)

Unnamed: 0_level_0,orig_x,orig_y,dest_x,dest_y,otaz,dtaz,otaz_ix,dtaz_ix,AM_DAT_DIST,AM_DAT_FFTIME,...,AM_S2NH_DIST,AM_S2NH_FFTIME,AM_S2NH_HOVDIST,AM_S2NH_TIME,AM_S2TH_DIST,AM_S2TH_FFTIME,AM_S2TH_HOVDIST,AM_S2TH_TIME,AM_S2TH_TOLLCOST,AM_S2TH_TOLLDIST
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,946907,925689,884665,494832,1129,1122,151,144,7.384,15.406,...,7.384,15.406,0.0,15.411,7.384,15.406,0.0,15.411,0.0,0.0
2,952193,945628,892053,478115,1141,1136,163,158,3.54,6.478,...,3.54,6.478,0.0,6.582,3.54,6.478,0.0,6.582,0.0,0.0
3,889200,491106,959298,919813,1132,1144,154,166,3.245,6.013,...,3.245,6.013,0.0,6.038,3.245,6.013,0.0,6.038,0.0,0.0
4,877274,476391,947965,925746,1149,1129,171,151,4.343,7.732,...,4.343,7.732,0.0,7.752,4.343,7.732,0.0,7.752,0.0,0.0


## Assembling Data for Estimation

We've got a lot of different parts of our data laid out so far.  All that's left is to assemble all this data together, to get ready from model estimation.  For estimation with Larch, there's a `DataFrames` object designed to help with exactly this.

In [28]:
from larch import DataFrames

As the basic starting point for our `DataFrames`, we can use either an `idco` or `idca` format table.
The simplest one is the `idco` format input, as there are very few requirements for this input type.
All we need is a DataFrame with an integer-type index (a default RangeIndex meets this requirement).

In [29]:
DataFrames(data_co).info(verbose=True)

larch.DataFrames:  (not computation-ready)
  n_cases: 4
  n_alts: 0
  data_ca: <not populated>
  data_co:
    - Income   (4 non-null int64)
    - CarTime  (4 non-null int64)
    - CarCost  (4 non-null int64)
    - BusTime  (4 non-null int64)
    - BusCost  (4 non-null int64)
    - WalkTime (4 non-null int64)
    - Chosen   (4 non-null object)


If we just create our DataFrames with an `idco` table and nothing else, as we do above, 
it will work fine, although the
resulting DataFrames object will be missing a bunch of relevant information, like how many alternatives
are represented, what their names are, which alternative were available 
and which alternative was chosen in each case. 
We can add this information as extra arguments for the DataFrames
constructor.

In [30]:
dfs = DataFrames(
    data_co, 
    ch='Chosen',
    alt_codes=[1,2,3],
    alt_names=['Car','Bus','Walk'], 
    av={
        1: True,
        2: True,
        3: 'WalkTime > 0'
    }
)
dfs.info(verbose=True)

larch.DataFrames:  (not computation-ready)
  n_cases: 4
  n_alts: 3
  data_ca: <not populated>
  data_co:
    - Income   (4 non-null int64)
    - CarTime  (4 non-null int64)
    - CarCost  (4 non-null int64)
    - BusTime  (4 non-null int64)
    - BusCost  (4 non-null int64)
    - WalkTime (4 non-null int64)
    - Chosen   (4 non-null object)
  data_av: <populated>
  data_ch: Chosen


Larch also allows us to start with `idca` data instead.  However, there are 
some more particular requirements on how the `idca` data should be formatted
before passing it to the DataFrames constructor.  Instead of requiring a 
integer-type index, the `idca` data must have a two-level pandas MultiIndex,
and the first level need to be an integer type, giving caseids. 
The second level gives alterative ids, and it can be a non-integer set of 
categorical values (or it can be integer code numbers).  If we ignore this, and just
put our single-level index DataFrame into the contructor, Larch will assume it's
actually `idco` data, as we see below where there appears to be 11 cases.

In [31]:
dfs = DataFrames(data_ca)
dfs.info(verbose=True)

larch.DataFrames:  (not computation-ready)
  n_cases: 11
  n_alts: 0
  data_ca: <not populated>
  data_co:
    - caseid (11 non-null int64)
    - altid  (11 non-null object)
    - Income (11 non-null int64)
    - Time   (11 non-null int64)
    - Cost   (11 non-null int64)
    - Chosen (11 non-null int64)


If we set our index correctly first, we'll get a better result, which shows 
11 rows in the data_ce dataframe, but only 4 cases total.

In [32]:
dfs = DataFrames(data_ca.set_index(['caseid','altid']))
dfs.info(verbose=True)

larch.DataFrames:  (not computation-ready)
  n_cases: 4
  n_alts: 3
  data_ce: 11 rows
    - Income (11 non-null int64)
    - Time   (11 non-null int64)
    - Cost   (11 non-null int64)
    - Chosen (11 non-null int64)
  data_co: <not populated>
  data_av: <populated>


Here, Larch has detected the alternative names given in the second level of the 
index, and makes them available later.

In [33]:
dfs.alternative_names()

['Bus', 'Car', 'Walk']

Also, you might note that Larch has automatically detected that the provided
dataframe is in sparse-format and stored it not as `data_ca` but instead as `data_ce`.
If we use actual dense-format inputs, we'll get a DataFrames object that uses the 
dense `data_ca`.

In [34]:
dfs = DataFrames(forced_ca)
dfs.info(verbose=True)

larch.DataFrames:  (not computation-ready)
  n_cases: 4
  n_alts: 3
  data_ca:
    - Chosen (12 non-null float64)
    - Cost   (8 non-null object)
    - Income (12 non-null int64)
    - Time   (12 non-null object)
  data_co: <not populated>


Although Larch can auto-detect the alternatives when loading data in `idca` format, 
it can't auto-detect which column contains the "choice" data. And when we provide 
dense-format inputs, Larch also can't infer availablity without help. Just as we provided
some helpful hints to fill in the gaps for the `idco` input, we can do so for the
`idca` input as well.

In [35]:
dfs = DataFrames(forced_ca, ch='Chosen', av='Time>0')
dfs.info(verbose=True)

larch.DataFrames:  (not computation-ready)
  n_cases: 4
  n_alts: 3
  data_ca:
    - Chosen (12 non-null float64)
    - Cost   (8 non-null object)
    - Income (12 non-null int64)
    - Time   (12 non-null object)
  data_co: <not populated>
  data_av: Time>0
  data_ch: Chosen


When loading `idca` data, we can opt to let Larch automatically identify data columns that
do not vary within observations (and therefore can be represented more succinctly as `idco` data),
and "crack" the data into its contituent `idca` and `idco` parts. For efficiency reasons, only 
numeric columns are processed in this manner, so if we want it to work as expected we need
to ensure all our numeric data is actually stored in numeric dtypes.

In [36]:
dfs = DataFrames(forced_ca.astype(np.float64), ch='Chosen', av='Time>0', crack=True)
dfs.info(verbose=True)

larch.DataFrames:
  n_cases: 4
  n_alts: 3
  data_ca:
    - Chosen (12 non-null float64)
    - Cost   (8 non-null float64)
    - Time   (12 non-null float64)
  data_co:
    - Income (4 non-null float64)
  data_av: Time>0
  data_ch: Chosen


If you've noticed the "not computation-ready" indicators on the various 
output above, you might also have noticed that this last DataFrames is
the first that doesn't bear that mark. These flags
are a result of one or more data columns that are not presently stored
in double precision or "float64" format, and for this last DataFrames
we forced all the data into the correct format. It's not a big problem if
we don't do so, as Larch will automatically 
convert data into this format when you try to estimate a model.

## Data Subsets

Sometimes, you'll assemble a full DataFrames based on a "complete" dataset, but you'll 
only want to work on a discrete choice model for a subset of the data.  This can happen,
for example, if you build the DataFrames off the "trips" table from a household travel
survey, but you want to estimate a mode choice model only for home-based work trips.

When the query filter for the dataset you want can be expressed as a function of the 
`data_co` in a Dataframes, you can use the `selector_co` method to make a new DataFrames
that is the correct subset of all your data -- `idca`, choice, availability, and weighting
data are also sliced along with the `idco` data.

In [37]:
dfs1 = dfs.selector_co("Income > 35_000")
dfs1.info(verbose=True)

larch.DataFrames:
  n_cases: 2
  n_alts: 3
  data_ca:
    - Chosen (6 non-null float64)
    - Cost   (4 non-null float64)
    - Time   (6 non-null float64)
  data_co:
    - Income (2 non-null float64)
  data_av: <populated>
  data_ch: <populated>
