# 203: Exampville Destination Choice

Welcome to Exampville, the best simulated town in this here part of the internet!

Exampville is a demonstration provided with Larch that walks through some of the 
data and tools that a transportation planner might use when building a travel model. 

In [None]:
import larch, numpy, pandas, os
from larch import P, X
import larch.numba as lx

In [None]:
larch.__version__

In this example notebook, we will walk through the estimation of a tour 
destination choice model.  First, let's load the data files from
our example.

In [None]:
hh, pp, tour, skims, emp = larch.example(200, ['hh', 'pp', 'tour', 'skims', 'emp'])

For this destination choice model, we'll want to use the mode choice
logsums we calculated previously from the mode choice estimation,
but we'll use these values as fixed input data instead of a modeled value.  
We can load these logsums from the file in which they were saved. 
For this example, we can indentify that file using the `larch.example` 
function, which will automatically rebuild the file if it doesn't exists.
In typical applications, a user would generally just give the filename 
as a string and ensure manually that the file exists before loading it.

In [None]:
logsums_file = larch.example(202, output_file='/tmp/logsums.pkl.gz')
logsums = pandas.read_pickle(logsums_file)

## Preprocessing

We'll replicate the pre-processing used in the mode choice estimation,
to merge the household and person characteristics into the tours data,
add the index values for the home TAZ's, filter to include only 
work tours, and merge with the level of service skims. (If this 
pre-processing was computationally expensive, it would probably have
been better to save the results to disk and reload them as needed,
but for this model these commands will run almost instantaneously.)

In [None]:
raw = tour.merge(hh, on='HHID').merge(pp, on=('HHID', 'PERSONID'))
raw["HOMETAZi"] = raw["HOMETAZ"] - 1
raw["DTAZi"] = raw["DTAZ"] - 1
raw = raw[raw.TOURPURP == 1]
raw.index.name = 'CASE_ID'
raw

The alternatives in
the destinations model are much more regular than in the mode choice 
model, as every observation will have a similar set of alternatives
and the utility function for each of those alternatives will share a 
common functional form.  We'll leverage this by using `idca` format 
to make data management simpler.  

First, we'll assemble some individual variables that we'll want to use.
We can build an array to represent the distance to each destination based
on the `"AUTO_DIST"` matrix in the `skims` OMX file. 

In [None]:
distance = pandas.DataFrame(
    data=skims.AUTO_DIST[:][raw["HOMETAZi"], :],
    index=raw.index,
    columns=skims.TAZ_ID,
) 

This command pulls the relevant row, identified by the `"HOMETAZi"` column
in the `raw` data, into each row of a new DataFrame, which has a row for each
case and a column for each alterative. 

Note that the `[:]` 
inserted into the `data` argument is used to instruct the `pytables` module
to load the entire matrix into memory, and then `numpy` indexing is used to 
actually select out the rows needed.  This is a technical limitation of
the `pytables` module and could theoretically be a very computationally 
expensive step if the skims matrix is huge relative to the number of rows in
the `raw` DataFrame. However, in practice a single matrix from the skims file
is generally not that large compared to the number of observations, and this
step can be processed quite efficiently.

The logsums we previously loaded is in the same format as the `distance`, 
with a row for each case and a column for each alterative. To use the `idca` 
format, we'll reshape these data, so each is a single column 
(i.e., a `pandas.Series`), with a two-level `MultiIndex` giving case and 
alternative respectively, and then assemble these columns into a single 
DataFrame.  We can do the reshaping using the `stack` method, and we will
make sure the resulting Series has an appropriate name using `rename`, before 
we combine them together using `pandas.concat`:

In [None]:
ca = pandas.concat([
    distance.stack().rename("distance"),
    logsums.stack().rename("logsum"), 
], axis=1)


In [None]:
ca = lx.Dataset(dict(
    distance=distance.rename_axis(index='CASE_ID', columns="TAZ"),
    logsum=logsums.rename_axis(index='CASE_ID', columns="TAZ"), 
    alt_names=pd.Series([f'TAZ{i}' for i in skims.TAZ_ID], index=skims.TAZ_ID).rename_axis(index='TAZ'),
))
ca

For our destination choice model, we'll also want to use employment data.
This data, as included in our example, has unique 
values only by alternative and not by caseid, so there are only
40 unique rows.
(This kind of structure is common for destination choice models.)

In [None]:
emp.info()

To make this work with the computational 
arrays required for Larch, we'll need to join this to the other 
`idca` data.  Doing so is simple, because the index of the `emp` DataFrame
is the same as the alternative id level of the `ca` MultiIndex.  You can see
the names of the levels on the MultiIndex like this:

Knowing the name on the alternatives portion of the `idca` data lets us 
join the employment data like this:

Then we bundle the raw data along with this newly organized `idca` data,
into the `larch.DataFrames` structure, which is used for estimation.
This structure also identifies a vector of the alterative codes 
and optionally, names and choice identifiers.
This structure can be attached to a model as its `dataservice`.

In [None]:
tree = lx.DataTree(
    base=ca.rename({'CASE_ID': '_caseid_', 'TAZ':'_altid_'}),
    tour=tour.query("TOURPURP == 1"),
    hh=hh.set_index("HHID"),
    person=pp.set_index('PERSONID'),
    emp=emp,
    relationships=(
        "base._altid_ @ emp.TAZ",
        "base._caseid_ -> tour.CASE_ID",
        "tour.HHID @ hh.HHID",
        "tour.PERSONID @ person.PERSONID",
    ),
)

tree

In [None]:
tree.root_dataset

In [None]:
# dfs = larch.DataFrames(
#     co=raw,
#     ca=ca,
#     alt_codes=skims.TAZ_ID, 
#     alt_names=[f'TAZ{i}' for i in skims.TAZ_ID],
#     ch_name='DTAZ',
#     av=1,
# )

In [None]:
# dfs.info(1)

## Model Definition

In [None]:
m = lx.Model(datatree=tree)
m.title = "Exampville Work Tour Destination Choice v1"

In [None]:
m.quantity_ca = (
        + P.EmpRetail_HighInc * X('RETAIL_EMP * (INCOME>50000)')
        + P.EmpNonRetail_HighInc * X('NONRETAIL_EMP') * X("INCOME>50000")
        + P.EmpRetail_LowInc * X('RETAIL_EMP') * X("INCOME<=50000")
        + P.EmpNonRetail_LowInc * X('NONRETAIL_EMP') * X("INCOME<=50000")
)

m.quantity_scale = P.Theta


In [None]:
m.utility_ca = (
    + P.logsum * X.logsum
    + P.distance * X.distance
)

In [None]:
m.lock_values(
    EmpRetail_HighInc=0,
    EmpRetail_LowInc=0,
)

## Model Estimation

In [None]:
m.load_data()

In [None]:
m.loglike()

In [None]:
m.maximize_loglike()

In [None]:
m.calculate_parameter_covariance()

## Model Visualization

For destination choice and similar type models, it might be beneficial to
review the observed and modeled choices, and the relative distribution of
these choices across different factors.  For example, we would probably want
to see the distribution of travel distance.  The `Model` object includes
a built-in method to create this kind of visualization.

In [None]:
m.distribution_on_idca_variable('distance')

The `distribution_on_idca_variable` has a variety of options,
for example to control the number and range of the histogram bins:

In [None]:
m.distribution_on_idca_variable('distance', bins=40, range=(0,10))

Alternatively, the histogram style can be swapped out for a smoothed kernel density
function:

In [None]:
m.distribution_on_idca_variable(
    'distance',
    style='kde',
)

Subsets of the observations can be pulled out, to observe the 
distribution conditional on other `idco` factors, like income.

In [None]:
m.distribution_on_idca_variable(
    'distance',
    xlabel="Distance (miles)",
    bins=26,
    subselector='INCOME<10000',
    range=(0,13),
    header='Destination Distance, Very Low Income (<$10k) Households',
)

## Save and Report Model

In [None]:
report = larch.Reporter(title=m.title)

In [None]:
report << '# Parameter Summary' << m.parameter_summary()

In [None]:
report << "# Estimation Statistics" << m.estimation_statistics()

In [None]:
report << "# Utility Functions" << m.utility_functions()

The figures shown above can also be inserted directly into reports.

In [None]:
figure = m.distribution_on_idca_variable(
    'distance', 
    xlabel="Distance (miles)",
    style='kde',
    header='Destination Distance',
)
report << "# Visualization"
report << figure

In [None]:
report.save(
    '/tmp/exampville_dest_choice.html',
    overwrite=True,
    metadata=m,
)