Skip to content

Python Tips for Working with Soundcast Data

Brice Nichols edited this page Feb 12, 2021 · 7 revisions

Soundcast inputs and outputs come in a variety of data formats. Many are conventional text or Emme files, while some are stored in formats like HDF5 which work with Python. This section will describe how to efficiently interact with special file formats and large text files that are generally too large for analysis in Excel. There are many ways to work with this data, but these examples show how PSRC works with the data internally.

Working with Large Text-Based Files

There are many inputs and outputs that are simple text files, delimited by comma (CSV), whitespace, or tabs. Since Soundcast works at a highly disaggregate level of parcels and individual trips (over 15 million rows), these files can often be quite large. These large files often require merging operations (e.g., adding household information to trip records, parcel data to households, etc.)

For these reasons, PSRC uses the Python library Pandas to work with almost all of the Soundcast data. Pandas allows users to quickly read and operate on text-based data much like one would in a sophisticated spreadsheet or in a database. Using Python for these operations has the added benefit of documenting procedures for repeatability and review. The following code snippets are provided to show examples of how the Pandas library is used to process some key input and output data.

To use Python for these procedures, we recommend using either an IDE or Jupyter notebooks

Load the pandas library and apply the handle "pd"; any time pd appears in the code, it means we are using a component of the pandas library.

import pandas as pd

For this example we will load a set of standard outputs for households and persons. We will create a simple calculation of total workers per TAZ. This will involve loading 2 datasets, merging them together, filtering, and aggregating. Though this specific measure is rudimentary, the methods demonstrated here are the basic building blocks for most analyses that need to be performed on this data.

Note that the file path specified should be a full path (e.g., 'C://user/soundcast/outputs/daysim/_households.tsv') to the file location or the IDE/script should be running from the Soundcast root directory (e.g., 'C://user/soundcast').

df_hh = pd.read_csv('outputs/daysim/_household.tsv', delim_whitespace=True)
df_person = pd.read_csv('outputs/daysim/_person.tsv', delim_whitespace=True)

Merge household data to person datasets (called dataframes [df] in Pandas vocabulary) to have the household TAZ information at the person level. This method merges all the fields from df_hh based on the common field 'hhno' using a left join (so each person has household information). The result is reassigned as the variable df_person, so df_person will have all the household attributes.

df_person = df_person.merge(df_hh, on='hhno', how='left')

Note that if working from an interactive IDE or a Jupyter Notebook, you can check the results with:

df_person.head()    # to see the top 10 rows of the dataframe
df_person.columns    # to get a list of all columns

We can filter by worker type to get only full-time workers. Recall that a data dictionary for Daysim outputs is available within the wiki. Since this is a filter of the full data, we can keep the original full df_person table and store this subset table as a new variable df_worker

df_worker = df_person[df_person['pwtyp'] == 1]

Finally, to calculate the total number of workers per TAZ, use the groupby method:

worker_per_taz = df_worker.groupby('hhtaz').sum()[['hhexpfac']]

This function aggregates at the household TAZ level (hhtaz), which we joined to the person table. The field 'hhtaz' inside the groupby method specifies the level of aggregation and the method afterwards .sum() computes a sum across these fields. Other methods like count, mean, min, max, etc. can also be used here as needed. Finally, the list of values in brackets [['hhexpfac']] specifies which columns should be returned. In this case, we return the hhexpfac field, which is a weight always equal to 1 for daysim records. Each of the tables has one of these expfac fields (trexpfac for trips, hhexpfac for households, etc.), which are useful for taking sums.

Loading HDF5 Data

The household and person input file is stored in HDF5 format (H5 for short). H5 is an efficient file storage system, akin to a database. To access h5 files, we use the h5py Python library as follows:

hh_person = h5py.File('inputs/scenario/landuse/hh_and_persons.h5')

Inside this H5 file are 2 data tables. The data are stored as dictionaries, where the key is the table name and the value is the data itself. To view the table names:

hh_person.keys()

To access the data, you must pass the table name:

hh_person['Household']

For each table, data is stored by the column, so each table's column is called separately. This data is again stored in a dictionary format, but now the key is the column and the value is the column data

hh_person['Household'].keys()

To access a complete column as an array, pass the data table key, the column key, and a numpy indexer [:] to select all values.

hh_person['Household']['hhno'][:]

The following function uses this logic to extract each column of data to its appropriate table. The results are stored in two pandas dataframes (hh and person).

 hh = pd.DataFrame()
 person = pd.DataFrame()
 for col in hh_person['Household']:
     hh[col] = hh_person['Household'][col][:]
 for col in hh_person['Person']:
     person[col] = hh_person['Person'][col][:]

Other Resources

Many examples of using pandas for data analysis can be found on the web, as well as within the Soundcast code itself. Many of the Soundcast Jupyter Notebook (such as this one) can be viewed directly in the browser through Github. A set of other notebooks can be found in scripts/summarize/standard/notebooks.

Virtual Environments

Soundcast is designed to be run within a virtual environment. This is a specific version of Python libraries and versions that can be turned on and off as needed. In addition to the virtual environment used by Soundcast, it may be useful to create other environments for different types of analysis. For instance, the environment used for Soundcast must use a specific, older version of Python 2, but as newer tools become available, users may want to use Python 3 separately. The following section describes how to create a new environment and install some specific libraries. The example is centered around the use of the geopandas library, which is very useful for spatial analyses often performed manually in ArcGIS. The geopandas library is very helpful for automating such processes, but works much better with Python 3. Therefore, we recommend installing a separate virtual environment for this as demonstrated below.

conda create --name myenv
  • This creates a new virtual environment with the name myenv. The environment is created, but not activated (in other words, we are not yet using it). To activate the virtual environment:
activate myenv
  • This should now display (myenv) in the command prompt, indicating that you are using the myenv environment. Now that it's activate, we will need to install some important libraries often used with Soundcast but not always available in a standard Anaconda install of Python. To add these libraries, enter the following commands (pressing Enter after each):
conda install geopandas
conda install h5py
  • To test whether these are available, enter python to start a Python interpreter. When Python is loaded, enterimport geopandas. If not errors are reported the library is now available. Test for each added environments.
  • Refer to the conda documentation for other actions and more information regarding virtual environments
Clone this wiki locally