### Data sources
- LEHD Origin-Destination Employment Statistics (LODES): The definition of variable codes, datasets, etc. can be found at the latest [LODES 7.3 Technical Documentation](https://lehd.ces.census.gov/data/lodes/LODES7/LODESTechDoc7.3.pdf). All LEHD Origin-Destination Employment Statistics (LODES) data are available, as described in the LODES documentation above. No changes have been made to the original CSV files. Data are available from 2002 to 2015. See the documentation above for caveats.
- Driving Times and Distances Dataset: Census tracts are 2010 vintage, and the columns are the origin tract, destination travel, travel time in minutes, and travel distance in miles. These data were calculated by the Data Science team at the Urban Institute. See [Github repo](https://github.com/UI-Research/spark-osrm).

In [6]:
import pyspark
from pyspark.sql.functions import *

Load in data and see what it looks like

In [7]:
drive = spark.read.parquet('s3://lsdm-emr-util/lsdm-data/travel-times/drive_times.parquet')
od = spark.read.parquet('s3://lsdm-emr-util/lsdm-data/lodes/od/od.parquet')
rac = spark.read.parquet('s3://lsdm-emr-util/lsdm-data/lodes/rac/rac.parquet')

AttributeError: 'module' object has no attribute 'read'

In [None]:
print((drive.count(), len(drive.columns)))
drive.take(2)

In [None]:
print((od.count(), len(od.columns)))
od.take(2)

In [None]:
print((rac.count(), len(rac.columns)))
rac.take(2)

Make census tract columns in origin-destination data

In [None]:
od = od.withColumn('h_tract', substring(od.h_geocode, 0, 11))\
        .withColumn('w_tract', substring(od.w_geocode, 0, 11))

Join the three dataframes

In [None]:
# left join od with rac based on h_geocode, giving us characteristics of residence areas
df = od.join(rac, od.h_geocode == rac.h_geocode, 'left_outer')

In [None]:
# left join join with driving, giving us travel times for each origin-destination pair
df = df.join(drive, [drive.from_tract == df.h_tract, drive.to_tract == df.w_tract])