### Data sources
- LEHD Origin-Destination Employment Statistics (LODES): The definition of variable codes, datasets, etc. can be found at the latest [LODES 7.3 Technical Documentation](https://lehd.ces.census.gov/data/lodes/LODES7/LODESTechDoc7.3.pdf). All LEHD Origin-Destination Employment Statistics (LODES) data are available, as described in the LODES documentation above. No changes have been made to the original CSV files. Data are available from 2002 to 2015. See the documentation above for caveats.
- Driving Times and Distances Dataset: Census tracts are 2010 vintage, and the columns are the origin tract, destination travel, travel time in minutes, and travel distance in miles. These data were calculated by the Data Science team at the Urban Institute. See [Github repo](https://github.com/UI-Research/spark-osrm).

In [13]:
import pyspark
from pyspark.sql.functions import *


Load in data and see what it looks like

In [8]:
drive = spark.read.parquet('s3://lsdm-emr-util/lsdm-data/travel-times/drive_times.parquet')
od = spark.read.parquet('s3://lsdm-emr-util/lsdm-data/lodes/od/od.parquet')
rac = spark.read.parquet('s3://lsdm-emr-util/lsdm-data/lodes/rac/rac.parquet')

In [26]:
print((drive.count(), len(drive.columns)))
drive.take(2)

(122004331, 4)


[Row(from_tract=u'36103146402', to_tract=u'42091207003', miles=141.2, minutes=184.1),
 Row(from_tract=u'36103146402', to_tract=u'42091209000', miles=162.7, minutes=203.4)]

In [27]:
print((od.count(), len(od.columns)))
od.take(2)

(1577789908, 17)


[Row(w_geocode=u'271630714002025', h_geocode=u'271630712082020', s000=1, sa01=0, sa02=1, sa03=0, se01=0, se02=1, se03=0, si01=1, si02=0, si03=0, createdate=u'20160219', year=2012, tract_id=u'27163071208', h_tract=u'27163071208', w_tract=u'27163071400'),
 Row(w_geocode=u'271630714002025', h_geocode=u'271630712083004', s000=1, sa01=0, sa02=1, sa03=0, se01=0, se02=1, se03=0, si01=1, si02=0, si03=0, createdate=u'20160219', year=2012, tract_id=u'27163071208', h_tract=u'27163071208', w_tract=u'27163071400')]

In [28]:
print((rac.count(), len(rac.columns)))
rac.take(2)

(80870134, 44)


[Row(h_geocode=u'260010001001000', c000=4, ca01=0, ca02=4, ca03=0, ce01=1, ce02=1, ce03=2, cns01=0, cns02=0, cns03=0, cns04=0, cns05=0, cns06=0, cns07=0, cns08=1, cns09=1, cns10=0, cns11=0, cns12=0, cns13=0, cns14=1, cns15=0, cns16=1, cns17=0, cns18=0, cns19=0, cns20=0, cr01=4, cr02=0, cr03=0, cr04=0, cr05=0, cr07=0, ct01=4, ct02=0, cd01=0, cd02=1, cd03=2, cd04=1, cs01=3, cs02=1, createdate=u'20170919', year=2015),
 Row(h_geocode=u'260010001001004', c000=2, ca01=0, ca02=2, ca03=0, ce01=1, ce02=1, ce03=0, cns01=0, cns02=0, cns03=0, cns04=0, cns05=0, cns06=0, cns07=0, cns08=0, cns09=0, cns10=1, cns11=0, cns12=0, cns13=0, cns14=0, cns15=0, cns16=0, cns17=0, cns18=1, cns19=0, cns20=0, cr01=2, cr02=0, cr03=0, cr04=0, cr05=0, cr07=0, ct01=2, ct02=0, cd01=1, cd02=1, cd03=0, cd04=0, cs01=1, cs02=1, createdate=u'20170919', year=2015)]

Make census tract columns in LEHD data

In [19]:
od = od.withColumn('h_tract', substring(od.h_geocode, 0, 11))\
        .withColumn('w_tract', substring(od.w_geocode, 0, 11))

In [30]:
rac = rac.withColumn('h_tract', substring(rac.h_geocode, 0, 11))

Join the three dataframes

In [33]:
# join od with rac based on h_geocode, giving us characteristics of residence areas
df = od.join(rac, od.h_geocode == rac.h_geocode, 'inner')

In [None]:
df.take(1)