### Data sources
- LEHD Origin-Destination Employment Statistics (LODES): The definition of variable codes, datasets, etc. can be found at the latest [LODES 7.3 Technical Documentation](https://lehd.ces.census.gov/data/lodes/LODES7/LODESTechDoc7.3.pdf). All LEHD Origin-Destination Employment Statistics (LODES) data are available, as described in the LODES documentation above. No changes have been made to the original CSV files. Data are available from 2002 to 2015. See the documentation above for caveats.
- Driving Times and Distances Dataset: Census tracts are 2010 vintage, and the columns are the origin tract, destination travel, travel time in minutes, and travel distance in miles. These data were calculated by the Data Science team at the Urban Institute. See [Github repo](https://github.com/UI-Research/spark-osrm).

In [27]:
import pyspark
import numpy as np
import pandas as pd
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator 

from IPython.core.interactiveshell import InteractiveShell
from pyspark.sql.types import StringType

warnings.filterwarnings(action='once')
InteractiveShell.ast_node_interactivity = "all"

In [2]:
spark = SparkSession.builder \
    .appName('pyspark-exploration') \
    .config('spark.driver.cores', '2') \
    .config('spark.executor.memory', '8gb') \
    .config('spark.executor.cores', '2') \
    .getOrCreate()     

In [3]:
def debug(df):
    """
    Function to pretty print the toDebugString
    """
    for rddstring in df.rdd.toDebugString().split('\n'):
        print rddstring.strip()

### Load and prepare data

Load in data and see what it looks like

In [4]:
drive = spark.read.parquet('s3://lsdm-emr-util/lsdm-data/travel-times/drive_times.parquet')
od = spark.read.parquet('s3://lsdm-emr-util/lsdm-data/lodes/od/od.parquet')

In [5]:
print((drive.count(), len(drive.columns)))
drive.take(2)
drive.dtypes

(122004331, 4)


[Row(from_tract=u'36103146402', to_tract=u'42091207003', miles=141.2, minutes=184.1),
 Row(from_tract=u'36103146402', to_tract=u'42091209000', miles=162.7, minutes=203.4)]

[('from_tract', 'string'),
 ('to_tract', 'string'),
 ('miles', 'double'),
 ('minutes', 'double')]

In [6]:
print((od.count(), len(od.columns)))
od.take(2)
od.dtypes

(1577789908, 14)


[Row(w_geocode=u'271630714002025', h_geocode=u'271630712082020', s000=1, sa01=0, sa02=1, sa03=0, se01=0, se02=1, se03=0, si01=1, si02=0, si03=0, createdate=u'20160219', year=2012),
 Row(w_geocode=u'271630714002025', h_geocode=u'271630712083004', s000=1, sa01=0, sa02=1, sa03=0, se01=0, se02=1, se03=0, si01=1, si02=0, si03=0, createdate=u'20160219', year=2012)]

[('w_geocode', 'string'),
 ('h_geocode', 'string'),
 ('s000', 'int'),
 ('sa01', 'int'),
 ('sa02', 'int'),
 ('sa03', 'int'),
 ('se01', 'int'),
 ('se02', 'int'),
 ('se03', 'int'),
 ('si01', 'int'),
 ('si02', 'int'),
 ('si03', 'int'),
 ('createdate', 'string'),
 ('year', 'int')]

Make census tract and state columns in origin-destination data

In [5]:
od = od.withColumn('h_tract', substring(od.h_geocode, 0, 11))\
        .withColumn('w_tract', substring(od.w_geocode, 0, 11))

In [6]:
od = od.withColumn('h_state', substring(od.h_geocode, 0, 2))\
        .withColumn('w_state', substring(od.w_geocode, 0, 2))

Make state column in drive data

In [7]:
drive = drive.withColumn('from_state', substring(drive.from_tract, 0, 2))\
            .withColumn('to_state', substring(drive.to_tract, 0, 2))

Make "total" and "pct" columns

In [8]:
sa_cols = ['sa01', 'sa02', 'sa03']
se_cols = ['se01', 'se02', 'se03']
si_cols = ['si01', 'si02', 'si03']

In [9]:
od = od.withColumn('sa_total', od.sa01+od.sa02+od.sa03)\
    .withColumn('se_total', od.se01+od.se02+od.se03)\
    .withColumn('si_total', od.si01+od.si02+od.si03)

In [10]:
for cat_ls, cat_name in [(sa_cols, 'sa_total'), (se_cols, 'se_total'), (si_cols, 'si_total')]:
    for col in cat_ls:
        new = col + '_pct'
        od = od.withColumn(new, od[col]/od[cat_name])

In [None]:
od.take(1)

Check out lineage and partitions of the two dataframes

In [12]:
debug(od)

(137) MapPartitionsRDD[7] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   MapPartitionsRDD[6] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   MapPartitionsRDD[5] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   FileScanRDD[4] at javaToPython at NativeMethodAccessorImpl.java:0 []


In [15]:
debug(drive)

(33) MapPartitionsRDD[29] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   MapPartitionsRDD[28] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   MapPartitionsRDD[27] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   FileScanRDD[26] at javaToPython at NativeMethodAccessorImpl.java:0 []


In [16]:
od.rdd.getNumPartitions()
drive.rdd.getNumPartitions()

137

33

### Join: origin-destination and driving dataframes  
Assumption: travel time and distance for a census tract is the same for all its comprising block groups.

Repartition od before joining



In [13]:
od = od.repartition(8)

Try working with just California data first, where origin and destination are state code 06  
Join od_ca and drive_ca

In [14]:
od_ca = od.where("h_state == 06" and "w_state == 06")

In [15]:
drive_ca = drive.where('to_state == 06' and 'from_state == 06')

In [16]:
df_ca = od_ca.join(drive_ca, [drive.from_tract == od.h_tract, drive.to_tract == od.w_tract])

In [21]:
df_ca.limit(1).collect()

[Row(w_geocode=u'060014355002011', h_geocode=u'060014001001045', s000=1, sa01=1, sa02=0, sa03=0, se01=0, se02=1, se03=0, si01=0, si02=0, si03=1, createdate=u'20160228', year=2008, h_tract=u'06001400100', w_tract=u'06001435500', h_state=u'06', w_state=u'06', sa_total=1, se_total=1, si_total=1, sa01_pct=1.0, sa02_pct=0.0, sa03_pct=0.0, se01_pct=0.0, se02_pct=1.0, se03_pct=0.0, si01_pct=0.0, si02_pct=0.0, si03_pct=1.0, from_tract=u'06001400100', to_tract=u'06001435500', miles=16.9, minutes=24.2, from_state=u'06', to_state=u'06')]

Now join full od with driving, giving us travel times for each origin-destination pair.  

In [17]:
df = od.join(drive, [drive.from_tract == od.h_tract, drive.to_tract == od.w_tract])

Resulting dataframe is split across 200 partitions, as we can see from the getNumPartitions method

In [18]:
df.rdd.getNumPartitions()

200

In [24]:
debug(df)

(200) MapPartitionsRDD[60] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   MapPartitionsRDD[59] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   MapPartitionsRDD[58] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   ZippedPartitionsRDD2[57] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   MapPartitionsRDD[51] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   ShuffledRowRDD[50] at javaToPython at NativeMethodAccessorImpl.java:0 []
+-(8) MapPartitionsRDD[49] at javaToPython at NativeMethodAccessorImpl.java:0 []
|  ShuffledRowRDD[48] at javaToPython at NativeMethodAccessorImpl.java:0 []
+-(137) MapPartitionsRDD[47] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   MapPartitionsRDD[46] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   FileScanRDD[45] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   MapPartitionsRDD[56] at javaToPython at NativeMethodAccessorImpl.java:0 []
|   ShuffledRowRDD[55] at javaToPython at NativeMe

### Grouped aggregation: proportion of goods-producing jobs per year

Again, start with California only  

The proportion of jobs in California in Goods Producing industry sectors ("si01") has steadily declined over the past decade as the economy moved away from manufacturing and towards services

In [25]:
agg_ca = df_ca.groupBy("year").agg(sum("si01"), sum('si02'), sum('si03'))
results_ca = agg_ca.collect()

In [26]:
agg_ca_df = pd.DataFrame(results_ca)

In [27]:
agg_ca_df.columns = ['year', 'si01', 'si02', 'si03']
agg_ca_df['total_jobs'] = agg_ca_df.si01 + agg_ca_df.si02 + agg_ca_df.si03

In [28]:
agg_ca_df['pct_si01'] = agg_ca_df.si01 / agg_ca_df.total_jobs

In [29]:
agg_ca_df.sort_values('year')

Unnamed: 0,year,si01,si02,si03,total_jobs,pct_si01
13,2002,1061707,1033147,3245012,5339866,0.198827
0,2003,1019401,1024673,3263365,5307439,0.19207
6,2004,1028667,1022742,3306224,5357633,0.192
9,2005,1052749,1053289,3372850,5478888,0.192146
3,2006,1045599,1071275,3441228,5558102,0.188122
1,2007,1025099,1065219,3474935,5565253,0.184196
12,2008,967440,1052308,3553468,5573216,0.173587
8,2009,877083,977576,3534286,5388945,0.162756
10,2010,823689,978019,3610732,5412440,0.152184
11,2011,827560,1009758,3651098,5488416,0.150783


Now do it for the whole country. We see a similar pattern of declining percentage of Goods Producing jobs.

In [30]:
agg_full = df.groupBy("year").agg(sum("si01"), sum('si02'), sum('si03'))
results_full = agg_full.collect()

In [31]:
agg_full_df = pd.DataFrame(results_full)

In [33]:
agg_full_df.columns = ['year', 'si01', 'si02', 'si03']
agg_full_df['total_jobs'] = agg_full_df.si01 + agg_full_df.si02 + agg_full_df.si03

In [34]:
agg_full_df['pct_si01'] = agg_full_df.si01 / agg_full_df.total_jobs

In [35]:
agg_full_df.sort_values('year')

Unnamed: 0,year,si01,si02,si03,total_jobs,pct_si01
13,2002,9049198,9483453,27533384,46066035,0.19644
0,2003,8865886,9600409,27942515,46408810,0.191039
6,2004,9121645,9926406,29220115,48268166,0.188978
9,2005,9300469,10086742,29813064,49200275,0.189033
3,2006,9447083,10167374,30462147,50076604,0.188653
1,2007,9357236,10167600,30749160,50273996,0.186125
12,2008,9023304,10151678,31044813,50219795,0.179676
8,2009,7960944,9624079,30834938,48419961,0.164415
10,2010,7515555,9468287,31314073,48297915,0.155608
11,2011,7827414,9978429,32872199,50678042,0.154454


### Window function: average total jobs over time

Doesn't work

In [None]:
window = Window.partitionBy("h_geocode", "w_geocode")\
            .orderBy("year")\
            .rowsBetween(Window.currentRow -1, Window.currentRow + 1)
avg_jobs = avg(col('s000')).over(window)

In [None]:
df_ca.select('year', 'h_geocode', 'w_geocode', avg_jobs.alias("avg_jobs")).limit(10).show()

In [37]:
w = Window.partitionBy("h_geocode", "w_geocode")\
        .orderBy('year')\
        .rangeBetween(-1, 1)

df_ca = df_ca.withColumn('centered_avg_jobs', avg("s000").over(w))

In [38]:
df_ca.take(1)

[Row(w_geocode=u'060014040002005', h_geocode=u'060014001001007', s000=1, sa01=0, sa02=0, sa03=1, se01=0, se02=0, se03=1, si01=0, si02=0, si03=1, createdate=u'20160228', year=2005, h_tract=u'06001400100', w_tract=u'06001404000', h_state=u'06', w_state=u'06', sa_total=1, se_total=1, si_total=1, sa01_pct=0.0, sa02_pct=0.0, sa03_pct=1.0, se01_pct=0.0, se02_pct=0.0, se03_pct=1.0, si01_pct=0.0, si02_pct=0.0, si03_pct=1.0, from_tract=u'06001400100', to_tract=u'06001404000', miles=4.2, minutes=11.5, from_state=u'06', to_state=u'06', centered_avg_jobs=1.0)]

### Reshape data from long to wide based on year column

Doesn't work

First try it for California

In [19]:
df_ca.na.drop('all', subset=['h_geocode', 'w_geocode', 'year', 's000'])

DataFrame[w_geocode: string, h_geocode: string, s000: int, sa01: int, sa02: int, sa03: int, se01: int, se02: int, se03: int, si01: int, si02: int, si03: int, createdate: string, year: int, h_tract: string, w_tract: string, h_state: string, w_state: string, sa_total: int, se_total: int, si_total: int, sa01_pct: double, sa02_pct: double, sa03_pct: double, se01_pct: double, se02_pct: double, se03_pct: double, si01_pct: double, si02_pct: double, si03_pct: double, from_tract: string, to_tract: string, miles: double, minutes: double, from_state: string, to_state: string]

In [20]:
pivoted_ca = df_ca.groupby('h_geocode', 'w_geocode')\
    .pivot("year")\
    .sum("s000")

KeyboardInterrupt: 

pivoted_ca.take(5)

Now for the whole country

pivoted = df.groupby('h_geocode', 'w_geocode')\
    .pivot("year")\
    .sum("s000")

pivoted.take(1)

### Analysis: How does age and earnings affect distance traveled for work in California?

Regression equation: dist = b0 + b1&ast;sa02_pct + b2&ast;sa03_pct + b3&ast;se02_pct + b4&ast;se03_pct + error  
sa02 = % of jobs for workers age 30 to 54  
sa03 = % of jobs for workers age 55 or older  
se02 = % of jobs with earnings 1251/month to 3333/month 
se03 = % of jobs with earnings greater than 3333/month

Note: 
- First category of each group (age an earnings) are left out of equation to serve as baseline
- Filtering data to 2015 so that each origin/destination pair only counts as one observation

This regression explores the relationship between job composition and travel distance by census block. By regressing travel distance on these four variables, we can evaluate whether age and earnings of jobs in a origin/destination pair has an effect on the travel distance.

In [None]:
df_ca_15 = df_ca.where('year == 2015')

In [22]:
df_ca_reg = df_ca_15[['sa02_pct', 'sa03_pct', 'se02_pct', 'se03_pct', 'miles']]

In [23]:
df_ca_reg_vect = df_ca_reg.rdd.map(lambda x: [Vectors.dense(x[0:4]), x[-1]]).toDF(['features', 'label'])
df_ca_reg_vect.show(5)

+-----------------+-----+
|         features|label|
+-----------------+-----+
|[0.0,0.0,1.0,0.0]| 16.9|
|[0.0,0.0,1.0,0.0]| 16.9|
|[0.0,1.0,0.0,1.0]| 16.9|
|[1.0,0.0,1.0,0.0]| 15.5|
|[0.0,0.0,0.0,1.0]| 15.5|
+-----------------+-----+
only showing top 5 rows



In [24]:
lr = LinearRegression(featuresCol = 'features', labelCol = 'label')

In [25]:
lr_model = lr.fit(df_ca_reg_vect)

KeyboardInterrupt: 

In [None]:
# make predictions
predictions = lr_model.transform(df_ca_reg_vect)

In [None]:
# print the coefficients and intercept 
print("Coefficients: %s" % str(lr_model.coefficients))
print("Intercept: %s" % str(lr_model.intercept))

In [None]:
# evaluate results
trainingSummary = lr_model.summary
print("RMSE: {}".format(trainingSummary.rootMeanSquaredError))
print("r2: {}".format(trainingSummary.r2))