# Obtain and Clean LODES Data

> The LEHD Origin-Destination Employment Statistics (LODES) datasets are released both as
part of the OnTheMap application and in raw form as a set of comma separated variable (CSV)
text files. This document describes the structure of those raw files and provides basic information
for users who want to perform analytical work on the data outside of the OnTheMap application." (U.S. Census, 2021)

U.S. Census Bureau. (2021). LEHD Origin-Destination Employment Statistics Data (2002-2018) [computer file]. Washington, DC: U.S. Census Bureau, Longitudinal-Employer Household Dynamics Program [distributor], accessed on {CURRENT DATE} at https://lehd.ces.census.gov/data/#lodes. LODES 7.5 [version]

The LODES Data provides emplyoment characterstics and origin-destination data
    
    1. Read in LODES data
    2. Select Work Area Charactersitics in Study Area
    3. Select Origin-Destination data in Study Area

## Description of Program
- program:    LODES_1av1_CleanLODESdata
- task:       Obtain and read in LODES data
- Version:    2021-08-14 - comparing 2015 and 2018 Rowland Elementary Jobists
-             2021-08-18 - comparing improved 2014 and 2015 Rowland Elementary joblists
- project:    Interdependent Networked Community Resilience Modeling Environment (IN-CORE) Subtask 5.2 - Social Institutions
- funding:	  NIST Financial Assistance Award Numbers: 70NANB15H044 and 70NANB20H008 
- author:     Nathanael Rosenheim

- Suggested Citation:
Rosenheim, N. (2021) “Obtain, Clean, and LODES Jobs Data". 
Archived on Github and ICPSR.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import os # For saving output to path

In [2]:
# Display versions being used - important information for replication
import sys
print("Python Version     ", sys.version)
print("numpy version:     ", np.__version__)
print("pandas version:    ", pd.__version__)

Python Version      3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 15:37:01) [MSC v.1916 64 bit (AMD64)]
numpy version:      1.21.1
pandas version:     1.3.1


In [3]:
# Store Program Name for output files to have the same name
programname = "LODES_3av1_ExploreLODESdata_2021-08-18"
# Make directory to save output
if not os.path.exists(programname):
    os.mkdir(programname)

# Setup access to IN-CORE
https://incore.ncsa.illinois.edu/

In [4]:
#from pyincore import IncoreClient, Dataset, FragilityService, MappingSet, DataService
#from pyincore_viz.geoutil import GeoUtil as viz

### IN-CORE addons
This program uses coded that is being developed as potential add ons to pyincore. These functions are in a folder called pyincore_addons - this folder is located in the same directory as this notebook.
The add on functions are organized to mirror the folder sturcture of https://github.com/IN-CORE/pyincore

Each add on function attempts to follow the structure of existing pyincore functions and includes some help information.

In [5]:
# To reload submodules need to use this magic command to set autoreload on
%load_ext autoreload
%autoreload 2
# open, read, and execute python program with reusable commands
# function that loops through lodes data structure
import pyincoredata_addons.lodes_fullloop_20210815 as lodes
import pyincoredata_addons.lodes_mcmcsa_util_20210816 as mcmc

# since the geoutil is under construction it might need to be reloaded
from importlib import reload 
#lodes = reload(lodes) # with auto reload on this command is not needed

# Print list of add on functions
from inspect import getmembers, isfunction
print(getmembers(lodes,isfunction))
print(getmembers(mcmc,isfunction))

[('fix_char_vars', <function fix_char_vars at 0x000001B1A0288D38>), ('full_lodes_loop', <function full_lodes_loop at 0x000001B19FC09948>), ('get_homeblocklist', <function get_homeblocklist at 0x000001B1A0288948>), ('import_lodes', <function import_lodes at 0x000001B1A0288A68>), ('keep_nonzeros', <function keep_nonzeros at 0x000001B1A0288EE8>), ('new_jobtypes', <function new_jobtypes at 0x000001B1A0288CA8>), ('stack_jobset', <function stack_jobset at 0x000001B1A0288E58>)]
[('add_missingeducation', <function add_missingeducation at 0x000001B1A028A048>), ('add_random_number', <function add_random_number at 0x000001B1A028A948>), ('calculate_combined_fitness', <function calculate_combined_fitness at 0x000001B1A028E1F8>), ('calculate_total_fitness', <function calculate_total_fitness at 0x000001B1A028E288>), ('get_single_characteristic_fitness', <function get_single_characteristic_fitness at 0x000001B1A028E168>), ('markov_chain_monte_carlo_simanneal', <function markov_chain_monte_carlo_simann

## Read In Cleaned LODES data

In [6]:
# Program used to clean LODES data joblist
sourceprogram1 =  "LODES_1av1_CleanLODESdata_2021-08-18"
sourceprogram2 =  "LODES_1av1_CleanLODESdata_2021-08-16"
# Make directory to save output
df = {} # Create dictionary to store data frames


filepath = sourceprogram1+"/"+sourceprogram1+"2014.csv"
df['2014'] = pd.read_csv(filepath)

filepath = sourceprogram2+"/"+sourceprogram2+"2015.csv"
df['2015'] = pd.read_csv(filepath)

In [7]:
df['2015'][['w_geocode','h_geocode','jobidac','year']].head()

Unnamed: 0,w_geocode,h_geocode,jobidac,year
0,371559612002006,212231002003022,jidodJT07133jobidac011532,2015
1,371559612002006,370179503001059,jidodJT07213jobidac211512,2015
2,371559612002006,370190203042017,jidodJT07223jobidac311532,2015
3,371559612002006,370479302001019,jidodJT07333jobidac411532,2015
4,371559612002006,370479306003057,jidodJT07323jobidac211512,2015


In [8]:
df['2014'][['w_geocode','h_geocode','jobidac','year']].head()

Unnamed: 0,w_geocode,h_geocode,jobidac,year
0,371559612002006,80140302001057,jidodJT07313jobidac311521,2014
1,371559612002006,80590104024033,jidodJT07233jobidac311512,2014
2,371559612002006,370179501003034,jidodJT07333jobidac411511,2014
3,371559612002006,370179503004045,jidodJT07233jobidac211522,2014
4,371559612002006,370179503005006,jidodJT07223jobidac211512,2014


### Plan to compare different years
The different years should have similar characteristics by count, race, ethnicity, and sex. Other charactersitics should be similar like education, age, and earnings - but these three might actually increase by 1 increment.
Industry and super sector should also be constant.

The h_geocodes should be compared by Census Tracts - for most observations. Long distance commutes (greater than 30-40 miles) should be compared by PUMAs. Really long distance pairs should be compared by SuperPUMAs (500+ miles between home and work. The PUMA and SuperPUMA designations would need to be merged in manually. I have code for PUMA merge - but I have never scene the actually mapping between PUMAs and Super PUMAs. Also - need to double check the citation for PUMA nad SuperPUMA coarsening. For now I might just use counties and state level data for longer distance commutes.

#### compare based on jobidac only

In [9]:
years = ['2014','2015']
years[0]

'2014'

In [10]:
jobidac_list = {}
for year in years:
    jobidac_list[year] = pd.pivot_table(df[year], index = 'jobidac', values = 'h_geocode', aggfunc='count')
    jobidac_list[year] .reset_index(inplace = True)
    jobidac_list[year] =  jobidac_list[year].rename(columns = {'h_geocode' : year})

compare_jobidac_list = pd.merge(left = jobidac_list[years[0]],
                                right = jobidac_list[years[1]],
                                on = 'jobidac',
                                how = 'outer')
compare_jobidac_list = compare_jobidac_list.fillna(value = 0)
compare_jobidac_list.loc[:,'match'] = 0
compare_jobidac_list.loc[(compare_jobidac_list[years[0]] <= compare_jobidac_list[years[1]]),'match'] = compare_jobidac_list[years[0]] 
compare_jobidac_list.loc[(compare_jobidac_list[years[1]] <= compare_jobidac_list[years[0]]),'match'] = compare_jobidac_list[years[1]] 


In [11]:
compare_jobidac_list.sort_values(by = 'jobidac')

Unnamed: 0,jobidac,2014,2015,match
0,jidodJT07113jobidac011532,1.0,1.0,1.0
1,jidodJT07123jobidac011511,1.0,1.0,1.0
2,jidodJT07123jobidac011512,1.0,1.0,1.0
3,jidodJT07123jobidac011522,2.0,0.0,0.0
4,jidodJT07123jobidac011532,1.0,1.0,1.0
...,...,...,...,...
82,jidodJT07333jobidac411531,0.0,1.0,0.0
83,jidodJT07333jobidac411532,0.0,1.0,0.0
84,jidodJT11113jobidac011532,0.0,1.0,0.0
58,jidodJT11213jobidac211521,1.0,0.0,0.0


In [12]:
compare_jobidac_list.loc[(compare_jobidac_list[years[0]] != 0) &
                         (compare_jobidac_list[years[1]] != 0)] 

Unnamed: 0,jobidac,2014,2015,match
0,jidodJT07113jobidac011532,1.0,1.0,1.0
1,jidodJT07123jobidac011511,1.0,1.0,1.0
2,jidodJT07123jobidac011512,1.0,1.0,1.0
4,jidodJT07123jobidac011532,1.0,1.0,1.0
10,jidodJT07223jobidac111512,1.0,1.0,1.0
11,jidodJT07223jobidac111522,1.0,1.0,1.0
12,jidodJT07223jobidac111532,1.0,3.0,1.0
16,jidodJT07223jobidac211532,1.0,3.0,1.0
18,jidodJT07223jobidac311512,1.0,1.0,1.0
20,jidodJT07223jobidac311522,2.0,1.0,1.0


In [13]:
total_jobyear1 = np.sum(compare_jobidac_list[years[0]])
total_jobyear2 = np.sum(compare_jobidac_list[years[1]])
match = np.sum(compare_jobidac_list['match'])
print('Total jobs in year1 ',total_jobyear1)
print('Total jobs in year2 ',total_jobyear2)
print('Match on all characteristics ', match)
print('Match on all characteristics ', int(match/total_jobyear1*100), '%')

Total jobs in year1  82.0
Total jobs in year2  74.0
Match on all characteristics  33.0
Match on all characteristics  40 %


### Match on Jobidac and Tract

In [14]:
def compare_joblist_years(df,years, index_vars, value_var):
    pivot_df = {}
    
    for year in years:
        pivot_df[year] = pd.pivot_table(df[year], index = index_vars, values = value_var, aggfunc='count')
        pivot_df[year] .reset_index(inplace = True)
        pivot_df[year] =  pivot_df[year].rename(columns = {value_var : year})

    # merge all years together
    compare_list = pivot_df[years[0]]
    for i in range(len(years) -1):
        compare_list = pd.merge(left = compare_list,
                                right = pivot_df[years[i+1]],
                                on = index_vars,
                                how = 'outer')
    compare_list = compare_list.fillna(value = 0)
    
    # How many jobs match
    compare_list.loc[:,'match'] = 0
    for i in range(len(years) -1):
        compare_list.loc[(compare_list[years[i]] <= compare_list[years[i+1]]),'match'] = compare_list[years[i]] 
        compare_list.loc[(compare_list[years[i+1]] <= compare_list[years[i]]),'match'] = compare_list[years[i+1]] 
    
    mintotaljobs = 0
    for year in years:
        totaljobs = np.sum(compare_list[year])
        print('Total jobs in',year,totaljobs)
        if  mintotaljobs == 0:
            mintotaljobs = totaljobs
        elif mintotaljobs > totaljobs:
            mintotaljobs = totaljobs

    match = np.sum(compare_list['match'])
    print('Match on ',index_vars, match)
    print('Match Percent = ', int(match/mintotaljobs*100), '%')
    
    return match, compare_list

In [15]:
def add_tractid(df,geocodevar):
    
    df[geocodevar+'_str'] = df[geocodevar].apply(lambda x : str(int(x)).zfill(15))
    df[geocodevar+'_tractid'] = df[geocodevar+'_str'].str[0:11]
    
    return df

def add_countyid(df,geocodevar):
    
    df[geocodevar+'_str'] = df[geocodevar].apply(lambda x : str(int(x)).zfill(15))
    df[geocodevar+'_countyid'] = df[geocodevar+'_str'].str[0:5]
    
    return df

In [16]:
match, comparelist = compare_joblist_years(df,years, 'jobidac', 'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  jobidac 33.0
Match Percent =  44 %


In [17]:
for year in years:
    df[year] = add_tractid(df[year], 'h_geocode')
    df[year] = add_countyid(df[year], 'h_geocode')

In [18]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['jobidac','h_geocode_tractid'], value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['jobidac', 'h_geocode_tractid'] 1.0
Match Percent =  1 %


In [19]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['jobidac','h_geocode_countyid'], value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['jobidac', 'h_geocode_countyid'] 19.0
Match Percent =  25 %


In [20]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['Race'], value_var =  'h_geocode')

Total jobs in 2014 82
Total jobs in 2015 74.0
Match on  ['Race'] 71.0
Match Percent =  95 %


In [21]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['Race','h_geocode_tractid'], value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['Race', 'h_geocode_tractid'] 24.0
Match Percent =  32 %


In [22]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['Race','h_geocode_countyid'], value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['Race', 'h_geocode_countyid'] 58.0
Match Percent =  78 %


In [23]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['Sex'], value_var =  'h_geocode')

Total jobs in 2014 82
Total jobs in 2015 74
Match on  ['Sex'] 74
Match Percent =  100 %


In [24]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['Sex','h_geocode_tractid'], value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['Sex', 'h_geocode_tractid'] 29.0
Match Percent =  39 %


In [25]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['Sex','Race','h_geocode_tractid'], value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['Sex', 'Race', 'h_geocode_tractid'] 20.0
Match Percent =  27 %


In [26]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['Sex','Race','Ethnicity','h_geocode_tractid'],
                                           value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['Sex', 'Race', 'Ethnicity', 'h_geocode_tractid'] 19.0
Match Percent =  25 %


In [27]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['jobtype','Sex','Race','Ethnicity','h_geocode_tractid'],
                                           value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['jobtype', 'Sex', 'Race', 'Ethnicity', 'h_geocode_tractid'] 19.0
Match Percent =  25 %


In [28]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['jobtype','IndustryCode','Sex','Race','Ethnicity','h_geocode_tractid'],
                                           value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['jobtype', 'IndustryCode', 'Sex', 'Race', 'Ethnicity', 'h_geocode_tractid'] 19.0
Match Percent =  25 %


In [29]:
from pyincoredata_addons.lodesdatautil_20210815 import add_latlon
from pyincoredata_addons.lodesdatautil_20210815 import add_distance

for year in years:
    df[year] = add_latlon(df[year])
    df[year] = add_distance(df[year], 'blklatdd_w', 'blklatdd_h', 'blklondd_w', 'blklondd_h' )

w_geocode nc
{'tabblk2010': 'tabblk2010_w', 'blklatdd': 'blklatdd_w', 'blklondd': 'blklondd_w'}
h_geocode co
{'tabblk2010': 'tabblk2010_h', 'blklatdd': 'blklatdd_h', 'blklondd': 'blklondd_h'}
h_geocode nc
{'tabblk2010': 'tabblk2010_h', 'blklatdd': 'blklatdd_h', 'blklondd': 'blklondd_h'}
h_geocode sc
{'tabblk2010': 'tabblk2010_h', 'blklatdd': 'blklatdd_h', 'blklondd': 'blklondd_h'}
w_geocode nc
{'tabblk2010': 'tabblk2010_w', 'blklatdd': 'blklatdd_w', 'blklondd': 'blklondd_w'}
h_geocode ky
{'tabblk2010': 'tabblk2010_h', 'blklatdd': 'blklatdd_h', 'blklondd': 'blklondd_h'}
h_geocode nc
{'tabblk2010': 'tabblk2010_h', 'blklatdd': 'blklatdd_h', 'blklondd': 'blklondd_h'}


In [30]:
df['2015'].head()

Unnamed: 0.1,Unnamed: 0,w_geocode,h_geocode,jobidod,jobidod_counter,jobidod_total_rand,jobidwacracod_counter,jobidac,jobidac_counter,jobtype,...,w_geocode_stabbr,tabblk2010_w,blklatdd_w,blklondd_w,h_geocode_stfips,h_geocode_stabbr,tabblk2010_h,blklatdd_h,blklondd_h,od_distance
0,1,371559612002006,370179503001059,jidodJT07213,1,0,0,jidodJT07213jobidac211512,1,JT07,...,nc,371559612002006,34.622429,-78.995762,37,nc,370179503001059,34.644261,-78.71662,25.654264
1,2,371559612002006,370190203042017,jidodJT07223,1,0,7,jidodJT07223jobidac311532,1,JT07,...,nc,371559612002006,34.622429,-78.995762,37,nc,370190203042017,33.953199,-78.086835,111.849522
2,3,371559612002006,370479302001019,jidodJT07333,1,0,3,jidodJT07333jobidac411532,1,JT07,...,nc,371559612002006,34.622429,-78.995762,37,nc,370479302001019,34.362963,-78.411764,60.802301
3,4,371559612002006,370479306003057,jidodJT07323,1,0,0,jidodJT07323jobidac211512,1,JT07,...,nc,371559612002006,34.622429,-78.995762,37,nc,370479306003057,34.284822,-78.899931,38.554777
4,5,371559612002006,370510030013000,jidodJT07233,1,0,37,jidodJT07233jobidac411512,1,JT07,...,nc,371559612002006,34.622429,-78.995762,37,nc,370510030013000,34.914973,-78.92702,33.12985


In [31]:
df['2015'].loc[df['2015']['h_geocode_stabbr'] != df['2015']['w_geocode_stabbr']]

Unnamed: 0.1,Unnamed: 0,w_geocode,h_geocode,jobidod,jobidod_counter,jobidod_total_rand,jobidwacracod_counter,jobidac,jobidac_counter,jobtype,...,w_geocode_stabbr,tabblk2010_w,blklatdd_w,blklondd_w,h_geocode_stfips,h_geocode_stabbr,tabblk2010_h,blklatdd_h,blklondd_h,od_distance
73,0,371559612002006,212231002003022,jidodJT07133,1,0,3,jidodJT07133jobidac011532,1,JT07,...,nc,371559612002006,34.622429,-78.995762,21,ky,212231002003022,38.51265,-85.271997,707.714296


In [32]:
def add_coarse_geovar(df):
    """
    LODES data coarsens the geocoding of the home block based on distance between origin and desination pairs
    if distance is greater than the average use PUMA (County)
    if distnace is greater than 500 miles use 
    """
    
    col_list = [col for col in df]
    # Check if dataframe has tract and county variables for home geocodes
    if 'h_geocode_tractid' not in col_list:
        df = add_tractid(df, 'h_geocode')
    if 'h_geocode_countyid' not in col_list:
        df = add_countyid(df, 'h_geocode')
    
    # Check if dataframe has od_distance
    if 'od_distance' not in col_list:
        if 'blklatdd_w' not in col_list:
            df = add_latlon(df)
        
        df = add_distance(df, 'blklatdd_w', 'blklatdd_h', 'blklondd_w', 'blklondd_h' )
    
    mean_dist = df['od_distance'].mean()
    print("The average commute distance for data is ",mean_dist)
    # Add coarse_geovar
    # All jobs are coarsened to tractid for home origin
    df.loc[:,'h_geocode_coarse'] = df['h_geocode_tractid']
    
    # locate observations with longer than average commutes
    # Coarse Geovar should be the PUMA with population of 100,000
    df.loc[(df['od_distance'] > mean_dist), 'h_geocode_coarse'] = df['h_geocode_countyid']
    
    # Locate observations with commutes longer than 500 km
    df.loc[(df['od_distance'] > 500), 'h_geocode_coarse'] = df['h_geocode_stfips']
    
    return df  

In [33]:
for year in years:
    df[year] = add_coarse_geovar(df[year])

The average commute distance for data is  79.87614206426692
The average commute distance for data is  48.40164774264854


In [34]:
match, comparelist = compare_joblist_years(df,years, index_vars = ['jobtype','IndustryCode','Sex','Race','Ethnicity','h_geocode_coarse'],
                                           value_var =  'h_geocode')

Total jobs in 2014 82.0
Total jobs in 2015 74.0
Match on  ['jobtype', 'IndustryCode', 'Sex', 'Race', 'Ethnicity', 'h_geocode_coarse'] 19.0
Match Percent =  25 %


In [35]:
df['2015'].head()

Unnamed: 0.1,Unnamed: 0,w_geocode,h_geocode,jobidod,jobidod_counter,jobidod_total_rand,jobidwacracod_counter,jobidac,jobidac_counter,jobtype,...,tabblk2010_w,blklatdd_w,blklondd_w,h_geocode_stfips,h_geocode_stabbr,tabblk2010_h,blklatdd_h,blklondd_h,od_distance,h_geocode_coarse
0,1,371559612002006,370179503001059,jidodJT07213,1,0,0,jidodJT07213jobidac211512,1,JT07,...,371559612002006,34.622429,-78.995762,37,nc,370179503001059,34.644261,-78.71662,25.654264,37017950300
1,2,371559612002006,370190203042017,jidodJT07223,1,0,7,jidodJT07223jobidac311532,1,JT07,...,371559612002006,34.622429,-78.995762,37,nc,370190203042017,33.953199,-78.086835,111.849522,37019
2,3,371559612002006,370479302001019,jidodJT07333,1,0,3,jidodJT07333jobidac411532,1,JT07,...,371559612002006,34.622429,-78.995762,37,nc,370479302001019,34.362963,-78.411764,60.802301,37047
3,4,371559612002006,370479306003057,jidodJT07323,1,0,0,jidodJT07323jobidac211512,1,JT07,...,371559612002006,34.622429,-78.995762,37,nc,370479306003057,34.284822,-78.899931,38.554777,37047930600
4,5,371559612002006,370510030013000,jidodJT07233,1,0,37,jidodJT07233jobidac411512,1,JT07,...,371559612002006,34.622429,-78.995762,37,nc,370510030013000,34.914973,-78.92702,33.12985,37051003001


### Save final job list and mcmc results table as csv

In [36]:
for year in years:
    savefile = sys.path[0]+"/"+programname+"/"+programname+"_"+year+".csv"
    df[year].to_csv(savefile)