# Apply Timeit to LODES MCMC SA Utilities

> The LEHD Origin-Destination Employment Statistics (LODES) datasets are released both as
part of the OnTheMap application and in raw form as a set of comma separated variable (CSV)
text files. This document describes the structure of those raw files and provides basic information
for users who want to perform analytical work on the data outside of the OnTheMap application." (U.S. Census, 2021)

U.S. Census Bureau. (2021). LEHD Origin-Destination Employment Statistics Data (2002-2018) [computer file]. Washington, DC: U.S. Census Bureau, Longitudinal-Employer Household Dynamics Program [distributor], accessed on {CURRENT DATE} at https://lehd.ces.census.gov/data/#lodes. LODES 7.5 [version]

## Description of Program
- program:    LODES_1bv1_timeit_MCMCSA_util
- task:       Use timeit to impove code effeciency
- See github commits for description of program updates
- Current Version:    2021-09-16
- project:    Interdependent Networked Community Resilience Modeling Environment (IN-CORE) Subtask 5.2 - Social Institutions
- funding:	  NIST Financial Assistance Award Numbers: 70NANB15H044 and 70NANB20H008 
- author:     Nathanael Rosenheim

- Suggested Citation:
Rosenheim, N. (2021) “Obtain, Clean, and LODES Jobs Data". 
Archived on Github and ICPSR.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import os # For saving output to path

In [2]:
# Display versions being used - important information for replication
import sys
print("Python Version     ", sys.version)
print("numpy version:     ", np.__version__)
print("pandas version:    ", pd.__version__)

Python Version      3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 15:37:01) [MSC v.1916 64 bit (AMD64)]
numpy version:      1.21.1
pandas version:     1.3.1


In [3]:
# Store Program Name for output files to have the same name
programname = "LODES_1bv1_timeit_MCMCSA_util"
# Save Outputfolder - due to long folder name paths output saved to folder with shorter name
# files from this program will be saved with the program name - this helps to follow the overall workflow
outputfolder = "lodes_workflow_output"
# Make directory to save output
if not os.path.exists(outputfolder):
    os.mkdir(outputfolder)

### Setup notebook enviroment to access Cloned Github Package
This notebook uses packages that are in developement. The packages are available at:

https://github.com/npr99/Labor_Market_Allocation

To replicate this notebook Clone the Github Package to a folder that is a sibling of this notebook.

To access the sibling package you will need to append the parent directory ('..') to the system path list.

In [5]:
# to access new package that is in a sibling folder - the system path list needs to inlcude the parent folder (..)
# append the path of the directory that includes the github repository.
sys.path.append("..\\github_com\\npr99\\Labor_Market_Allocation")

# use the .. relative path if running the notebook in the GitHub folder 
# Caution this will add files to github on the next commit and push 
# recommend running notebooks in a folder outside of the cloned github repository
#sys.path.append("..") 

# Setup access to IN-CORE
https://incore.ncsa.illinois.edu/

In [5]:
#from pyincore import IncoreClient, Dataset, FragilityService, MappingSet, DataService
#from pyincore_viz.geoutil import GeoUtil as viz

### IN-CORE addons
This program uses coded that is being developed as potential add ons to pyincore. These functions are in a folder called pyincore_addons - this folder is located in the same directory as this notebook.
The add on functions are organized to mirror the folder sturcture of https://github.com/IN-CORE/pyincore

Each add on function attempts to follow the structure of existing pyincore functions and includes some help information.

In [6]:
# To reload submodules need to use this magic command to set autoreload on
%load_ext autoreload
%autoreload 2
# open, read, and execute python program with reusable commands
# function that loops through lodes data structure
import pyincoredata_addons.SourceData.lehd_ces_census_gov.lodes_fullloop as lodes
import pyincoredata_addons.SourceData.lehd_ces_census_gov.lodes_mcmcsa_loops as mcmc

# since the geoutil is under construction it might need to be reloaded
#from importlib import reload 
#lodes = reload(lodes) # with auto reload on this command is not needed

# Print list of add on functions
from inspect import getmembers, isfunction
print(getmembers(lodes,isfunction))
print(getmembers(mcmc,isfunction))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
[('add_jobids', <function add_jobids at 0x000001E5BECEE558>), ('block_to_joblist', <function block_to_joblist at 0x000001E5BECEEDC8>), ('check_out_of_state_rac', <function check_out_of_state_rac at 0x000001E5BECEBB88>), ('expand_df', <function expand_df at 0x000001E5BECEE4C8>), ('explorebyblock', <function explorebyblock at 0x000001E5BECEE1F8>), ('fix_char_vars', <function fix_char_vars at 0x000001E5BECEBF78>), ('get_homeblocklist', <function get_homeblocklist at 0x000001E5BECEB9D8>), ('import_lodes', <function import_lodes at 0x000001E5BECEBCA8>), ('keep_nonzeros', <function keep_nonzeros at 0x000001E5BECEE168>), ('new_jobtypes', <function new_jobtypes at 0x000001E5BECEBEE8>), ('obtain_lodes_county_loop', <function obtain_lodes_county_loop at 0x000001E5BE6BF678>), ('out_of_state_rac_blocks', <function out_of_state_rac_blocks at 0x000001E5BECEEA68>), ('remove_duplicate_block_error', <function remove

## Read in Joblist File 
The joblist file is the input to the MCMC SA process. This file is the output of the Clean LODES Data file. Currently the MCMC SA process takes multilple hours to run for a single block.


In [8]:
source_program = "LODES_1av3_CleanLODESdata"
source_folder = "lodes_workflow_output"
work_block = 371559612002006
year = '2015'

input_file = outputfolder+"/"+"joblist"+"_"+str(work_block)+"_"+year+"_mcmcsainput.csv"
possible_block_joblist_df = pd.read_csv(input_file)
possible_block_joblist_df.head()

Unnamed: 0.1,Unnamed: 0,w_geocode,h_geocode,jobidwacracod_counter,jobidod,jobidac,jobidac_counter,jobtype,year,Age,...,SuperSector,Education,Ethnicity,IndustryCode,Race,Sex,SA,SE,SI,jobidod_counter
0,0,371559612002006,212231002003022,0,jidodJT07133,jidodJT07133jobidac011511,1,JT07,2015,1,...,3,0,1,15,1,1,1.0,1.0,1.0,1
1,1,371559612002006,212231002003022,1,jidodJT07133,jidodJT07133jobidac011512,1,JT07,2015,1,...,3,0,1,15,1,2,1.0,1.0,1.0,1
2,2,371559612002006,212231002003022,2,jidodJT07133,jidodJT07133jobidac011531,1,JT07,2015,1,...,3,0,1,15,3,1,1.0,1.0,1.0,1
3,3,371559612002006,212231002003022,3,jidodJT07133,jidodJT07133jobidac011532,1,JT07,2015,1,...,3,0,1,15,3,2,1.0,1.0,1.0,1
4,4,371559612002006,370179503001059,0,jidodJT07213,jidodJT07213jobidac211512,1,JT07,2015,2,...,3,2,1,15,1,2,1.0,1.0,1.0,1


## The MCMC SA loop includes these steps

Might want to move this step out of the MCMC_SA_LOOP

In [9]:
from pyincoredata_addons.SourceData.lehd_ces_census_gov.lodes_datautil import add_probability_job_selected
from pyincoredata_addons.SourceData.lehd_ces_census_gov.lodes_mcmcsa_util import rand_select_jobs

seedk = 41411

possible_block_joblist_df = add_probability_job_selected(possible_block_joblist_df)
rand_select_jobs_df = rand_select_jobs(possible_block_joblist_df, seedk)

In [10]:
rand_select_jobs_df[['w_geocode','h_geocode','jobidod','jobidac']].astype(str).describe()

Unnamed: 0,w_geocode,h_geocode,jobidod,jobidac
count,498,498,498,498
unique,1,72,11,149
top,371559612002006,370510030013000,jidodJT07233,jidodJT07233jobidac311512
freq,498,48,196,11


In [11]:
pd.pivot_table(rand_select_jobs_df, values = 'w_geocode', index=['jobtype','Earnings'],
                           aggfunc='count',
                           margins = True, margins_name = 'Total',
                           columns=['select_job'])

Unnamed: 0_level_0,select_job,0,1,Total
jobtype,Earnings,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
JT07,1.0,13.0,12.0,25
JT07,2.0,183.0,26.0,209
JT07,3.0,228.0,34.0,262
JT11,1.0,,2.0,2
Total,,424.0,74.0,498


## Read in data to check MCMC SA Fitness
The MCMC SA process uses the total counts of jobs to check the fitness of the random job select process.

In [15]:
import us
from pyincoredata_addons.SourceData.lehd_ces_census_gov.lodes_datautil import explorebyblock

# Compare random selection of jobs with the WAC jobs list by Earnings, Age, and SuperSector
block_str = str(int(work_block)).zfill(15)
stfips = block_str[0:2]
stabbr = str.lower(us.states.lookup(stfips).abbr)
countyfips = block_str[0:5]
    
all_segstems = {'Earnings' : 'SE',
        'Age' : 'SA',
        'SuperSector' : 'SI'}

county_jobcounts_df = {}
wac_joblist_df = {}
for segstems in all_segstems:
    segstem = all_segstems[segstems]
    input_file = outputfolder+"/"+stabbr+"_"+countyfips+"_"+"wac_"+year+"_"+segstem+".csv"
    county_jobcounts_df[segstems] = pd.read_csv(input_file)
    wac_joblist_df[segstems] = explorebyblock(county_jobcounts_df[segstems],'w_geocode',[work_block])

In [16]:
wac_joblist_df['Earnings'].head()

Unnamed: 0.1,Unnamed: 0,w_geocode,C000,CA01,CA02,CA03,CE01,CE02,CE03,CNS01,...,CD03,CD04,CS01,CS02,jobtype,jobcount,segpart,Earnings,seg_stem,year
1596,6602,371559612002006,34.0,2.0,22.0,10.0,0.0,0.0,34.0,0.0,...,13.0,8.0,9.0,25.0,JT07,34.0,SE03,3,SE,2015
1597,4414,371559612002006,26.0,3.0,18.0,5.0,0.0,26.0,0.0,0.0,...,7.0,3.0,5.0,21.0,JT07,26.0,SE02,2,SE,2015
1598,1639,371559612002006,12.0,1.0,3.0,8.0,12.0,0.0,0.0,0.0,...,2.0,2.0,1.0,11.0,JT07,12.0,SE01,1,SE,2015
1599,2288,371559612002006,2.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,...,0.0,1.0,0.0,2.0,JT11,2.0,SE01,1,SE,2015


In [17]:
pd.pivot_table(wac_joblist_df['Earnings'], values = 'C000', index=['jobtype','Earnings'],
                           aggfunc='sum',
                           margins = True, margins_name = 'Total',
                           )

Unnamed: 0_level_0,Unnamed: 1_level_0,C000
jobtype,Earnings,Unnamed: 2_level_1
JT07,1.0,12.0
JT07,2.0,26.0
JT07,3.0,34.0
JT11,1.0,2.0
Total,,74.0


## Function to Optimize

Splitting up the larger code block into smaller functions to optimize.
This code splits the block level job counts into a dataframe that just has observations with segement variable not equal to zero and the job list were the segment category matches the segement variable name.

Example: 
The block level data has a variable called CE01 for job counts with earnings in category 1 (less than $1,250 per month). 

The possible joblist data has jobs with the earnings variable characteristic set to 1. These jobs have other characteristics that when randomly selected can be compared to the characteristic totals in the block level data.

### Option 1 - Pandas .loc to split

In [18]:
def split_joblist_opt1(wac_block_jobcounts_df,joblist_df,jobtype, segement, seg_var,i):

    # split the wac joblist by segment and jobtype
    split_wac_joblist_df = wac_block_jobcounts_df.loc[
        (wac_joblist_df[segment][seg_var] != 0) & 
        (wac_joblist_df[segment]['jobtype'] == jobtype)]

    
    # split the wac-rac-od joblist by segment and jobtype
    split_rand_select_jobs_df =  joblist_df.loc[
        (joblist_df[segment] == i) & 
        (joblist_df['jobtype'] == jobtype)]
    
    return split_wac_joblist_df, split_rand_select_jobs_df


In [19]:
from pyincoredata_addons.SourceData.lehd_ces_census_gov._lodes_data_structure import all_stems

segment = 'Earnings'
# What is the segment stem to use to split data
seg_stem = all_stems[segment]

i = 1
# look at specific totals by earnings and age - segement vars are CE01, CE02, CE03
seg_var = seg_stem+"0"+str(i)
    
split_wac_joblist_df, split_rand_select_jobs_df = split_joblist_opt1(wac_joblist_df['Earnings'],
                                                                     rand_select_jobs_df,
                                                                     'JT07',
                                                                     segment,
                                                                     seg_var,i)

In [20]:
split_wac_joblist_df.head(1).T

Unnamed: 0,1598
Unnamed: 0,1639
w_geocode,371559612002006
C000,12.0
CA01,1.0
CA02,3.0
CA03,8.0
CE01,12.0
CE02,0.0
CE03,0.0
CNS01,0.0


In [21]:
%timeit -r 5 -n 30 split_joblist_opt1(wac_joblist_df['Earnings'],rand_select_jobs_df,'JT07',segment,seg_var,i)

888 µs ± 23.7 µs per loop (mean ± std. dev. of 5 runs, 30 loops each)


### Option 2 - numpy .where

In [22]:
# Drop unnamed columns
wac_joblist_df['Earnings'] = wac_joblist_df['Earnings'].loc[:, ~wac_joblist_df['Earnings'].columns.str.contains('^Unnamed')]
wac_joblist_df['Earnings'].head()

Unnamed: 0,w_geocode,C000,CA01,CA02,CA03,CE01,CE02,CE03,CNS01,CNS02,...,CD03,CD04,CS01,CS02,jobtype,jobcount,segpart,Earnings,seg_stem,year
1596,371559612002006,34.0,2.0,22.0,10.0,0.0,0.0,34.0,0.0,0.0,...,13.0,8.0,9.0,25.0,JT07,34.0,SE03,3,SE,2015
1597,371559612002006,26.0,3.0,18.0,5.0,0.0,26.0,0.0,0.0,0.0,...,7.0,3.0,5.0,21.0,JT07,26.0,SE02,2,SE,2015
1598,371559612002006,12.0,1.0,3.0,8.0,12.0,0.0,0.0,0.0,0.0,...,2.0,2.0,1.0,11.0,JT07,12.0,SE01,1,SE,2015
1599,371559612002006,2.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,2.0,JT11,2.0,SE01,1,SE,2015


In [23]:
def split_joblist_opt2(dataset,jobtype, segement, seg_var,i):
    
    # Save column names
    colnames = list(dataset.columns)
    
    # Identify columns to keep
    keep_columns = [seg_var,'jobtype']
    
    # Make a dictionary of column names with column positions
    col_positions = dict(zip(keep_columns, list(range(1, len(keep_columns)+1))))

    
    # Identify to drop to make numpy array
    drop_columns = [col for col in colnames if col not in keep_columns]
    
    # Convert dataset to numpy
    original_data = dataset.drop(columns = drop_columns).to_numpy()
    
    orininal_index = list(dataset.index)
    
    # Concactanate the data and index
    original_array = np.c_[orininal_index,original_data]
    
    # Select rows 
    (rows,) = np.where((original_array[:,col_positions['jobtype']] == jobtype) & 
                        (original_array[:,col_positions[seg_var]] != 0))
    selected_data = original_array[rows,:]
    
    # Get index from selected rows
    selected_index = selected_data[:, 0]
    
    # Drop index from selected_data numpy array
    selected_data = selected_data[:, 1:]
    
    # Create a new dataset using the matrix
    result = pd.DataFrame(selected_data, columns=keep_columns, index=selected_index)
    
    # Reinsert dropped columns
    result = pd.merge(right = result,
                     left  = dataset.drop(columns= keep_columns),
                     right_index = True,
                     left_index=True)
    
    return result

In [24]:
split_joblist_opt2(wac_joblist_df['Earnings'],'JT07',segment,seg_var,i)

Unnamed: 0,w_geocode,C000,CA01,CA02,CA03,CE02,CE03,CNS01,CNS02,CNS03,...,CD04,CS01,CS02,jobcount,segpart,Earnings,seg_stem,year,CE01,jobtype
1598,371559612002006,12.0,1.0,3.0,8.0,0.0,0.0,0.0,0.0,0.0,...,2.0,1.0,11.0,12.0,SE01,1,SE,2015,12.0,JT07


In [25]:
%timeit -r 5 -n 30  split_joblist_opt2(wac_joblist_df['Earnings'],'JT07',segment,seg_var,i)

2.17 ms ± 59.9 µs per loop (mean ± std. dev. of 5 runs, 30 loops each)
