Skip to content
Matching OPM from all possible sources
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs
outputs
programs
.gitignore
RB 033-2018-08-16 17.12.36.jpg
README.Rmd
README.html
README.md
config.R
global-libraries.R
opm-match.Rproj
opm_table.pdf
opm_table.tex
overview.xlsx
pathconfig.R
template.tex

README.md

title author date output
OPM Plan B
Lars Vilhuber
2/22/2019
html_document
keep_md number_sections
true
true

Situation

For a variety of reasons, the DUA (Data Use Agreement) between OPM and the Census Bureau has not been finalized for a signficant time. This puts at risk dependent data products (QWI, J2J, LODES). The present document outlines a fallback strategy which may allow for such dependent products to continue to be produced, albeit with some quality reductions. It is expected that quality compromises are minimal in the short-run, but would continue to increase over time. We address some workarounds at the end of this document.

Naming convention

  • OPM(Census) – OPM microdata acquired through DUA (last year of data: 2015)
  • OPM(FOIA[x]) – OPM microdata acquired through FOIA request to OPM (x = Cornell1, Cornell2, Buzzfeed). Time coverage varies
  • OPM(PU) – OPM microdata publicly available at Fedscope.gov
  • ECF(A) – ECF built with dataset A

Availability of Data

/home/ssgprojects/project0002/cdl77/opm-clean/

Locations at Cornell

kable(opmlocs)
x
/data/clean/opm-foia
/ssgprojects/project0002/cdl77/opm-clean/outputs/2016
/ssgprojects/project0002/cdl77/opm-clean/outputs/buzzfeed
/data/clean/opm

Sources:

  • OPM "/data/doc/opm/SRC.txt"
  • Buzzfeed "/data/doc/opm-foia/Buzzfeed-20170524-Were Sharing A Vast Trove Of Federal Payroll Records.pdf"
  • Cornell-FOIA 2013 "/data/doc/opm-foia/20131126154301380.pdf"
  • Cornell-FOIA 2016 "/data/doc/opm-foia/OPM letter FOIA response 201611.pdf" Fedscope "/data/doc/opm/FS_Employment_Sep2011_Documentation.pdf"

Variables

TODO: This still needs the data elements on the internal data

The various data sources do not all have the same data elements (full list):

overview <- read_excel("overview.xlsx") %>% select("Variable","Buzzfeed","Cornell-FOIA 2013","Cornell FOIA 2016", "Fedscope-old","Fedscope-new" )
## New names:
## * `` -> `..9`
## * `` -> `..10`
kable(overview %>% slice(1:10)) %>%
	kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
Variable Buzzfeed Cornell-FOIA 2013 Cornell FOIA 2016 Fedscope-old Fedscope-new
Employee Name 1 0 0 0 0
Pseudo ID 1 2 3 0 0
Agency/Subelement 1 1 1 1 1
Duty Station 1 1 1 0 0
Location (State/Country) 0 0 0 1 1
Age Level 1 1 1 1 1
Sex 0 1 0 1 0
Ethnicity 0 0 0 1 1
Race 0 0 0 1 1
GS-Equivalent Grade 0 1 0 1 1
(with another 28 rows not shown)

In particular, 12 variables are common to all public datasets, but key variables are present only on one or two datasets:

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.
Variable Buzzfeed Cornell-FOIA 2013 Cornell FOIA 2016 Fedscope-old Fedscope-new common
Duty Station 1 1 1 0 0 3
Location (State/Country) 0 0 0 1 1 2
Sex 0 1 0 1 0 2
Ethnicity 0 0 0 1 1 2
Race 0 0 0 1 1 2
GS-Equivalent Grade 0 1 0 1 1 3
Length of Service 0 1 1 1 1 4
Average Salary 0 0 0 1 1 2
Average Length of Service 0 0 0 1 1 2
Employment 0 0 0 1 1 2
Supervisory Status 1 0 0 1 1 3
Work Status 0 0 0 1 1 2
CBSA 0 1 1 0 0 2
MSA-RETRO 0 1 0 1 1 3
STEM Occupation 0 0 0 1 1 2
File Date 0 1 1 0 0 2
Start Date 0 1 1 0 0 2
End Date 0 1 1 0 0 2
Accession 1 0 0 1 1 3
Effective Date of Accession 1 0 0 1 1 3
Separation 1 0 0 1 1 3
Effective Date of Separation 1 0 0 1 1 3

Data for tabulation purposes

For tabulation purposes, a few key variables are missing from some of the public-use data, which means no single data source is adequate for LEHD purposes:

Variable Buzzfeed Cornell-FOIA 2013 Cornell FOIA 2016 Fedscope-old Fedscope-new common
Employee Name 1 0 0 0 0 1
Duty Station 1 1 1 0 0 3
Age Level 1 1 1 1 1 5
Sex 0 1 0 1 0 2
Ethnicity 0 0 0 1 1 2
Race 0 0 0 1 1 2

Note that the combination of Employee Name, Sex, Age Level may be sufficient to acquire a PIK within the secure confines of the Census Bureau:

Variable Buzzfeed Cornell-FOIA 2013 Cornell FOIA 2016 Fedscope-old Fedscope-new common
Employee Name 1 0 0 0 0 1
Age Level 1 1 1 1 1 5
Sex 0 1 0 1 0 2
Alternatively, matching the consolidated public-use file to the confidential internal use file by the same methods will pick up a `PIK` from historical files.

None of the files have residential address - required for LODES processing. This requires the acquisition of a PIK.

The Plan

Step 1: Entity resolution for public data

Using Chen, Shrivastava, Steorts (2018) algorithms (fasthash), resolve to unique persons, using common variables as distinguishers. This still requires some work, as fasthash estimates, but does not output unique entities. Generates OPM(merged).

Step 2: Repeat entity resolution using private data

We then repeat the process with the private data. This attaches a PIK to most records. Imputation procedures (standard LEHD) will need to handle the remaining ones.

Step 2a: Alternative match

Alternatively, the OPM(merged) file can be matched to OPM(Census) using classical two-file matchers. This does not provide the best statistical features, but may be a feasible workaround.

Step 2b: Matching to Numident

Note that one possibility is to include the Census Numident in the set of files that are matched against (using a subset of variables), leveraging the demographics available on the OPM(merged). file. However, the match will be less certain, given the paucity of common information.

Scope

For a given end year t on OPM(Census), this will yield at least a t+1 file. OPM(Fedscope-new) is released every quarter. OPM(FOIA-new) can be generated at some cost yearly. As the link to OPM(Buzzfeed) and OPM(Census) gets farther away (t+k), the match quality will decrease/ non-matchable records increase.

Quality assessment

We would want to leverage the uncertainty in the linkage for tabulation purposes, providing a measure of the uncertainty to the tabulation system (imputed demographics are already carried forward in 10 implicates).

![graph](RB 033-2018-08-16 17.12.36.jpg)

You can’t perform that action at this time.