Skip to content

krwilkin/census-occ-to-soc

Repository files navigation

census-occ-to-soc

Data on occupations---or the tasks workers perform on the job---provide valuable information about the labor market. A primary source of information on individual occupations is Census household survey data, such as the Current Populuation Survey, American Community Survey, and the Survey of Income and Program Participation. In addition, other agencies, such as Bureau of Labor Statistics (BLS), report economy-wide aggregates of information by occupation (e.g., Occupational Employment and Wage Statistics). In some cases, researchers may wish to commingle these data sets. However, this is complicated by the fact that Census uses different occupation codes from the SOC.

Census occupation codes are based on SOC codes but some aggregation is required. The 2018 version of the SOC lists 1,016 unique occupations compared to 507 used by Census. All SOC codes are accounted for in Census surveys, but some groups of SOC codes are aggregated into a single SOC code. Even though Census provides an official crosswalk file linking its occupation codes to the SOC, researchers may struggle to find a clean method for commingling data from two disparate sources because of the format of the crosswalk.

The files in this repository demonstrate how to use the Census-SOC crosswalk with an application to occupation-level data from the Department of Labor's Occupational Information Network, or O*NET. O*NET provides a rich set of occupation-specific data---in both numerical and text form---about the different tasks, skills, and abilities used on most SOC occupations. The goal of this exercise (for a related project) was to link Census occupation codes to O*NET task information. The particular project aimed to use natural language processing (NLP) tools, so we specifically retained text information about tasks used on the job. I adopted a simple strategy of copying tasks from all related SOC occupations to each Census occupation; for cases where multiple SOC codes match to a single Census occupation, I simply created a large composite string (space-delimited) between each of the tasks and job descriptions.

My strategy is to create data files in a common format that is easy to link (i.e., JSON). The Census crosswalk is in Excel format. The Jupyter Notebook census_soc_xwalk.ipynb downloads the crosswalk from the Census Web site and saves to soc_census_xwalk.json. O*NET data were taken from their API and stored in the file onet_occdata.json (NOTE: I seem to have lost the script accessing the O*NET API).

Linkage between Census occupation codes and O*NET is done in the Jupyter Notebook onet_census_link.ipynb. It takes as inputs the JSON files referenced in the previous paragraph and produces output cenocc_onet.json. Regular expressions are the primary tool used to relate the two data sets. A detailed discussion is included.

About

Creates a crosswalk between Census occupation codes and Standard Occupational Classification (SOC) codes (2018 vintage) with an application to O*NET.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published