## Linking Census Occupation Codes with O\*NET SOC Data

This notebook links occupation descriptions and task content descriptions from the Occupation Information Network (O\*NET) to Census occupation codes and outputs to a `.json` format file. O\*NET data are organized by the Standard Occupational Classification (SOC) system, while Census uses its own coding regime based on the SOC. Many Census codes are aggregates of one or more SOC codes. This script uses the Census-SOC crosswalk to aggregate O\*NET data to each Census code.

O\*NET data include a detailed description of each occupation as well as descriptions of the specific tasks which job analysts have determined are required to function within each occupation. In addition, they include sample job titles that fit into each occupation. If we assume that respondents are likely to write in their occupational title or describe what they do (tasks) for their occupation, then these data represent a collection of common occupation-specific terms that could be used by a NLP algorithm to train a model capable of predicting the occupation code for a given write-in.

The output of this workbook is a `.json` file of Census occupation codes, a list of matched SOC codes from O\*NET, a text string containing descriptions from each O\*NET occupation input, a text string containing a description of all tasks associated with the Census code, and a text field comprising a comma-separated list of sample job titles. I chose to keep these fields separate so analysts could choose how or when to implement them indepdently, if at all. The JSON file is easily accessible using a standard text editor and it can be read into a `pandas` dataframe directly for use with the `nltk` in Python (at the end of this notebook I demonstrate how to read JSON into `pandas`).

This file uses two inputs: (1) a Census-SOC crosswalk file obtained from the U.S. Census Bureau Web site and (2) a custom dictionary file taken from the O\*NET API. Creation of the latter file is undertaken by the file `soc_work.py`.

In [1]:
# import modules
import json
import re
import pandas as pd

In [2]:
# load json files
with open( 'soc_census_xwalk.json' , 'r' ) as f:
    xwalk = json.load( f )

with open( 'onet_occdata.json' , 'r' ) as f:
    onet = json.load( f )

The `json` format is essentially a list of dictionaries and each dictionary is a set of key-value pairs. If the data were organized in a tradtional rectangular data set, each dictionary (list element) would be a record, the keys would indicate a column title, and the value would be the datum for each column. 

The `xwalk` dictionary follows this format. However, the O\*NET dictionary varies slightly. The set of sample occupations (indicated by the key `sample_of_reported_job_titles`) is organized as a list. Further, `tasks` are themselves a list of dictionary objects. In this regard, the `onet` JSON object may not be in "true" JSON format. The ability of JSON objects to store different types of information without an explicit schema is advantageous. 

For potentially complicated data structures, it is important to understand the hierarchical structure before it can be leveraged effectively. Here, we explore the keys associated with each JSON object and print off its first record.

In [3]:
# explore 'xwalk' object
print( 'xwalk keys:\n' , xwalk[ 0 ].keys() , '\n\n' )

print( 'first record of xwalk:\n' , xwalk[ 0 ] , '\n\n' )

print( 'number of census occupations:\n' , len( xwalk ) )

xwalk keys:
 dict_keys(['occ_title', 'cenocc', 'soc']) 


first record of xwalk:
 {'occ_title': 'Chief executives', 'cenocc': '0010', 'soc': '11-1011'} 


number of census occupations:
 507


In [4]:
# explore 'onet' object
print( 'onet keys:\n' , onet[ 0 ].keys() , '\n\n' )

print( 'number of onet occupations:\n' , len( onet ) , '\n' )

print( 'first record of onet:' )
onet[ 0 ]

onet keys:
 dict_keys(['code', 'title', 'description', 'sample_of_reported_job_titles', 'tasks']) 


number of onet occupations:
 1016 

first record of onet:


{'code': '13-2011.00',
 'title': 'Accountants and Auditors',
 'description': 'Examine, analyze, and interpret accounting records to prepare financial statements, give advice, or audit and evaluate statements prepared by others. Install or advise on systems of recording costs or other financial and budgetary data.',
 'sample_of_reported_job_titles': ['Accountant',
  'Accounting Officer',
  'Audit Partner',
  'Auditor',
  'Certified Public Accountant (CPA)',
  'Cost Accountant',
  'Financial Auditor',
  'General Accountant',
  'Internal Auditor',
  'Revenue Tax Specialist'],
 'tasks': [{'id': 21505,
   'name': 'Prepare detailed reports on audit findings.'},
  {'id': 21506,
   'name': 'Report to management about asset utilization and audit results, and recommend changes in operations and financial activities.'},
  {'id': 21507,
   'name': 'Collect and analyze data to detect deficient controls, duplicated effort, extravagance, fraud, or non-compliance with laws, regulations, and management

The above illustration reveals several important details. First, with 507 Census occupations and 1,016 O\*NET occupations, some aggregation will be necessary. Second, O\*NET occupation codes include eight numeric digits (two following the decimal) while those referenced in the Census codes comprise only six. It is also worth noting that some Census occupations may have multiple SOC codes assigned to them (they will be in a space-delimited list; see the full file for details).

The SOC is a hierarchical system where the number of non-zero digits indicate nesting within a larger grouping. There are four levels of nesting: major, minor, broad, and detailed. The first two _non-zero_ digits indicate the "major" occupation group. For example, the 2018 SOC code 11-0000 refers to the "Management Occupations" major group. This major group nests all "minor" groups, which are indicated at the three-digit level (11-1000, 11-2000, 11-3000, and 11-9000). The broad group is characterized by the fifth digit being non-zero and the sixth digit being zero. For example, "Top Executives" are the broad group nested within the minor group 11-1000 and comprises occupation codes 11-1010, 11-1020, and 11-1030. Finally, the broad group nests "detailed" occupations, which have non-elements at the sixth digit. The broad group for "Public Relations and Fundraising Managers" (SOC 11-2030) nests detailed occupations "Public Realations managers" (SOC 11-2031) and "Fundraising Managers" (SOC 11-2033). More information about the SOC is available from the [Bureau of Labor Statistics](https://www.bls.gov/soc). For our purposes, we use the 2018 SOC definitions, which will correspond to the ACS 2018 and 2019 data used in our capstone project.

The eight-digit SOC codes used by O\*NET will follow the same nesting structure and should pose no problems for matching to Census occupation codes. Most O\*NET occupation codes have .00 as their seventh and eight digits. Any non-zero digit at this level of detail will be nested to the sixth digit SOC as described above, while values of .00 are equivalent to six-digit codes.

Understanding the SOC hierarchy is critical for correctly matching O\*NET data to Census codes. Many Census occupation codes will have a direct match to the SOC detailed occupations, such as "Food Service Managers" (Census code 0310 and SOC code 11-9051). Others refer to broad occupations and require aggregating information from detailed occupations in O\*NET. For example, Census occupation code 0335 refers to "Entertainment and Recreation Managers," which is the broad occupation group 11-9070 comprising detailed occupations 11-9071 and 11-9072.

The next steps of this notebook use the SOC hierarchical structure to create a Census-code specific occupations "documents" by matching and, in some cases, aggregating information from O\*NET descriptions of occupations, tasks, and sample occupations. To do this, I iterate over the set of Census occupation codes in `xwalk` and use the associated SOC code to select matching elements from O\*NET. The information will be stored in a dictionary file that can be exported to JSON format for later use. O\*NET descriptions, tasks, and job titles are stored in their own fields.

Note that this is a first attempt at creating such a corpus. It may be decided upon later that omitting the sample job titles will be a good idea. Further, we may decided that adding sample occupations from the DTMF is desirable.

In [5]:
'''
this section outlines the process for linking onet dat to census occupation codes. the strategy is to
loop through each census code in 'xwalk'. for each census code, i use a list comprehension to get the 
set of matching soc codes in 'onet'. i use a regular expression pattern ('repat') that is conditionally
assigned based on the user-defined function 'socpat'. see the offical census crosswalk file for explicit 
occupation codes assigned into aggregate categories (i.e., those ending in one or more '0' or 'X'); this
accounts for the unque regex assignment controlled for in 'socpat'. i define a class ('onetOutput') that
constructs composites of descriptions, tasks, and sample job titles for all soc codes that serve as
an input to the census code. finally, all results are stored in a json object (i.e., a list of
dictionaries) with a field for each of these components. i also preserve the set of onet soc occupation
codes that go into a given match for manual inspection.
'''
# define a class that takes a list of soc codes and returns descriptions, tasks, and job titles from 'onet'
# NOTE: this class is not general; it takes a list object in the same format as 'onet'
class onetOutput:
    def __init__( self , soclist , onetlist ):
        self.soclist = soclist
        self.onetlist = onetlist
    
    # get description(s) for each element of soclist and output to a single string
    def description( self ):
        desclist = [ x[ 'description' ] for x in self.onetlist if x[ 'code' ] in self.soclist ]
        if len( desclist ) == 1:
            return( desclist[ 0 ] )
        else:
            return( ' '.join( desclist ) )
    
    # get tasks (if available, otherwise return empty string)
    def tasks( self ):
        tasklist = [ x[ 'tasks' ] for x in self.onetlist if x[ 'code' ] in self.soclist ]
        if sum( [ len( x ) for x in tasklist ] ) == 0:
            return( '' )
        else:
            task_desc = []
            for i in tasklist:
                for j in i:
                    task_desc.append( j[ 'name' ] )
            return( ' '.join( task_desc ) )
    
    # get list of sample job titles as a comma-separated string
    def job_titles( self ):
        try:
            titles = [ x[ 'sample_of_reported_job_titles' ] for x in self.onetlist if x['code' ] in self.soclist ]
            return( ', '.join( [ ', '.join( x ) for x in titles ] ) )
            
        except:
            return( '' )

In [6]:
def socpat( soc_code ):
    if soc_code[ -1 ] == '0':
        numz = re.findall( '(?<=[1-9])0+$' , soc_code )
        return( '{0}{1}\.\d{{2}}'.format( re.split( '{}$'.format( numz[ 0 ] ) , censoc )[ 0 ] , 
                                          '\d{' + str( len( numz[ 0 ] ) ) + '}' ) )
    elif soc_code[ -1 ] == 'X':
        if soc_code == '13-20XX':
            return( '{}\d{{2}}(?<=54|99)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '15-124X':
            return( '{}[23]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '15-20XX':
            return( '{}\d{{2}}(?<=51|99)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '17-301X':
            return( '{}[239]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '17-302X':
            return( '{}\d(?<!3)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '19-204X':
            return( '{}[23]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '19-303X':
            return( '{}[29]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '19-40XX':
            return( '{}\d{{2}}(?<=40|51)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '21-109X':
            return( '{}[149]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '25-30XX':
            return( '{}\d{{2}}(?<=11|21|31|99)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '25-90XX':
            return( '{}\d{{2}}(?<=21|31|99).\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '27-102X':
            return( '{}[79]\d\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '29-12XX':
            return( '{}[12](?!4)\d\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '29-203X':
            return( '{}[36]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '29-205X':
            return( '{}[17]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '31-113X':
            return( '{}[23]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '31-909X':
            return( '{}[39]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '33-909X':
            return( '{}[29]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '37-201X':
            return( '{}[19]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '37-301X':
            return( '{}[29]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '39-30XX':
            return( '{}\d{{2}}(?<=21|90)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '39-40XX':
            return( '{}\d{{2}}(?<=11|12|21)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '39-509X':
            return( '{}[13]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '47-50XX':
            return( '{}[589]\d(?<=1|9)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '49-209X':
            return( '{}[45]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '49-904X':
            return( '{}[15]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '49-909X':
            return( '{}[79]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '51-20XX':
            return( '{}[569]\d(?<=1|2|9)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '51-403X':
            return( '{}[245]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '51-4XXX':
            return( '{}\d{{3}}(?<=081|19[1-49])\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-3 ] ) ) )
        elif soc_code == '51-609X':
            return( '{}[129]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '51-20XX':
            return( '{}\d{{2}}(?<=30|99)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '51-919X':
            return( '{}[23]\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-1 ] ) ) )
        elif soc_code == '51-91XX':
            return( '{}\d{{2}}(?<=41|99)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '53-40XX':
            return( '{}\d{{2}}(?<=22|41|99)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '53-60XX':
            return( '{}\d{{2}}(?<=11|41|99)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '53-70XX':
            return( '{}\d{{2}}(?<=11|31|41)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        elif soc_code == '53-71XX':
            return( '{}\d{{2}}(?<=21|99)\.\d{{2}}'.format( re.sub( '-' , '\-' , soc_code[ 0:-2 ] ) ) )
        else:
            raise ValueError( 'no corresponding SOC code(s) given for' , soc_code )
    else:
        return( re.sub( '-' , '\-' , soc_code ) )

In [7]:
# initialize list to store results
cenonet = []

# loop over census occupation codes and get needed onet content
for i in xwalk:
    # extract crosswalk between census code and soc
    # NOTE: split censoc on space to account for multiple SOC codes
    cencode = i[ 'cenocc' ]
    censoc = i[ 'soc' ]
    
    # use list comprehension to get set of all matching soc codes in onet
    socmatch = [ x[ 'code' ] for x in onet if len( re.findall( socpat( censoc ) , x[ 'code' ] ) ) > 0 ]
    
    # get components for each element of 'socmatch', store as dictionary, and append to list
    cendict = { 'cenocc' : str( cencode ) ,
                'occtitle' : i[ 'occ_title' ] ,
                'cen_soc' : censoc ,
                'onet_soc' : socmatch ,
                'description' : onetOutput( socmatch , onet ).description() ,
                'tasks' : onetOutput( socmatch , onet ).tasks() ,
                'job_titles' : onetOutput( socmatch , onet ).job_titles() }
    
    cenonet.append( cendict )

In [8]:
cenonet[0:5]

[{'cenocc': '0010',
  'occtitle': 'Chief executives',
  'cen_soc': '11-1011',
  'onet_soc': ['11-1011.00', '11-1011.03'],
  'description': 'Determine and formulate policies and provide overall direction of companies or private and public sector organizations within guidelines set up by a board of directors or similar governing body. Plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers. Communicate and coordinate with management, shareholders, customers, and employees to address sustainability issues. Enact or oversee a corporate sustainability strategy.',
  'tasks': "Direct or coordinate an organization's financial or budget activities to fund operations, maximize investments, or increase efficiency. Appoint department heads or managers and assign or delegate responsibilities to them. Analyze operations to evaluate performance of a company or its staff in meeting objectives or to determine areas

In [9]:
len( cenonet )

507

In [10]:
# save dictionary to json object
with open( 'cenocc_onet.json' , 'w' ) as f:
    json.dump( cenonet , f , indent = 4 )

The file `cenocc_onet.json` contains a link between Census occupation codes and the O\*NET data described earlier in this notebook.

It is possible to read the JSON file directly using `pandas` and the `read_json()` function. I demonstrate that below. I also print the `shape` of the dataframe; note that the number of rows equal 507, which is the same as the number of Census occupation codes in the `xwalk` file. Finally, I print the first 10 records of the dataframe.

In [11]:
# load json object with pandas
df = pd.read_json( 'cenocc_onet.json' )
df.sort_values( by = 'cenocc' , inplace = True )
df.reset_index( drop = True , inplace = True )

In [12]:
# print the dimensions of the object
df.shape

(507, 7)

In [13]:
# print the first 10 records of the dataframe
df.head( 10 )

Unnamed: 0,cenocc,occtitle,cen_soc,onet_soc,description,tasks,job_titles
0,10,Chief executives,11-1011,"[11-1011.00, 11-1011.03]",Determine and formulate policies and provide o...,Direct or coordinate an organization's financi...,"Chief Diversity Officer (CDO), Chief Executive..."
1,20,General and operations managers,11-1021,[11-1021.00],"Plan, direct, or coordinate the operations of ...","Review financial statements, sales or activity...","Business Manager, General Manager (GM), Operat..."
2,30,Legislators,11-1031,[11-1031.00],"Develop, introduce, or enact laws and statutes...",Analyze and understand the local and national ...,
3,40,Advertising and promotions managers,11-2011,[11-2011.00],"Plan, direct, or coordinate advertising polici...",Plan and prepare advertising and promotional m...,"Account Executive, Advertising Manager (Ad Man..."
4,51,Marketing managers,11-2021,[11-2021.00],"Plan, direct, or coordinate marketing policies...","Identify, develop, or evaluate marketing strat...","Account Supervisor, Brand Manager, Business De..."
5,52,Sales managers,11-2022,[11-2022.00],"Plan, direct, or coordinate the actual distrib...",Direct and coordinate activities involving sal...,"District Sales Manager, National Sales Manager..."
6,60,Public relations and fundraising managers,11-2030,"[11-2033.00, 11-2032.00]","Plan, direct, or coordinate activities to soli...","Assign, supervise, and review the activities o...","Account Supervisor, Annual Giving Director, De..."
7,101,Administrative services managers,11-3012,[11-3012.00],"Plan, direct, or coordinate one or more admini...",Prepare and review operational reports and sch...,"Administrative Coordinator, Administrative Dir..."
8,102,Facilities managers,11-3013,"[11-3013.00, 11-3013.01]","Plan, direct, or coordinate operations and fun...","Acquire, distribute and store supplies. Conduc...","Facilities Manager, Corporate Physical Securit..."
9,110,Computer and information systems managers,11-3021,[11-3021.00],"Plan, direct, or coordinate activities in such...","Direct daily operations of department, analyzi...","Application Development Director, Computing Se..."
