# Mappings between Job Titles and SOC Codes

Online supplementary material to "The Evolution of Work in the United States" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum.

* [Project data library](https://occupationdata.github.io) 

* [GitHub repository](https://github.com/phaiptt125/newspaper_project)

***

This IPython notebook demonstrates how we map between job titles and SOC from newspaper text. 

* We use the continuous bag of words (CBOW) model previously constructed. See [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb) for more detail. 
* See [here](http://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more explanations.
* See project data library for full results.

<b> Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. </b>
***

## List of auxiliary files (see project data library or GitHub repository)

* *"title_substitute.py"* : This python code edits job titles.
* *"word_substitutes.csv"* : List of job title words substitution.
* *"phrase_substitutes.csv"* : List of job title phrases substitution.

Note: We look for the most common job titles and list manually-coded substitutions in *"word_substitutes.csv"* and *"phrase_substitutes.csv"*. 

In [1]:
import os
import re
import sys
import platform
import collections
import shutil

import pandas as pd
import math
import multiprocessing
import os.path
import numpy as np
from gensim import corpora, models
from gensim.models import Word2Vec, keyedvectors 
from gensim.models.word2vec import LineSentence
from sklearn.metrics.pairwise import cosine_similarity

sys.path.append('./auxiliary files')

from title_substitute import *

## Edit job titles

We first lightly edit job titles to reduce the number of unique titles: We convert all titles to lowercase and remove all non-alphanumeric characters; combine titles which are very similar to one another (e.g., replacing "hostesses" with "host"); replace plural nouns with their singular form (e.g., replacing "nurses" with "nurse", "foremen" with "foreman"); and remove abbreviations (e.g., replacing "asst" with "assistant", and "customer service rep" with "customer service representative"). 

In [2]:
# import files for editing titles
word_substitutes = io.open('word_substitutes.csv','r',encoding='utf-8',errors='ignore').read()
word_substitutes = ''.join([w for w in word_substitutes if ord(w) < 127])
word_substitutes = [w for w in re.split('\n',word_substitutes) if not w=='']
 
phrase_substitutes = io.open('phrase_substitutes.csv','r',encoding='utf-8',errors='ignore').read()
phrase_substitutes = ''.join([w for w in phrase_substitutes if ord(w) < 127])
phrase_substitutes = [w for w in re.split('\n',phrase_substitutes) if not w=='']

In [3]:
# some illustrations (see "title_substitute.py")

list_job_titles = ['registered nurses',
                   'rn', 
                   'hostesses',
                   'foremen', 
                   'customer service rep']

for title in list_job_titles: 
    title_clean = substitute_titles(title,word_substitutes,phrase_substitutes)
    print('original title = ' + title)
    print('edited title = ' + title_clean)
    print('---')

original title = registered nurses
edited title = registered nurse
---
original title = rn
edited title = registered nurse
---
original title = hostesses
edited title = host
---
original title = foremen
edited title = foreman
---
original title = customer service rep
edited title = customer service representative
---


## Some technical issues

* The procedure of replacing plural nouns with their singular form works in general:

In [4]:
substitute_titles('galaxies',word_substitutes,phrase_substitutes)
# Note: We do not supply the mapping from 'galaxies' to 'galaxy'.

'galaxy'

* The procedure of replacing abbreviations, on the other hand, requires user-provided information, i.e., we list down the most common substitutions. While we cannot possibly identify all abbreviations, we will use the continuous bag of word (CBOW) model later. Common abbreviations would have similar meanings as their original words. 

## ONET reported job titles 

The ONET publishes, for each SOC code, a list of reported job titles in "Sample of Reported Titles" and "Alternate Titles" sections. The ONET data dictionary, see [here](https://www.onetcenter.org/dl_files/database/db_22_1_dictionary.pdf), explains these files as the following:

*"This file [Sample of Reported Titles] contains job titles frequently reported by incumbents and occupational experts on data collection surveys."* (page 52)

*"This file [Alternate Titles] contains alternate, or 'lay', occupational titles for the ONET-SOC classification system. The file was developed to improve keyword searches in several Department of Labor internet applications (i.e., Career InfoNet, ONET OnLine, and ONET Code Connector). The file contains
occupational titles from existing occupational classification systems, as well as from other diverse sources."* (page 50)

## A mapping between ONET reported job titles and SOC codes

The ONET provides, for each job title in "Sample of Reported Titles" and "Alternate Titles", a corresponding SOC code. We then record these mappings directly. 

Some job titles, unfortunately, do not have a unique mapping to an SOC code. For example, "Office Administrator" is reported to be "43-9061.00", "43-6011.00" and "43-6014.00". For these titles, we rely on the ONET website search algorithm. First, we enter "Office Administrator" into the search query box, "Occupation Quick Search." See [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/auxiliary%20files/example_ONET_api.png) for a screenshot of this procedure. 

Then, we map "Office Administrator" to "43-9061.00", which is the cloest match that the ONET website provides. Next, we apply the same title editing procedure as in newspaper job titles described above. We record these mappings to "title2SOC.txt" as shown below.   

In [21]:
title2SOC_filename = 'title2SOC.txt'
names = ['title','original_title','soc']

# title: The edited title, to be matched with newspaper titles.
# original_title: The original titles from ONET website. 
# soc: Occupation code.
 
# import into pandas dataframe
title2SOC = pd.read_csv(title2SOC_filename, sep = '\t', names = names)

# print number of total mappings
print('Total mappings = ' + str(len(title2SOC)))
 
# print some examples
title2SOC.head()

Total mappings = 45207


Unnamed: 0,title,original_title,soc
0,operation director,Operations Director,11102100
1,us commissioner,U.S. Commissioner,11101100
2,sale and marketing director,Sales and Marketing Director,11202200
3,market analysis director,Market Analysis Director,11202100
4,director of sale and marketing,Director of Sales and Marketing,41101200


The subsequent sections of this IPython notebook explain how we use these mappings from ONET, in combination with the previously constructed continuous bag of words (CBOW) model, to assign an SOC code to each of the newspaper job title.

## Map ONET job titles to newspaper job titles (direct match)

We assign the ONET job title, where a corresponding SOC code is available, to each of the newspaper job title. First, for each newspaper job title, we check if there is any direct string match. Suppose we have "sale and marketing director" in the newspaper:

In [6]:
"sale and marketing director" in title2SOC['title'].values

True

In [7]:
title2SOC[title2SOC['title'] == "sale and marketing director"]

Unnamed: 0,title,original_title,soc
2,sale and marketing director,Sales and Marketing Director,11202200


* Since, we have "sale and marketing director" in our list of ONET titles, we can proceed and assign the SOC of "11-2022.00". 

## Map ONET job titles to newspaper job titles (CBOW-based)

For those newspaper job titles where there is no exact match to our list of ONET job titles, we reply on our previously constructed CBOW model to assign the "closet" ONET job title to each of the newspaper job title.   

In the actual implementation, we set our dimension of the CBOW model to be 300, as explained [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb). For illustrative purposes, however, this IPython notebook provides examples using the CBOW model with the dimension of 5. The embedded code below illustrates how we construct this CBOW model:

***
    model = Word2Vec(LineSentence(open('ad_combined.txt')), 
                        size = 5, 
                        window = 5, 
                        min_count = 5, 
                        workers = multiprocessing.cpu_count())

    model.save('cbow_small.model')
***

In [8]:
model = Word2Vec.load('cbow_small.model')
# 'cbow_small.model' has dimension of 5.
# In the actual implementation, we use our previously constructed 'cbow.model', which has dimension of 300.  

Our CBOW model provides a vector representation of each word in the corpus. For example:

In [9]:
model['customer']

array([-0.23945422, -0.33969662, -0.25194243,  0.86623007,  0.11592443], dtype=float32)

In [10]:
model['relation']

array([ 0.03195868, -0.56184751,  0.24374393,  0.58998656,  0.52517688], dtype=float32)

In [11]:
model['specialist']

array([-0.52168244, -0.50416076,  0.10234968,  0.33064061,  0.59487033], dtype=float32)

We compute a vector represenation of "customer relation specialist" to be the sum of a vector representation of "customer", "relation" and "specialist".

In [12]:
vector_title = model['customer'] + model['relation'] + model['specialist']
vector_title

array([-0.72917795, -1.40570486,  0.09415118,  1.78685713,  1.23597169], dtype=float32)

As such, we can compute a vector represenation of:

1. All job titles from our newspaper data.
2. All job titles from our list of ONET titles.

Suppose we have "customer relation specialist" as a newspaper job title, we first check if there is a direct match to our list of ONET titles: 

In [13]:
"customer relation specialist" in title2SOC['title'].values

False

Since there is no direct match, we then assign a vector representation of this title and compute how similar this title to each of the ONET job titles. We use cosine similarity as a measure of how two vectors are similar to each other. As the cosine function gives the value between 0 and 1, the closer value to 1 means the two vectors are closer to each other. The results below demonstrate cosine similarity scores to some ONET job titles:

In [14]:
vector_newspaper = model['customer'] + model['relation'] + model['specialist']

print('Computing cosine similarity of "customer relation specialist" to: ')
print('----------------')

# compute similarity to "executive secretary" 
vector_to_match  = model['executive'] + model['secretary']
cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))
print( '"executive secretary" = ' +  str(cosine))

# compute similarity to "mechanical engineer" 
vector_to_match  = model['mechanical'] + model['engineer']
cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))
print( '"mechanical engineer" = ' +  str(cosine))

# compute similarity to "customer service assistant" 
vector_to_match  = model['customer'] + model['service'] + model['assistant']
cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))
print( '"customer service assistant" = ' +  str(cosine))

# compute similarity to "client relation specialist" 
vector_to_match  = model['client'] + model['relation'] + model['specialist']
cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))
print( '"client relation specialist" = ' +  str(cosine))

Computing cosine similarity of "customer relation specialist" to: 
----------------
"executive secretary" = [[ 0.6176427]]
"mechanical engineer" = [[ 0.80217057]]
"customer service assistant" = [[ 0.96143997]]
"client relation specialist" = [[ 0.99550998]]


***
Therefore, using the CBOW model, we conclude that "customer relation specialist" has a closer meaning to "client relation specialist" than than "executive secretary", "mechanical engineer" and "customer service assistant." 

Even though the we do not have "customer relation specialist" in our list of ONET job titles, our CBOW model suggests that this job title is extremely similar to "client relation specialist". There are two reasons why this should be the case. First, there are two identical words "relation" and "specialist" in both job titles. Second, our CBOW model suggests that "client" and "customer" are similar to each other:

In [15]:
cosine_similarity(model['client'].reshape(1,-1), model['customer'].reshape(1,-1))

array([[ 0.96610314]], dtype=float32)

In the actual implementation, we compute cosine similarity score to all 45207 ONET job titles, which cannot be performed in this IPython notebook. 

Nevertheless, it turns out that "client relation specialist" is indeed the cloest ONET job title to "customer relation specialist." We, then, assign the SOC code of "customer relation specialist" to be the same as "client relation specialist."  

In [16]:
title2SOC[title2SOC['title'] == "client relation specialist"]

Unnamed: 0,title,original_title,soc
14392,client relation specialist,Client Relations Specialist,43405100


## Some technical issues

* We ignore job title words that are not in our CBOW model. 
* Unlike the LDA model, we do not stem words. As a result, the model considers different forms of a word as different words, e.g., "manage" and "management". However, our CBOW model generally assign similar vector representation, for example: 

In [17]:
cosine_similarity(model['manage'].reshape(1,-1), model['management'].reshape(1,-1))

array([[ 0.92724895]], dtype=float32)

* Our CBOW model is invariant to the order of job title words, e.g., we consider "executive secretary" and "secretary executive" as the same title. 

In [18]:
model['executive'] + model['secretary']

array([-0.5665881 , -0.73142403,  0.72307652, -0.10102642,  1.02186275], dtype=float32)

In [19]:
model['secretary'] + model['executive']

array([-0.5665881 , -0.73142403,  0.72307652, -0.10102642,  1.02186275], dtype=float32)

* Common abbreviations would have similar meanings as their original words. For instance, "rn" is a common abbreviation for "registered nurse", as a result, our CBOW model assigns very similar vector representation:   

In [20]:
vector_title = model['registered'] + model['nurse']
cosine_similarity(model['rn'].reshape(1,-1), vector_title.reshape(1,-1))

array([[ 0.98632824]], dtype=float32)

* There are rare circumstances where our CBOW model suggests more than one "cloest" ONET titles to a newspaper job title, i.e., the cosine similarity scores are exactly equal. This can happen because there are some different ONET job titles, each map to a different SOC, but our CBOW model assigns the exact same vector representation. For example, ONET registers "wage and salary administrator" to be "11-3111.00" (Compensation and Benefits Managers) and "salary and wage administrator" to be "13-1141.00" (Compensation, Benefits, and Job Analysis Specialists). However, our CBOW model assigns the exact same vector representation to "wage and salary administrator" and "salary and wage administrator." In these circumstances, we reply on The Bureau of Labor Statistics employment data, see [here](https://www.bls.gov/oes/current/oes_nat.htm), and choose the SOC code with higher employment.

## Additional amendments

Finally, we made additional amendments as the following (see [here](https://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more detail):

1. We assign an SOC code of 999999 (“missing”) if certain words or phrases appear — “associate,” “career builder,” “liberal employee benefit,” “many employee benefit,” or “personnel” — anywhere in the job title, or for certain exact titles: “boys,” “boys boys,” “men boys girls,” “men boys girls women,” “men boys men,” “people,” “professional,” or “trainee.” These words and phrases appear commonly in our newspaper ads and do not refer to the SOC code which our CBOW model indicates. “Associate” commonly appears the part of the name of the firms which are placing the ad. “Personnel” commonly refers to the personnel department to which the applicant should contact.

2. We also replace the SOC code for the job title “Assistant” from 399021 (the SOC code for “Personal Care Aides”) to 436014 (the SOC code for “Secretaries and Administrative Assistants”). “Assistant” is the fifth most common job title, and judging by the text within the job ads refers to a secretarial occupation rather than one for a personal care worker. While we are hesitant to modify our job title to SOC mapping in an ad hoc fashion for any job title, mis-specifying this mapping for such a common title would have a noticeably deleterious impact on our dataset.

3. In a final step, we amend the output of the CBOW model for a few ambiguously defined job titles. These final amendments have no discernible impact on aggregate trends in task content, on role within-occupation shifts in accounting for aggregate task changes, or on the role of shifts in the demand for tasks in accounting for increased earnings inequality. First, for job titles which include “server” and which do not also include a food-service-related word — banquet, bartender, cashier, cocktail, cook, dining, food, or restaurant — we substitute an SOC code beginning with 3530 with the SOC code for computer systems analysts (151121). Second, for job titles which contain the word “programmer,” do not include the words “cnc” or “machine,” we substitute SOC codes beginning with 5140 or 5141 with the SOC code for computer programmers (151131). Finally, for job titles which contain the word “assembler” and do not contain a word referring to manufacturing assembly work — words containing the strings electronic, electric, machin, mechanical, metal, and wire — we substitute SOC codes beginning with 5120 with the SOC code of computer programmers (151131). The amendments, which alter the SOC codes for approximately 0.2 percent of ads in our data set, are necessary for ongoing work in which we explore the role of new technologies in the labor market. Certain words refer both to a job title unrelated to new technologies as well as to new technologies. By linking the aforementioned job titles to SOCs that have no exposure to new technologies, we would be vastly overstating the rates at which food service staff or manufacturing production workers adopt new ICT software. On the other hand, since these 8 ads represent a small portion of the ads referring to computer programmer occupations, lumping the ambiguous job titles with the computer programmer SOC codes will only have a minor effect on the assessed technology adoption rates for computer programmers.