# Mappings between Job Titles and SOC Codes

Online supplementary material to "The Evolving U.S. Occupational Structure" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum.

* [Most recent version of the paper](
http://ssc.wisc.edu/~eatalay/skills.pdf)

* [Project data library](http://ssc.wisc.edu/~eatalay/occupation_data.html) 

* [GitHub repository](https://github.com/phaiptt125/newspaper_project)

***

This IPython notebook demonstrates how we map between job titles and SOC from newspaper text. 

* We use the continuous bag of words (CBOW) model previously constructed. See [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb) for more detail. 
* See [here](http://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more explanations.
* See project data library for full results.

<b> Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. </b>
***

## List of auxiliary files (see project data library or GitHub repository)

* *"title_substitute.py"* : This python code edits job titles.
* *"word_substitutes.csv"* : List of job title words substitution.
* *"phrase_substitutes.csv"* : List of job title phrases substitution.

Note: We manually look for the most common job titles and list manually-coded substitutions in *"word_substitutes.csv"* and *"phrase_substitutes.csv"*. 

In [1]:
import os
import re
import sys
import platform
import collections
import shutil

import pandas as pd
import math
import multiprocessing
import os.path
import numpy as np
from gensim import corpora, models
from gensim.models import Word2Vec, keyedvectors 
from gensim.models.word2vec import LineSentence
from sklearn.metrics.pairwise import cosine_similarity

sys.path.append('./auxiliary files')

from title_substitute import *

## Edit job titles

We first lightly edit job titles to reduce the number of unique titles: We convert all titles to lowercase and remove all non-alphanumeric characters; combine titles which are very similar to one another (e.g., replacing "hostesses" with "host"); replace plural nouns with their singular form (e.g., replacing "nurses" with "nurse", "foremen" with "foreman"); and remove abbreviations (e.g., replacing "asst" with "assistant", and "customer service rep" with "customer service representative"). 

In [2]:
# import files for editing titles
word_substitutes = io.open('word_substitutes.csv','r',encoding='utf-8',errors='ignore').read()
word_substitutes = ''.join([w for w in word_substitutes if ord(w) < 127])
word_substitutes = [w for w in re.split('\n',word_substitutes) if not w=='']
 
phrase_substitutes = io.open('phrase_substitutes.csv','r',encoding='utf-8',errors='ignore').read()
phrase_substitutes = ''.join([w for w in phrase_substitutes if ord(w) < 127])
phrase_substitutes = [w for w in re.split('\n',phrase_substitutes) if not w=='']

In [3]:
# some illustrations (see "title_substitute.py")

list_job_titles = ['registered nurses',
                   'rn', 
                   'hostesses',
                   'foremen', 
                   'customer service rep']

for title in list_job_titles: 
    title_clean = substitute_titles(title,word_substitutes,phrase_substitutes)
    print('original title = ' + title)
    print('edited title = ' + title_clean)
    print('---')

original title = registered nurses
edited title = registered nurse
---
original title = rn
edited title = registered nurse
---
original title = hostesses
edited title = host
---
original title = foremen
edited title = foreman
---
original title = customer service rep
edited title = customer service representative
---


## Some technical issues

* The procedure of replacing plural nouns with their singular form works in general:

In [4]:
substitute_titles('galaxies',word_substitutes,phrase_substitutes)
# Note: We do not supply the mapping from 'galaxies' to 'galaxy'.

'galaxy'

* The procedure of replacing abbreviations, on the other hand, requires user-provided information, i.e., we list down the most common substitutions. While we cannot possibly identify all abbreviations, we will use the continuous bag of word (CBOW) model later. Common abbreviations would have similar meanings as their original words. 

## ONET reported job titles 

The ONET publishes, for each SOC code, a list of reported job titles in "Sample of Reported Titles" and "Alternate Titles" sections. According to the ONET data dictionary [here](https://www.onetcenter.org/dl_files/database/db_22_1_dictionary.pdf), the "Sample of Reported Titles" file:

***
*"contains job titles frequently reported by incumbents and occupational experts on data collection surveys."* (page 52). 
***

Similarly, the "Alternate Titles" file:
***
*"contains alternate, or 'lay', occupational titles for the ONET-SOC classification system. The file was developed to improve keyword searches in several Department of Labor internet applications (i.e., Career InfoNet, ONET OnLine, and ONET Code Connector). The file contains
occupational titles from existing occupational classification systems, as well as from other diverse sources."* (page 50).
***

Some job titles, unfortunately, do not have a unique mapping to an SOC code. For example, "Office Administrator" is reported to be "43-9061.00", "43-6011.00" and "43-6014.00". For these titles, we rely on the ONET website search algorithm. First, we enter "Office Administrator"  into the search query box, "Occupation Quick Search". Then, we assign the cloest match that the ONET website provides. The screenshot below demonstrates this procedure: 

<img src="example_website.png">

Then, we map "Office Administrator" to "43-9061.00". Next, we apply the same title editing procedure as in newspaper job titles:

In [5]:
title2SOC_filename = 'title2SOC.txt'
names = ['title','original_title','soc']

# title: The edited title, to be matched with newspaper titles.
# original_title: The original titles from ONET website. 
# soc: Occupation code.
 
# import into pandas dataframe
title2SOC = pd.read_csv(title2SOC_filename, sep = '\t', names = names)

# print number of total mappings
print('Total mappings = ' + str(len(title2SOC)))
 
# print some examples
title2SOC.head()

Total mappings = 45207


Unnamed: 0,title,original_title,soc
0,operation director,Operations Director,11102100
1,us commissioner,U.S. Commissioner,11101100
2,sale and marketing director,Sales and Marketing Director,11202200
3,market analysis director,Market Analysis Director,11202100
4,director of sale and marketing,Director of Sales and Marketing,41101200


The subsequent sections of this IPython notebook explain how we use these mappings, in combination with the previously constructed continuous bag of words (CBOW) model, to assign an SOC code to each of the newspaper job title. The final output of this exercise, a mapping between newspaper job titles and SOC codes, can be downloaded [here](http://ssc.wisc.edu/~eatalay/apst/soc_mapping_apst.csv).  

## Compute a vector representation of each job title

We apply the CBOW model previously constructed to represent each job title with a vector. In the actual implementation, we set our dimension of the CBOW model to be 300, as explained [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb). 

For illustrative purposes, however, this IPython notebook provides examples using the CBOW model with the dimension of 5. The embedded code below illustrates how we construct this CBOW model:

***
    model = Word2Vec(LineSentence(open('ad_combined.txt')), size = 5, window = 5, min_count = 5, workers = multiprocessing.cpu_count())
    
    model.save('cbow_small.model')
***

In [6]:
model = Word2Vec.load('cbow_small.model')
# 'cbow_small.model' has dimension of 5.
# In the actual implementation, we use our previously constructed 'cbow.model', which has dimension of 300.  

* The model provides a vector representation of each word in the corpus. For example:

In [7]:
model['customer']

array([-0.23945422, -0.33969662, -0.25194243,  0.86623007,  0.11592443], dtype=float32)

In [8]:
model['service']

array([-0.30502519, -0.39435992, -0.19132054,  0.81630003,  0.22020572], dtype=float32)

In [9]:
model['trainee']

array([-0.46670517, -0.50954461,  0.44123417,  0.19443344,  0.53857094], dtype=float32)

* We compute a vector represenation of "customer service trainee": 

In [10]:
vector_title = model['customer'] + model['service'] + model['trainee']
vector_title

array([-1.01118457, -1.24360108, -0.00202879,  1.87696362,  0.87470108], dtype=float32)

As such, we can compute a vector represenation of:
1. All job titles from our newspaper data.
2. All job titles from our list of ONET titles.

## Map ONET job titles to newspaper job titles (direct match)

We assign the ONET job title, where a corresponding SOC code is available, to each of the newspaper job title. First, for each newspaper job title, we check if there is any direct string match. Suppose we have "sale and marketing director" in the newspaper:

In [11]:
"sale and marketing director" in title2SOC['title'].values

True

In [12]:
title2SOC[title2SOC['title'] == "sale and marketing director"]

Unnamed: 0,title,original_title,soc
2,sale and marketing director,Sales and Marketing Director,11202200


* Since, we have "sale and marketing director" in our list of ONET titles, we can proceed and assign the SOC of "11-2022.00". 

## Map ONET job titles to newspaper job titles (CBOW-based)

Using the previously constructed CBOW model, we assign a vector to each of the newspaper job title that does not have a direct string match. We then compute cosine similarity score and assign the "closest" ONET job title and its corresponding SOC code.

* Suppose we have "customer service trainee" as a newspaper job title, we first check if there is a direct match to our list of ONET titles: 

In [13]:
"customer service trainee" in title2SOC['title'].values

False

* Since there is no direct match, we then assign a vector representation of this title and compute how similar this title to each of the ONET job titles. We use cosine similarity as a measure of how two vectors are similar to each other. As the cosine function gives the value between 0 and 1, the closer value to 1 means the two vectors are closer to each other. The results below demonstrate cosine similarity scores to some ONET job titles:

In [14]:
vector_newspaper = model['customer'] + model['service'] + model['trainee']

# compute similarity to "executive secretary" 
vector_to_match  = model['executive'] + model['secretary']
cosine = cosine_similarity(vector_to_match .reshape(1,-1), vector_newspaper.reshape(1,-1))
print( 'cosine similarity to "executive secretary" = ' +  str(cosine))

# compute similarity to "mechanical engineer" 
vector_to_match  = model['mechanical'] + model['engineer']
cosine = cosine_similarity(vector_to_match .reshape(1,-1), vector_newspaper.reshape(1,-1))
print( 'cosine similarity to "mechanical engineer" = ' +  str(cosine))

# compute similarity to "client representative" 
vector_to_match  = model['client'] + model['representative']
cosine = cosine_similarity(vector_to_match .reshape(1,-1), vector_newspaper.reshape(1,-1))
print( 'cosine similarity to "client representative" = ' +  str(cosine))

# compute similarity to "customer service assistant" 
vector_to_match  = model['customer'] + model['service'] + model['assistant']
cosine = cosine_similarity(vector_to_match .reshape(1,-1), vector_newspaper.reshape(1,-1))
print( 'cosine similarity to "customer service assistant" = ' +  str(cosine))

cosine similarity to "executive secretary" = [[ 0.53498054]]
cosine similarity to "mechanical engineer" = [[ 0.73577219]]
cosine similarity to "client representative" = [[ 0.9032464]]
cosine similarity to "customer service assistant" = [[ 0.99343538]]


***
Therefore, using the CBOW model, we conclude that "customer service trainee" has a closer meaning to "customer service assistant" than "executive secretary", "mechanical engineer" and "client representative". 

In the actual implementation, we compute cosine similarity score to all 45207 ONET job titles. This computation cannot be performed in this Ipython notebook. It turns out that "customer service assistant" is indeed the cloest ONET job title to "customer service trainee". We, then, assign the SOC code of "customer service trainee" to be the same as "customer service assistant".     

In [15]:
title2SOC[title2SOC['title'] == "customer service assistant"]

Unnamed: 0,title,original_title,soc
2333,customer service assistant,Customer Service Assistant,43202100


## Some technical issues

* We ignore job title words that are not in our CBOW model. 
* Unlike the LDA model, we do not stem words. As a result, the model considers different forms of a word as different words, e.g., "manage" and "management". However, our CBOW model generally assign similar vector representation, for example: 

In [16]:
cosine_similarity(model['manage'].reshape(1,-1), model['management'].reshape(1,-1))

array([[ 0.92724895]], dtype=float32)

* Our procedure is invariant to the order of job title words, e.g., we consider "executive secretary" and "secretary executive" as the same title. 

In [17]:
model['executive'] + model['secretary']

array([-0.5665881 , -0.73142403,  0.72307652, -0.10102642,  1.02186275], dtype=float32)

In [18]:
model['secretary'] + model['executive']

array([-0.5665881 , -0.73142403,  0.72307652, -0.10102642,  1.02186275], dtype=float32)

* Common abbreviations would have similar meanings as their original words. For instance, "rn" is a common abbreviation for "registered nurse", as a result, our CBOW model assigns very similar vector representation:   

In [19]:
vector_title = model['registered'] + model['nurse']
cosine_similarity(model['rn'].reshape(1,-1), vector_title.reshape(1,-1))

array([[ 0.98632824]], dtype=float32)