# Mappings between Job Titles and SOC Codes

Online supplementary material to "The Evolving U.S. Occupational Structure" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum.

* [Most recent version of the paper](
http://ssc.wisc.edu/~eatalay/skills.pdf)

* [Project data library](http://ssc.wisc.edu/~eatalay/occupation_data.html) 

* [GitHub repository](https://github.com/phaiptt125/newspaper_project)

***

This IPython notebook demonstrates how we map between job titles and SOC from newspaper text. 

* We use the continuous bag of words (CBOW) model previously constructed. See [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb) for more detail. 
* See [here](http://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more explanations.
* See project data library for full results.

<b> Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. </b>
***

## List of auxiliary files (see project data library or GitHub repository)

* *"title_substitute.py"* : This python code edits job titles.
* *"word_substitutes.csv"* : List of job title words substitution.
* *"phrase_substitutes.csv"* : List of job title phrases substitution.

Note: We manually look for most common job titles and list down the substitutions in *"word_substitutes.csv"* and *"phrase_substitutes.csv"*. 

In [1]:
import os
import re
import sys
import platform
import collections
import shutil

import pandas as pd
import math
import multiprocessing
import os.path
import numpy as np
from gensim import corpora, models
from gensim.models import Word2Vec, keyedvectors 
from gensim.models.word2vec import LineSentence
from sklearn.metrics.pairwise import cosine_similarity

sys.path.append('./auxiliary files')

from title_substitute import *

## Edit job titles

We first lightly edit job titles to reduce the number of unique titles: We convert all titles to lowercase and remove all non-alphanumeric characters; combine titles which are very similar to one another (e.g., replacing "hostesses" with "host"); replace plural nouns with their singular form (e.g., replacing "nurses" with "nurse", "foremen" with "foreman"); and remove abbreviations (e.g., replacing "asst" with "assistant", and "customer service rep" with "customer service representative"). 

In [2]:
# import files for editing titles
word_substitutes = io.open('word_substitutes.csv','r',encoding='utf-8',errors='ignore').read()
word_substitutes = ''.join([w for w in word_substitutes if ord(w) < 127])
word_substitutes = [w for w in re.split('\n',word_substitutes) if not w=='']
 
phrase_substitutes = io.open('phrase_substitutes.csv','r',encoding='utf-8',errors='ignore').read()
phrase_substitutes = ''.join([w for w in phrase_substitutes if ord(w) < 127])
phrase_substitutes = [w for w in re.split('\n',phrase_substitutes) if not w=='']

In [3]:
# some illustrations (see "title_substitute.py")

list_job_titles = ['registered nurses',
                   'rn', 
                   'hostesses',
                   'foremen', 
                   'customer service rep']

for title in list_job_titles: 
    title_clean = substitute_titles(title,word_substitutes,phrase_substitutes)
    print('original title = ' + title)
    print('edited title = ' + title_clean)
    print('---')

original title = registered nurses
edited title = registered nurse
---
original title = rn
edited title = registered nurse
---
original title = hostesses
edited title = host
---
original title = foremen
edited title = foreman
---
original title = customer service rep
edited title = customer service representative
---


## Some technical issues

* The procedure of replacing plural nouns with their singular form works in general:

In [4]:
substitute_titles('galaxies',word_substitutes,phrase_substitutes)
# Note: We do not supply the mapping from 'galaxies' to 'galaxy'.

'galaxy'

* The procedure of replacing abbreviations, on the other hand, requires user-provided information, i.e., we list down most common substitutions. While we cannot possibly identify all abbreviations, we will use the continuous bag of word (CBOW) model later. Common abbreviations would have similar meanings as their original words. 

## Map job titles to SOC codes

First, for the most common 1000 job titles, we map the titles to their SOC codes using ONET-SOC AutoCoder (see [here](http://www.onetsocautocoder.com/plus/onetmatch)). These mappings are retrieved manually. This results in a mapping between job titles
and SOC codes for 3.9 million newspaper job ads.

For the remaining ads, we apply a continuous bag of words (CBOW) model, in combination with online job vacancy postings, provided by Economic Modeling Specialists International (EMSI), containing a large correspondence between job titles and SOC codes. See [here](https://github.com/phaiptt125/online_job_posting/blob/master/data_cleaning/initial_cleaning.ipynb) for more information on how we pre-process online job vacancy postings.

We extract information on job titles and SOC codes from one month of online job vacancy postings, January 2012, which results in 332,829 unique mappings between job titles and SOC codes:  

In [5]:
title2SOC_filename = 'online_postings_title2SOC.txt'

# import into pandas dataframe
title2SOC = pd.read_csv(title2SOC_filename, sep = '\t', names = ['title','soc'])

# print number of total mappings
print('Total mappings = ' + str(len(title2SOC)))

title2SOC = title2SOC.head(100)

# implement the same title editing procedure illustrated above
title2SOC['title'] = title2SOC['title'].apply(lambda x: substitute_titles(x,word_substitutes,phrase_substitutes))

# print some examples
title2SOC.head()

Total mappings = 332829


Unnamed: 0,title,soc
0,expeditor,435061
1,coach project,119199
2,entry full level management provided training,411012
3,customer professional service,434051
4,coordinator patient service,434051


## Compute a vector representation of each job title

Next, we apply the CBOW model previously constructed to represent each job title with a vector. In the actual implementation, we set our dimension of the CBOW model to be 300, as explained [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb). 

For illustrative purposes, however, this IPython notebook provides examples using the CBOW model with the dimension of 5. The embedded code below illustrates how we construct this CBOW model:

***
    model = Word2Vec(LineSentence(open('ad_combined.txt')), size = 5, window = 5, min_count = 5, workers = multiprocessing.cpu_count())
    
    model.save('cbow_small.model')
***

In [6]:
model = Word2Vec.load('cbow_small.model')
# 'cbow_small.model' has dimension of 5.
# In the actual implementation, we use our previously constructed 'cbow.model', which has dimension of 300.  

* The model provides a vector representation of each word in the corpus. For example:

In [7]:
model['customer']

array([-0.23945422, -0.33969662, -0.25194243,  0.86623007,  0.11592443], dtype=float32)

In [8]:
model['professional']

array([-0.37457037, -0.43614858,  0.05933725,  0.80807394,  0.11387233], dtype=float32)

In [9]:
model['service']

array([-0.30502519, -0.39435992, -0.19132054,  0.81630003,  0.22020572], dtype=float32)

* We compute a vector represenation of "customer professional service": 

In [10]:
vector_title = model['customer'] + model['professional'] + model['service']
vector_title

array([-0.91904974, -1.17020512, -0.38392574,  2.49060392,  0.45000249], dtype=float32)

As such, we can compute a vector represenation of:
1. All job titles from our newspaper data.
2. All job titles from online job vacancy postings in January 2012.

## Map newspaper job titles to online posting job titles

We assign the "closest" online posting job title, where a corresponding SOC code is available, to each of the newspaper job title. We use cosine similarity as a measure of how two vectors are similar to each other. As the cosine function gives the value between 0 and 1, the closer value to 1 means the two vectors are closer to each other. 

* Suppose we have "client representative" as a newspaper job title, we assign a vector of:  

In [11]:
vector_to_match = model['client'] + model['representative']
vector_to_match

array([-1.26642871, -0.56747955, -0.3570725 ,  1.08817148,  0.34984976], dtype=float32)

* The cosine similarity of "client representative" and "customer professional service" is:  

In [12]:
vector_title = model['customer'] + model['professional'] + model['service']
cosine_similarity(vector_to_match.reshape(1,-1), vector_title.reshape(1,-1))

array([[ 0.89043331]], dtype=float32)

* The cosine similarity of "client representative" and "mechanical engineer" is:

In [13]:
vector_title = model['mechanical'] + model['engineer']
cosine_similarity(vector_to_match.reshape(1,-1), vector_title.reshape(1,-1))

array([[ 0.40693504]], dtype=float32)

* The cosine similarity of "client representative" and "executive secretary" is:

In [14]:
vector_title = model['executive'] + model['secretary']
cosine_similarity(vector_to_match.reshape(1,-1), vector_title.reshape(1,-1))

array([[ 0.39240193]], dtype=float32)

Therefore, using the CBOW model, we conclude that "client representative" has a closer meaning to "customer professional service" than both "mechanical engineer" and "executive secretary".   

In the actual implementation, we compute the cosine similarity of "client representative" to all 332,829 job titles from online job vacancy postings in January 2012. This computation, however, cannot be done in this IPython notebook. 

## Assign SOC codes to newspaper job titles

Once we identify the closest job title from online vacancy postings, we  then proceed to assign the same SOC to the newspaper job title. It turns out that "client representative service" has the closest vector representation to "client representative", so we assign SOC code of "client representative" to be the same as "client representative service".   

In [15]:
vector_title = model['client'] + model['representative'] + model['service']
cosine_similarity(vector_to_match.reshape(1,-1), vector_title.reshape(1,-1))

array([[ 0.98711061]], dtype=float32)

In [16]:
title2SOC[title2SOC['title'] == "client representative service"]

Unnamed: 0,title,soc
12,client representative service,434051


* The example above, "customer professional service", also maps to the same SOC code.

In [17]:
title2SOC[title2SOC['title'] == "customer professional service"]

Unnamed: 0,title,soc
3,customer professional service,434051


## Some technical issues

* We ignore job title words that are not in our CBOW model. 
* Unlike the LDA model, we do not stem words. As a result, the model considers different forms of a word as different words, e.g., "manage" and "management". However, our CBOW model generally assign similar vector representation, for example: 

In [18]:
cosine_similarity(model['manage'].reshape(1,-1), model['management'].reshape(1,-1))

array([[ 0.92724895]], dtype=float32)

* Our procedure is invariant to the order of job title words, e.g., we consider "executive secretary" and "secretary executive" as the same title. 

In [19]:
model['executive'] + model['secretary']

array([-0.5665881 , -0.73142403,  0.72307652, -0.10102642,  1.02186275], dtype=float32)

In [20]:
model['secretary'] + model['executive']

array([-0.5665881 , -0.73142403,  0.72307652, -0.10102642,  1.02186275], dtype=float32)

* Common abbreviations would have similar meanings as their original words. For instance, "rn" is a common abbreviation for "registered nurse", as a result, our CBOW model assigns very similar vector representation:   

In [21]:
vector_title = model['registered'] + model['nurse']
cosine_similarity(model['rn'].reshape(1,-1), vector_title.reshape(1,-1))

array([[ 0.98632824]], dtype=float32)