# The Continuous Bag of Words Model

Online supplementary material to "The Evolution of Work in the United States" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum.

* [Project data library](https://occupationdata.github.io) 

* [GitHub repository](https://github.com/phaiptt125/newspaper_project)

***

This IPython notebook demonstrates how we map between occupational characteristics to words or phrases from newspaper text using the Continuous Bag of Words Model (CBOW). 

* See [here](http://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more examples.
* See project data library for full results.

<b> Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. </b>
***

## Import necessary modules

In [1]:
import os
import re
import sys
import platform
import collections
import shutil

import pandas
import math
import multiprocessing
import os.path
import numpy as np
from gensim import corpora, models
from gensim.models import Word2Vec, keyedvectors 
from gensim.models.word2vec import LineSentence
from sklearn.metrics.pairwise import cosine_similarity

In our implementation, we construct our model by taking as our text corpora all of the text from job ads which appeared in our cleaned newspaper data, plus the raw text from job ads which were posted on-line in two months: January 2012 and January 2016.

## Prepare newspaper text data

For newspaper text data, we:

1. Retrieve document metadata, remove markup from the newspaper text, and to perform an initial spell-check of the text (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/initial_cleaning.ipynb)). 
2. Exclude non-job ad pages (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb)).
3. Transform unstructured newspaper text into spreadsheet data (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/structured_data.ipynb)).
4. Delete all non alphabetic characters, e.g., numbers and punctuations.
5. Convert all characters to lowercase. 

The example below demonstrates how to perform step 4 and 5 in a very short snippet of Display Ad page 226, from the January 14, 1979 Boston Globe. 

In [2]:
text = "manage its Primary Care Programs including 24-hour Emergency Room Primary Care program"

print('--- newspaper text ---')
print(text)
print('')
print('--- transformed text ---')
print(re.sub('[^a-z ]','',text.lower()))

--- newspaper text ---
manage its Primary Care Programs including 24-hour Emergency Room Primary Care program

--- transformed text ---
manage its primary care programs including hour emergency room primary care program


## Prepare online job posting text data

Economic Modeling Specialists International (EMSI) provided us with online postings data in a processed format and relatively clean form: see [here](https://github.com/phaiptt125/online_job_posting/blob/master/data_cleaning/initial_cleaning.ipynb).

For the purpose of this project, we use online postings data to:
1. Enrich the sample of text usuage when constructing the Continuous Bag of Words model
2. Retrieve a mapping between job titles and ONET-SOC codes. 

## Construct CBOW model

In [3]:
# filename of the combined ads ~ 15 GB 
text_data_filename = 'ad_combined.txt'

# construct CBOW model
dim_model = 300
model = Word2Vec(LineSentence(open(text_data_filename)), 
                 size=dim_model, 
                 window=5, 
                 min_count=5, 
                 workers=multiprocessing.cpu_count())

model.init_sims(replace=True)

# define output filename for CBOW model
cbow_filename = 'cbow.model'

# save model into file
model.save(cbow_filename)

## Compute similar words

In [4]:
# load model
model = Word2Vec.load(cbow_filename)
word_all = model.wv # set of all words in the model

In [5]:
def find_similar_words(phrase,model,dim_model):
    # This function compute similar words given a word or phrase.
    # If the input is just one word, this function is the same as gensim built-in function: model.most_similar
    
    # phrase : input for word or phrases to look for. For a phrase with multiple words, add "_" in between.
    # model : constructed CBOW model
    # dim_model : dimension of the model, i.e., length of a vector of each word 
    
    tokens = [w for w in re.split('_',phrase) if w in word_all] 
    # split input to tokens, ignoring words that are not in the model  
    
    vector_by_word = np.zeros((len(tokens),dim_model)) # initialize a matrix 
    
    for i in range(0,len(tokens)):
        word = tokens[i] # loop for each word
        vector_this_word = model[word] # get a vector representation
        vector_by_word[i,:] = vector_this_word # record the vector
    
    vector_this_phrase = sum(vector_by_word) 
    # sum over words to get a vector representation of the whole phrase
    
    most_similar_words = model.similar_by_vector(vector_this_phrase, topn=100, restrict_vocab=None)
    # find 100 most similar words
    
    most_similar_words = [w for w in most_similar_words if not w[0] == phrase]
    # take out the output word that is identical to the input word
    
    return most_similar_words

Cosine similarity score of any pair of words/phrases is defined to be a cosine of the two vectors representing those pair of words/phrases. Higher cosine similarity score means the two words/phrases tend to appear in similar contexts.

The function *find_similar_words* above returns a set of similar words, ordered by cosine similarity score, and their corresponding cosine similarity score. For example, the ten most similar words to "creative" are: 

In [6]:
most_similar_words = find_similar_words('creative',model,dim_model)
most_similar_words[:10]

[('imaginative', 0.6997416615486145),
 ('versatile', 0.6824457049369812),
 ('creature', 0.591433584690094),
 ('innovative', 0.5758161544799805),
 ('resourceful', 0.5575118660926819),
 ('creallve', 0.5550633668899536),
 ('restive', 0.5526227951049805),
 ('dynamic', 0.5416233539581299),
 ('clever', 0.5349052548408508),
 ('pragmatic', 0.5299020409584045)]

Likewise, the ten most similar words to "bookkeeping" are:

In [7]:
most_similar_words = find_similar_words('bookkeeping',model,dim_model)
most_similar_words[:10]

[('bkkp', 0.6903467178344727),
 ('beekeeping', 0.6871334314346313),
 ('stenography', 0.672173023223877),
 ('bkkpng', 0.6181079745292664),
 ('bkkpg', 0.6175851821899414),
 ('bookkpg', 0.5925684571266174),
 ('dkkpg', 0.5809350609779358),
 ('bkkping', 0.5768048167228699),
 ('clerical', 0.5741672515869141),
 ('payroll', 0.5619226098060608)]

The strength of the Continuous Bag of Words (CBOW) model is twofold. First, the model provides context-based synonyms which allows us to keep track of relevant words even if their usage may differ over time. We provide one example in the main paper: 

*For instance, even though “creative” and “innovative” largely refer to the same occupational skill, it is possible that their relative usage among potential employers may differ within the sample period. This is indeed the case: Use of the word “innovative” has increased more quickly than “creative” over the sample period. To the extent that our ad hoc classification included only one of these two words, we would be mis-characterizing trends in the ONET skill of “Thinking Creatively.” The advantage of the continuous bag of words model is that it will identify that “creative” and “innovative” mean the same thing because they appear in similar contexts within job ads. Hence, even if employers start using “innovative” as opposed to “creative” part way through our sample, we will be able to consistently measure trends in “Thinking Creatively” throughout the entire period.*

The second advantage of the CBOW model is to identify common abbrevations and transcription errors. The word "bookkeeping", for instance, was offen mistranscribed into "beekeeping" due to the imperfection of the Optical Character Recognition (OCR) algorithm. Moreover, our CBOW model also reveals common abbrevations that employers offen used such as "bkkp" and "bkkpng".