# The LDA Model to Identify Job Ads
Online supplementary material to "The Evolution of Work in the United States" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum. 

* [Most recent version of the paper](
http://ssc.wisc.edu/~eatalay/skills.pdf)

* [Project data library](http://ssc.wisc.edu/~eatalay/occupation_data.html) 

* [GitHub repository](https://github.com/phaiptt125/newspaper_project)

***

This IPython notebook demonstrates the Latent Dirichlet Allocation (LDA) procedure used
to identify which advertisements, in our newspaper text, are job ads. Text input for this procedure is the output from the initial text cleaning step (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/initial_cleaning.ipynb)).

<b> Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. </b>
***

## List of auxiliary files (see project data library or GitHub repository)

* *extract_LDA_result.py* : This python code extracts results from the LDA estimation into machine-readable files.
* *compute_spelling.py* : This python code computes ratio of correctly-spelled words and records all correctly-spelled words.
***
## Import necessary modules

In [1]:
import os
import re
import sys
import pandas as pd

# settings for pandas module
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 50)

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer("english")

import gensim
from gensim import corpora, models

sys.path.append('./auxiliary files')

from compute_spelling import *
from extract_LDA_result import *

## Stopwords

According to "Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward Loper, stopwords are usually "[words with] little lexical content, and their presence in a text fails to distinguish it from other texts." See [here](http://www.nltk.org/book/ch02.html) for more explanations. These words include:

In [2]:
from nltk.corpus import stopwords
print(list(stopwords.words('english')))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

* In our applications, for example, simply knowing that a page contains any of these stopwords does not help us to classify whether this page is related to job ads.

## Stemmers
According to "Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward Loper, the purpose of stemmers is to "remove morphological affixes from words, leaving only the word stem." See [here](http://www.nltk.org/howto/stem.html) for more explanations.

In [3]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
original_words = ['manager','managers','management','manage','manages','managing']
stemmed = [stemmer.stem(w) for w in original_words]

print('Original words: ')
print(original_words)
print('----------------------------------')
print('Stems: ')
print(stemmed)
print('----------------------------------')

Original words: 
['manager', 'managers', 'management', 'manage', 'manages', 'managing']
----------------------------------
Stems: 
['manag', 'manag', 'manag', 'manag', 'manag', 'manag']
----------------------------------


* In our applications, the advantage of stemmers is to reduce dimensionality of text data while retaining sufficient information of what each page is roughly about. Indeed, a group of word stems is sufficient for us to classify which advertisements, in our newspaper text, are job ads. 

## Dimensionality reduction

A Latent Dirichlet Allocation (LDA) procedure takes all words in each document into account. Estimating a LDA using our full sample of newspaper pages and words would be a huge computational burden. As such, we implement the following before estimating:

1. Restrict attention to pages of advertisements which are sufficiently long, with at least 200 words. 
2. Remove stopwords, numerals, and words which are not contained in the English dictionary.
3. Stem words; that is, we remove word affixes so that words in different forms—singular nouns, plural nouns, verbs, adjectives, adverbs—are grouped as one.

This section describes how we implement these three steps on one page of newspaper - the 106th page of display ads in the August 12, 1979 edition of the Boston Globe (the page identifier is "Globe_displayad_19790812_106").

In [4]:
# import newspaper page
content_ad = open('Globe_displayad_19790812_106.txt').read()
print(content_ad)

Olin Skis are recognized as tops in the industry Our commitment to excellence speaks for itself We are proud of that our business Is built on it and we re determined to keep it that way
In only 9 years we ve achieved position of leadership in the ski Industry Maintaining that leadership will be as much the result of creative effective management as it will be of technical design At Olin Ski we appreciate that We are seeking talented professionals who share that appreciation and want to join us as we grow
PROJECT ENGINEER
You will be totally Involved in projects requiring analysis and development of manufacturing processes including machine design cost analysis machine fabrication and delivery and Implementation to the manufacturing operation
The position requires well organized self motivated decision maker with good communication skills You should have BSME or equivalent and 2-5 years experience In manufacturing engineering or process development
TECHNICIAN
This position in our develo

In [5]:
# count and record all correctly-spelled words
WordCount,CorrectSpellingWords = RecordCorrectSpelling(content_ad) # see compute_spelling.py  
print('Total word count = ' + str(WordCount)) 

Total word count = 1456


In [6]:
tokens = word_tokenize(CorrectSpellingWords) # transfrom string into list of words (tokens)

# remove stopwords and stem all tokens
selected_tokens = [stemmer.stem(w) for w in tokens if not w in stop_words]
print(','.join(selected_tokens)) # print all word stems

ski,recogn,top,industri,commit,excel,speak,proud,busi,built,determin,keep,year,achiev,posit,leadership,ski,industri,maintain,leadership,much,result,creativ,effect,manag,technic,design,ski,appreci,seek,talent,profession,share,appreci,want,join,total,involv,project,requir,analysi,develop,manufactur,process,includ,machin,design,cost,analysi,machin,fabric,deliveri,implement,manufactur,posit,requir,well,organ,self,motiv,decis,maker,good,communic,skill,equival,year,experi,manufactur,engin,process,posit,develop,depart,involv,materi,construct,laboratori,test,prototyp,design,work,close,develop,engin,design,fabric,prepar,test,futur,ski,model,race,qualifi,least,year,technic,school,abil,work,independ,previous,experi,ski,work,experi,technic,area,would,offer,day,work,week,excel,salari,benefit,packag,modern,facil,minut,south,seek,growth,opportun,leader,industri,send,resum,salari,histori,ski,compani,inc,smith,equal,opportun,system,corpor,fastest,grow,largest,manufactur,build,autom,energi,manag,system,

* The 'selected_tokens' of each page forms a basis for LDA model estimation. 
* In addition, we randomly sample 100,000 newspaper pages. Each of these pages' identider is recorded in the 'RandomSeed.txt'  

## LDA Estimation 
This section describe the LDA estimation. First, all newspaper pages are combined into one input file, called "LDA_input.txt".

In [7]:
data = pd.read_csv('LDA_input.txt', sep='\t', names=['page_identifier','tokens'])
data.head(10) # print some examples

Unnamed: 0,page_identifier,tokens
0,Globe_displayad_19600425_1,auto buy yes ought buy greatest select best de...
1,Globe_displayad_19600425_10,friend top valu halibut special rich flavor be...
2,Globe_displayad_19600425_11,shop hour open night store open everi night th...
3,Globe_displayad_19600425_12,hord new largest store outstand tradit clock f...
4,Globe_displayad_19600425_14,compani get advertis dollar direct mail hard f...
5,Globe_displayad_19600425_16,clear jiffi ordinari plunger seat proper forc ...
6,Globe_displayad_19600425_17,end corn doctor send pain fli fast quick remov...
7,Globe_displayad_19600425_18,restaur jot rain alert speaker former train un...
8,Globe_displayad_19600425_19,one million home per year welcom wagon hostess...
9,Globe_displayad_19600425_21,cemeteri lot anoth garden plan ahead import ma...


In [8]:
# import list of 100000 sampled pages' identifer
random_seed = [w for w in re.split('\n',open('RandomSeed.txt').read()) if not w=='']

In [9]:
# prepare inputs for LDA estimation
IdenifierList = list() # list all identifiers
LDAinputList_all = list() # all pages
LDAinputList_sample = list() # estimate using pages listed in the random seeds

for index, row in data.iterrows():

        IdenifierList.append(row['page_identifier'])
        LDAinputList_all.append(word_tokenize(row['tokens']))
        
        if row['page_identifier'] in random_seed:
            LDAinputList_sample.append(word_tokenize(row['tokens']))

In [10]:
# Turn tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(LDAinputList_all)
dictionary.filter_extremes(no_below = 5, no_above = 0.95, keep_n = 1000)

# Filter out tokens that appear in
# (1) less than no_below documents (absolute number) or
# (2) more than no_above documents (fraction of total corpus size, 
# not absolute number).
# (3) after (1) and (2), keep only the first keep_n most frequent 
# tokens (or keep all if None).

dictionary.compactify() # re-do the term dictionary after the filter

# map documents to dictionary
corpus_sample = [dictionary.doc2bow(t) for t in LDAinputList_sample]
corpus_all   = [dictionary.doc2bow(t) for t in LDAinputList_all]

In [11]:
# estimate LDA model 

num_topics = 5 # set number of topics

lda = gensim.models.ldamodel.LdaModel(corpus_sample, 
                                      num_topics = num_topics, 
                                      id2word = dictionary, 
                                      passes=50)

# interpolate LDA model to all pages
doc_topic = lda[corpus_all]

* Write down results into machine-readable files

In [12]:
TopicKeyword = lda.show_topics(num_topics = num_topics, 
                               num_words=50, 
                               log=False, 
                               formatted=False)

WordScoreList = GetWordScore(TopicKeyword) # see extract_LDA_result.py

# write down coefficients in the beta matrix (see project data page)
WordScoreFilename = 'WordScore_' + str(num_topics) + '_Topics.txt'
WordScoreFile = open(WordScoreFilename, 'w')
WordScoreFile.write('topic\tword\tscore' + '\n')
WordScoreFile.write('\n'.join(WordScoreList))
WordScoreFile.close()

# write down probability matrix by ad page 
ScoreTable = GetDocumentScore(doc_topic, num_topics) # see extract_LDA_result.py
DocumentScoreFilename = 'Document_' + str(num_topics) + '_Topics.txt'
DocumentScoreFile = open(DocumentScoreFilename, 'w') 
    
for i in range(0,len(ScoreTable)):
    DocumentScoreFile.write(IdenifierList[i]+'\t'+ScoreTable[i]+'\n')
    
DocumentScoreFile.close()
    
# write down most predictive words by topic    
ListWordByTopic = GetWordList(WordScoreList,num_topics) # see extract_LDA_result.py

WordFilename = 'Word_' + str(num_topics) + '_Topics.txt'
WordFile = open(WordFilename,'w')
for i in range(0, num_topics):
    WordFile.write(str(i) + '\t' + '\t'.join(ListWordByTopic[i]) + '\n')
WordFile.close()

## LDA Results
* Most predictive words by topic

In [13]:
word = pd.read_csv(WordFilename,sep='\t',names = ['topic'] + list(range(0,50)))
word

Unnamed: 0,topic,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
0,0,reg,save,store,size,price,color,style,set,regular,charg,white,shop,open,qualiti,solid,cotton,select,design,mall,top,valu,use,special,new,tabl,marsh,includ,men,polyest,print,gold,one,choic,chair,item,fashion,nylon,choos,full,hour,blue,order,easi,look,assort,rug,famous,dress,long,great
1,1,car,price,new,auto,power,tire,stereo,air,door,stock,radio,ford,save,motor,low,model,speed,dealer,automat,brand,wheel,use,payment,list,control,system,month,window,servic,mile,cash,drive,truck,equip,wagon,warranti,buy,record,tape,factori,call,inc,brake,plus,leas,rear,financ,trade,instal,foreign
2,2,day,free,new,one,call,coupon,travel,per,offer,week,time,may,rate,includ,night,get,year,tax,hotel,special,price,save,make,plus,citi,money,pool,servic,good,avail,everi,globe,world,state,person,even,inform,today,name,two,box,first,way,see,reserv,send,best,like,bank,charg
3,3,street,open,inc,ave,mass,call,rte,rout,new,home,center,sat,main,sun,park,room,north,south,shop,exit,mall,hill,avail,servic,plaza,ski,hous,west,shore,locat,newton,live,free,road,bedroom,tel,estat,daili,area,cape,squar,best,show,wed,one,day,villag,restaur,inn,offic
4,4,system,experi,comput,opportun,year,manag,engin,program,requir,design,work,call,posit,develop,salari,product,includ,employ,profession,equal,compani,resum,busi,offic,help,benefit,new,pleas,send,servic,offer,excel,industri,oper,electron,high,must,school,need,respons,applic,process,person,career,open,control,degre,train,manufactur,account


* Out of these five topics, only topic number 4 contains words associated with job ads. As a result, we classify topic number 4 as job ad.  
* The number of topics, K, is chosen so that i) with K topics there is a single job-related topic, and with ii) K+1 topics, there are multiple job-related topics.
* Next, we look at the probability distribution of a page of ad belongs to each topic:  

In [14]:
names = ['ad_identifier','prob_0','prob_1','prob_2','prob_3','prob_4']
# 'prob_n' is the probability of a page belong to topic n  

document = pd.read_csv(DocumentScoreFilename, sep='\t', names = names)
document

Unnamed: 0,ad_identifier,prob_0,prob_1,prob_2,prob_3,prob_4
0,Globe_displayad_19600425_1,0.000,0.198,0.785,0.000,0.000
1,Globe_displayad_19600425_10,0.602,0.178,0.203,0.000,0.000
2,Globe_displayad_19600425_11,0.351,0.066,0.578,0.000,0.000
3,Globe_displayad_19600425_12,0.587,0.180,0.228,0.000,0.000
4,Globe_displayad_19600425_14,0.019,0.177,0.709,0.000,0.092
5,Globe_displayad_19600425_16,0.275,0.209,0.000,0.322,0.185
6,Globe_displayad_19600425_17,0.333,0.000,0.411,0.254,0.000
7,Globe_displayad_19600425_18,0.016,0.016,0.468,0.471,0.029
8,Globe_displayad_19600425_19,0.108,0.105,0.782,0.000,0.000
9,Globe_displayad_19600425_21,0.000,0.133,0.000,0.719,0.137


* The example page in the previous section, Globe_displayad_19790812_106, has a probability distribution of:  

In [15]:
document[document['ad_identifier'] == 'Globe_displayad_19790812_106']

Unnamed: 0,ad_identifier,prob_0,prob_1,prob_2,prob_3,prob_4
210971,Globe_displayad_19790812_106,0.0,0.0,0.0,0.028,0.971


* We classify a page to be related to job ads if the estimated likelihood is above 0.40, see [here](http://ssc.wisc.edu/~eatalay/apst/apst_lda.pdf) for discussion on this cutoff. 
* Our LDA model estimates that the Globe_displayad_19790812_106 page has probability of 0.971 being in topic number 4. We, therefore, classify this page to be related to job ads.  

At this point, we are able to identify newspaper pages that are related to job ads. The next step is to identify job titles, discern the boundaries between individual job ads and finally transform text into structured data (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/structured_data.ipynb)).    