# DreamJobber

---

**Process**
1. Clean text
2. Bag of Words
3. LDA model (Latent Dirichlet allocation) 
4. Define Topics from LDA model
5. Tf-idf and K-means Clustering for job titles??

---

**Import Necessary Libraries**

In [1]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
import nltk
from functions import *
import pickle

In [2]:
#nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/admin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

---

**Load Data**

In [3]:
df = pd.read_csv('data/combined.csv')

In [4]:
df.head()

Unnamed: 0,job_description,job_title
0,DOJ offers a range of opportunities for experi...,Attorney and Assistant United States Attorney
1,As an FBI Special Agent with a military or law...,Special Agent - Law Enforcement or Military Ve...
2,As an FBI Special Agent with expertise in educ...,Special Agent - Education/Teaching Background
3,As an FBI Special Agent with Accounting/Financ...,Special Agent - Accounting/Finance Background
4,"As an FBI Special Agent, your STEM background ...","Special Agent - Science, Technology, Engineeri..."


In [5]:
#check for missing values
df.isna().sum()

job_description    5
job_title          0
dtype: int64

In [6]:
#looks like there are 5 rows that have no job description
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32000 entries, 0 to 31999
Data columns (total 2 columns):
job_description    31995 non-null object
job_title          32000 non-null object
dtypes: object(2)
memory usage: 500.1+ KB


In [7]:
#drop rows with no job descriptions
df = df[pd.notnull(df['job_description'])]

In [8]:
#sanity check, looks good
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31995 entries, 0 to 31999
Data columns (total 2 columns):
job_description    31995 non-null object
job_title          31995 non-null object
dtypes: object(2)
memory usage: 749.9+ KB


---

## Text Cleaning

1. Tokenize
2. Remove words with fewer than 3 characters
3. Remove stop words
4. Normalize words (Lemmatize and Stem)

**Test the functions on one row of text**

In [9]:
stemmer = SnowballStemmer('english')

In [10]:
text_sample = df[df.index == 5000].values[0][0]

print('original text: ')
words = []
for word in text_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized text: ')
print(preprocess(text_sample))

original text: 
['Newly', 'appointed', 'employee(s)', 'or', 'employee(s)', 'converted', 'to', 'permanent', 'status,', 'selected', 'under', 'this', 'announcement,', 'may', 'be', 'eligible', 'to', 'apply', 'for', 'an', 'award', 'up', 'to', 'the', 'maximum', 'limitation', 'under', 'the', 'provisions', 'of', 'the', 'Education', 'Debt', 'Reduction', 'Program', '(EDRP).', 'Funding,', 'on', 'the', 'final', 'award', 'amount,', 'is', 'contingent', 'on', 'the', 'availability', 'of', 'EDRP', 'funds.', 'Employee(s)', 'must', 'apply', 'for', 'EDRP', 'within', 'four', '(4)', 'months', 'of', 'appointment', 'or', 'conversion.']


 tokenized and lemmatized text: 
['newli', 'appoint', 'convert', 'perman', 'status', 'select', 'announc', 'elig', 'appli', 'award', 'maximum', 'limit', 'provis', 'educ', 'debt', 'reduct', 'program', 'edrp', 'fund', 'final', 'award', 'conting', 'avail', 'edrp', 'fund', 'appli', 'edrp', 'month', 'appoint', 'convers']


**Apply functions to job_description**

In [11]:
#apply function and display first 5 rows
processed_text = df['job_description'].map(preprocess)
processed_text[:5]

0    [offer, rang, opportun, experi, attorney, work...
1    [special, agent, militari, enforc, background,...
2    [special, agent, expertis, educ, gift, relat, ...
3    [special, agent, account, financi, expertis, e...
4    [special, agent, stem, background, provid, ski...
Name: job_description, dtype: object

---

## Bag of Words

In [12]:
#I'll use bag of words to extract features from text for use in modeling

In [26]:
dictionary = gensim.corpora.Dictionary(processed_text)

In [27]:
#check the length before I filter out the extremes
len(dictionary)

56095

In [28]:
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

In [29]:
#check length after filtering out extremes
len(dictionary)

11379

In [30]:
#bow2doc: counts the number of occurrences of each distinct word, 
#converts the word to its integer word id and returns the result as a sparse vector

bow2doc_corpus = [dictionary.doc2bow(text) for text in processed_text]

In [31]:
#let's take a look
bow_doc_5000 = bow2doc_corpus[5000]

for i in range(len(bow_doc_5000)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_5000[i][0], 
                                                     dictionary[bow_doc_5000[i][0]], 
                                                     bow_doc_5000[i][1]))

Word 46 ("militari") appears 2 time.
Word 85 ("area") appears 4 time.
Word 97 ("follow") appears 1 time.
Word 163 ("local") appears 1 time.
Word 186 ("announc") appears 2 time.
Word 189 ("day") appears 1 time.
Word 197 ("open") appears 1 time.
Word 248 ("includ") appears 1 time.
Word 278 ("defin") appears 1 time.
Word 301 ("close") appears 1 time.
Word 308 ("consider") appears 2 time.
Word 309 ("date") appears 1 time.
Word 403 ("vacanc") appears 2 time.
Word 620 ("claim") appears 1 time.
Word 624 ("prefer") appears 1 time.
Word 634 ("member") appears 1 time.
Word 641 ("commut") appears 1 time.
Word 673 ("separ") appears 1 time.
Word 693 ("armi") appears 1 time.
Word 876 ("counti") appears 1 time.
Word 1154 ("move") appears 1 time.
Word 1620 ("citizen") appears 1 time.
Word 2099 ("involuntarili") appears 1 time.
Word 2104 ("spous") appears 1 time.
Word 2137 ("monterey") appears 2 time.
Word 2726 ("garrison") appears 1 time.
Word 3439 ("cruz") appears 1 time.
Word 3440 ("presidio") appea

---

## LDA model with Bag of Words

In [32]:
lda_model = gensim.models.LdaMulticore(bow2doc_corpus, 
                                       num_topics=6, 
                                       id2word=dictionary, 
                                       passes=10, 
                                       workers=4)


In [33]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.016*"custom" + 0.012*"product" + 0.010*"safeti" + 0.010*"food" + 0.010*"manag" + 0.010*"equip" + 0.009*"maintain" + 0.009*"time" + 0.008*"abil" + 0.008*"perform"
Topic: 1 
Words: 0.015*"manag" + 0.012*"develop" + 0.012*"sale" + 0.011*"project" + 0.011*"custom" + 0.010*"busi" + 0.010*"market" + 0.009*"product" + 0.009*"skill" + 0.009*"team"
Topic: 2 
Words: 0.034*"care" + 0.028*"patient" + 0.024*"nurs" + 0.015*"medic" + 0.015*"health" + 0.012*"provid" + 0.011*"clinic" + 0.009*"home" + 0.008*"assist" + 0.007*"resid"
Topic: 3 
Words: 0.036*"store" + 0.031*"manag" + 0.019*"assist" + 0.018*"custom" + 0.015*"abil" + 0.013*"perform" + 0.011*"associ" + 0.009*"duti" + 0.008*"product" + 0.008*"mainten"
Topic: 4 
Words: 0.019*"manag" + 0.012*"account" + 0.010*"skill" + 0.009*"includ" + 0.008*"offic" + 0.008*"abil" + 0.008*"process" + 0.007*"report" + 0.007*"year" + 0.007*"assist"
Topic: 5 
Words: 0.014*"benefit" + 0.011*"opportun" + 0.011*"compani" + 0.009*"train" + 0.009*"care

**Pickled LDA model results**

In [24]:
# pickle.dump(lda_model, open('lda_model.pkl', 'wb'))
pickled_lda = pickle.load(open('lda_model.pkl', 'rb'))

In [25]:
for idx, topic in pickled_lda.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.031*"care" + 0.025*"patient" + 0.021*"nurs" + 0.016*"health" + 0.015*"medic" + 0.013*"provid" + 0.011*"clinic" + 0.008*"assist" + 0.008*"home" + 0.007*"manag"
Topic: 1 
Words: 0.016*"busi" + 0.016*"manag" + 0.013*"develop" + 0.012*"account" + 0.009*"team" + 0.008*"year" + 0.008*"client" + 0.008*"compani" + 0.008*"skill" + 0.008*"provid"
Topic: 2 
Words: 0.020*"custom" + 0.017*"sale" + 0.010*"manag" + 0.010*"time" + 0.010*"compani" + 0.009*"product" + 0.008*"opportun" + 0.007*"team" + 0.007*"train" + 0.007*"includ"
Topic: 3 
Words: 0.014*"locat" + 0.014*"applic" + 0.010*"test" + 0.009*"appli" + 0.009*"href" + 0.008*"area" + 0.008*"attr" + 0.008*"offic" + 0.007*"famili" + 0.007*"hire"
Topic: 4 
Words: 0.022*"manag" + 0.018*"store" + 0.016*"custom" + 0.014*"abil" + 0.013*"perform" + 0.013*"assist" + 0.011*"product" + 0.009*"ensur" + 0.008*"duti" + 0.008*"maintain"
Topic: 5 
Words: 0.019*"project" + 0.015*"manag" + 0.011*"skill" + 0.009*"applic" + 0.009*"engin" + 0.009*"

In [None]:
#next-steps
#add dicejobs data
#find optimal lda model parameters to get a good seperation or topics
#figure out the topics
#use word probability 
#figure out what to do with job titles, tfidf and k-means clustering in order to match lda model topics?

In [28]:
#!pip install pyLDAvis

In [21]:
#visualize the topics in order to better label 
%matplotlib inline
import pyLDAvis
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=bow2doc_corpus, dictionary=dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


---

## LDA model with tf-idf

In [None]:
#from gensim import corpora, models
#tfidf = models.TfidfModel(bow2doc_corpus)
#corpus_tfidf = tfidf[bow2doc_corpus]
#from pprint import pprint
#for doc in corpus_tfidf:
#    pprint(doc)
#    break

In [None]:
#lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
 #                                            num_topics=25, 
  #                                           id2word=dictionary, 
   #                                          passes=10, 
    #                                         workers=4)
#for idx, topic in lda_model_tfidf.print_topics(-1):
#    print('Topic: {} Word: {}'.format(idx, topic))