# DreamJobber

**Tech Edition**

---

**Process**
1. Clean text
2. Bag of Words
3. LDA model (Latent Dirichlet allocation)
4. Fine tune LDA model
5. Define Topics from LDA model
6. Create df of document probabilities
6. Classification model

---

**Import Necessary Libraries**

In [1]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
import nltk
from functions import *
import pickle

#lda model evaluatoin with coherence
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary

In [2]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/admin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

---

**Load Data**

In [3]:
df_1 = pd.read_json('dice_jobs.json', lines=True)
df_1.head()

Unnamed: 0,job_description,job_title
0,,UI Lead/Architect
1,,Web Application Architect
2,,Senior DataStage Developer
3,,Hadoop Administrator
4,,UX Visual Designer


In [4]:
df_2 = pd.read_json('dice_jobs_more.json', lines=True)
df_2.head()

Unnamed: 0,job_description,job_title
0,[TEKsystems is seeking an IT Specialist for a ...,IT Specialist-Direct Placement
1,[Job Title: Java Developer\nJob Location : Phi...,Java Developer
2,[RESPONSIBILITIES:\nKforce has a client in the...,Senior Full Stack Developer
3,[Description:\nContributes to and supports a v...,Quallity Assurance Technician
4,[RESPONSIBILITIES:\nKforce has a client that i...,Project Manager


In [5]:
df = pd.concat([df_1, df_2], ignore_index=True, sort=True)

In [6]:
df.head()

Unnamed: 0,job_description,job_title
0,,UI Lead/Architect
1,,Web Application Architect
2,,Senior DataStage Developer
3,,Hadoop Administrator
4,,UX Visual Designer


In [7]:
#check for missing values
df.isna().sum()

job_description    1283
job_title           243
dtype: int64

In [8]:
#looks like there are rows that have no job description
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7075 entries, 0 to 7074
Data columns (total 2 columns):
job_description    5792 non-null object
job_title          6832 non-null object
dtypes: object(2)
memory usage: 110.6+ KB


In [9]:
#drop rows with no job descriptions
df = df.dropna()

In [10]:
#sanity check, looks good
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5792 entries, 9 to 7074
Data columns (total 2 columns):
job_description    5792 non-null object
job_title          5792 non-null object
dtypes: object(2)
memory usage: 135.8+ KB


In [11]:
df.head()

Unnamed: 0,job_description,job_title
9,[2+ years of experience developing Java / J2EE...,Java Developer (Sign-On BONUS!)
10,"[Passion for technology and learning, a natura...","iOS Developer - Mobile Rate: Open, Duration: 1..."
17,[We enjoy approved IT vendor status with sever...,"EUC Engineer, Rate-Open, Duration: 18 Months"
18,[We enjoy approved IT vendor status with sever...,"Software Developer - RPG, Rate-Open, Duration:..."
19,[SME in Linux Operating system with Strong Vir...,Sr. Linux Consultant with Weblogic exp


In [12]:
def remove_brackets(list1):
    return str(list1).replace('[','').replace(']','')

In [13]:
df['job_description'] = df['job_description'].map(remove_brackets)

In [14]:
df.head()

Unnamed: 0,job_description,job_title
9,'2+ years of experience developing Java / J2EE...,Java Developer (Sign-On BONUS!)
10,"'Passion for technology and learning, a natura...","iOS Developer - Mobile Rate: Open, Duration: 1..."
17,'We enjoy approved IT vendor status with sever...,"EUC Engineer, Rate-Open, Duration: 18 Months"
18,'We enjoy approved IT vendor status with sever...,"Software Developer - RPG, Rate-Open, Duration:..."
19,'SME in Linux Operating system with Strong Vir...,Sr. Linux Consultant with Weblogic exp


---

## Text Cleaning

1. Tokenize
2. Remove words with fewer than 3 characters
3. Remove stop words
4. Normalize words (Lemmatize and Stem)

**Test the functions on one row of text**

In [15]:
stemmer = SnowballStemmer('english')

In [16]:
text_sample = df[df.index == 1300].values[0][0]

print('original text: ')
words = []
for word in text_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized text: ')
print(preprocess(text_sample))

original text: 
["'We", 'are', 'looking', 'for', 'a', 'person', 'who', 'can', 'work', 'directly', 'on', 'a', 'W2.', 'This', 'is', 'a', 'full', 'time', 'role', 'so', 'please', 'only', 'apply', 'if', 'you', 'do', 'not', 'require', "sponsorship.',", "'What", 'we', 'are', 'looking', 'for:\\nWe', 'are', 'a', 'successful', 'and', 'fast-growing', 'financial', 'services', 'company.', 'Due', 'to', 'company', 'growth,', 'we', 'are', 'looking', 'for', 'a', 'Web', 'Services', 'Developer', 'to', 'join', 'our', 'Web', 'Applications', 'Team', 'at', 'our', 'Fort', 'Mill', 'office.', 'Our', 'Web', 'Applications', 'Team', 'is', 'a', 'fast-paced', 'and', 'high-energy', 'team', 'that', 'delivers', 'MuleSoft', 'Anypoint', 'APIs', 'to', 'our', 'enterprise', 'consumers.\\n*', 'You', 'should', 'be', 'able', 'to', 'perform', 'analysis', 'and', 'demonstrate', 'ability', 'to', 'learn', 'programming', 'languages', 'and', 'techniques.\\n*', 'You', 'should', 'be', 'self-motivated;', 'detail', 'oriented', 'and', 'ha

**Apply functions to job_description**

In [17]:
#apply function and display first 5 rows
processed_text = df['job_description'].map(preprocess)
processed_text[:5]

9                         [year, develop, java, applic]
10    [passion, technolog, learn, natur, curios, lov...
17    [enjoy, approv, vendor, status, lead, compani,...
18    [enjoy, approv, vendor, status, lead, compani,...
19    [linux, oper, strong, virtual, knowledg, vmwar...
Name: job_description, dtype: object

---

## Bag of Words

In [18]:
#I'll use bag of words to extract features from text for use in modeling

In [19]:
dictionary = gensim.corpora.Dictionary(processed_text)

In [20]:
#check the length before I filter out the extremes
len(dictionary)

10566

In [21]:
dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=100000)

In [22]:
#check length after filtering out extremes
len(dictionary)

2960

In [23]:
#bow2doc: counts the number of occurrences of each distinct word, 
#converts the word to its integer word id and returns the result as a sparse vector

bow2doc_corpus = [dictionary.doc2bow(text) for text in processed_text]

In [24]:
##let's take a look
#bow_doc_5000 = bow2doc_corpus[5000]

#for i in range(len(bow_doc_5000)):
 #   print("Word {} (\"{}\") appears {} time.".format(bow_doc_5000[i][0], 
  #                                                   dictionary[bow_doc_5000[i][0]], 
   #                                                  bow_doc_5000[i][1]))

---

## Find optimal number of topics

In [None]:
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=bow2doc_corpus,
                                                        texts=processed_text, start=5, limit=40, step=5)

In [None]:
import matplotlib.pyplot as plt
limit=40; start=5; step=5;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Number of Topics")
plt.ylabel("Coherence score")
plt.show()

---

## LDA model with Bag of Words

In [25]:
lda_model = gensim.models.LdaMulticore(bow2doc_corpus, 
                                       num_topics=9, 
                                       id2word=dictionary, 
                                       passes=75, 
                                       workers=4,
                                      chunksize=100)


In [26]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.068*"test" + 0.030*"status" + 0.028*"autom" + 0.023*"engin" + 0.018*"ident" + 0.017*"nation" + 0.016*"orient" + 0.016*"qualifi" + 0.016*"employ" + 0.016*"regard"
Topic: 1 
Words: 0.085*"network" + 0.068*"secur" + 0.019*"cisco" + 0.016*"engin" + 0.013*"includ" + 0.013*"maintain" + 0.012*"system" + 0.012*"switch" + 0.012*"infrastructur" + 0.011*"inform"
Topic: 2 
Words: 0.031*"data" + 0.021*"busi" + 0.017*"process" + 0.013*"system" + 0.013*"report" + 0.013*"document" + 0.012*"test" + 0.012*"design" + 0.010*"implement" + 0.009*"function"
Topic: 3 
Words: 0.019*"issu" + 0.018*"technic" + 0.017*"problem" + 0.015*"user" + 0.015*"servic" + 0.015*"hardwar" + 0.014*"window" + 0.013*"system" + 0.012*"microsoft" + 0.012*"help"
Topic: 4 
Words: 0.033*"leader" + 0.031*"world" + 0.029*"servic" + 0.021*"power" + 0.020*"help" + 0.020*"peopl" + 0.019*"partner" + 0.019*"talent" + 0.018*"staff" + 0.018*"chang"
Topic: 5 
Words: 0.018*"code" + 0.018*"design" + 0.013*"solut" + 0.013*"fram

**Pickled LDA model results**

In [27]:
pickle.dump(lda_model, open('lda_model.pkl', 'wb'))
pickled_lda = pickle.load(open('lda_model.pkl', 'rb'))

In [28]:
for idx, topic in pickled_lda.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.068*"test" + 0.030*"status" + 0.028*"autom" + 0.023*"engin" + 0.018*"ident" + 0.017*"nation" + 0.016*"orient" + 0.016*"qualifi" + 0.016*"employ" + 0.016*"regard"
Topic: 1 
Words: 0.085*"network" + 0.068*"secur" + 0.019*"cisco" + 0.016*"engin" + 0.013*"includ" + 0.013*"maintain" + 0.012*"system" + 0.012*"switch" + 0.012*"infrastructur" + 0.011*"inform"
Topic: 2 
Words: 0.031*"data" + 0.021*"busi" + 0.017*"process" + 0.013*"system" + 0.013*"report" + 0.013*"document" + 0.012*"test" + 0.012*"design" + 0.010*"implement" + 0.009*"function"
Topic: 3 
Words: 0.019*"issu" + 0.018*"technic" + 0.017*"problem" + 0.015*"user" + 0.015*"servic" + 0.015*"hardwar" + 0.014*"window" + 0.013*"system" + 0.012*"microsoft" + 0.012*"help"
Topic: 4 
Words: 0.033*"leader" + 0.031*"world" + 0.029*"servic" + 0.021*"power" + 0.020*"help" + 0.020*"peopl" + 0.019*"partner" + 0.019*"talent" + 0.018*"staff" + 0.018*"chang"
Topic: 5 
Words: 0.018*"code" + 0.018*"design" + 0.013*"solut" + 0.013*"fram

---

**Coherence Score**

In [None]:
# Compute Coherence Score using c_v
coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_text, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
# Compute Coherence Score using UMass
coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_text, dictionary=dictionary, coherence="u_mass")
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
#!pip install pyLDAvis

In [None]:
##visualize the topics in order to better label 
#%matplotlib inline
#import pyLDAvis
#import pyLDAvis.gensim
#vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=bow2doc_corpus, dictionary=dictionary)
#pyLDAvis.enable_notebook()
#pyLDAvis.display(vis)

---

**Create df for topic scores for each jobtitle**

In [None]:
topic_vecs = []
for i in range(len(bow2doc_corpus)):
    top_topics = lda_model.get_document_topics(bow2doc_corpus[i], minimum_probability=0.0)
    #i in range(amount of topics)
    topic_vec = [top_topics[i][1] for i in range(10)]
    topic_vecs.append(topic_vec)

In [None]:
df_topic_vecs = pd.DataFrame(topic_vecs)
df_topic_vecs.head(10)

In [None]:
#name columns for df
#col_names=['']
#topics_df.columns = col_names
#topics_df.head()

In [None]:
#next step merge with original df of job titles and job descriptions
#pickle the merged df

---

In [None]:
#next-steps
#add dicejobs data
#find optimal lda model parameters to get a good seperation for topics
#figure out the topics 

---

## LDA model with tf-idf

In [None]:
#from gensim import corpora, models
#tfidf = models.TfidfModel(bow2doc_corpus)
#corpus_tfidf = tfidf[bow2doc_corpus]
#from pprint import pprint
#for doc in corpus_tfidf:
#    pprint(doc)
#    break

In [None]:
#lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
 #                                            num_topics=25, 
  #                                           id2word=dictionary, 
   #                                          passes=10, 
    #                                         workers=4)
#for idx, topic in lda_model_tfidf.print_topics(-1):
#    print('Topic: {} Word: {}'.format(idx, topic))

# Nearest Neighbors

In [None]:
df_final = pd.merge(df, df_topic_vecs,left_index=True, right_index=True)

In [None]:
df_final.head()

In [None]:
from sklearn.neighbors import NearestNeighbors

In [None]:
topics = df_final.drop(['job_description', 'job_title'], axis=1)
job = df_final['job_title']

In [None]:
nearest_neighbor = NearestNeighbors(n_neighbors=50, metric='cosine')
nearest_neighbor.fit(topics)