# DreamJobber

**Tech Edition**

---

**Process**
1. Clean text
2. Bag of Words
3. LDA model (Latent Dirichlet allocation)
4. Fine tune LDA model
5. Define Topics from LDA model
6. Create df of document probabilities
6. Nearest Neighbors Model

---

**Import Necessary Libraries**

In [1]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
import nltk
from dreamjobber_web.recommend import input_user_scores, make_recommendation
from dreamjobber_web.recommend import collect_score_and_recommend
from dreamjobber_web.recommend import collect_feedback, show_to_user
from functions import lemmatize_stem, preprocess, remove_brackets
from functions import remove_punctuation, remove_stop_words
from lda import show_topics_sentences
import pickle

#lda model evaluation with coherence
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary

#unsupervised learning model
from sklearn.neighbors import NearestNeighbors

In [2]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/admin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

---

**Load Data**

In [3]:
df_1 = pd.read_json('data/dice_jobs_1.json', lines=True)
df_2 = pd.read_json('data/dice_jobs_2.json', lines=True)
df_3 = pd.read_json('data/dice_jobs_3.json', lines=True)
df_4 = pd.read_json('data/dice_jobs_4.json', lines=True)
df_5 = pd.read_json('data/dice_jobs_5.json', lines=True)
df_6 = pd.read_json('data/dice_jobs_6.json', lines=True)

In [4]:
#concat into one df
df = pd.concat([df_1, df_2, df_3, df_4, df_5, df_6], 
               ignore_index=True, sort=True)

In [5]:
df.head()  

Unnamed: 0,job_description,job_title
0,,UI Lead/Architect
1,,Web Application Architect
2,,Senior DataStage Developer
3,,Hadoop Administrator
4,,UX Visual Designer


In [6]:
#check for missing values
df.isna().sum()

job_description    6524
job_title          5484
dtype: int64

In [7]:
#looks like there are rows that have no job description
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27016 entries, 0 to 27015
Data columns (total 2 columns):
job_description    20492 non-null object
job_title          21532 non-null object
dtypes: object(2)
memory usage: 422.2+ KB


In [8]:
#drop rows with no job descriptions
df = df.dropna()

In [9]:
#sanity check, looks good
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20492 entries, 9 to 27015
Data columns (total 2 columns):
job_description    20492 non-null object
job_title          20492 non-null object
dtypes: object(2)
memory usage: 480.3+ KB


In [10]:
df.head()

Unnamed: 0,job_description,job_title
9,[2+ years of experience developing Java / J2EE...,Java Developer (Sign-On BONUS!)
10,"[Passion for technology and learning, a natura...","iOS Developer - Mobile Rate: Open, Duration: 1..."
17,[We enjoy approved IT vendor status with sever...,"EUC Engineer, Rate-Open, Duration: 18 Months"
18,[We enjoy approved IT vendor status with sever...,"Software Developer - RPG, Rate-Open, Duration:..."
19,[SME in Linux Operating system with Strong Vir...,Sr. Linux Consultant with Weblogic exp


In [11]:
#need to remove brackets from job_description
df['job_description'] = df['job_description'].map(remove_brackets)

In [12]:
#remove '\\n' and replace with ','
df['job_description'] = df['job_description'].map(
                        lambda x: x.replace('\\n', ','))

In [13]:
#lowercase job_description text before applying stopwords
df['job_description'] = df['job_description'].map(lambda x: x.lower())

In [14]:
#lowercase job_title text before cleaning
df['job_title'] = df['job_title'].map(lambda x: x.lower())

In [15]:
#remove punctuation from job_title
df['job_title'] = df['job_title'].map(remove_punctuation)

In [16]:
#remove stop words from job_title
df['job_title'] = df['job_title'].map(remove_stop_words)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20492 entries, 9 to 27015
Data columns (total 2 columns):
job_description    20492 non-null object
job_title          20492 non-null object
dtypes: object(2)
memory usage: 480.3+ KB


In [18]:
#drop any duplicates
df = df.drop_duplicates()

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18704 entries, 9 to 27015
Data columns (total 2 columns):
job_description    18704 non-null object
job_title          18704 non-null object
dtypes: object(2)
memory usage: 438.4+ KB


In [20]:
#reset the index
df = df.reset_index(drop=True)

In [21]:
#check to see if there are any job titles that have been 
#removed via stopwords, I will want to remove these rows 
#because the job titles were not real job titles
df.loc[df['job_title']=='']

Unnamed: 0,job_description,job_title
1526,'',
2508,'robert half technology is looking for an expe...,
13230,'someone who has spent 8-10 years in insurance...,
13569,"'my client is looking for sap fico in houston,...",
14136,"'share your resume on click here to apply', 'j...",
14157,"'role : fico with s4 hana.,location : manitowo...",
15118,"'hello all,', '', '', '', '', '', '', '', '', ...",
16508,"'job description:', 'leidos is looking for a j...",


In [22]:
df = df.drop(df.index[[1526, 
                       2508, 
                       13230, 
                       13569, 
                       14136, 
                       14157, 
                       15118, 
                       16508]])

In [23]:
#sanity check
df.loc[df['job_title']=='']

Unnamed: 0,job_description,job_title


In [24]:
#reset index one last time
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,job_description,job_title
0,'2+ years of experience developing java / j2ee...,java developer
1,"'passion for technology and learning, a natura...",ios developer mobile
2,'we enjoy approved it vendor status with sever...,euc engineer
3,'we enjoy approved it vendor status with sever...,software developer rpg
4,'sme in linux operating system with strong vir...,linux consultant with weblogic exp


---

## Text Cleaning

1. Tokenize
2. Remove words with fewer than 2 characters
3. Remove stop words
4. Normalize words (Lemmatize and Stem)

**Test the functions on one row of text**

In [25]:
stemmer = SnowballStemmer('english')

In [26]:
text_sample = df[df.index == 13].values[0][0]

print('original text: ')
words = []
for word in text_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized text: ')
print(preprocess(text_sample))

original text: 
["'demonstrates", 'brand', 'passion,champions', 'and', 'embraces', 'change,makes', 'good', 'decisions,delivers', 'results,takes', 'action', 'with', 'integrity,communicates', "effectively'"]


 tokenized and lemmatized text: 
['demonstr', 'brand', 'passion', 'champion', 'embrac', 'chang', 'make', 'good', 'decis', 'deliv', 'result', 'take', 'action', 'integr', 'communic', 'effect']


**Apply cleaning functions to job_description**

In [27]:
#apply text cleaning function and display first 5 rows
processed_text = df['job_description'].map(preprocess)
processed_text[:5]

0                                 [develop, java, web]
1    [passion, technolog, learn, natur, curios, lov...
2    [enjoy, approv, vendor, status, lead, compani,...
3    [enjoy, approv, vendor, status, lead, compani,...
4    [sme, linux, oper, strong, virtual, knowledg, ...
Name: job_description, dtype: object

---

## Bag of Words

In [28]:
#I'll use bag of words to extract features from text for use in modeling

In [29]:
dictionary = gensim.corpora.Dictionary(processed_text)

In [30]:
#check the length before I filter out the extremes
len(dictionary)

24548

In [31]:
dictionary.filter_extremes(no_below=25, 
                           no_above=0.5, 
                           keep_n=100000)

In [32]:
#check length after filtering out extremes
len(dictionary)

3753

In [33]:
#bow2doc: counts the number of occurrences of each distinct word, 
#converts the word to its integer word id and returns the result 
#as a sparse vector

bow2doc_corpus = [dictionary.doc2bow(text) for text in processed_text]

---

## LDA model with Bag of Words

In [None]:
#train lda model, this takes a while so I pickled my desired 
#results in the cells below
lda_model = gensim.models.LdaMulticore(bow2doc_corpus, 
                                       num_topics=9, 
                                       id2word=dictionary, 
                                       passes=50, 
                                       workers=4,
                                      chunksize=500)


In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

**Pickled LDA model results**

In [34]:
#pickle.dump(lda_model, open('dreamjobber_web/webapp/pickled_models/lda_model.pkl', 'wb'))
pickled_lda = pickle.load(open('dreamjobber_web/webapp/pickled_models/lda_model.pkl', 'rb'))

In [35]:
for idx, topic in pickled_lda.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.014*"client" + 0.011*"solut" + 0.010*"consult" + 0.009*"group" + 0.009*"help" + 0.008*"provid" + 0.008*"industri" + 0.008*"compani" + 0.007*"profession" + 0.007*"innov"
Topic: 1 
Words: 0.069*"secur" + 0.021*"inform" + 0.016*"system" + 0.012*"engin" + 0.011*"risk" + 0.011*"oper" + 0.011*"program" + 0.010*"network" + 0.010*"provid" + 0.010*"control"
Topic: 2 
Words: 0.036*"project" + 0.015*"process" + 0.014*"technic" + 0.014*"abil" + 0.014*"requir" + 0.011*"plan" + 0.011*"product" + 0.010*"function" + 0.010*"solut" + 0.009*"implement"
Topic: 3 
Words: 0.039*"test" + 0.029*"softwar" + 0.021*"web" + 0.016*"java" + 0.014*"code" + 0.013*"net" + 0.012*"framework" + 0.011*"engin" + 0.011*"javascript" + 0.011*"end"
Topic: 4 
Words: 0.032*"cloud" + 0.024*"engin" + 0.021*"architectur" + 0.016*"solut" + 0.016*"aw" + 0.014*"platform" + 0.012*"build" + 0.011*"integr" + 0.011*"deploy" + 0.011*"architect"
Topic: 5 
Words: 0.029*"network" + 0.018*"system" + 0.014*"technic" + 0.014*"

In [36]:
#manually name topics
col_names=['Analyst', 'Security', 'Leadership', 'Software/Web Dev', 
           'Cloud Computing', 'Computer Network', 'Database Admin', 
           'Computer Support', 'WebDev']

---

**LDA model evaluation**

In [37]:
# Compute Coherence Score using c_v
coherence_model_lda = CoherenceModel(model=pickled_lda, 
                                     texts=processed_text, 
                                     dictionary=dictionary, 
                                     coherence='c_v')

coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.4762757589382995


In [38]:
# Compute Coherence Score using UMass
coherence_model_lda = CoherenceModel(model=pickled_lda, 
                                     texts=processed_text, 
                                     dictionary=dictionary, 
                                     coherence="u_mass")

coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  -1.1423289375289365


In [40]:
#visualize the topics in order to better label 
#you may need to pip install pyLDAvis by uncommenting and 
#running the line below
#!pip install pyLDAvis
%matplotlib inline
import pyLDAvis
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(topic_model=pickled_lda, 
                              corpus=bow2doc_corpus, 
                              dictionary=dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [41]:
#show topics and descriptions
df_topic_sents_keywords = show_topics_sentences(ldamodel=pickled_lda, 
                                                corpus=bow2doc_corpus, 
                                                texts=df['job_description'])


df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 
                             'Topic_Perc_Contrib', 'Keywords', 
                             'Text']

In [42]:
df_dominant_topic.head()

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,3.0,0.7037,"test, softwar, web, java, code, net, framework...",'2+ years of experience developing java / j2ee...
1,1,0.0,0.9259,"client, solut, consult, group, help, provid, i...","'passion for technology and learning, a natura..."
2,2,0.0,0.7992,"client, solut, consult, group, help, provid, i...",'we enjoy approved it vendor status with sever...
3,3,0.0,0.7992,"client, solut, consult, group, help, provid, i...",'we enjoy approved it vendor status with sever...
4,4,5.0,0.533,"network, system, technic, server, troubleshoot...",'sme in linux operating system with strong vir...


---

**Create df for topic scores for each jobtitle**

In [43]:
#get probabilities of a document belonging to each topic and append to list
topic_vecs = []
for i in range(len(bow2doc_corpus)):
    top_topics = pickled_lda.get_document_topics(bow2doc_corpus[i], 
                                                 minimum_probability=0.0)
    #i in range(amount of topics)
    topic_vec = [top_topics[i][1] for i in range(9)]
    topic_vecs.append(topic_vec)

In [44]:
#create a dataframe for topic scores
df_topic_vecs = pd.DataFrame(topic_vecs)
df_topic_vecs.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.037037,0.037038,0.037037,0.703682,0.03705,0.037038,0.037037,0.037037,0.037043
1,0.92589,0.009262,0.009264,0.009264,0.009264,0.009265,0.009263,0.009263,0.009266
2,0.79917,0.002366,0.002367,0.002366,0.002366,0.002366,0.002366,0.184267,0.002366
3,0.799178,0.002366,0.002367,0.002366,0.002366,0.002366,0.002366,0.18426,0.002366
4,0.00102,0.00102,0.00102,0.00102,0.411041,0.532986,0.049853,0.00102,0.00102
5,0.011124,0.011136,0.011131,0.01112,0.209851,0.011122,0.712269,0.011126,0.011121
6,0.002647,0.002647,0.248674,0.002648,0.596378,0.002647,0.139064,0.002647,0.002649
7,0.037075,0.037057,0.037067,0.037072,0.367976,0.37255,0.037086,0.03707,0.037047
8,0.799164,0.002366,0.002367,0.002366,0.002366,0.002366,0.002366,0.184274,0.002366
9,0.799176,0.002366,0.002367,0.002366,0.002366,0.002366,0.002366,0.184262,0.002366


In [45]:
#name columns for dataframe

col_names=['Analyst', 'Security', 'Leadership', 'Software/Web Dev', 
           'Cloud Computing', 'Computer Network', 'Database Admin', 
           'Computer Support', 'WebDev']

df_topic_vecs.columns = col_names
df_topic_vecs.head()

Unnamed: 0,Analyst,Security,Leadership,Software/Web Dev,Cloud Computing,Computer Network,Database Admin,Computer Support,WebDev
0,0.037037,0.037038,0.037037,0.703682,0.03705,0.037038,0.037037,0.037037,0.037043
1,0.92589,0.009262,0.009264,0.009264,0.009264,0.009265,0.009263,0.009263,0.009266
2,0.79917,0.002366,0.002367,0.002366,0.002366,0.002366,0.002366,0.184267,0.002366
3,0.799178,0.002366,0.002367,0.002366,0.002366,0.002366,0.002366,0.18426,0.002366
4,0.00102,0.00102,0.00102,0.00102,0.411041,0.532986,0.049853,0.00102,0.00102


---

# Nearest Neighbors

In [46]:
#next step merge df_topic_vecs with original df of job titles and job descriptions
#pickle the merged df

In [47]:
df_final = pd.merge(df, df_topic_vecs, 
                    left_index=True, 
                    right_index=True)

In [48]:
#pickle.dump(df_final, open('dreamjobber_web/webapp/pickled_models/df_final.pkl', 'wb'))
pickled_df_final = pickle.load(open('dreamjobber_web/webapp/pickled_models/df_final.pkl', 'rb'))
pickled_df_final.head()

Unnamed: 0,job_description,job_title,Analyst,Security,Leadership,Software/Web Dev,Cloud Computing,Computer Network,Database Admin,Computer Support,WebDev
0,'2+ years of experience developing java / j2ee...,java developer,0.037037,0.037038,0.037037,0.703682,0.03705,0.037038,0.037037,0.037037,0.037043
1,"'passion for technology and learning, a natura...",ios developer mobile,0.92589,0.009262,0.009264,0.009264,0.009264,0.009265,0.009263,0.009263,0.009266
2,'we enjoy approved it vendor status with sever...,euc engineer,0.799141,0.002366,0.002367,0.002366,0.002366,0.002366,0.002366,0.184296,0.002366
3,'we enjoy approved it vendor status with sever...,software developer rpg,0.799154,0.002366,0.002367,0.002366,0.002366,0.002366,0.002366,0.184283,0.002366
4,'sme in linux operating system with strong vir...,linux consultant with weblogic exp,0.00102,0.00102,0.00102,0.00102,0.411034,0.532981,0.049865,0.00102,0.00102


In [49]:
topics = pickled_df_final.drop(['job_description', 'job_title'], axis=1)
jobs = pickled_df_final['job_title']

In [50]:
#pickle jobs for use in webapp
#pickle.dump(jobs, open('dreamjobber_web/webapp/pickled_models/jobs.pkl', 'wb'))
jobs = pickle.load(open('dreamjobber_web/webapp/pickled_models/jobs.pkl', 'rb'))

In [51]:
nearest_neighbor = NearestNeighbors(n_neighbors=50)
nearest_neighbor.fit(topics)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=None, n_neighbors=50, p=2, radius=1.0)

In [52]:
#pickle nearest_neighbor model for use in webapp
#pickle.dump(nearest_neighbor, open('dreamjobber_web/webapp/pickled_models/nn_model.pkl', 'wb'))
nearest_neighbor = pickle.load(open('dreamjobber_web/webapp/pickled_models/nn_model.pkl', 'rb'))

---

**Make Recommendations**

In [53]:
show_to_user(nearest_neighbor, jobs)

Scale of 0-10.
    0 is Do NOT agree and 10 is agree
Agree or Disagree: I am/I like Analyst: 8
Agree or Disagree: I am/I like Security: 5
Agree or Disagree: I am/I like Leadership: 4
Agree or Disagree: I am/I like Software/App Dev: 5
Agree or Disagree: I am/I like Cloud Computing: 3
Agree or Disagree: I am/I like Computer Network: 3
Agree or Disagree: I am/I like Database Admin: 7
Agree or Disagree: I am/I like Computer Support: 0
Agree or Disagree: I am/I like WebDev: 5


['1. c++ quant developer elite global wealth management team', '2. data conversion analyst', '3. lead application developer', '4. data architect', '5. data modeler c++ mortgage backed securities', '6. business intelligence engineer', '7. data engineer', '8. level data engineer', '9. data scientist', '10. software engineer']
How did you like your recommendations? bad, okay, or goodokay
