# DreamJobber

**Tech Edition**

---

**Process**
1. Clean text
2. Bag of Words
3. LDA model (Latent Dirichlet allocation)
4. Fine tune LDA model
5. Define Topics from LDA model
6. Create df of document probabilities
6. Classification model

---

**Import Necessary Libraries**

In [1]:
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
import nltk
from dreamjobber_tech.recommend import *
from functions import *
import pickle

#lda model evaluatoin with coherence
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary

from sklearn.neighbors import NearestNeighbors

In [None]:
nltk.download('wordnet')

---

**Load Data**

In [None]:
df_1 = pd.read_json('data/dice_jobs_1.json', lines=True)
df_2 = pd.read_json('data/dice_jobs_2.json', lines=True)
df_3 = pd.read_json('data/dice_jobs_3.json', lines=True)
df_4 = pd.read_json('data/dice_jobs_4.json', lines=True)
df_5 = pd.read_json('data/dice_jobs_5.json', lines=True)
df_6 = pd.read_json('data/dice_jobs_6.json', lines=True)

In [None]:
#concat into one df
df = pd.concat([df_1, df_2, df_3, df_4, df_5, df_6], ignore_index=True, sort=True)

In [None]:
df.head()

In [None]:
#check for missing values
df.isna().sum()

In [None]:
#looks like there are rows that have no job description
df.info()

In [None]:
#drop rows with no job descriptions
df = df.dropna()

In [None]:
#sanity check, looks good
df.info()

In [None]:
df.head()

In [None]:
#need to remove brackets from job_description
df['job_description'] = df['job_description'].map(remove_brackets)

In [None]:
#remove '\\n' and replace with ','
df['job_description'] = df['job_description'].map(lambda x: x.replace('\\n', ','))

In [None]:
#lowercase text before applying stopwords
df['job_description'] = df['job_description'].map(lambda x: x.lower())

In [None]:
#lowercase job_title text before cleaning
df['job_title'] = df['job_title'].map(lambda x: x.lower())

In [None]:
#Return first title from job_titles
df['job_title'] = df['job_title'].map(get_first_title)

In [None]:
#def remove_stop_words(text):
    
 #   """Remove stopwords from job titles"""
    
  #  jobtitle_stopwords = ['bonus', 
   #                       'duration',
    #                      'month',
     #                     'open',
      #                    'rate',
       #                   'sign-on',
        #                  'year']
    
    #result = []
    
    #for word in text:
    
     #   if word not in jobtitle_stopwords:
      #      result.append(text)
    #return result


In [None]:
#remove stop words from job titles
#df['job_title'] = df['job_title'].map(remove_stop_words)

In [None]:
df = df.reset_index(drop=True)
df.head()

In [None]:
df.info()

In [None]:
df = df.drop_duplicates()

In [None]:
df.info()

---

## Text Cleaning

1. Tokenize
2. Remove words with fewer than 3 characters
3. Remove stop words
4. Normalize words (Lemmatize and Stem)

**Test the functions on one row of text**

In [None]:
stemmer = SnowballStemmer('english')

In [None]:
text_sample = df[df.index == 53].values[0][0]

print('original text: ')
words = []
for word in text_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized text: ')
print(preprocess(text_sample))

**Apply cleaning functions to job_description**

In [None]:
#apply function and display first 5 rows
processed_text = df['job_description'].map(preprocess)
processed_text[:10]

---

## Bag of Words

In [None]:
#I'll use bag of words to extract features from text for use in modeling

In [None]:
dictionary = gensim.corpora.Dictionary(processed_text)

In [None]:
#check the length before I filter out the extremes
len(dictionary)

In [None]:
dictionary.filter_extremes(no_below=25, no_above=0.5, keep_n=100000)

In [None]:
#check length after filtering out extremes
len(dictionary)

In [None]:
#bow2doc: counts the number of occurrences of each distinct word, 
#converts the word to its integer word id and returns the result as a sparse vector

bow2doc_corpus = [dictionary.doc2bow(text) for text in processed_text]

---

## Find optimal number of topics

In [None]:
#model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=bow2doc_corpus,
 #                                                       texts=processed_text, start=5, limit=40, step=5)

In [None]:
#import matplotlib.pyplot as plt
#limit=40; start=5; step=5;
#x = range(start, limit, step)
#plt.plot(x, coherence_values)
#plt.xlabel("Number of Topics")
#plt.ylabel("Coherence score")
#plt.show()

---

## LDA model with Bag of Words

In [None]:
lda_model = gensim.models.LdaMulticore(bow2doc_corpus, 
                                       num_topics=10, 
                                       id2word=dictionary, 
                                       passes=50, 
                                       workers=4,
                                      chunksize=750)


In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

**Pickled LDA model results**

In [None]:
#pickle.dump(lda_model, open('pickled_models/lda_model.pkl', 'wb'))
pickled_lda = pickle.load(open('pickled_models/lda_model.pkl', 'rb'))

In [None]:
for idx, topic in pickled_lda.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

---

**LDA model evaluation**

In [None]:
# Compute Coherence Score using c_v
coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_text, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
# Compute Coherence Score using UMass
coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_text, dictionary=dictionary, coherence="u_mass")
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
#!pip install pyLDAvis

In [None]:
##visualize the topics in order to better label 
%matplotlib inline
import pyLDAvis
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=bow2doc_corpus, dictionary=dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)


In [None]:
#show topics and descriptions
df_topic_sents_keywords = show_topics_sentences(ldamodel=pickled_lda, corpus=bow2doc_corpus, texts=df['job_description'])


df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']


In [None]:
df_dominant_topic.head()

---

**Create df for topic scores for each jobtitle**

In [None]:
topic_vecs = []
for i in range(len(bow2doc_corpus)):
    top_topics = pickled_lda.get_document_topics(bow2doc_corpus[i], minimum_probability=0.0)
    #i in range(amount of topics)
    topic_vec = [top_topics[i][1] for i in range(9)]
    topic_vecs.append(topic_vec)

In [None]:
df_topic_vecs = pd.DataFrame(topic_vecs)
df_topic_vecs.head(10)

In [None]:
#name columns for df

col_names=['Computer Network', 'Web Dev', 'Security', 'Analyst', 
           'Leadership', 'Database Admin', 'Cloud Computing', 'Computer Support', 'Software/App Dev']


df_topic_vecs.columns = col_names
df_topic_vecs.head()

---

---

# Nearest Neighbors

In [None]:
#next step merge with original df of job titles and job descriptions
#pickle the merged df

In [None]:
df_final = pd.merge(df, df_topic_vecs,left_index=True, right_index=True)

In [None]:
#pickle.dump(df_final, open('pickled_models/df_final.pkl', 'wb'))
pickled_df_final = pickle.load(open('pickled_models/df_final.pkl', 'rb'))
pickled_df_final.head()

In [None]:
topics = pickled_df_final.drop(['job_description', 'job_title'], axis=1)
jobs = pickled_df_final['job_title']

In [2]:
#pickle.dump(jobs, open('pickled_models/jobs.pkl', 'wb'))
jobs = pickle.load(open('pickled_models/jobs.pkl', 'rb'))

In [None]:
nearest_neighbor = NearestNeighbors(n_neighbors=50)
nearest_neighbor.fit(topics)

In [3]:
#pickle.dump(nearest_neighbor, open('pickled_models/nn_model.pkl', 'wb'))
nearest_neighbor = pickle.load(open('pickled_models/nn_model.pkl', 'rb'))

---

**Make Recommendations**

In [4]:
show_to_user(nearest_neighbor, jobs)


Scale of 0-10.
    0 is Do NOT agree and 10 is agree
Agree or Disagree: I am/I like Computer Network: 4
Agree or Disagree: I am/I like Web Dev: 4
Agree or Disagree: I am/I like Security: 4
Agree or Disagree: I am/I like Analyst: 4
Agree or Disagree: I am/I like Leadership: 4
Agree or Disagree: I am/I like Database Admin: 4
Agree or Disagree: I am/I like Cloud Computing: 4
Agree or Disagree: I am/I like Computer Support: 4
Agree or Disagree: I am/I like Software/App Dev: 4


['1. lead application developer', '2. contract analyst', '3. desktop support', '4. mulesoft developer', '5. it project coordinator', '6. senior application security engineer', '7. data warehouse programmer', '8. service technician - paris', '9. project manager', '10. webmethods developer']
How did you like your recommendations? bad, okay, or goodokay


In [None]:
## next-steps
#1.clean job-titles!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#2.recommendations return unique list of top 10
#3.seperate functions (textcleaning.py, recommend.py)
#4.flask, level up some links from your recommendations to jobs
