# vectorizing job offers

## aim
- cluster job offeres by similarity based on a dictionary of skills

## outline
- preprocess doc2vec with full job offers
- train model
- test similarity of job descriptions
- cluster offers (Kmeans, KNN)

## outcome
unicorns in a meadow

In [133]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import gensim
import joblib
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import math
import multiprocessing
import gensim.models.doc2vec
import time
import json

%matplotlib inline

In [4]:
df = joblib.load('../../../raw_data/processed_data.joblib')

In [5]:
df.shape

(7859, 14)

In [6]:
df['tag_language'] = df['tag_language'].fillna(value='en')

In [7]:
df.head(3)

Unnamed: 0,job_title,job_text,company,location,job_info,query_text,source,job_link,tag_language,reviews,job_info_tokenized,job_text_tokenized,job_text_tokenized_titlecase,job_title_tokenized
0,(Junior) Data Engineer (f/m/x),Customlytics ist die führende App Marketing Be...,Customlytics GmbH,Berlin,(Junior) Data Engineer (f/m/x)\nCustomlytics G...,data science,scrape_json,,en,,"[junior, data, engineer, fmx, customlytics, gm...","[customlytics, ist, die, führende, app, market...","[Customlytics, ist, die, führende, App, Market...","[junior, data, engineer, fmx]"
1,,Responsibilities\n\nAs working student (m/f/x)...,Aroundhome,Berlin,Aroundhome6 Bewertungen - Berlin,data science,scrape_json,,en,,"[aroundhome, bewertungen, berlin]","[responsibilities, as, working, student, mfx, ...","[Responsibilities, As, working, student, mfx, ...",[]
2,,Aufgaben\nAls Werkstudent (m/w/d) IT arbeitest...,Aroundhome,Berlin,"Aroundhome6 Bewertungen - Berlin\nTeilzeit, Pr...",data science,scrape_json,,de,,"[aroundhome, bewertungen, berlin, teilzeit, pr...","[aufgaben, als, werkstudent, mwd, it, arbeites...","[Aufgaben, Als, Werkstudent, mwd, IT, arbeites...",[]


In [8]:
# select english jobs
df_eng = df.copy()
df_eng = df_eng[df_eng['tag_language'] == 'en']
df_eng.reset_index(inplace=True)
df_eng.drop(columns='index', inplace=True)

In [9]:
df_eng.head()

Unnamed: 0,job_title,job_text,company,location,job_info,query_text,source,job_link,tag_language,reviews,job_info_tokenized,job_text_tokenized,job_text_tokenized_titlecase,job_title_tokenized
0,(Junior) Data Engineer (f/m/x),Customlytics ist die führende App Marketing Be...,Customlytics GmbH,Berlin,(Junior) Data Engineer (f/m/x)\nCustomlytics G...,data science,scrape_json,,en,,"[junior, data, engineer, fmx, customlytics, gm...","[customlytics, ist, die, führende, app, market...","[Customlytics, ist, die, führende, App, Market...","[junior, data, engineer, fmx]"
1,,Responsibilities\n\nAs working student (m/f/x)...,Aroundhome,Berlin,Aroundhome6 Bewertungen - Berlin,data science,scrape_json,,en,,"[aroundhome, bewertungen, berlin]","[responsibilities, as, working, student, mfx, ...","[Responsibilities, As, working, student, mfx, ...",[]
2,Full Stack Developer (m/f/d),We’re Phiture: a leading mobile growth consult...,Phiture,BerlinKreuzberg,Full Stack Developer (m/f/d)\nPhiture - Berlin...,data science,scrape_json,,en,,"[full, stack, developer, mfd, phiture, berlink...","[were, phiture, a, leading, mobile, growth, co...","[Were, Phiture, a, leading, mobile, growth, co...","[full, stack, developer, mfd]"
3,,"We are 18,000+ employees strong, operating in ...",PRA Health Sciences,Berlin,PRA Health Sciences - Berlin,data science,scrape_json,,en,,"[pra, health, sciences, berlin]","[we, are, employees, strong, operating, in, mo...","[We, are, employees, strong, operating, in, mo...",[]
4,Head of Finance,Head of Finance (m/f/d)\nAt Home our mission i...,Home HT GmbH,Berlin,Head of Finance\nHome HT GmbH2 Bewertungen - B...,data science,scrape_json,,en,,"[head, of, finance, home, ht, gmbh, bewertunge...","[head, of, finance, mfd, at, home, our, missio...","[Head, of, Finance, mfd, At, Home, our, missio...","[head, of, finance]"


In [10]:
df_eng.shape

(7547, 14)

In [11]:
# join strings
def join_strings(text):
    return ' '.join(text)

In [12]:
# lemmatize
def lemmatize_words(word):
    lemmatizer = WordNetLemmatizer()
    lemmatized = lemmatizer.lemmatize(word)

    return lemmatized

In [13]:
# remove stopwords
def remove_stopwords(text):

    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text) 
    text = [w for w in word_tokens if not w in stop_words] 
  
    return text

#['heute', 'weiter', 'zur', 'bewerbung', 'diesen', 'job', 'melden']

In [14]:
# process text
df_eng['clean'] = df_eng['job_text_tokenized'].apply(join_strings).apply(lemmatize_words)\
    .apply(remove_stopwords)

## model doc2vec 1

Conclusions :)
- ~ 700 offers - 100 epocs
    - model performs ok, but tends to cluster according to company
    - texts with very high similarity (> 0.90) are likely to be duplicated job adds
    - looks like the model first shows offers based on duplicates, then company, then position (probably because of semantics)


- 2500 offers - 150 epocs
    - still clusters by company
    - add more data? or try bigrams

In [110]:
# tag texts
texts = df_eng['clean']
texts_tagged = [TaggedDocument(text, tags=['tag_'+str(tag)]) for tag, text in enumerate(texts)]
texts_tagged[0]

TaggedDocument(words=['customlytics', 'ist', 'die', 'führende', 'app', 'marketing', 'beratungsagentur', 'aus', 'berlin', 'wir', 'bieten', 'consulting', 'und', 'handson', 'support', 'rund', 'um', 'app', 'marketing', 'strategie', 'produktmanagement', 'analytics', 'crm', 'unser', 'team', 'erarbeitet', 'mit', 'unternehmen', 'jeder', 'größe', 'konzepte', 'zur', 'erfolgreichen', 'vermarktung', 'von', 'mobilen', 'apps', 'dabei', 'decken', 'wir', 'nicht', 'nur', 'das', 'gesamte', 'spektrum', 'infrastruktureller', 'marketingthemen', 'ab', 'wir', 'konzipieren', 'planen', 'und', 'steuern', 'sowohl', 'das', 'ui', 'ux', 'design', 'von', 'mobilen', 'apps', 'als', 'auch', 'performance', 'marketing', 'kampagnen', 'für', 'alle', 'app', 'verticals', 'über', 'uns', 'unser', 'data', 'team', 'braucht', 'unterstützung', 'du', 'bist', 'motiviert', 'und', 'von', 'der', 'mobile', 'industry', 'begeistert', 'dann', 'suchen', 'wir', 'dich', 'um', 'die', 'data', 'warehouselösungen', 'für', 'unsere', 'kunden', 'aus

In [111]:
# reduced dataset
texts_tagged_small = texts_tagged[:3000]

In [27]:
data_to_train = texts_tagged_small # texts_tagged_small, texts_tagged

# build vocabulary with CBOW (dm=0)
cores = multiprocessing.cpu_count()
model_dbow = Doc2Vec(documents=data_to_train,
                     dm=0,
                     alpha=0.025,
                     vector_size=len(data_to_train), 
                     min_count=1,
                     workers=cores)

# train the model
model_dbow.train(data_to_train, total_examples=model_dbow.corpus_count, epochs=15)

In [28]:
model_dbow.save('../../../models/doc2vec_3000_15_epochs')
#joblib.dump(model_dbow, filename='../../../models/doc2vec_3000_20_epochs.joblib' )

In [None]:
model_dbow.corpus_count

### test the model by hand and with copy-pasted text

**test model with texts in the database**

In [112]:
# load model
model_loaded = Doc2Vec.load('../../../models/doc2vec_3000_20_epochs')

In [131]:
def similar_jobs(tokenized_job, offers):
    ''' input: tokenized job offers, number of offers 
        returns tags of top x most similar job offers and similarity probabilities
    '''

    # infer vector from text 
    infer_vector = model_loaded.infer_vector(tokenized_job)
    # find similar offers
    similar_documents = model_loaded.docvecs.most_similar([infer_vector], topn = offers)

    return similar_documents

In [129]:
def print_top_jobs(text, offers=5):
    
    """ input: index of text in dataframe and number of offers we want to see
        prints text of the offers
    """
    
    tags = similar_jobs(text, offers)
    tags = [list(i) for i in tags]
    
    print(f"{tags}\n")
    print(f"{df_eng['job_title'][text_index], df_eng['company'][text_index], df_eng['job_text'][text_index]} \
        \n-------------END------------\n ")
    
    for tag in tags: 
        num = int(tag[0].strip('tag_'))
        
        print(f"{df_eng['job_title'][num], df_eng['company'][num], df_eng['job_text'][num]} \
        \n-------------END------------\n ") 
    

In [44]:
# print_top_jobs(400, 10) # duplicates 3050; 2020; 400
# #print_top_jobs(5000) 

**test model with copy-pasted job**

In [48]:
## change case to lower
def to_lower(text):
    return text.lower()

## remove numbers from the corpus
def remove_number(text):
    text = ''.join(word for word in text if not word.isdigit())
    
    return text

## remove special puncutation from text
def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    
    return text

In [122]:
offer = """
You are a Data Scientist and enjoy finding evidence-based solutions to complex scientific questions? You would like to have the opportunity to use your knowledge in a profitable way and to complement your expertise at the same time? You also want start-up atmosphere, a great team and a job in a meaningful industry? Great, then read on!

November is a premium end-of-life brand. With a team of over 70 people, we develop and market innovative retirement products that relieve families of the organizational and financial burdens of a funeral while they are still alive. In the event of death, we organize funerals throughout Germany with an experienced team of morticians, decorators, florists and speakers. Our online portal is now used by circa 700,000 people per month and we count thousands of very satisfied customers.

Your mission:

You help develop an evidence-based model for decision making for our potential customers.
You identify all relevant data sources and prepare data for the specific analysis questions.
You are always on the lookout for new analysis methods especially in the environment of Big Data and Machine Learning.
You perform complex data analyses along our customer journey and develop data-driven algorithms.
You validate and evaluate the collected data and statistical models as well as the resulting recommendations for action that are relevant in various business areas.
You work closely with Novembers management level, as well as with various external subject matter experts, such as university contacts in the field of psychology & statistics.
You abolish the watering can principle in marketing - Through your work it will be possible to address our customers in an individualized and personalized way and to predict what the customer will want in the future.
Your competencies:

Successfully completed studies in data science, physics, engineering, (business) computer science, (business) mathematics, economics, or psychology with a strong quantitative focus and/or statistics
Professional experience in the field of Data Science, as well as hands-on experience with agile projects in all phases up to operational use
Extensive knowledge in the application of statistical methods (e.g. R, Python) as well as in the use of databases (e.g. SQL) and BI/Visualization tools
Knowledge in the application of machine learning, optimization and data mining methods, deep learning and AI (e.g. Python, Keras Tensorflow/Theano)
Outstanding analytical skills, as well as a confident use of data preparation, visualization and analysis methods, especially in the field of machine learning
Your English skills are very good, German skills are advantageous
You've read this far and are still excited? Great, then here comes our offer for you:

Responsible task in an exciting industry that has hardly been digitized so far
Room for independent work
An open corporate culture as well as a passionate team and a lot of fun at work
Attractive salary package and benefits (cooperation with Urban Sports Club, team lunch, etc.)
Home office option and spacious office with foosball in the heart of Berlin
Regular team events, free drinks, snacks and much more
"""

In [123]:
token_offer = to_lower(offer)
token_offer = remove_number(offer)
token_offer = remove_punctuation(offer)
token_offer = lemmatize_words(offer)
token_offer = remove_stopwords(offer)
#offer
token_offer[:]

['You',
 'Data',
 'Scientist',
 'enjoy',
 'finding',
 'evidence-based',
 'solutions',
 'complex',
 'scientific',
 'questions',
 '?',
 'You',
 'would',
 'like',
 'opportunity',
 'use',
 'knowledge',
 'profitable',
 'way',
 'complement',
 'expertise',
 'time',
 '?',
 'You',
 'also',
 'want',
 'start-up',
 'atmosphere',
 ',',
 'great',
 'team',
 'job',
 'meaningful',
 'industry',
 '?',
 'Great',
 ',',
 'read',
 '!',
 'November',
 'premium',
 'end-of-life',
 'brand',
 '.',
 'With',
 'team',
 '70',
 'people',
 ',',
 'develop',
 'market',
 'innovative',
 'retirement',
 'products',
 'relieve',
 'families',
 'organizational',
 'financial',
 'burdens',
 'funeral',
 'still',
 'alive',
 '.',
 'In',
 'event',
 'death',
 ',',
 'organize',
 'funerals',
 'throughout',
 'Germany',
 'experienced',
 'team',
 'morticians',
 ',',
 'decorators',
 ',',
 'florists',
 'speakers',
 '.',
 'Our',
 'online',
 'portal',
 'used',
 'circa',
 '700,000',
 'people',
 'per',
 'month',
 'count',
 'thousands',
 'satisfied

In [124]:
infer_vector = model_loaded.infer_vector(token_offer)
infer_vector

array([ 0.21694319, -0.18454333,  0.0015175 , ..., -0.141679  ,
        0.18379204,  0.06664702], dtype=float32)

In [125]:
similar_documents = model_loaded.docvecs.most_similar([infer_vector], topn = 5)

In [126]:
tags = similar_documents
tags = [list(i) for i in tags]

print(f"{tags}\n")
# print(f"{df_eng['job_title'][text_index], df_eng['company'][text_index], df_eng['job_text'][text_index]} \
#     \n-------------END------------\n ")

for tag in tags: 
    num = int(tag[0].strip('tag_'))

    print(f"{df_eng['job_title'][num], df_eng['company'][num], df_eng['job_text'][num]} \
    \n-------------END------------\n ") 


[['tag_511', 0.9291380047798157], ['tag_689', 0.9210711717605591], ['tag_718', 0.5408948659896851], ['tag_559', 0.5356857776641846], ['tag_1', 0.5032567381858826]]

('Data Scientist (m/w/d)', 'November', "You are a Data Scientist and enjoy finding evidence-based solutions to complex scientific questions? You would like to have the opportunity to use your knowledge in a profitable way and to complement your expertise at the same time? You also want start-up atmosphere, a great team and a job in a meaningful industry? Great, then read on!\nNovember is a premium end-of-life brand. With a team of over 70 people, we develop and market innovative retirement products that relieve families of the organizational and financial burdens of a funeral while they are still alive. In the event of death, we organize funerals throughout Germany with an experienced team of morticians, decorators, florists and speakers. Our online portal is now used by circa 700,000 people per month and we count thousands

## improve the model 1
Steps:
- filter out all words not in dictionary 
- train model
- get output and see if it's better

In [134]:
# import dictionary

with open('../fydjob/data/dicts/skills_dict.json') as json_file:
    dictionary = json.load(json_file)

# collapse dictionary into list

#skills_list = 

In [144]:
skills_list = []

for value in dictionary.items():
    print(value[1])

business
knowledge
programming
soft_skills
tech_adjectives


In [None]:
# filter tokens for skills

In [65]:
# train model # train_model.train(alldocs, total_examples=len(alldocs), epochs=epochs, start_alpha=0.025, end_alpha=0.001)

In [None]:
# save model

In [None]:
# test model

## Improve the model - 2

- try bigrams instead ot unigrams

In [None]:
texts_small = df_eng['clean'][:2500]

In [None]:
texts_small.head()

In [None]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim import models

In [None]:
bigram = Phrases(texts_small, min_count=1, threshold=2, delimiter=b' ')

bigram_phraser = Phraser(bigram)


In [None]:
bigram_token = []
for sent in texts_small:
    bigram_token.append(bigram_phraser[sent])
    
bigram_token[:20]