# vectorizing job offers

## aim
- cluster job offeres by similarity based on a dictionary of skills

## outline
- preprocess doc2vec with full job offers
- train model
- test similarity of job descriptions
- cluster offers (Kmeans, KNN)

## outcome
unicorns in a meadow

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import gensim
import joblib
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint


%matplotlib inline

In [3]:
df = joblib.load('../../../raw_data/indeed/ip_2021-03-03.joblib')

In [4]:
df.shape

(954, 11)

In [5]:
df.head(3)

Unnamed: 0,job_title,job_text,company,location,job_info,query_text,tag_language,job_info_tokenized,job_text_tokenized,job_text_tokenized_titlecase,job_title_tokenized
0,Data Scientist / Matching Engineer (m/w/d),You are responsible for improvement of Taledo’...,Taledo,Berlin,Data Scientist / Matching Engineer (m/w/d)\nTa...,data engineer,en,"[data, scientist, matching, engineer, mwd, tal...","[you, are, responsible, for, improvement, of, ...","[You, are, responsible, for, improvement, of, ...","[data, scientist, matching, engineer, mwd]"
1,(Junior) Data Engineer (f/m/x),Über die Stelle\nUnser Data Team braucht Unter...,Customlytics GmbH,Berlin,(Junior) Data Engineer (f/m/x)\nCustomlytics G...,data engineer,de,"[junior, data, engineer, fmx, customlytics, gm...","[über, die, stelle, unser, data, team, braucht...","[Über, die, Stelle, Unser, Data, Team, braucht...","[junior, data, engineer, fmx]"
2,Junior Data Scientist (m/w/d) docmetric,Kennziffer:\nreq6004\nStandort:\nBerlin\nJob S...,CompuGroup Medical,Berlin,Junior Data Scientist (m/w/d) docmetric\nCompu...,data engineer,de,"[junior, data, scientist, mwd, docmetric, comp...","[kennziffer, req, standort, berlin, job, segme...","[Kennziffer, req, Standort, Berlin, Job, Segme...","[junior, data, scientist, mwd, docmetric]"


In [6]:
# select english jobs
df_eng = df.copy()
df_eng = df_eng[df_eng['tag_language'] == 'en']
df_eng.reset_index(inplace=True)
df_eng.drop(columns='index', inplace=True)

In [7]:
df_eng.head()

Unnamed: 0,job_title,job_text,company,location,job_info,query_text,tag_language,job_info_tokenized,job_text_tokenized,job_text_tokenized_titlecase,job_title_tokenized
0,Data Scientist / Matching Engineer (m/w/d),You are responsible for improvement of Taledo’...,Taledo,Berlin,Data Scientist / Matching Engineer (m/w/d)\nTa...,data engineer,en,"[data, scientist, matching, engineer, mwd, tal...","[you, are, responsible, for, improvement, of, ...","[You, are, responsible, for, improvement, of, ...","[data, scientist, matching, engineer, mwd]"
1,Senior Software Engineer - Data Platform,We are looking for a Senior Software Engineer ...,Zalando SE,Berlin,Senior Software Engineer - Data Platform\nZala...,data engineer,en,"[senior, software, engineer, data, platform, z...","[we, are, looking, for, a, senior, software, e...","[We, are, looking, for, a, Senior, Software, E...","[senior, software, engineer, data, platform]"
2,Senior Software Engineer - Data Platform,We are looking for a Senior Software Engineer ...,Zalando,Berlin,Senior Software Engineer - Data Platform\nZala...,data engineer,en,"[senior, software, engineer, data, platform, z...","[we, are, looking, for, a, senior, software, e...","[We, are, looking, for, a, Senior, Software, E...","[senior, software, engineer, data, platform]"
3,Senior Data Engineer (m/w/t),"As a member of the Data Engineering Team, you ...",Quandoo GmbH,Berlin,Senior Data Engineer (m/w/t)\nQuandoo GmbH17 B...,data engineer,en,"[senior, data, engineer, mwt, quandoo, gmbh, b...","[as, a, member, of, the, data, engineering, te...","[As, a, member, of, the, Data, Engineering, Te...","[senior, data, engineer, mwt]"
4,Data Engineer (w/m/d),We are digitty.io – an international start-up ...,digitty.io,Berlin,Data Engineer (w/m/d)\ndigitty.io - Berlin,data engineer,en,"[data, engineer, wmd, digittyio, berlin]","[we, are, digittyio, an, international, startu...","[We, are, digittyio, an, international, startu...","[data, engineer, wmd]"


In [8]:
# join strings
def join_strings(text):
    return ' '.join(text)

In [9]:
# lemmatize
def lemmatize_words(word):
    lemmatizer = WordNetLemmatizer()
    lemmatized = lemmatizer.lemmatize(word)

    return lemmatized

In [10]:
# remove stopwords
def remove_stopwords(text):

    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text) 
    text = [w for w in word_tokens if not w in stop_words] 
  
    return text

#['heute', 'weiter', 'zur', 'bewerbung', 'diesen', 'job', 'melden']

In [11]:
# process text
df_eng['clean'] = df_eng['job_text_tokenized'].apply(join_strings).apply(lemmatize_words)\
    .apply(remove_stopwords)

In [2]:
df_eng['clean']

NameError: name 'df_eng' is not defined

## model doc2vec

Conclusions :)
- model performs ok, but tends to cluster according to company
- texts with very high similarity are duplicated job adds (maybe filter out > 0.95?) 
- looks like the model first shows offers based on duplicates, then company, then position (probably because of semantics)

In [13]:
# tag texts
texts = df_eng['clean']
texts_tagged = [TaggedDocument(text, tags=['tag_'+str(tag)]) for tag, text in enumerate(texts)]
texts_tagged[0]

TaggedDocument(words=['responsible', 'improvement', 'taledos', 'search', 'matching', 'engine', 'candidates', 'jobs', 'business', 'drivers', 'data', 'science', 'develop', 'compare', 'different', 'algorithmic', 'approaches', 'andor', 'ml', 'models', 'monitor', 'production', 'performance', 'measure', 'success', 'work', 'update', 'outdated', 'models', 'research', 'discuss', 'algorithmical', 'well', 'model', 'improvements', 'regularly', 'knowledgeable', 'developed', 'ai', 'community', 'propose', 'whats', 'applicable', 'taledo', 'expect', 'curious', 'nature', 'like', 'solve', 'challenging', 'problems', 'proficient', 'python', 'worked', 'relevant', 'libraries', 'know', 'use', 'data', 'handling', 'numpy', 'pandas', 'dask', 'psycopg', 'mldl', 'scikitlearn', 'xgboost', 'keras', 'pytorch', 'spacy', 'visualization', 'seaborn', 'matplotlib', 'experience', 'evaluating', 'different', 'approaches', 'choosing', 'appropriate', 'metric', 'worked', 'search', 'matching', 'nlp', 'using', 'various', 'approac

In [14]:
len(texts_tagged)

707

In [15]:
# build vocabulary with CBOW (dm=0) - instanciate model
model_dbow = Doc2Vec(documents=texts_tagged,
                     dm=0,
                     alpha=0.025,
                     vector_size=len(texts_tagged), 
                     min_count=1)

In [16]:
model_dbow.corpus_count

707

In [17]:
# train the model
for epoch in range(10):
    if epoch % 2 == 0:
        print(f'training epoch {epoch}')
    model_dbow.train(texts_tagged, total_examples=model_dbow.corpus_count, epochs=10)

training epoch 0
training epoch 2
training epoch 4
training epoch 6
training epoch 8
training epoch 10
training epoch 12
training epoch 14
training epoch 16
training epoch 18


In [18]:
def find_similar_jobs(tokenized_job):
    ''' input: tokenized job offers 
        returns tags of top 5 most similar job offers and similarity probabilities
    '''
    
    # infer vector from text
    infer_vector = model_dbow.infer_vector(tokenized_job)
    # finds similar texts
    similar_documents = model_dbow.docvecs.most_similar([infer_vector], topn = 5)
    
    return similar_documents

In [19]:
input_text_index = 0
find_similar_jobs(texts[input_text_index])

[('tag_0', 0.9779205322265625),
 ('tag_192', 0.7734708786010742),
 ('tag_273', 0.302062064409256),
 ('tag_396', 0.2912944257259369),
 ('tag_277', 0.28880682587623596)]

In [20]:
num = 0
df_eng['job_title'][num], df_eng['company'][num], df_eng['job_text'][num]

('Data Scientist / Matching Engineer (m/w/d)',
 'Taledo',
 "You are responsible for improvement of Taledo’s Search / Matching engine between candidates and jobs and other business drivers through Data Science. You develop and compare different algorithmic approaches and/or ML models. You monitor production performance to measure the success of your work. You update outdated models. You research and discuss algorithmical as well as model improvements regularly. You are knowledgeable about what is being developed in the AI community and propose what’s applicable to Taledo.\n\nWhat we expect:\nYou are curious by nature. You like to solve challenging problems\nYou are proficient in python\nYou have worked with relevant libraries and know when to use which: data handling (numpy, pandas, dask, psycopg2), ML/DL (scikit-learn, xgboost, keras, pytorch, spacy) and visualization (seaborn, matplotlib)\nYou have experience evaluating different approaches and choosing an appropriate metric\nYou have

In [21]:
num = 192
df_eng['job_title'][num], df_eng['company'][num], df_eng['job_text'][num]

('',
 'Taledo',
 "Head of Data is responsible for driving Taledo’s business through data and taking full ownership for that. In this regard, you know how to manage yourself and the data scientist who will collaborate with you (e.g. 1on1s). By strategically thinking about the data department, you define an impactful roadmap, deliver it and measure the success of your work. You are operationally involved in retrieving insights from data, to guide our product and business development teams to success. You are conducting workshops with recruiting experts to improve the matching engine. You assist the data science modelling with insights and evaluation, ideally you can support the data scientist also operationally.\n\nWhat we expect:\nYou have a great success story in the field of Data\nYou are confident in joining a grown product with the aim to take full ownership\nYou like to generate insights from data and drive business with it (KPIs, cohors, segmentation, exploration). Experience with