In [209]:
from utils.load import load_data
from utils.transform import process_text, TfidfVectorizer
from utils.evaluate import cosine_similarity

In [243]:
data = load_data(file_name="potential-talents.xlsx", folder_name="data")
print(data.info(), "\n")
print(data.describe(), "\n")
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB
None 

               id  fit
count  104.000000  0.0
mean    52.500000  NaN
std     30.166206  NaN
min      1.000000  NaN
25%     26.750000  NaN
50%     52.500000  NaN
75%     78.250000  NaN
max    104.000000  NaN 



Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


The id column is just an index column that would not be relevant to the fitness of any roles.
Although the job_title and location columns are highly correlated, the job_title column seems to be the only relevant column in determining the fitness of a particular role based on the column values and information we have about the requirements.

Therefore, only the job_title column will be used in the ranking procedures. Having said that, the other columns will still be returned in the result so that the user (i.e. the client) can have the full information about each of the relevant candidates.
The fit column will be filled with a fitness score for each row/candidate later.

1. Get a fitness score (i.e. probability between 0-1) for each candidate based on keywords ("Aspiring human resources", "seeking human resources").
- Only look at the job_title column and ignore the rest because it is unclear whether they are relevant to the fitness of human resources roles that the client is looking to fill.
- It can be considered as an Information Retrieval problem where the job title column is a set of documents, the keywords are queries, and the relevance score between each document and queries will become the fitness score for each document.
- Transform each job title and keyword into vector representations and compute a cosine similarity (or dot product, or euclidean distance) score for each candidate and use the score as a relevance (or fitness) score.
- Can try inverted index for fast search/rank retrieval.
- tf-idf, latent semantic indexing can be used.

2. Rank candidates according to fitness scores.

3. Build a pipeline on top of the above to take user feedback (i.e. 'star') and re-rank candidates based on it.
* Taking feedback and returning re-ranked candidates (or just re-ranking candidates) can be in the form of a method/function of a class.

4. Additionally, determine a cut-off point to filter out candidates which in the first place should not be on the list. We can take 1) a document selection strategy where a document (i.e. a job_title) is relevant or not based on absolute relevance - it's essentially a binary classification task where 1 is relevant and 0 is not, or 2) a document ranking strategy where there is a cut-off relevance score and documents will be ranked/selected based on relative relevance, which is a regression problem.
* This would also work for other roles without losing high potential candidates since the only change in the overall procedures/pipelines will be different job titles and keywords as input.

What is the disadvantage of GloVe embedding?
One of the main disadvantages of Word2Vec and GloVe embedding is that they are unable to encode unknown or out-of-vocabulary words. So, to deal with this problem Facebook proposed a model FastText. It is an extension to Word2Vec and follows the same Skip-gram and CBOW model.

In [211]:
keywords = ["Aspiring human resources", "seeking human resources"]

In [244]:
job_title_words = list(set(" ".join(data['job_title']).split()))
hr_words = [word for word in job_title_words if "HR" in word]
hr_words

['CHRO,', 'GPHR', 'SPHR', 'HR', 'HRIS']

In [247]:
from utils.transform import convert_terms
hr_terms_dict = {'CHRO,': 'Chief Human Resources Officer,',
                'GPHR': 'Global Professional in Human Resources',
                'SPHR': 'Senior Professional in Human Resources',
                'HR': 'Human Resources',
                'HRIS': 'Human Resources Information System',
                'People': 'Human'} # this is for titles like 'People Development Coordinator at Ryan'.

for i, job_title in enumerate(data['job_title']):
    converted = []
    for word in job_title.split():
        converted.append(convert_terms(word, hr_terms_dict))
    data.loc[i, 'job_title'] = " ".join(converted)

Similar conversions can be done for terms like "staff*", "employ*", but we will leave the decision to domain experts and only convert terms that specifically include "HR" as above.

In [248]:
tfidf_args = {'strip_accents':'unicode',
              'lowercase':True,
              'stop_words':'english',
              'ngram_range':(1,3)}
tfidf_vectorizer = TfidfVectorizer(**tfidf_args)

job_title_processed = data['job_title'].apply(
    process_text,
    remove_stopwords=True,
    lemmitize=True,
    stem=True
)
keywords_processed = [process_text(keyword) for keyword in keywords]

data['fit_tfidf'] = cosine_similarity(tfidf_vectorizer.fit_transform(job_title_processed),
                                      tfidf_vectorizer.transform(keywords_processed)).sum(axis=1)

In [249]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['job_title'])
data['fit_keras_tokenizer'] = cosine_similarity(tokenizer.texts_to_matrix(data['job_title']),
                                                tokenizer.texts_to_matrix(keywords)).sum(axis=1)

In [250]:
import numpy as np
from gensim.test.utils import get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

# glove file source: https://nlp.stanford.edu/projects/glove/
# Create temp file and save converted embedding into it
target_file = get_tmpfile('word2vec.6B.50d.txt')
glove2word2vec('data/glove/glove.6B.50d.txt', target_file)
glove_dimension = 50

# Load the converted embedding into memory
glove_vectors = KeyedVectors.load_word2vec_format(target_file)

# Transform keywords into glove vectors
glove_vectors_keywords = []
for keyword in keywords:
    keyword_vector = np.zeros(glove_dimension)
    for word in keyword.split():
        try:
            keyword_vector += glove_vectors[word.lower()]
        except KeyError:
            pass
    
    glove_vectors_keywords.append(keyword_vector)

# Transform job titles into glove vectors
glove_vectors_job_title = []
for job_title in data['job_title']:
    job_title_vector = np.zeros(glove_dimension)
    for word in job_title.split():
        try:
            job_title_vector += glove_vectors[word.lower()]
        except KeyError:
            pass

    glove_vectors_job_title.append(job_title_vector)

# Calculate cosine similarity score between job titles and keywords based on glove vectors and add it to the dataframe.
data['fit_glove'] = cosine_similarity(glove_vectors_job_title, glove_vectors_keywords).sum(axis=1)

In [251]:
from gensim.models import Word2Vec

word2vec = Word2Vec(sentences=data['job_title'].apply(lambda x: [word.lower() for word in x.split()]))
word2vec_dimension = word2vec.vector_size

# Transform keywords into glove vectors
word2vec_keywords = []
for keyword in keywords:
    keyword_vector = np.zeros(word2vec_dimension)
    for word in keyword.split():
        try:
            keyword_vector += word2vec.wv[word.lower()]
        except KeyError:
            pass
    
    word2vec_keywords.append(keyword_vector)

# Transform job titles into glove vectors
word2vec_job_title = []
for job_title in data['job_title']:
    job_title_vector = np.zeros(word2vec_dimension)
    for word in job_title.split():
        try:
            job_title_vector += word2vec.wv[word.lower()]
        except KeyError:
            pass

    word2vec_job_title.append(job_title_vector)

# Calculate cosine similarity score between job titles and keywords based on glove vectors and add it to the dataframe.
data['fit_word2vec'] = cosine_similarity(word2vec_job_title, word2vec_keywords).sum(axis=1)

In [252]:
from gensim.models.fasttext import FastText

fasttext = FastText(sentences=data['job_title'].apply(lambda x: [word.lower() for word in x.split()]))
fasttext_dimension = fasttext.vector_size

# Transform keywords into glove vectors
fasttext_keywords = []
for keyword in keywords:
    keyword_vector = np.zeros(fasttext_dimension)
    for word in keyword.split():
        try:
            keyword_vector += fasttext.wv[word.lower()]
        except KeyError:
            pass
    
    fasttext_keywords.append(keyword_vector)

# Transform job titles into glove vectors
fasttext_job_title = []
for job_title in data['job_title']:
    job_title_vector = np.zeros(fasttext_dimension)
    for word in job_title.split():
        try:
            job_title_vector += fasttext.wv[word.lower()]
        except KeyError:
            pass

    fasttext_job_title.append(job_title_vector)

# Calculate cosine similarity score between job titles and keywords based on glove vectors and add it to the dataframe.
data['fit_fasttext'] = cosine_similarity(fasttext_job_title, fasttext_keywords).sum(axis=1)

In [253]:
# Transform fit scores so that different fit scores will have the same range between 0 and 1.
# This is for easier comparisons among different fit scores.
from sklearn.preprocessing import MinMaxScaler

minmax_scaler = MinMaxScaler()
fit_columns = [col for col in data.columns if "fit_" in col]
data[fit_columns] = minmax_scaler.fit_transform(data[fit_columns])

data.describe()

Unnamed: 0,id,fit,fit_tfidf,fit_keras_tokenizer,fit_glove,fit_word2vec,fit_fasttext
count,104.0,0.0,104.0,104.0,104.0,104.0,104.0
mean,52.5,,0.328938,0.511697,0.639963,0.570685,0.516792
std,30.166206,,0.315271,0.34902,0.265058,0.311138,0.310942
min,1.0,,0.0,0.0,0.0,0.0,0.0
25%,26.75,,0.057644,0.1,0.473497,0.299533,0.183819
50%,52.5,,0.253802,0.591047,0.669484,0.674261,0.472658
75%,78.25,,0.497634,0.725639,0.87098,0.839401,0.768933
max,104.0,,1.0,1.0,1.0,1.0,1.0


In [254]:
data['fit'] = data[fit_columns].sum(axis=1)

In [255]:
data[data['fit']==0].job_title.value_counts()

Series([], Name: job_title, dtype: int64)

In [256]:
data.sort_values('fit', ascending=False).head()

Unnamed: 0,id,job_title,location,connection,fit,fit_tfidf,fit_keras_tokenizer,fit_glove,fit_word2vec,fit_fasttext
29,30,Seeking Human Resources Opportunities,"Chicago, Illinois",390,4.882201,0.921024,1.0,1.0,0.988652,0.972524
27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,4.882201,0.921024,1.0,1.0,0.988652,0.972524
98,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,4.811019,0.913385,1.0,0.95578,0.988652,0.953203
16,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.773292,1.0,1.0,0.924198,0.849094,1.0
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.773292,1.0,1.0,0.924198,0.849094,1.0


Looking at the job titles for the candidates who got the highest fitness score, they indeed look very relevant - in fact the job titles include one of the exact keywords "Aspiring Human Resources" in them.

In [257]:
data.sort_values('fit', ascending=True).head()

Unnamed: 0,id,job_title,location,connection,fit,fit_tfidf,fit_keras_tokenizer,fit_glove,fit_word2vec,fit_fasttext
34,35,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.169533,0.0,0.0,0.11365,0.0,0.055883
47,48,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.169533,0.0,0.0,0.11365,0.0,0.055883
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.169533,0.0,0.0,0.11365,0.0,0.055883
22,23,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.169533,0.0,0.0,0.11365,0.0,0.055883
84,85,RRP Brand Portfolio Executive at JTI (Japan To...,Greater Philadelphia Area,500+,0.246018,0.0,0.0,0.0,0.105107,0.140912


In [258]:
data[data['fit']==0].job_title.unique()

array([], dtype=object)

In [263]:
data.describe()

Unnamed: 0,id,fit,fit_tfidf,fit_keras_tokenizer,fit_glove,fit_word2vec,fit_fasttext,has_zero_scores
count,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0
mean,52.5,2.568075,0.328938,0.511697,0.639963,0.570685,0.516792,0.25
std,30.166206,1.472122,0.315271,0.34902,0.265058,0.311138,0.310942,0.43511
min,1.0,0.169533,0.0,0.0,0.0,0.0,0.0,0.0
25%,26.75,1.372012,0.057644,0.1,0.473497,0.299533,0.183819,0.0
50%,52.5,2.619778,0.253802,0.591047,0.669484,0.674261,0.472658,0.0
75%,78.25,3.599888,0.497634,0.725639,0.87098,0.839401,0.768933,0.25
max,104.0,4.882201,1.0,1.0,1.0,1.0,1.0,1.0


No candidates with a fitness score of 0 although the min fit score of each of the fit_columns is all 0.

Add a filter column 'has_zero_scores' for candidates with at least 1 'zero' fitness score from the fit_columns.

In [259]:
has_zero_scores = []
for i, row in data.iterrows():
    has_zero_score = 0
    for fit in data.iloc[i][fit_columns]:
        if fit == 0:
            has_zero_score = 1
    
    has_zero_scores.append(has_zero_score)

data['has_zero_scores'] = has_zero_scores

In [260]:
data[data['has_zero_scores'] == 1].job_title.unique()

array(['Native English Teacher at EPIK (English Program in Korea)',
       'Advisory Board Member at Celal Bayar University',
       'Student at Chapman University',
       'Junior MES Engineer| Information Systems',
       'RRP Brand Portfolio Executive at JTI (Japan Tobacco International)',
       'Information Systems Specialist and Programmer with a love for data and organization.',
       'Bachelor of Science in Biology from Victoria University of Wellington',
       'Undergraduate Research Assistant at Styczynski Lab',
       'Lead Official at Western Illinois University',
       'Admissions Representative at Community medical center long beach',
       'Student at Westfield State University',
       'Student at Indiana University Kokomo - Business Management - Retail Manager at Delphi Hardware and Paint',
       'Student', 'Business Intelligence and Analytics at Travelers',
       'Always set them up for Success',
       'Director Of Administration at Excellence Logging'], dtype=

We can drop these values as they indeed seem irrelavant to our keywords, "Aspiring human resources" and "seeking human resources".

In [264]:
for fit_col in fit_columns:
    for i, job_title in enumerate(data.sort_values(fit_col).head().job_title.values):
        print(f"{fit_col} least fit job title {i+1}: {job_title}")
    print()

fit_tfidf least fit job title 1: Director Of Administration at Excellence Logging
fit_tfidf least fit job title 2: Native English Teacher at EPIK (English Program in Korea)
fit_tfidf least fit job title 3: Bachelor of Science in Biology from Victoria University of Wellington
fit_tfidf least fit job title 4: Student at Chapman University
fit_tfidf least fit job title 5: Advisory Board Member at Celal Bayar University

fit_keras_tokenizer least fit job title 1: Director Of Administration at Excellence Logging
fit_keras_tokenizer least fit job title 2: Native English Teacher at EPIK (English Program in Korea)
fit_keras_tokenizer least fit job title 3: Advisory Board Member at Celal Bayar University
fit_keras_tokenizer least fit job title 4: Native English Teacher at EPIK (English Program in Korea)
fit_keras_tokenizer least fit job title 5: Advisory Board Member at Celal Bayar University

fit_glove least fit job title 1: RRP Brand Portfolio Executive at JTI (Japan Tobacco International)
fi

Looking at the job titles with the worst fitness scores, each evaluation methods for fitness seems to perform fine - i.e. all those 'worst' job titles do not seem relevant to our keywords.

# Add rank to each candidate based on fitness scores - set the ranks of all the candidates with a fitness score of 0 to 104 (i.e. data.shape[0]).
# train a ranking algorithm (learn to rank)

1. instead of tf-idf, <b>try word2vec or GloVe</b> that can capture nuanced meanings better.
2. it not working, one way to do it is replace "HR" with "Human resources", "People" with "human" before pre-processing steps.
3. once I have good proxy (i.e. similarity score), rank the candidates based on them. and then use a ranking algorithm such as bubble sort or quick sort - we don't feed the algorithm the scores but the rankings and the algorithm will learn how/why they were ranked that way.
4. once I have the ranking algorithm, you can take feedback (i.e. manually starred candidates) into the ranking algorithm for re-ranking the candidates based on feedback.

RankNet, LambdaRank, and LambdaMART
My first choice would probably by XGBoost, the extreme gradient boosting algorithm. The benefit here (apart from the fact that it’s nearly always brilliant) is that you can set your distance metrics easily to match those of the RankNet, LambdaRank, and LambdaMART models explained above, by passing in the objective parameter in your param dictionary. Here, 'objective: rank:map' corresponds to RankNet, 'objective: rank:ndcg' corresponds to LambdaRank, and 'objective: rank:pairwise' corresponds to LambdaMART.


insertion sort, merge sort, and quicksort


Learning to rank (LTR) is a method that is used in the construction of classification models for information retrieval systems. The training data consists of lists of articles with an induced partial order that gives a numerical or ordinal score, or a binary judgment for each article. The purpose of the model is to order the elements into new lists according to the scores that take into account the judgments obtained from the articles.
