### <b>Table of Content</b>

0. Import functions

1. Load and explore data

2. Pre-process job titles

3. Get fitness scores

    3-1. TF-IDF
   
    3-2. TensorFlow Tokenizer

    3-3. GloVe
   
    3-4. Word2Vec
   
    3-5. FastText

4. Scale and evaluate fitness scores

5. Drop irrelevant candidates

6. Present initial list of top candidates

7. Prepare train and test data

8. Train ranking models
   
    8-1. XGBoost Ranker
   
    8-2. LGBM Ranker

9. Star ideal candidates

    9-1. Get ids of ideal candidates to star

    9-2. Create a copy of train and test data for re-ranking candidates based on stars

    9-3. Add a binary feature 'star' to training features

10. Re-rank all candidates

11. Re-train ranking models based on updated ranks and training features

12. Evaluate results

13. Save models for later use

14. Conclusion

### <b>0. Import functions</b>

In [1]:
from utils.load import load_data
from utils.split import split_data
from utils.transform import add_zero_score_col, update_ranks
from utils.process_text import get_relevant_terms, update_str_col, process_text, convert_words_to_vectors, get_word_vectors
from utils.predict import get_stats
from utils.evaluate import cosine_similarity, get_mean_stats
from utils.save import save_model

import pandas as pd
from pandas_profiling import ProfileReport
import random

from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer

from tensorflow.keras.preprocessing.text import Tokenizer

from gensim.test.utils import get_tmpfile
from gensim.models import KeyedVectors, Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models.fasttext import FastText

from xgboost import XGBRanker
from lightgbm import LGBMRanker

import warnings
warnings.simplefilter("ignore", UserWarning)

### <b>1. Load and explore data</b>

In [2]:
data = load_data(file_name="potential-talents.xlsx", folder_name="data")
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB
None 

               id  fit
count  104.000000  0.0
mean    52.500000  NaN
std     30.166206  NaN
min      1.000000  NaN
25%     26.750000  NaN
50%     52.500000  NaN
75%     78.250000  NaN
max    104.000000  NaN 



Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [3]:
# ProfileReport(data)

The id column is just an index column that would not be relevant to the fitness of any roles.
Although the job_title and location columns are highly correlated, the job_title column seems to be the only relevant column in determining the fitness of a particular role based on the column values and information we have about the requirements.

Therefore, only the job_title column will be used in the ranking procedures. Having said that, the other columns will still be returned in the result so that the user (i.e. the client) can have the full information about each of the relevant candidates.
The fit column will be filled with a fitness score for each candidate in the next steps.

### <b>2. Pre-process job titles</b>

Convert human resources-related terms in a way that job titles containing those terms will have better fitness scores. That is, those job titles might end up having a fitness score of 0 without conversion because for instance "HR" and "Human Resources" would be considered to have nothing in common by most algorithms.

In [4]:
hr_words = get_relevant_terms(word_list=data['job_title'], term="HR")
hr_words

['SPHR', 'CHRO,', 'HRIS', 'GPHR', 'HR']

In [5]:
hr_terms_dict = {'CHRO,': 'Chief Human Resources Officer,',
                'GPHR': 'Global Professional in Human Resources',
                'SPHR': 'Senior Professional in Human Resources',
                'HR': 'Human Resources',
                'HRIS': 'Human Resources Information System',
                'People': 'Human'} # this is for job titles like 'People Development Coordinator'.

data = update_str_col(dataframe=data, column='job_title', mapping_dict=hr_terms_dict)

Similar conversions can be done for terms like "staff*", "employ*", but we will leave the decision to domain experts later and for now only convert terms that specifically include "HR" as above.

### <b>3. Get fitness scores</b>

Vectorize job titles and keywords using different word embedding techniques and calculate cosine similarity between the vectors. Higher cosine similarity would mean the job titles and keywords are more closely related.

##### <b>3-1. TF-IDF</b>

TF-IDF (Term Frequency-Inverse Document Frequency) quantifies how relevant a word is to a document in a collection of documents or corpus.
* Word < Document < Corpus (Collection of Documents)

For instance, if a word appears in a document many times but also appears across different documents many times, then the word will have a low TF-IDF score, meaning the word is less important to that particular document since the word is just so common in any other documents. An example of a term like this would be "the", which is not very meaningful to any particular document.

Whereas if a word appears in a document many times but it rarely appears in other documents, then it would mean that the word is important in that particular document. To put it differently, we consider the word and the document are highly related.

Another thing to note is that it is often important to pre-process text data such as removing stop words, lemmatize, etc. before vectorizing through TF-IDF in order to get better (or more useful) results.

In [6]:
tfidf_args = {'strip_accents':'unicode',
              'lowercase':True,
              'stop_words':'english',
              'ngram_range':(1,3)}
tfidf_vectorizer = TfidfVectorizer(**tfidf_args)

job_title_processed_tfidf = data['job_title'].apply(
    process_text,
    remove_stopwords=True,
    lemmatize=True,
    stem=True
)

keywords = ["Aspiring human resources", "seeking human resources"]
keywords_processed_tfidf = [process_text(keyword) for keyword in keywords]

data['fit_tfidf'] = cosine_similarity(tfidf_vectorizer.fit_transform(job_title_processed_tfidf),
                                      tfidf_vectorizer.transform(keywords_processed_tfidf)).sum(axis=1)

For other vectorizers later on, still remove stopwords since stopwords do not add any values or meanings. However, do not lemmatize or stem because such pre-processing could result in worse results when using algorithms that have a pre-defined vocabulary, dictionary, or corpus. This is because they would not be able to provide meaningful word embedding when a lemma or stem is not found in the collection of words that they refer to for word embedding.

In [7]:
job_title_processed = data['job_title'].apply(
    process_text,
    remove_stopwords=True,
    lemmatize=False,
    stem=False
)
keywords_processed = [process_text(
    keyword,
    remove_stopwords=True,
    lemmatize=False,
    stem=False
    ) for keyword in keywords]

##### <b>3-2. TensorFlow Tokenizer</b>

Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation.

TensorFlow's Tokenizer class allows users to vectorize a text corpus by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on TF-IDF, etc.

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words may include the ' character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.

In [8]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(job_title_processed) # fit_on_texts updates internal vocabulary based on a list of texts; similar to tf-idf.
data['fit_keras_tokenizer'] = cosine_similarity(tokenizer.texts_to_matrix(job_title_processed),
                                                tokenizer.texts_to_matrix(keywords_processed)).sum(axis=1)

Moving on, we are going to use Gensim.<br>
Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities. Available modules from the library include Word2Vec, glove2word2vec, and fasttext, which will be used for word embeddings (or vectorization) in this project.

##### <b>3-3. GloVe</b>

GloVe stands for Global Vectors for word representation. It is an unsupervised learning algorithm that generates word embeddings by aggregating global word co-occurrence matrices from a given corpus. That is, GloVe allows users to take a corpus of text and transform each word in the corpus into a position in a high-dimensional space, and this is achieved by mapping words (from the corpus) into a space (or dimensions) where the distance between words is related to semantic similarity.

* glove2word2vec allows users to convert GloVe vectors into the word2vec. Both files are presented in text format and are almost identical except that word2vec includes the number of vectors and its dimension.

In [9]:
# glove file source: https://nlp.stanford.edu/projects/glove/
word2vec_file = get_tmpfile('word2vec.6B.50d.txt') # Create a temp file
glove2word2vec('data/glove/glove.6B.50d.txt', word2vec_file) # Save glove2word2vec into the temp file
glove_vectors = KeyedVectors.load_word2vec_format(word2vec_file) # Load the glove2word2vec from the teamp file
glove_dimension = 50

# Transform job titles and keywords into glove vectors
glove_vectors_job_title = convert_words_to_vectors(job_title_processed, glove_vectors, glove_dimension)
glove_vectors_keywords = convert_words_to_vectors(keywords_processed, glove_vectors, glove_dimension)
data['fit_glove'] = cosine_similarity(glove_vectors_job_title, glove_vectors_keywords).sum(axis=1)

##### <b>3-4. Word2Vec</b>

Similar to GloVe, Word2Vec is an unsupervised learning algorithm that creates a distributed representation of words into numerical vectors. Simply put, it converts text (i.e. a collection of words) into numerical vectors in a high-dimensional space. Those vectorized words can capture semantics and relationships among words.

In [10]:
word2vec = Word2Vec(sentences=job_title_processed.apply(lambda x: [word.lower() for word in x.split()]))
word2vec_dimension = word2vec.vector_size

# Transform job titles and keywords into word2vec vectors
word2vec_job_title = convert_words_to_vectors(job_title_processed, word2vec, word2vec_dimension)
word2vec_keywords = convert_words_to_vectors(keywords_processed, word2vec, word2vec_dimension)
data['fit_word2vec'] = cosine_similarity(word2vec_job_title, word2vec_keywords).sum(axis=1)

##### <b>3-5. FastText</b>

One of the main disadvantages of Word2Vec and GloVe embeddings is that they are unable to encode unknown or out-of-vocabulary words. To deal with this problem, FastText was created by then Facebook (now Meta) in 2015. FastText takes into account the internal structure of words while learning word representations, which allows it to provide better embeddings for morphologically rich languages, for words that rarely occur, or even for made-up words. In other words, FastText word vectors are built from vectors of substrings of characters contained in it, and this allows it to produce vectors for any word.

In [11]:
fasttext = FastText(sentences=job_title_processed.apply(lambda x: [word.lower() for word in x.split()]))
fasttext_dimension = fasttext.vector_size

# Transform job titles and keywords into fasttext vectors
fasttext_job_title = convert_words_to_vectors(job_title_processed, fasttext, fasttext_dimension)
fasttext_keywords = convert_words_to_vectors(keywords_processed, fasttext, fasttext_dimension)
data['fit_fasttext'] = cosine_similarity(fasttext_job_title, fasttext_keywords).sum(axis=1)

### <b>4. Scale and evaluate fitness scores</b>

Transform fit scores so that different fit scores will have the same range between 0 and 1.<br>
This is for easier comparisons among different fit scores created by various algorithms.

In [12]:
minmax_scaler = MinMaxScaler()
fit_columns = [col for col in data.columns if "fit_" in col]
data[fit_columns] = minmax_scaler.fit_transform(data[fit_columns])

data.describe()

Unnamed: 0,id,fit,fit_tfidf,fit_keras_tokenizer,fit_glove,fit_word2vec,fit_fasttext
count,104.0,0.0,104.0,104.0,104.0,104.0,104.0
mean,52.5,,0.328938,0.541808,0.604935,0.597979,0.586781
std,30.166206,,0.315271,0.359278,0.313345,0.307347,0.290476
min,1.0,,0.0,0.0,0.0,0.0,0.0
25%,26.75,,0.057644,0.106066,0.286359,0.338444,0.4224
50%,52.5,,0.253802,0.642826,0.664575,0.721075,0.684321
75%,78.25,,0.497634,0.8,0.864163,0.815329,0.776623
max,104.0,,1.0,1.0,1.0,1.0,1.0


Now that each of the fit columns (there are 5 'fit_' columns in total) has a scale of 0-1, sum all fit scores so that the 'fit' column will have a score between 0 and 5 for each candidate.

In [13]:
data['fit'] = data[fit_columns].sum(axis=1)

No candidate has a fitness score of 0 as shown below.

In [14]:
data[data['fit']==0].job_title.value_counts()

Series([], Name: job_title, dtype: int64)

Looking at the job titles of the candidates who got the highest fitness scores, they indeed look very relevant to what we are looking for (i.e. "Aspiring human resources" and "seeking human resources").

In [15]:
data.sort_values('fit', ascending=False).head()

Unnamed: 0,id,job_title,location,connection,fit,fit_tfidf,fit_keras_tokenizer,fit_glove,fit_word2vec,fit_fasttext
27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,4.828453,0.921024,1.0,1.0,0.973559,0.933869
29,30,Seeking Human Resources Opportunities,"Chicago, Illinois",390,4.828453,0.921024,1.0,1.0,0.973559,0.933869
98,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,4.762808,0.913385,1.0,0.953443,0.973559,0.922421
32,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.693111,1.0,1.0,0.920193,0.815329,0.957589
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.693111,1.0,1.0,0.920193,0.815329,0.957589


On the other hand, the job titles with the lowest fitness scores do seem irrelevant - they don't include any terms related to human resources.

In [16]:
data.sort_values('fit', ascending=True).head()

Unnamed: 0,id,job_title,location,connection,fit,fit_tfidf,fit_keras_tokenizer,fit_glove,fit_word2vec,fit_fasttext
34,35,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.235728,0.0,0.0,0.0,0.066091,0.169637
47,48,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.235728,0.0,0.0,0.0,0.066091,0.169637
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.235728,0.0,0.0,0.0,0.066091,0.169637
22,23,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.235728,0.0,0.0,0.0,0.066091,0.169637
86,87,Bachelor of Science in Biology from Victoria U...,"Baltimore, Maryland",40,0.241118,0.0,0.0,0.05825,0.066091,0.116777


### <b>5. Drop irrelevant candidates</b>

As we have confirmed that our scoring methods (i.e. the fitness scores) work as expected without dramatic deviation, we can safely drop the most irrelevant candidates.

In [17]:
data.describe()

Unnamed: 0,id,fit,fit_tfidf,fit_keras_tokenizer,fit_glove,fit_word2vec,fit_fasttext
count,104.0,104.0,104.0,104.0,104.0,104.0,104.0
mean,52.5,2.66044,0.328938,0.541808,0.604935,0.597979,0.586781
std,30.166206,1.4741,0.315271,0.359278,0.313345,0.307347,0.290476
min,1.0,0.235728,0.0,0.0,0.0,0.0,0.0
25%,26.75,1.59036,0.057644,0.106066,0.286359,0.338444,0.4224
50%,52.5,2.952384,0.253802,0.642826,0.664575,0.721075,0.684321
75%,78.25,3.540943,0.497634,0.8,0.864163,0.815329,0.776623
max,104.0,4.828453,1.0,1.0,1.0,1.0,1.0


Looking at the min fitness scores, no candidate has a 'fit' score of 0 (the min score of the 'fit' column is 0.357239) although the minimum fitness score of each of the 'fit_' columns is 0.

To filter candidates with at least 1 zero fitness score, add a new column 'has_zero_scores' to the data.

In [18]:
data = add_zero_score_col(data, fit_columns)
data[data['has_zero_scores'] == 1].job_title.unique()

array(['Native English Teacher at EPIK (English Program in Korea)',
       'Advisory Board Member at Celal Bayar University',
       'Student at Chapman University',
       'Junior MES Engineer| Information Systems',
       'RRP Brand Portfolio Executive at JTI (Japan Tobacco International)',
       'Information Systems Specialist and Programmer with a love for data and organization.',
       'Bachelor of Science in Biology from Victoria University of Wellington',
       'Undergraduate Research Assistant at Styczynski Lab',
       'Lead Official at Western Illinois University',
       'Admissions Representative at Community medical center long beach',
       'Student at Westfield State University',
       'Student at Indiana University Kokomo - Business Management - Retail Manager at Delphi Hardware and Paint',
       'Student', 'Business Intelligence and Analytics at Travelers',
       'Always set them up for Success',
       'Director Of Administration at Excellence Logging'], dtype=

These job titles indeed seem irrelevant to our keywords, "Aspiring human resources" and "seeking human resources".

To confirm if we can drop candidates with these job titles, let's look at the candidates with the worst fitness scores, excluding candidates with at least 1 zero fitness score.

In [19]:
for fit_col in fit_columns:
    for i, job_title in enumerate(data[data['has_zero_scores'] == 0].sort_values(fit_col).head().job_title.values):
        print(f"{fit_col} least fit job title {i+1}: {job_title}")
    print()

fit_tfidf least fit job title 1: Human Development Coordinator at Ryan
fit_tfidf least fit job title 2: Human Development Coordinator at Ryan
fit_tfidf least fit job title 3: Human Development Coordinator at Ryan
fit_tfidf least fit job title 4: Human Development Coordinator at Ryan
fit_tfidf least fit job title 5: Human Development Coordinator at Ryan

fit_keras_tokenizer least fit job title 1: Seeking employment opportunities within Customer Service or Patient Care
fit_keras_tokenizer least fit job title 2: Human Development Coordinator at Ryan
fit_keras_tokenizer least fit job title 3: SVP, Chief Human Resources Officer, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | Global Professional in Human Resources | Senior Professional in Human Resources
fit_keras_tokenizer least fit job title 4: Human Development Coordinator at Ryan
fit_keras_tokenizer least fit job title 5: Human Development Coordinator at Ryan

fit_glove least fit job title 1: 2019 C.

Although these are 'worst' candidates, the job titles do include a part of keywords such as "Human", "seeking", so they are not entirely irrelevant. Hence, we will discard only the candidates with at least 1 zero fitness score (i.e. has_zero_score==1).

This exclusion method can be built into the pipeline so that irrelevant candidates will be excluded from the list of potential candidates before being presented to hiring managers or HR.

In [20]:
data_filtered = data[ data['has_zero_scores'] != 1 ].drop('has_zero_scores', axis=1)

### <b>6. Present initial list of top candidates</b>

Here are top 20 'best-fit' candidates and their job titles, after dropping candidates with at least 1 zero fitness score.

In [21]:
data_filtered = data_filtered.sort_values(['fit', 'id', 'connection'], ascending=[False, True, True]).reset_index(drop=True)
print(f"Unique job titles of top 20 candidates:\n{data_filtered.head(20).job_title.unique()}")
data_filtered.head(20)

Unique job titles of top 20 candidates:
['Seeking Human Resources Opportunities'
 'Seeking Human Resources Position'
 'Aspiring Human Resources Professional'
 'Aspiring Human Resources Manager, seeking internship in Human Resources.'
 'Aspiring Human Resources Specialist'
 'Seeking Human Resources Human Resources Information System and Generalist Positions']


Unnamed: 0,id,job_title,location,connection,fit,fit_tfidf,fit_keras_tokenizer,fit_glove,fit_word2vec,fit_fasttext
0,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,4.828453,0.921024,1.0,1.0,0.973559,0.933869
1,30,Seeking Human Resources Opportunities,"Chicago, Illinois",390,4.828453,0.921024,1.0,1.0,0.973559,0.933869
2,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,4.762808,0.913385,1.0,0.953443,0.973559,0.922421
3,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.693111,1.0,1.0,0.920193,0.815329,0.957589
4,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.693111,1.0,1.0,0.920193,0.815329,0.957589
5,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.693111,1.0,1.0,0.920193,0.815329,0.957589
6,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.693111,1.0,1.0,0.920193,0.815329,0.957589
7,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.693111,1.0,1.0,0.920193,0.815329,0.957589
8,58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,4.693111,1.0,1.0,0.920193,0.815329,0.957589
9,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,4.693111,1.0,1.0,0.920193,0.815329,0.957589


### <b>7. Prepare train and test data</b>

Vectorize job titles using FastText since it's more recent and robust algorithm than the others used in the earlier steps. The transformed word vectors will be used as training features.

Set the 'id' column as the index and the 'fit' column as the target. That is, the index of each row will be the id of each candidate and the fitness score of each candidate will be the target, which we want to predict based on (vectorized) job titles.

In [22]:
training_features = pd.DataFrame(
    get_word_vectors(data_filtered, 'job_title', vectorizer='fasttext',
                     to_process_text=True, remove_stopwords=True, lemmatize=False, stem=False)
)
data_selected = pd.concat([data_filtered, training_features], axis=1).set_index('id')
X = data_selected[training_features.columns]
y = data_selected['fit']

Split data into train and test sets before model training or further transformation.

In [23]:
test_size = 0.2
random_state = 1
X_train, X_test, y_train_fitness, y_test_fitness = split_data(
    X, y, test_size, random_state=random_state, oversampling=False)

Convert fitness scores into ranks for each of the y_train and y_test data sets separately so that ranks will start from 1 to the number of data points for each data set. For instance, if there are 10 instances in the y_train data set, the rank will range from 1 to 10, not from 2 to 20 with many missing ranks.

Also, change the name of the target column from 'fit' to 'rank' since we've got ranks.

In [24]:
y_train = y_train_fitness.rank(method='dense', ascending=False)
y_train.name = 'rank'

y_test = y_test_fitness.rank(method='dense', ascending=False)
y_test.name = 'rank'

### <b>8. Train ranking models</b>

Now it's time to train Learning To Rank (LTR) models. LTR models in general take training features and ranks as input and try to learn how a set of objects/instances is ranked based on the training features. In our case, training features will be the job title vectors and the prediction target will be ranks. A major difference between a traditional supervised machine learning model and an LTR model is that the former will produce a prediction for each instance at a time whereas the latter will produce a list of ranks as a whole.

We will use NDCG (normalized discounted cumulative gain) as the evaluation metric for the ranking models. NDCG is a measure of the effectiveness of a ranking system, taking into account the position of relevant items in the ranked list. It is based on the idea that items that are higher in the ranking should be given more credit than items that are lower in the ranking - it penalizes highly relevant items ranked lower, which should have appeared higher in the ranked list.

##### <b>8-1. XGBoost Ranker</b>

XGBoost (eXtreme Gradient Boosting) is a tree based ensemble machine learning algorithm where decision trees are grown sequentially with more weights given to weak learners so that the overall prediction error can be minimized. Thanks to its efficiency and versatility, it is widely used in the field of data science. Thus, we are also going to use a ranker based on XGBoost, that is the XGB Ranker.

In [25]:
xgb_params = {
    'objective': 'rank:pairwise',
    'booster': 'gbtree',
    'eval_metric': 'ndcg',
    'random_state': random_state
}

xgb_ranker = XGBRanker(**xgb_params)
xgb_ranker.fit(X_train, y_train,
               group=y_train.value_counts(),
               eval_set=[(X_test, y_test)],
               eval_group=[y_test.value_counts()]
               )

stats_df, xgb_train_result, xgb_test_result = get_stats(X_train, X_test, y_train, y_test,
                                                        xgb_ranker, target_column="rank", target="candidates")

Ground truth stats:
                      y_train  y_test
Mean (Top 5 rankers)   2.7368  3.3750
Mean                  14.5323  6.1250
Std                   10.0421  3.2838 

Train stats:
Mean rank of top 5 candidates based on predictions: 4.4211
Mean rank of all candidates based on predictions: 14.5
Std rank of all candidates based on predictions: 9.569
Mean absolute difference between each pair of rank and predicted rank: 2.9677
    rank  pred_rank  abs_diff
id                           
28   1.0        1.0       0.0
30   1.0        1.0       0.0
60   3.0        2.0       1.0
36   3.0        2.0       1.0
49   3.0        2.0       1.0 

Test stats:
Mean rank of top 5 candidates based on predictions: 3.375
Mean rank of all candidates based on predictions: 5.9375
Std rank of all candidates based on predictions: 3.8204
Mean absolute difference between each pair of rank and predicted rank: 2.5625
    rank  pred_rank  abs_diff
id                           
73   2.0        1.0       1.0
25 

The XGBRanker seems to be performing fine - low mean absolute difference between the real ranks and predicted ranks, also the mean and standard deviation of the predicted ranks are quite close to the stats of the real ranks.

##### <b>8-2. LGBM Ranker</b>

However, LGBM (Light Gradient Boosting Machine) is becoming more popular as it's even faster than XGBoost without compromising accuracy and it provides more than 100 hyperparameters for users to tune. One of the main differences between LGBM and XGB is that LGBM grow trees leaf-wise whereas XGB grow trees level-wise, resulting in smaller and faster models with LGBM.

Thus, let's try this algorithm out and see how it performs compared to the XGB ranker.

In [26]:
lgbm_ranker = LGBMRanker(
    boosting_type="dart",
    objective="lambdarank",
    metric= "ndcg",
    label_gain =[i for i in range(int(max(y_train.max(), y_test.max())) + 2)],
    random_state=random_state
    )

lgbm_ranker.fit(
    X=X_train,
    y=y_train,
    group=y_train.value_counts(),
    eval_set=[(X_test, y_test)],
    eval_group=[y_test.value_counts()],
    verbose=-1
    )

_, lgbm_train_result, lgbm_test_result = get_stats(X_train, X_test, y_train, y_test,
                                                   lgbm_ranker, target_column="rank", target="candidates")

Ground truth stats:
                      y_train  y_test
Mean (Top 5 rankers)   2.7368  3.3750
Mean                  14.5323  6.1250
Std                   10.0421  3.2838 

Train stats:
Mean rank of top 5 candidates based on predictions: 2.8947
Mean rank of all candidates based on predictions: 14.1452
Std rank of all candidates based on predictions: 10.2363
Mean absolute difference between each pair of rank and predicted rank: 2.5484
    rank  pred_rank  abs_diff
id                           
3    2.0        1.0       1.0
58   2.0        1.0       1.0
46   2.0        1.0       1.0
17   2.0        1.0       1.0
33   2.0        1.0       1.0 

Test stats:
Mean rank of top 5 candidates based on predictions: 4.125
Mean rank of all candidates based on predictions: 6.0625
Std rank of all candidates based on predictions: 3.5491
Mean absolute difference between each pair of rank and predicted rank: 2.5625
    rank  pred_rank  abs_diff
id                           
73   2.0        1.0       1.

The LGBM Ranker does run faster than the XGB Ranker but it does not always produce prediction results that are a lot better than the XGB Ranker, although most of the time it still performs slightly better - e.g. lower mean rank of top 5 candidates.

Perhaps hyperparameter tuning could make a big difference, but for the time being we cannot conclude that the LGBM Ranker is better than the XGB Ranker.

### <b>9. Star ideal candidates to star</b>

Since we have built our base ranking models, time to proceed to starring ideal candidates and re-rank all candidates based on the stars. Starring one candidate sets this candidate as an ideal candidate for the given role. The list of candidates will be re-ranked each time a candidate or a list of candidates is starred.

##### <b>9-1. Get ids of ideal candidates</b>

Since we cannot actually take input from HR for starring ideal candidates, randomly choose 5 candidates as ideal candidates.

In [27]:
random.seed(random_state)
input_ids = random.choices(y.index, k=5)
ideal_candidates = sorted([id for id in input_ids])

##### <b>9-2. Create a copy of train and test data for re-ranking candidates based on stars</b>

In [28]:
X_train_updated = X_train.copy()
X_test_updated = X_test.copy()
y_train_updated = y_train.copy()
y_test_updated = y_test.copy()

##### <b>9-3. Add a binary feature 'star' to training features</b>

If a candidate is 'starred', the feature value will be 1, otherwise 0. This feature will be a part of training features.

In [29]:
X_train_updated['star'] = [1 if idx in ideal_candidates else 0 for idx in X_train_updated.index]
X_test_updated['star'] = [1 if idx in ideal_candidates else 0 for idx in X_test_updated.index]

### <b>10. Re-rank all candidates</b>

Add 1 to the ranks of all candidates (e.g. rank 1 will become rank 2), and then change the rank of the ideal candidates to 1 in y_train_updated and y_test_updated.

In [30]:
y_train_updated, y_test_updated = update_ranks(y_train_updated, y_test_updated, ideal_candidates)

Rank of candidate 13 in y_test updated to 1.
Rank of candidate 15 in y_train updated to 1.
Rank of candidate 52 in y_test updated to 1.
Rank of candidate 62 in y_train updated to 1.
Rank of candidate 73 in y_test updated to 1.


### <b>11. Re-train ranking models based on updated ranks and training features</b>

In [31]:
xgb_ranker.fit(X_train_updated, y_train_updated,
               group=y_train_updated.value_counts(),
               eval_set=[(X_test_updated, y_test_updated)],
               eval_group=[y_test_updated.value_counts()]
               )

stats_df_updated, xgb_train_result_updated, xgb_test_result_updated = get_stats(X_train_updated, X_test_updated,
                                                                                y_train_updated, y_test_updated,
                                                                                xgb_ranker,
                                                                                target_column="rank", target="candidates",
                                                                                updated=True)

(Updated) Ground truth stats:
                      y_train_updated  y_test_updated
Mean (Top 5 rankers)           3.2632          3.0000
Mean                          15.0161          6.2500
Std                           10.1343          3.9749 

(Updated) Train stats:
Mean rank of top 5 candidates based on predictions: 3.2105
Mean rank of all candidates based on predictions: 15.0806
Std rank of all candidates based on predictions: 10.4306
Mean absolute difference between each pair of rank and predicted rank: 3.1935
    rank  pred_rank  abs_diff
id                           
3    3.0        1.0       2.0
58   3.0        1.0       2.0
46   3.0        1.0       2.0
17   3.0        1.0       2.0
33   3.0        1.0       2.0 

(Updated) Test stats:
Mean rank of top 5 candidates based on predictions: 4.25
Mean rank of all candidates based on predictions: 6.9375
Std rank of all candidates based on predictions: 4.1868
Mean absolute difference between each pair of rank and predicted rank: 2.

In [32]:
lgbm_ranker.fit(
    X=X_train_updated,
    y=y_train_updated,
    group=y_train_updated.value_counts(),
    eval_set=[(X_test_updated, y_test_updated)],
    eval_group=[y_test_updated.value_counts()],
    verbose=-1
    )

_, lgbm_train_result_updated, lgbm_test_result_updated = get_stats(X_train_updated, X_test_updated,
                                                                   y_train_updated, y_test_updated,
                                                                   lgbm_ranker,
                                                                   target_column="rank", target="candidates",
                                                                   updated=True)

(Updated) Ground truth stats:
                      y_train_updated  y_test_updated
Mean (Top 5 rankers)           3.2632          3.0000
Mean                          15.0161          6.2500
Std                           10.1343          3.9749 

(Updated) Train stats:
Mean rank of top 5 candidates based on predictions: 3.4737
Mean rank of all candidates based on predictions: 13.9355
Std rank of all candidates based on predictions: 10.1834
Mean absolute difference between each pair of rank and predicted rank: 3.629
    rank  pred_rank  abs_diff
id                           
3    3.0        1.0       2.0
58   3.0        1.0       2.0
46   3.0        1.0       2.0
17   3.0        1.0       2.0
33   3.0        1.0       2.0 

(Updated) Test stats:
Mean rank of top 5 candidates based on predictions: 5.375
Mean rank of all candidates based on predictions: 6.625
Std rank of all candidates based on predictions: 3.2223
Mean absolute difference between each pair of rank and predicted rank: 2.6

### <b>12. Evaluate results</b>

Collate and rearrange statistics for easier evaluation of the results and model performance.

In [33]:
stats_df_concat = pd.concat([stats_df, stats_df_updated], axis=1)
stats_df_concat = stats_df_concat.iloc[:, [0, 2, 1, 3]] # re-arrange columns
stats_df_concat

Unnamed: 0,y_train,y_train_updated,y_test,y_test_updated
Mean (Top 5 rankers),2.7368,3.2632,3.375,3.0
Mean,14.5323,15.0161,6.125,6.25
Std,10.0421,10.1343,3.2838,3.9749


As shown above, the stats before and after the starring of ideal candidates (e.g. y_train vs. y_train_updated) didn't vary significantly. In other words, we can consider the ranking models can handle the update of ranks based on stars.

In [34]:
overall_mean_df = get_mean_stats(
    xgb_train_result, xgb_train_result_updated,
    xgb_test_result, xgb_test_result_updated,
    lgbm_train_result, lgbm_train_result_updated,
    lgbm_test_result, lgbm_test_result_updated
    )
print(f"Overall mean statistics:\n{overall_mean_df}")

Overall mean statistics:
           xgb_train  xgb_train_updated  xgb_test  xgb_test_updated  \
rank         14.5323            15.0161    6.1250            6.2500   
pred_rank    14.5000            15.0806    5.9375            6.9375   
abs_diff      2.9677             3.1935    2.5625            2.5625   

           lgbm_train  lgbm_train_updated  lgbm_test  lgbm_test_updated  
rank          14.5323             15.0161     6.1250              6.250  
pred_rank     14.1452             13.9355     6.0625              6.625  
abs_diff       2.5484              3.6290     2.5625              2.625  


In [35]:
top5_mean_df = get_mean_stats(
    xgb_train_result[xgb_train_result["rank"]<=5], xgb_train_result_updated[xgb_train_result_updated["rank"]<=5],
    xgb_test_result[xgb_test_result["rank"]<=5], xgb_test_result_updated[xgb_test_result_updated["rank"]<=5],
    lgbm_train_result[lgbm_train_result["rank"]<=5], lgbm_train_result_updated[lgbm_train_result_updated["rank"]<=5],
    lgbm_test_result[lgbm_test_result["rank"]<=5], lgbm_test_result_updated[lgbm_test_result_updated["rank"]<=5]
    )
print(f"Top 5 mean statistics:\n{top5_mean_df}")

Top 5 mean statistics:
           xgb_train  xgb_train_updated  xgb_test  xgb_test_updated  \
rank          2.7368             3.2632     3.375              3.00   
pred_rank     4.4211             3.2105     3.375              4.25   
abs_diff      2.2105             1.9474     2.250              2.75   

           lgbm_train  lgbm_train_updated  lgbm_test  lgbm_test_updated  
rank           2.7368              3.2632      3.375              3.000  
pred_rank      2.8947              3.4737      4.125              5.375  
abs_diff       1.4211              3.3684      2.000              2.375  


If we look at the overall and top 5 mean statistics, the performance of each model in terms of the mean absolute difference between real ranks and predicted ranks could vary based on the particular selection of ideal candidates, but on the whole both models appear to be able to handle the re-ranking of candidates pretty well. That is, the performances didn't dramatically decrease (or increase) after the re-training based on the starring of ideal candidates.

### <b>13. Save models for later use</b>

In [36]:
save_model(xgb_ranker, "xgb_ranker_trained")
save_model(lgbm_ranker, "lgbm_ranker_trained")

Trained model saved: c:\Users\Admin\Documents\GitHub\Apziva\YKTXOBGWLuUXdzbs\xgb_ranker_trained.sav
Trained model saved: c:\Users\Admin\Documents\GitHub\Apziva\YKTXOBGWLuUXdzbs\lgbm_ranker_trained.sav


### <b>14. Conclusion</b>

goal and success metrics


was able to get fitness scores using various word embedding techniques and use them to train LTR models.
evaluated the model predictions - looked at means, sts, whole candidates, only top 5, etc. 
the process of re-ranking (or starring) candidates was built into the machine learning pipeline and the models were able to handle updated ranks and produce reasonable predictions based on the updated ranks.


other objectives
- We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.
my solutions do not soley rely on a single ranking algorithm or word embedding technique.

the fit column is calculated based on 5 different fitness metrics. each of those used the cosine similarity between the vector of job title and the vector of keywords for each candidate, where job titles that are highly related to keywords (e.g. human resources) will get higher cosine similarity scores.

as for rankers, XGB ranker and lgbm ranker can complement each other - can use both models for selecting top candidates to reduce the cahnge of missing high potential candidates.

- How can we filter out candidates which in the first place should not be in this list?
based on how fitness scores were calculated in this notebook, candidates with a fitness score of 0 from the fit column can be dropped. If no candidates have a fitness score of 0, can drop candidates with at least 1 '0 fitness score' from any of the fit columns such as fit_tfidf, fit_glove - this information is stored in the has_zero_scores column.

looking at the job titles of those candidates, they did look irrelevant.

- Can we determine a cut-off point that would work for other roles without losing high potential candidates?
based on how fitness scores were calculated in this notebook, candidates with a fitness score of 0 from the fit column can be dropped. If no candidates have a fitness score of 0, can drop candidates with at least 1 '0 fitness score' from any of the fit columns such as fit_tfidf, fit_glove - this information is stored in the has_zero_scores column.

- Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?
We could automate most of the processes but I don't think we can (or should) automate the manual review and starring of ideal candidates. From my point of view it is imperative to have some sort of human intervention at some point, in this application is the manual review and starring of candidates. Without such supervison, the machine (i.e. the automated procedure) might be effecient but we can never guarantee that it will not be biased at all. Even with the state-of-art technologies and algorithms, I don't believe that there is any machine or algorithm that is entirely bias-free.
To conclude, we can automate most part of the procedure, but I would leave out the starring operation to be manually done by human, which could actually help reduce bias or discover subtle errors in the algorithms.

10-1. Learning To Rank (LTR) models - XGB Ranker, LGBM Ranker
results - the model performance, etc. mention goals and success metrics mentioned in the project description (README).

10-2. Word embeddings and vectorizations - TF-IDF, Tokenizer, GloVe, Word2Vec, FastText


as a whole, think of the notebook as a story. when a person (interviewer) reads the notebook, they should be able to get what you are trying to do and present.

send an email if I submit before next session.



10-5. Application
the solution can be used in ## situations, business cases, etc.
