# Project 3 
## Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

## Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

## Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?


In [221]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
import lightgbm as lgb


#Cross validation
from sklearn.model_selection import KFold, cross_val_score

#Other
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

#class imbalance
from sklearn.metrics import f1_score
from sklearn.utils import resample

#NLP
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import gensim.downloader as api
from gensim.models import KeyedVectors


[nltk_data] Downloading package punkt to /Users/natalie/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/natalie/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/natalie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Loading data and initial exploration

In [222]:
og_data = pd.read_excel('potential-talents.xlsx')
data = og_data
keywords = 'aspiring human resources'

In [223]:
data.head(3)

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,


In [224]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [225]:
data.isnull().sum()

id              0
job_title       0
location        0
connection      0
fit           104
dtype: int64

Original data shape is 104, 5. The fit column is empty so we will remove it

In [226]:
data.drop('fit', axis=1, inplace=True)

In [227]:
data.columns

Index(['id', 'job_title', 'location', 'connection'], dtype='object')

Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

In [228]:
data.describe()

Unnamed: 0,id
count,104.0
mean,52.5
std,30.166206
min,1.0
25%,26.75
50%,52.5
75%,78.25
max,104.0


# Add controls

I want to add 3 controls to my data so I can check the similarity code in particular

In [229]:
controls = pd.DataFrame({
    "id": [1111, 1112, 1113],
    "job_title": ['Machine learning', 'NA', 'artist'], "location":['x','x','x'], "connection": [150, 0, 500]
}, index=[104, 105, 106])

# Append a dataframe
#
data = data.append(controls)

In [230]:
data.tail()

Unnamed: 0,id,job_title,location,connection
102,103,Always set them up for Success,Greater Los Angeles Area,500+
103,104,Director Of Administration at Excellence Logging,"Katy, Texas",500+
104,1111,Machine learning,x,150
105,1112,,x,0
106,1113,artist,x,500


# Exploring data deeper

In [231]:
for col in data.columns:
    print(f'{data[col].nunique()}: unique value in {col}')

107: unique value in id
55: unique value in job_title
42: unique value in location
36: unique value in connection


Less job titles than rows so check if there are duplicates rows

In [232]:
df = data.drop(['id'], axis = 1)                    
print("Duplicates:", df.duplicated().sum())

Duplicates: 51


In [233]:
# look at those rows
df[df.duplicated(keep=False)]  

Unnamed: 0,job_title,location,connection
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85
1,Native English Teacher at EPIK (English Progra...,Kanada,500+
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44
3,People Development Coordinator at Ryan,"Denton, Texas",500+
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+
...,...,...,...
60,HR Senior Specialist,San Francisco Bay Area,500+
61,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+
62,Student at Chapman University,"Lake Forest, California",2
63,"SVP, CHRO, Marketing & Communications, CSR Off...","Houston, Texas Area",500+


In [234]:
#clean up duplicates
newdf = df.drop_duplicates()                                    
data = pd.concat([data['id'], newdf], axis = 1).dropna(axis = 0)   
data = data.reset_index(drop = True)
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          56 non-null     int64 
 1   job_title   56 non-null     object
 2   location    56 non-null     object
 3   connection  56 non-null     object
dtypes: int64(1), object(3)
memory usage: 1.9+ KB


This new cleaned dataframe now has a shape of 56, 4

In [235]:
data.job_title.value_counts()

Aspiring Human Resources Professional                                                                                    2
Human Resources professional for the world leader in GIS software                                                        1
Student at Humber College and Aspiring Human Resources Generalist                                                        1
Human Resources Generalist at ScottMadden, Inc.                                                                          1
Retired Army National Guard Recruiter, office manager,  seeking a position in Human Resources.                           1
Human Resources Professional                                                                                             1
Seeking Human Resources Opportunities                                                                                    1
Student                                                                                                                  1
RRP Brand Portfo

A lot of accronyms are used so will need to replace them later. 

In [236]:
acronyms = {'GIS': 'geographic information system', 
            'HRIS': 'human resources information system', 'MES': 'manufacturing execution system',
            'SVP': 'senior vice president', 'CHRO':'chief human resources officer', 'CSR': 'corporate social responsibility',
           'GPHR':'global professional in human resources', 'SPHR': 'senior professional in human resources'
           }

In [237]:
print('location: ', data.location.value_counts())
print('connections: ', data.connection.value_counts())

location:  Houston, Texas Area                    4
x                                      3
Greater New York City Area             3
Raleigh-Durham, North Carolina Area    3
Amerika Birleşik Devletleri            2
Greater Philadelphia Area              2
Kanada                                 2
Greater Atlanta Area                   2
Austin, Texas Area                     2
San Francisco Bay Area                 1
Dallas/Fort Worth Area                 1
Greater Los Angeles Area               1
Baltimore, Maryland                    1
Monroe, Louisiana Area                 1
Jackson, Mississippi Area              1
Virginia Beach, Virginia               1
Las Vegas, Nevada Area                 1
New York, New York                     1
Cape Girardeau, Missouri               1
Los Angeles, California                1
Houston, Texas                         1
Bridgewater, Massachusetts             1
İzmir, Türkiye                         1
Greater Chicago Area                   1
Chatt

# Cleaning and processing data

Before starting my NLP analysis I will follow some best practices including:
1. removing unwanted characters punctuation, special characters, from the text.
2. making lowercase, removing stop words, stemming and lemmatization
3. replacing acronyms, normalizing

<!-- Data Splitting: Split the data into training, validation, and testing sets to build and evaluate models. It is essential to ensure that the distribution of the data is maintained across the different sets.

Model Selection: Select an appropriate NLP algorithm for the task at hand, and compare the performance of different models to select the best one.

Model Evaluation: Evaluate the model's performance on the test set and use appropriate metrics to measure the model's accuracy, precision, recall, and F1 score. -->


In [238]:
# Step 1 removing unwanted characters punctuation, special characters from the text.
data = data.replace({'job_title' : { "[\'!#)$%&(*+-./:;<=>?@[\]^_`{|}~\n]" : " "}}, regex=True)
data = data.replace({'location' : { "[\'!#)$%&(*+-./:;<=>?@[\]^_`{|}~\n]" : " "}}, regex=True)

In [239]:
#before making lowercase, rpleace all HR with human resources so it doesn't affect 'hr' string in words
data = data.replace({'HR': 'human resources'}, regex=True)
data.head(10)

Unnamed: 0,id,job_title,location,connection
0,1,2019 C T Bauer College of Business Graduate ...,Houston Texas,85
1,2,Native English Teacher at EPIK English Progra...,Kanada,500+
2,3,Aspiring Human Resources Professional,Raleigh Durham North Carolina Area,44
3,4,People Development Coordinator at Ryan,Denton Texas,500+
4,5,Advisory Board Member at Celal Bayar University,İzmir Türkiye,500+
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1
6,7,Student at Humber College and Aspiring Human R...,Kanada,61
7,8,human resources Senior Specialist,San Francisco Bay Area,500+
8,10,Seeking Human Resources human resourcesIS and ...,Greater Philadelphia Area,500+
9,11,Student at Chapman University,Lake Forest California,2


In [240]:
# Step 2 making lowercase, removing stop words, stemming and lemmatization
data['job_title'] = data['job_title'].str.lower()
data['location'] = data['location'].str.lower()

In [241]:
data.head(10)

Unnamed: 0,id,job_title,location,connection
0,1,2019 c t bauer college of business graduate ...,houston texas,85
1,2,native english teacher at epik english progra...,kanada,500+
2,3,aspiring human resources professional,raleigh durham north carolina area,44
3,4,people development coordinator at ryan,denton texas,500+
4,5,advisory board member at celal bayar university,i̇zmir türkiye,500+
5,6,aspiring human resources specialist,greater new york city area,1
6,7,student at humber college and aspiring human r...,kanada,61
7,8,human resources senior specialist,san francisco bay area,500+
8,10,seeking human resources human resourcesis and ...,greater philadelphia area,500+
9,11,student at chapman university,lake forest california,2


In [242]:
#cleaning
stop_words = set(stopwords.words('english'))
for i in range (len(data)):
    word_tokens = word_tokenize(data['job_title'][i])
    tokens_without = [word for word in word_tokens if word not in stop_words]
    lemmatized_sentence = []
    for word in tokens_without:
        lemmatized_sentence.append(WordNetLemmatizer().lemmatize(word))
    data['job_title'][i] = TreebankWordDetokenizer().detokenize(lemmatized_sentence) 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['job_title'][i] = TreebankWordDetokenizer().detokenize(lemmatized_sentence)


In [243]:
data.head(2)

Unnamed: 0,id,job_title,location,connection
0,1,2019 c bauer college business graduate magna c...,houston texas,85
1,2,native english teacher epik english program korea,kanada,500+


In [244]:
# Step 3 replacing acronyms, normalizing
#first we need to make all the keys/values of the previously defoned acronyms dict to be lower case
lc_acronyms = {k.lower(): v.lower() for k, v in acronyms.items()}
print(lc_acronyms)
data = data.replace(lc_acronyms, regex=True)

{'gis': 'geographic information system', 'hris': 'human resources information system', 'mes': 'manufacturing execution system', 'svp': 'senior vice president', 'chro': 'chief human resources officer', 'csr': 'corporate social responsibility', 'gphr': 'global professional in human resources', 'sphr': 'senior professional in human resources'}


In [245]:
print('new titles:', data['job_title'])

new titles: 0     2019 c bauer college business graduate magna c...
1     native english teacher epik english program korea
2                  aspiring human resource professional
3                   people development coordinator ryan
4          advisory board member celal bayar university
5                    aspiring human resource specialist
6     student humber college aspiring human resource...
7                      human resource senior specialist
8     seeking human resource human resourcesis gener...
9                            student chapman university
10    senior vice president chuman resourceso market...
11    human resource coordinator intercontinental bu...
12    aspiring human resource management student see...
13                   seeking human resource opportunity
14    experienced retail manager aspiring human reso...
15      human resource staffing recruiting professional
16                  human resource specialist luxottica
17    director human resource north 

In [246]:
#we will normalize the connections to be between 0-1. We will count 500+ as 500

# Function to normalize scores to between 0-1
def normalize_score(score):
    if score == '500+ ':
        score = 500
    return float(score)/500

# Applying the function to the 'connection' column to create a new 'normalized_connections' column
data['normalized_connections'] = data['connection'].apply(normalize_score)


# Word embedding

In [247]:
#We will then calculate the similarity between the job titles and the given keyword (stored as variable 'keywords' in the top of the code, at first it we use 'aspiring human resources')

# Download the word2vec model
model_name = "word2vec-google-news-300"
word_vectors = api.load(model_name)

# Tokenize and encode the job titles using word2vec
job_title_encodings = data["job_title"].apply(
    lambda title: np.mean([word_vectors[word] for word in title.split() if word in word_vectors], axis=0)
)

# Compute the cosine similarity between the encoded job titles and the keyword
keyword_encoding = np.mean([word_vectors[word] for word in keywords.split() if word in word_vectors], axis=0)
similarity_scores = np.vstack(job_title_encodings.apply(lambda encoding: cosine_similarity(encoding.reshape(1, -1), keyword_encoding.reshape(1, -1)))).ravel()

# Add the similarity scores as a new column in the DataFrame
data["similarity_score"] = similarity_scores


In [248]:
data.head()

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score
0,1,2019 c bauer college business graduate magna c...,houston texas,85,0.17,0.561702
1,2,native english teacher epik english program korea,kanada,500+,1.0,0.218221
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865
3,4,people development coordinator ryan,denton texas,500+,1.0,0.287687
4,5,advisory board member celal bayar university,i̇zmir türkiye,500+,1.0,0.208822


In [249]:
#We will then calculate the similarity between the job titles and the given keyword (stored as variable 'keywords' in the top of the code, at first it we use 'aspiring human resources')

# Load the BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize and encode the job titles using BERT
job_title_encodings = data["job_title"].apply(
    lambda title: model(**tokenizer(title, return_tensors="pt")).pooler_output.detach().numpy()
)

# Compute the cosine similarity between the encoded job titles and the keyword
keyword_encoding = model(**tokenizer(keywords, return_tensors="pt")).pooler_output.detach().numpy()
similarity_scores = np.vstack(job_title_encodings.apply(lambda encoding: cosine_similarity(encoding, keyword_encoding))).ravel()

# Add the similarity scores as a new column in the DataFrame
data["similarity_score_BERT"] = similarity_scores



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [250]:
data = data.sort_values(by='similarity_score', ascending=False)
data.head()

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594
45,97,aspiring human resource professional,kokomo indiana area,71,0.142,0.872865,0.933594
5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119
22,74,human resource professional,greater boston area,16,0.032,0.798362,0.958796


In [251]:
data.tail()

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT
4,5,advisory board member celal bayar university,i̇zmir türkiye,500+,1.0,0.208822,0.839048
51,103,always set success,greater los angeles area,500+,1.0,0.206629,0.98185
33,85,rrp brand portfolio executive jti japan tobacc...,greater philadelphia area,500+,1.0,0.188941,0.87501
55,1113,artist,x,500,1.0,0.124317,0.975621
54,1112,na,x,0,0.0,0.043954,0.943436


In [252]:
#reset dataframe index
find = data.reset_index()
#find where the starred items are now
find.loc[find['id'] > 1110]

Unnamed: 0,index,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT
44,53,1111,machine learning,x,150,0.3,0.271537,0.908728
54,55,1113,artist,x,500,1.0,0.124317,0.975621
55,54,1112,na,x,0,0.0,0.043954,0.943436


We can see that our controls show up quite low when sorting by word2vec similarity

In [253]:
#now make a new column fitness score based on an equation taking similarity_score (word2vec) and normalized_connections

# calculate the new column based on the equation
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1
data = data.sort_values(by='fitness_score', ascending=False)
data.tail(20)

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score
38,90,undergraduate research assistant styczynski lab,greater atlanta area,155,0.31,0.338278,0.807109,0.33545
34,86,information system specialist programmer love ...,gaithersburg maryland,4,0.008,0.358062,0.885482,0.323056
52,104,director administration excellence logging,katy texas,500+,1.0,0.240087,0.946331,0.316078
28,80,junior me engineer information system,myrtle beach south carolina area,52,0.104,0.335249,0.982931,0.312124
40,92,seeking employment opportunity within customer...,torrance california,64,0.128,0.324646,0.893608,0.304981
1,2,native english teacher epik english program korea,kanada,500+,1.0,0.218221,0.902306,0.296399
44,96,student indiana university kokomo business man...,lafayette indiana,19,0.038,0.322475,0.842853,0.294028
4,5,advisory board member celal bayar university,i̇zmir türkiye,500+,1.0,0.208822,0.839048,0.28794
51,103,always set success,greater los angeles area,500+,1.0,0.206629,0.98185,0.285966
50,102,business intelligence analytics traveler,greater new york city area,49,0.098,0.295893,0.85983,0.276104


# Starring candidates

In [254]:
data.head(10)

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score
45,97,aspiring human resource professional,kokomo indiana area,71,0.142,0.872865,0.933594,0.799779
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594,0.794379
8,10,seeking human resource human resourcesis gener...,greater philadelphia area,500+,1.0,0.747503,0.845393,0.772752
12,27,aspiring human resource management student see...,houston texas area,500+,1.0,0.737665,0.936784,0.763899
5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565,0.752968
49,101,human resource generalist loparex,raleigh durham north carolina area,500+,1.0,0.724036,0.98047,0.751632
26,78,human resource generalist schwan,amerika birleşik devletleri,500+,1.0,0.724036,0.97852,0.751632
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119,0.741931
13,28,seeking human resource opportunity,chicago illinois,390,0.78,0.721559,0.975341,0.727403
22,74,human resource professional,greater boston area,16,0.032,0.798362,0.958796,0.721726


In [255]:
# add a new column named 'starred' and set all values to 0
data['starred'] = 0

# set the value of the id 3 in 'starred' to 1 as an example
data.loc[data['id'] == 3, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head(10)


Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score,starred
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594,1.794379,1
45,97,aspiring human resource professional,kokomo indiana area,71,0.142,0.872865,0.933594,0.799779,0
8,10,seeking human resource human resourcesis gener...,greater philadelphia area,500+,1.0,0.747503,0.845393,0.772752,0
12,27,aspiring human resource management student see...,houston texas area,500+,1.0,0.737665,0.936784,0.763899,0
5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565,0.752968,0
49,101,human resource generalist loparex,raleigh durham north carolina area,500+,1.0,0.724036,0.98047,0.751632,0
26,78,human resource generalist schwan,amerika birleşik devletleri,500+,1.0,0.724036,0.97852,0.751632,0
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119,0.741931,0
13,28,seeking human resource opportunity,chicago illinois,390,0.78,0.721559,0.975341,0.727403,0
22,74,human resource professional,greater boston area,16,0.032,0.798362,0.958796,0.721726,0


In [256]:
# set the value in 'starred' to 1 as more example, for ids: 27, 6, 73
data.loc[data['id'] == 27, 'starred'] = 1
data.loc[data['id'] == 6, 'starred'] = 1
data.loc[data['id'] == 73, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head(10)

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score,starred
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594,1.794379,1
12,27,aspiring human resource management student see...,houston texas area,500+,1.0,0.737665,0.936784,1.763899,1
5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565,1.752968,1
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119,1.741931,1
45,97,aspiring human resource professional,kokomo indiana area,71,0.142,0.872865,0.933594,0.799779,0
8,10,seeking human resource human resourcesis gener...,greater philadelphia area,500+,1.0,0.747503,0.845393,0.772752,0
49,101,human resource generalist loparex,raleigh durham north carolina area,500+,1.0,0.724036,0.98047,0.751632,0
26,78,human resource generalist schwan,amerika birleşik devletleri,500+,1.0,0.724036,0.97852,0.751632,0
13,28,seeking human resource opportunity,chicago illinois,390,0.78,0.721559,0.975341,0.727403,0
22,74,human resource professional,greater boston area,16,0.032,0.798362,0.958796,0.721726,0


# Models

In [257]:
def ranking_model(data):
    #make x and y
    feature_cols = ['normalized_connections', 'similarity_score']
    X = data[feature_cols]
    y = data.starred 
    
    #split data into training and testing, check both have starred items
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    # Make sure at least one example of data with positive supervisory signal is in the training set
    print("Number of ranked items in training set:", y_train.sum())
    print("Number of ranked items in test set:",y_test.sum())
    while y_train.sum() ==0 or y_test.sum()== 0:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
        # Make sure at least one example of data with positive supervisory signal is in the training set
        print("Number of ranked items in training set:", y_train.sum())
        print("Number of ranked items in test set:",y_test.sum())
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)
    
    #define queries
    query_train = [X_train.shape[0]]
    query_val = [X_val.shape[0]]
    query_test = [X_test.shape[0]]

    # Train the LightGBM model
    
    gbm = lgb.LGBMRanker(
    objective="lambdarank",
    metric="ndcg",)
    
    gbm.set_params(min_child_samples=2)
    
    gbm.fit(X_train, y_train, group=query_train,
        eval_set=[(X_val, y_val)], eval_group=[query_val],
        eval_at=[5, 10, 20], early_stopping_rounds=50)


    # predict on test set and return
    return gbm.predict(X)

In [258]:
predictions = ranking_model(data)

Number of ranked items in training set: 3
Number of ranked items in test set: 1
[1]	valid_0's ndcg@5: 0.877215	valid_0's ndcg@10: 0.877215	valid_0's ndcg@20: 0.877215
Training until validation scores don't improve for 50 rounds
[2]	valid_0's ndcg@5: 0.877215	valid_0's ndcg@10: 0.877215	valid_0's ndcg@20: 0.877215
[3]	valid_0's ndcg@5: 0.919721	valid_0's ndcg@10: 0.919721	valid_0's ndcg@20: 0.919721
[4]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[5]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[6]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[7]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[8]	valid_0's ndcg@5: 0.919721	valid_0's ndcg@10: 0.919721	valid_0's ndcg@20: 0.919721
[9]	valid_0's ndcg@5: 0.919721	valid_0's ndcg@10: 0.919721	valid_0's ndcg@20: 0.919721
[10]	valid_0's ndcg@5: 0.919721	valid_0's ndcg@10: 0.919721	valid_0's ndcg@20: 0.919721
[11]	valid_0's ndcg@5: 0.919721	valid_0's ndcg@10: 0.919721	valid_0's nd

In [202]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout
from keras.optimizers import Adam
from keras.losses import binary_crossentropy
from keras.callbacks import EarlyStopping

# Define the RankNet model
def ranknet_model(num_features):
    input1 = Input(shape=(num_features,))
    input2 = Input(shape=(num_features,))

    shared_layer1 = Dense(64, activation='relu')
    shared_layer2 = Dense(32, activation='relu')

    h1 = shared_layer1(input1)
    h1 = Dropout(0.5)(h1)
    h2 = shared_layer2(h1)

    h3 = shared_layer1(input2)
    h3 = Dropout(0.5)(h3)
    h4 = shared_layer2(h3)

    s = Dense(1, activation='sigmoid', name='main_output')(h2-h4)

    model = Model(inputs=[input1, input2], outputs=s)

    return model

# Define a function to train the RankNet model
def train_ranknet_model(X_train, y_train, X_val, y_val, num_epochs=100, batch_size=128, learning_rate=0.001):
    num_features = X_train.shape[1]

    model = ranknet_model(num_features)

    optimizer = Adam(lr=learning_rate)

    model.compile(optimizer=optimizer, loss=binary_crossentropy)

    early_stopping = EarlyStopping(monitor='val_loss', patience=10, verbose=1, mode='auto')

    model.fit([X_train, X_train], y_train,
              validation_data=([X_val, X_val], y_val),
              epochs=num_epochs,
              batch_size=batch_size,
              callbacks=[early_stopping])

    return model


def ranking_model2(data):
    #make x and y
    feature_cols = ['normalized_connections', 'similarity_score']
    X = data[feature_cols]
    y = data.starred 
    
    #split data into training and testing, check both have starred items
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    # Make sure at least one example of data with positive supervisory signal is in the training set
    print("Number of ranked items in training set:", y_train.sum())
    print("Number of ranked items in test set:",y_test.sum())
    while y_train.sum() ==0 or y_test.sum()== 0:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
        # Make sure at least one example of data with positive supervisory signal is in the training set
        print("Number of ranked items in training set:", y_train.sum())
        print("Number of ranked items in test set:",y_test.sum())
    
    model = train_ranknet_model(X_train, y_train, X_test, y_test)

    return model

In [203]:
model2 = ranking_model2(data)
n1 = data[['normalized_connections']]
n1 = n1.values
n2 = data[['similarity_score']]
n2 = n2.values 
model2.predict(n1, n2)

Number of ranked items in training set: 8
Number of ranked items in test set: 2
Epoch 1/100


  super().__init__(name, **kwargs)


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


# Updating fitness and ranking

In [259]:
#rank based on model output
df = data
df['ranking'] = predictions
df = df.sort_values(by="ranking", ascending=False)

In [260]:
df.head()

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score,starred,ranking
5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565,1.752968,1,0.63396
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119,1.741931,1,0.63396
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594,1.794379,1,0.268402
45,97,aspiring human resource professional,kokomo indiana area,71,0.142,0.872865,0.933594,0.799779,0,0.268402
20,72,business management major aspiring human resou...,monroe louisiana area,5,0.01,0.716674,0.931661,0.646006,0,-0.350651


In [261]:
#reset dataframe index
df = df.reset_index()
#find where the starred items are now
df.loc[df['starred'] == 1]

Unnamed: 0,index,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score,starred,ranking
0,5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565,1.752968,1,0.63396
1,21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119,1.741931,1,0.63396
2,2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594,1.794379,1,0.268402
15,12,27,aspiring human resource management student see...,houston texas area,500+,1.0,0.737665,0.936784,1.763899,1,-0.651039


Starred rows are towards the top

# Trying with more starred candidates

In [262]:
data.head(20)

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score,starred,ranking
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594,1.794379,1,0.268402
12,27,aspiring human resource management student see...,houston texas area,500+,1.0,0.737665,0.936784,1.763899,1,-0.651039
5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565,1.752968,1,0.63396
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119,1.741931,1,0.63396
45,97,aspiring human resource professional,kokomo indiana area,71,0.142,0.872865,0.933594,0.799779,0,0.268402
8,10,seeking human resource human resourcesis gener...,greater philadelphia area,500+,1.0,0.747503,0.845393,0.772752,0,-0.651039
49,101,human resource generalist loparex,raleigh durham north carolina area,500+,1.0,0.724036,0.98047,0.751632,0,-0.651039
26,78,human resource generalist schwan,amerika birleşik devletleri,500+,1.0,0.724036,0.97852,0.751632,0,-0.651039
13,28,seeking human resource opportunity,chicago illinois,390,0.78,0.721559,0.975341,0.727403,0,-0.651039
22,74,human resource professional,greater boston area,16,0.032,0.798362,0.958796,0.721726,0,-0.651039


In [263]:
# set the value in 'starred' to 1 as more example, for ids: 101, 78, 74, 68, 8, 82
data.loc[data['id'] == 101, 'starred'] = 1
data.loc[data['id'] == 78, 'starred'] = 1
data.loc[data['id'] == 74, 'starred'] = 1
data.loc[data['id'] == 68, 'starred'] = 1
data.loc[data['id'] == 8, 'starred'] = 1
data.loc[data['id'] == 82, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head(10)

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score,starred,ranking
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594,1.794379,1,0.268402
12,27,aspiring human resource management student see...,houston texas area,500+,1.0,0.737665,0.936784,1.763899,1,-0.651039
5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565,1.752968,1,0.63396
49,101,human resource generalist loparex,raleigh durham north carolina area,500+,1.0,0.724036,0.98047,1.751632,1,-0.651039
26,78,human resource generalist schwan,amerika birleşik devletleri,500+,1.0,0.724036,0.97852,1.751632,1,-0.651039
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119,1.741931,1,0.63396
22,74,human resource professional,greater boston area,16,0.032,0.798362,0.958796,1.721726,1,-0.651039
16,68,human resource specialist luxottica,greater new york city area,500+,1.0,0.674604,0.940522,1.707144,1,-0.651039
7,8,human resource senior specialist,san francisco bay area,500+,1.0,0.63809,0.961523,1.674281,1,-0.651039
30,82,aspiring human resource professional energetic...,austin texas area,174,0.348,0.7004,0.919451,1.66516,1,-0.651039


In [264]:
predictions = ranking_model(data)
df = data
df['ranking'] = predictions
df = df.sort_values(by='ranking', ascending=False)

Number of ranked items in training set: 6
Number of ranked items in test set: 4
[1]	valid_0's ndcg@5: 0.877215	valid_0's ndcg@10: 0.877215	valid_0's ndcg@20: 0.877215
Training until validation scores don't improve for 50 rounds
[2]	valid_0's ndcg@5: 0.624051	valid_0's ndcg@10: 0.624051	valid_0's ndcg@20: 0.624051
[3]	valid_0's ndcg@5: 0.624051	valid_0's ndcg@10: 0.624051	valid_0's ndcg@20: 0.624051
[4]	valid_0's ndcg@5: 0.624051	valid_0's ndcg@10: 0.624051	valid_0's ndcg@20: 0.624051
[5]	valid_0's ndcg@5: 0.624051	valid_0's ndcg@10: 0.624051	valid_0's ndcg@20: 0.624051
[6]	valid_0's ndcg@5: 0.624051	valid_0's ndcg@10: 0.624051	valid_0's ndcg@20: 0.624051
[7]	valid_0's ndcg@5: 0.624051	valid_0's ndcg@10: 0.624051	valid_0's ndcg@20: 0.624051
[8]	valid_0's ndcg@5: 0.624051	valid_0's ndcg@10: 0.624051	valid_0's ndcg@20: 0.624051
[9]	valid_0's ndcg@5: 0.624051	valid_0's ndcg@10: 0.624051	valid_0's ndcg@20: 0.624051
[10]	valid_0's ndcg@5: 0.624051	valid_0's ndcg@10: 0.624051	valid_0's ndcg@2

In [265]:
df.head(20)

Unnamed: 0,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score,starred,ranking
2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594,1.794379,1,0.2
5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565,1.752968,1,0.2
21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119,1.741931,1,0.2
22,74,human resource professional,greater boston area,16,0.032,0.798362,0.958796,1.721726,1,0.2
49,101,human resource generalist loparex,raleigh durham north carolina area,500+,1.0,0.724036,0.98047,1.751632,1,0.2
26,78,human resource generalist schwan,amerika birleşik devletleri,500+,1.0,0.724036,0.97852,1.751632,1,0.2
16,68,human resource specialist luxottica,greater new york city area,500+,1.0,0.674604,0.940522,1.707144,1,0.2
15,67,human resource staffing recruiting professional,jackson mississippi area,500+,1.0,0.684689,0.89021,0.71622,0,0.2
45,97,aspiring human resource professional,kokomo indiana area,71,0.142,0.872865,0.933594,0.799779,0,0.175538
8,10,seeking human resource human resourcesis gener...,greater philadelphia area,500+,1.0,0.747503,0.845393,0.772752,0,0.175538


In [266]:
#reset dataframe index
df = df.reset_index()
#find where the starred items are now
df.loc[df['starred'] == 1]

Unnamed: 0,index,id,job_title,location,connection,normalized_connections,similarity_score,similarity_score_BERT,fitness_score,starred,ranking
0,2,3,aspiring human resource professional,raleigh durham north carolina area,44,0.088,0.872865,0.933594,1.794379,1,0.2
1,5,6,aspiring human resource specialist,greater new york city area,1,0.002,0.836409,0.964565,1.752968,1,0.2
2,21,73,aspiring human resource manager seeking intern...,houston texas area,7,0.014,0.822812,0.922119,1.741931,1,0.2
3,22,74,human resource professional,greater boston area,16,0.032,0.798362,0.958796,1.721726,1,0.2
4,49,101,human resource generalist loparex,raleigh durham north carolina area,500+,1.0,0.724036,0.98047,1.751632,1,0.2
5,26,78,human resource generalist schwan,amerika birleşik devletleri,500+,1.0,0.724036,0.97852,1.751632,1,0.2
6,16,68,human resource specialist luxottica,greater new york city area,500+,1.0,0.674604,0.940522,1.707144,1,0.2
11,12,27,aspiring human resource management student see...,houston texas area,500+,1.0,0.737665,0.936784,1.763899,1,0.175538
17,30,82,aspiring human resource professional energetic...,austin texas area,174,0.348,0.7004,0.919451,1.66516,1,-0.2
41,7,8,human resource senior specialist,san francisco bay area,500+,1.0,0.63809,0.961523,1.674281,1,-0.2


Problem: All the ranking come up as 0? Starred candidates are distributed throughout

## Result for ranking

I have ranked candidates based on a fitness score that incorporates cosine similarity and normalized connections. I then rerank if someone manually stars a candidate.
I then built a lightgbm ranking model to train on the data with the y='starred' and reranked the data based on the rankings obtained.
However I then see that my previously starred candidates don't all appear at the top of the new ranking as the model didn't train well on such a small positive class (n=4).
Adding more starred candidates (n=10) and retraining also gives better but still not great results.