# Project 3 
## Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

## Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

## Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?


In [60]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
import lightgbm as lgb


#Cross validation
from sklearn.model_selection import KFold, cross_val_score

#Other
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

#class imbalance
from sklearn.metrics import f1_score
from sklearn.utils import resample

#embedding using BERT

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity


# Loading data

In [95]:
og_data = pd.read_excel('potential-talents.xlsx')
data = og_data
keywords = 'aspiring human resources'

In [96]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [97]:
data.columns

Index(['id', 'job_title', 'location', 'connection', 'fit'], dtype='object')

In [98]:
data.describe()

Unnamed: 0,id,fit
count,104.0,0.0
mean,52.5,
std,30.166206,
min,1.0,
25%,26.75,
50%,52.5,
75%,78.25,
max,104.0,


In [99]:
data.head(55)

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,
7,8,HR Senior Specialist,San Francisco Bay Area,500+,
8,9,Student at Humber College and Aspiring Human R...,Kanada,61,
9,10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,


Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

In [100]:
print('title: ', data.job_title.unique())
print('location: ', data.location.unique())
print('connections: ', data.connection.unique())

title:  ['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional'
 'Native English Teacher at EPIK (English Program in Korea)'
 'Aspiring Human Resources Professional'
 'People Development Coordinator at Ryan'
 'Advisory Board Member at Celal Bayar University'
 'Aspiring Human Resources Specialist'
 'Student at Humber College and Aspiring Human Resources Generalist'
 'HR Senior Specialist'
 'Seeking Human Resources HRIS and Generalist Positions'
 'Student at Chapman University'
 'SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR'
 'Human Resources Coordinator at InterContinental Buckhead Atlanta'
 'Aspiring Human Resources Management student seeking an internship'
 'Seeking Human Resources Opportunities'
 'Experienced Retail Manager and aspiring Human Resources Professional'
 'Human Resources, Staffing and Recruiting Professional'
 'Human Resources Specialist at Luxottica'
 'Dire

# Processing data

In [101]:
#First we will normalize the connections to be between 0-1. We will count 500+ as 500

# Function to normalize scores to between 0-1
def normalize_score(score):
    if score == '500+ ':
        score = 500
    return float(score)/500

# Applying the function to the 'connection' column to create a new 'normalized_connections' column
data['normalized_connections'] = data['connection'].apply(normalize_score)


In [102]:
#We will then calculate the similarity between the job titles and the given keyword (stored as variable 'keywords' in the top of the code, at first it we use 'aspiring human resources')

# Load the BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize and encode the job titles using BERT
job_title_encodings = data["job_title"].apply(
    lambda title: model(**tokenizer(title, return_tensors="pt")).pooler_output.detach().numpy()
)

# Compute the cosine similarity between the encoded job titles and the keyword
keyword_encoding = model(**tokenizer(keywords, return_tensors="pt")).pooler_output.detach().numpy()
similarity_scores = np.vstack(job_title_encodings.apply(lambda encoding: cosine_similarity(encoding, keyword_encoding))).ravel()

# Add the similarity scores as a new column in the DataFrame
data["similarity_score"] = similarity_scores


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [103]:
data.head()

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,,0.17,0.764336
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,,1.0,0.89986
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,,0.088,0.937843
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,,1.0,0.824166


In [104]:
#now make a new column fitness score based on an equation taking similarity_score and normalized_connections

# calculate the new column based on the equation
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1
data = data.sort_values(by='fitness_score', ascending=False)
data.head(20)

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score
46,47,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241
58,59,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241
33,34,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241
21,22,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241
17,18,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241
60,61,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,0.9846
25,26,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,0.9846
37,38,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,0.9846
50,51,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,0.9846


# Starring candidates

In [105]:
# add a new column named 'starred' and set all values to 0
data['starred'] = 0

# set the value of the 7th row in 'starred' to 1 as an example
data.loc[7, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head(20)


Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred
7,8,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,1.9846,1
46,47,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0
58,59,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0
33,34,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0
21,22,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0
17,18,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0
60,61,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,0.9846,0
25,26,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,0.9846,0
37,38,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,0.9846,0


In [106]:
# set the value of the 3rd, 4th, 5th, row in 'starred' to 1 as more example
data.loc[27, 'starred'] = 1
data.loc[55, 'starred'] = 1
data.loc[100, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head()

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred
7,8,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,1.9846,1
27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,,0.78,0.986173,1.965555,1
55,56,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.949673,1
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,,1.0,0.940202,1.946181,1
58,59,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0


# Model

In [107]:
def ranking_model(data):
    #make x and y
    feature_cols = ['normalized_connections', 'similarity_score']
    X = data[feature_cols]
    y = data.starred 
    
    #split data into training and testing, check both have starred items
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    # Make sure at least one example of data with positive supervisory signal is in the training set
    print("Number of ranked items in training set:", y_train.sum())
    print("Number of ranked items in test set:",y_test.sum())
    while y_train.sum() ==0 or y_test.sum()== 0:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
        # Make sure at least one example of data with positive supervisory signal is in the training set
        print("Number of ranked items in training set:", y_train.sum())
        print("Number of ranked items in test set:",y_test.sum())
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)
    
    #define queries
    query_train = [X_train.shape[0]]
    query_val = [X_val.shape[0]]
    query_test = [X_test.shape[0]]

    # Train the LightGBM model
    
    gbm = lgb.LGBMRanker(
    objective="lambdarank",
    metric="ndcg",)
    
    gbm.fit(X_train, y_train, group=query_train,
        eval_set=[(X_val, y_val)], eval_group=[query_val],
        eval_at=[5, 10, 20], early_stopping_rounds=50)


    # predict on test set and return
    return gbm.predict(X)

In [108]:
predictions = ranking_model(data)

Number of ranked items in training set: 3
Number of ranked items in test set: 1
[1]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
Training until validation scores don't improve for 50 rounds
[2]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[3]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[4]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[5]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[6]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[7]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[8]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[9]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[10]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[11]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[12]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20: 1
[13]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1	valid_0's ndcg@20:

# Updating fitness and ranking

In [109]:
#rank based on model output
df = data
df['ranking'] = predictions
df = df.sort_values(by="ranking", ascending=False)

In [110]:
df.head()

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
7,8,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,1.9846,1,0.108387
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1,,0.002,0.969603,0.872843,0,0.108387
77,78,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,,1.0,0.939072,0.945165,0,0.108387
98,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,,0.096,0.986825,0.897743,0,0.108387
90,91,Lead Official at Western Illinois University,Greater Chicago Area,39,,0.078,0.96251,0.874059,0,0.108387


In [111]:
#reset dataframe index
df = df.reset_index()
#find where the starred items are now
df.loc[df['starred'] == 1]

Unnamed: 0,index,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
0,7,8,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,1.9846,1,0.108387
15,27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,,0.78,0.986173,1.965555,1,0.108387
30,100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,,1.0,0.940202,1.946181,1,0.108387
31,55,56,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.949673,1,0.108387


Problem: Starred rows are dispersed through the newly ranked dataset

# Trying with more starred candidates

In [112]:
data.head(50)

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
7,8,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,1.9846,1,0.108387
27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,,0.78,0.986173,1.965555,1,0.108387
55,56,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.949673,1,0.108387
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,,1.0,0.940202,1.946181,1,0.108387
58,59,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0,0.108387
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0,0.108387
33,34,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0,0.108387
21,22,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0,0.108387
17,18,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0,0.108387
46,47,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0,0.108387


In [113]:
# set some rows as 'starred' for more example
data.loc[60, 'starred'] = 1
data.loc[67, 'starred'] = 1
data.loc[12, 'starred'] = 1
data.loc[98, 'starred'] = 1
data.loc[80, 'starred'] = 1
data.loc[59, 'starred'] = 1
data.loc[28, 'starred'] = 1
data.loc[73, 'starred'] = 1
data.loc[57, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head(10)

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
7,8,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,1.9846,1,0.108387
60,61,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,1.9846,1,0.108387
27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,,0.78,0.986173,1.965555,1,0.108387
67,68,Human Resources Specialist at Luxottica,Greater New York City Area,500+,,1.0,0.961368,1.965231,1,0.108387
55,56,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.949673,1,0.108387
12,13,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.949673,1,0.108387
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,,1.0,0.940202,1.946181,1,0.108387
98,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,,0.096,0.986825,1.897743,1,0.108387
80,81,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,,0.91,0.873379,1.877041,1,-0.2
59,60,Aspiring Human Resources Specialist,Greater New York City Area,1,,0.002,0.969603,1.872843,1,0.108387


In [114]:
predictions = ranking_model(data)
df = data
df['ranking'] = predictions
df = df.sort_values(by='ranking', ascending=False)

Number of ranked items in training set: 11
Number of ranked items in test set: 2
[1]	valid_0's ndcg@5: 0.234639	valid_0's ndcg@10: 0.523947	valid_0's ndcg@20: 0.523947
Training until validation scores don't improve for 50 rounds
[2]	valid_0's ndcg@5: 0.234639	valid_0's ndcg@10: 0.523947	valid_0's ndcg@20: 0.523947
[3]	valid_0's ndcg@5: 0.234639	valid_0's ndcg@10: 0.523947	valid_0's ndcg@20: 0.523947
[4]	valid_0's ndcg@5: 0.234639	valid_0's ndcg@10: 0.523947	valid_0's ndcg@20: 0.523947
[5]	valid_0's ndcg@5: 0.296082	valid_0's ndcg@10: 0.58539	valid_0's ndcg@20: 0.58539
[6]	valid_0's ndcg@5: 0.296082	valid_0's ndcg@10: 0.58539	valid_0's ndcg@20: 0.58539
[7]	valid_0's ndcg@5: 0.296082	valid_0's ndcg@10: 0.58539	valid_0's ndcg@20: 0.58539
[8]	valid_0's ndcg@5: 0.296082	valid_0's ndcg@10: 0.58539	valid_0's ndcg@20: 0.58539
[9]	valid_0's ndcg@5: 0.296082	valid_0's ndcg@10: 0.58539	valid_0's ndcg@20: 0.58539
[10]	valid_0's ndcg@5: 0.296082	valid_0's ndcg@10: 0.58539	valid_0's ndcg@20: 0.58539

In [115]:
df.head(20)

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
77,78,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,,1.0,0.939072,0.945165,0,0.180808
12,13,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.949673,1,0.180808
42,43,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,0.949673,0,0.180808
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,,1.0,0.940202,1.946181,1,0.180808
64,65,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,0.949673,0,0.180808
55,56,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.949673,1,0.180808
102,103,Always set them up for Success,Greater Los Angeles Area,500+,,1.0,0.968243,0.971418,0,0.173611
67,68,Human Resources Specialist at Luxottica,Greater New York City Area,500+,,1.0,0.961368,1.965231,1,0.173611
33,34,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.986241,0,0.076226
37,38,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,0.9846,0,0.076226


In [116]:
#reset dataframe index
df = df.reset_index()
#find where the starred items are now
df.loc[df['starred'] == 1]

Unnamed: 0,index,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
1,12,13,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.949673,1,0.180808
3,100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,,1.0,0.940202,1.946181,1,0.180808
5,55,56,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.949673,1,0.180808
7,67,68,Human Resources Specialist at Luxottica,Greater New York City Area,500+,,1.0,0.961368,1.965231,1,0.173611
17,60,61,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,1.9846,1,0.076226
18,7,8,HR Senior Specialist,San Francisco Bay Area,500+,,1.0,0.982888,1.9846,1,0.076226
25,80,81,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,,0.91,0.873379,1.877041,1,0.06962
27,27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,,0.78,0.986173,1.965555,1,0.037103
38,73,74,Human Resources Professional,Greater Boston Area,16,,0.032,0.951872,1.859885,1,0.023573
39,59,60,Aspiring Human Resources Specialist,Greater New York City Area,1,,0.002,0.969603,1.872843,1,0.023573


## Result for ranking

I have ranked candidates based on a fitness score that incorporates cosine similarity and normalized connections. I then rerank if someone manually stars a candidate.
I then built a lightgbm ranking model to train on the data with the y='starred' and reranked the data based on the rankings obtained.
However I then see that my previously starred candidates don't all appear at the top of the new ranking as the model didn't train well on such a small positive class (n=4).
Adding more starred candidates (n=10) and retraining also gives better but still not great results.