# Project 3 
## Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

## Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

## Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?


In [20]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
import lightgbm as lgb


#Cross validation
from sklearn.model_selection import KFold, cross_val_score

#Other
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

#class imbalance
from sklearn.metrics import f1_score
from sklearn.utils import resample

#embedding using BERT

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity


# Loading data

In [2]:
og_data = pd.read_excel('potential-talents.xlsx')
data = og_data
keywords = 'aspiring human resources'

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [4]:
data.columns

Index(['id', 'job_title', 'location', 'connection', 'fit'], dtype='object')

In [5]:
data.describe()

Unnamed: 0,id,fit
count,104.0,0.0
mean,52.5,
std,30.166206,
min,1.0,
25%,26.75,
50%,52.5,
75%,78.25,
max,104.0,


In [6]:
data.head(15)

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,
7,8,HR Senior Specialist,San Francisco Bay Area,500+,
8,9,Student at Humber College and Aspiring Human R...,Kanada,61,
9,10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,


Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

In [7]:
print('title: ', data.job_title.unique())
print('location: ', data.location.unique())
print('connections: ', data.connection.unique())

title:  ['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional'
 'Native English Teacher at EPIK (English Program in Korea)'
 'Aspiring Human Resources Professional'
 'People Development Coordinator at Ryan'
 'Advisory Board Member at Celal Bayar University'
 'Aspiring Human Resources Specialist'
 'Student at Humber College and Aspiring Human Resources Generalist'
 'HR Senior Specialist'
 'Seeking Human Resources HRIS and Generalist Positions'
 'Student at Chapman University'
 'SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR'
 'Human Resources Coordinator at InterContinental Buckhead Atlanta'
 'Aspiring Human Resources Management student seeking an internship'
 'Seeking Human Resources Opportunities'
 'Experienced Retail Manager and aspiring Human Resources Professional'
 'Human Resources, Staffing and Recruiting Professional'
 'Human Resources Specialist at Luxottica'
 'Dire

# Processing data

In [8]:
#First we will normalize the connections to be between 0-1. We will count 500+ as 500

# Function to normalize scores to between 0-1
def normalize_score(score):
    if score == '500+ ':
        score = 500
    return float(score)/500

# Applying the function to the 'connection' column to create a new 'normalized_connections' column
data['normalized_connections'] = data['connection'].apply(normalize_score)


In [9]:
#We will then calculate the similarity between the job titles and the given keyword (stored as variable 'keywords' in the top of the code, at first it we use 'aspiring human resources')

# Load the BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize and encode the job titles using BERT
job_title_encodings = data["job_title"].apply(
    lambda title: model(**tokenizer(title, return_tensors="pt")).pooler_output.detach().numpy()
)

# Compute the cosine similarity between the encoded job titles and the keyword
keyword_encoding = model(**tokenizer(keywords, return_tensors="pt")).pooler_output.detach().numpy()
similarity_scores = np.vstack(job_title_encodings.apply(lambda encoding: cosine_similarity(encoding, keyword_encoding))).ravel()

# Add the similarity scores as a new column in the DataFrame
data["similarity_score"] = similarity_scores


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
data.head()

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,,0.17,0.764336
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,,1.0,0.89986
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,,0.088,0.937843
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,,1.0,0.824166


In [11]:
#now make a new column fitness score based on an equation taking similarity_score and normalized_connections

# calculate the new column based on the equation
data['fitness_score'] = data['similarity_score'] * 0.8 + data['normalized_connections'] * 0.2
data = data.sort_values(by='fitness_score', ascending=False)


# Starring candidates

In [12]:
# add a new column named 'starred' and set all values to 0
data['starred'] = 0

# set the value of the 7th row in 'starred' to 1 as an example
data.loc[6, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.8 + data['normalized_connections'] * 0.2 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head()


Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,,0.122,0.860559,1.712847,1
58,59,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.98777,0
21,22,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.98777,0
17,18,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.98777,0
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.98777,0


In [13]:
# set the value of the 3rd, 4th, 5th, row in 'starred' to 1 as more example
data.loc[2, 'starred'] = 1
data.loc[3, 'starred'] = 1
data.loc[4, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.8 + data['normalized_connections'] * 0.2 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head()

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,1.98777,1
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,,1.0,0.824166,1.859333,1
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,,0.088,0.937843,1.767874,1
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,,0.122,0.860559,1.712847,1
46,47,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,0.98777,0


# Make X and Y

In [49]:
def ranking_model(data):
    #make x and y
    feature_cols = ['normalized_connections', 'similarity_score']
    X = data[feature_cols]
    y = data.starred 
    
    #split data into training and testing, check both have starred items
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
    # Make sure at least one example of data with positive supervisory signal is in the training set
    print("Number of ranked items in training set:", y_train.sum())
    print("Number of ranked items in test set:",y_test.sum())
    
    # Create the LightGBM dataset objects for training and testing
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_test = lgb.Dataset(X_test, y_test)

    # Define the hyperparameters for the LightGBM model 
    params = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
    }

    # Train the LightGBM model
    model = lgb.train(params,
                      lgb_train,
                      num_boost_round=100,
                      valid_sets=[lgb_train, lgb_test],
                      early_stopping_rounds=10,
                      verbose_eval=False)


    # predict on test set and return
    return model.predict(X)

In [50]:
predictions = ranking_model(data)

Number of ranked items in training set: 3
Number of ranked items in test set: 1
[LightGBM] [Info] Number of positive: 3, number of negative: 80
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 35
[LightGBM] [Info] Number of data points in the train set: 83, number of used features: 2
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.036145 -> initscore=-3.283414
[LightGBM] [Info] Start training from score -3.283414


# Updating fitness and ranking

In [51]:
#rank based on model output
data['ranking'] = predictions
data = data.sort_values(by='ranking', ascending=False)
data.head()

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
28,29,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,,1.0,0.853446,0.882757,0,0.051546
80,81,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,,0.91,0.873379,0.880703,0,0.051546
19,20,Native English Teacher at EPIK (English Progra...,Kanada,500+,,1.0,0.89986,0.919888,0,0.051546
103,104,Director Of Administration at Excellence Logging,"Katy, Texas",500+,,1.0,0.865729,0.892584,0,0.051546
74,75,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",500+,,1.0,0.820195,0.856156,0,0.051546


In [53]:
#reset dataframe index
df = data.reset_index()
#find where the starred items are now
df.loc[df['starred'] == 1]

Unnamed: 0,index,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
7,4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,,1.0,0.824166,1.859333,1,0.051546
23,3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,1.98777,1,0.042667
48,2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,,0.088,0.937843,1.767874,1,0.041917
52,6,7,Student at Humber College and Aspiring Human R...,Kanada,61,,0.122,0.860559,1.712847,1,0.041917


Problem: Starred rows are dispersed through the newly ranked dataset

# Trying with more starred candidates

In [54]:
# set some rows as 'starred' for more example
data.loc[2, 'starred'] = 1
data.loc[3, 'starred'] = 1
data.loc[4, 'starred'] = 1
data.loc[6, 'starred'] = 1
data.loc[8, 'starred'] = 1
data.loc[12, 'starred'] = 1
data.loc[15, 'starred'] = 1
data.loc[27, 'starred'] = 1
data.loc[41, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.8 + data['normalized_connections'] * 0.2 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head()

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,1.98777,1,0.042667
12,13,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.955265,1,0.042667
27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,,0.78,0.986173,1.944938,1,0.034636
15,16,Native English Teacher at EPIK (English Progra...,Kanada,500+,,1.0,0.89986,1.919888,1,0.051546
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,,1.0,0.824166,1.859333,1,0.051546


In [55]:
predictions = ranking_model(data)
data['ranking'] = predictions
data = data.sort_values(by='ranking', ascending=False)
data.head()

Number of ranked items in training set: 7
Number of ranked items in test set: 2
[LightGBM] [Info] Number of positive: 7, number of negative: 76
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 32
[LightGBM] [Info] Number of data points in the train set: 83, number of used features: 2
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.084337 -> initscore=-2.384823
[LightGBM] [Info] Start training from score -2.384823


Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
88,89,Director Human Resources at EY,Greater Atlanta Area,349,,0.698,0.877289,0.841431,0,0.089798
32,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,,0.088,0.937843,0.767874,0,0.089798
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,,1.0,0.89986,0.919888,0,0.089798
19,20,Native English Teacher at EPIK (English Progra...,Kanada,500+,,1.0,0.89986,0.919888,0,0.089798
103,104,Director Of Administration at Excellence Logging,"Katy, Texas",500+,,1.0,0.865729,0.892584,0,0.089798


In [56]:
#reset dataframe index
df = data.reset_index()
#find where the starred items are now
df.loc[df['starred'] == 1]

Unnamed: 0,index,id,job_title,location,connection,fit,normalized_connections,similarity_score,fitness_score,starred,ranking
24,15,16,Native English Teacher at EPIK (English Progra...,Kanada,500+,,1.0,0.89986,1.919888,1,0.089798
25,2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,,0.088,0.937843,1.767874,1,0.089798
26,6,7,Student at Humber College and Aspiring Human R...,Kanada,61,,0.122,0.860559,1.712847,1,0.089798
27,8,9,Student at Humber College and Aspiring Human R...,Kanada,61,,0.122,0.860559,1.712847,1,0.089798
49,27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,,0.78,0.986173,1.944938,1,0.083967
51,12,13,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,,1.0,0.944081,1.955265,1,0.083967
52,3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,1.0,0.984712,1.98777,1,0.083967
84,4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,,1.0,0.824166,1.859333,1,0.081552
85,41,42,"SVP, CHRO, Marketing & Communications, CSR Off...","Houston, Texas Area",500+,,1.0,0.774014,1.819212,1,0.081552


## Result for ranking

I have ranked candidates based on a fitness score that incorporates cosine similarity and normalized connections. I then rerank if someone manually stars a candidate.
I then built a lightgbm ranking model to train on the data with the y='starred' and reranked the data based on the rankings obtained.
However I then see that my previously starred candidates don't all appear at the top of the new ranking as the model didn't train well on such a small positive class (n=4).
Adding more starred candidates (n=10) and retraining also gives weird results.