# Project 3 
## Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

## Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

## Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?


In [11]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
import lightgbm as lgb


#Cross validation
from sklearn.model_selection import KFold, cross_val_score

#Other
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

#class imbalance
from sklearn.metrics import f1_score
from sklearn.utils import resample

#embedding using BERT

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity


# Loading data

In [12]:
og_data = pd.read_excel('potential-talents.xlsx')
data = og_data
keywords = 'aspiring human resources'

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [14]:
data.columns

Index(['id', 'job_title', 'location', 'connection', 'fit'], dtype='object')

In [15]:
data.describe()

Unnamed: 0,id,fit
count,104.0,0.0
mean,52.5,
std,30.166206,
min,1.0,
25%,26.75,
50%,52.5,
75%,78.25,
max,104.0,


# Add controls

In [16]:
controls = pd.DataFrame({
    "id": [1111, 1112, 1113],
    "job_title": ['Machine learning', 'NA', 'artist'], "location":['x','x','x'], "connection": [150, 0, 500], "fit": [0, 0, 0]
}, index=[104, 105, 106])

# Append a dataframe
#
data = data.append(controls)

In [17]:
data.tail()

Unnamed: 0,id,job_title,location,connection,fit
102,103,Always set them up for Success,Greater Los Angeles Area,500+,
103,104,Director Of Administration at Excellence Logging,"Katy, Texas",500+,
104,1111,Machine learning,x,150,0.0
105,1112,,x,0,0.0
106,1113,artist,x,500,0.0


In [19]:
data.head(25)

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,
7,8,HR Senior Specialist,San Francisco Bay Area,500+,
8,9,Student at Humber College and Aspiring Human R...,Kanada,61,
9,10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,


Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

In [20]:
print('title: ', data.job_title.unique())
print('location: ', data.location.unique())
print('connections: ', data.connection.unique())

title:  ['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional'
 'Native English Teacher at EPIK (English Program in Korea)'
 'Aspiring Human Resources Professional'
 'People Development Coordinator at Ryan'
 'Advisory Board Member at Celal Bayar University'
 'Aspiring Human Resources Specialist'
 'Student at Humber College and Aspiring Human Resources Generalist'
 'HR Senior Specialist'
 'Seeking Human Resources HRIS and Generalist Positions'
 'Student at Chapman University'
 'SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR'
 'Human Resources Coordinator at InterContinental Buckhead Atlanta'
 'Aspiring Human Resources Management student seeking an internship'
 'Seeking Human Resources Opportunities'
 'Experienced Retail Manager and aspiring Human Resources Professional'
 'Human Resources, Staffing and Recruiting Professional'
 'Human Resources Specialist at Luxottica'
 'Dire

# Processing data

In [21]:
#First we will normalize the connections to be between 0-1. We will count 500+ as 500

# Function to normalize scores to between 0-1
def normalize_score(score):
    if score == '500+ ':
        score = 500
    return float(score)/500

# Applying the function to the 'connection' column to create a new 'normalized_connections' column
data['normalized_connections'] = data['connection'].apply(normalize_score)


In [37]:
pip install gensim

Collecting gensim
  Downloading gensim-4.3.1-cp38-cp38-macosx_10_9_x86_64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting smart-open>=1.8.1
  Downloading smart_open-6.3.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting scipy>=1.7.0
  Downloading scipy-1.10.1-cp38-cp38-macosx_10_9_x86_64.whl (35.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.0/35.0 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: smart-open, scipy, gensim
  Attempting uninstall: scipy
    Found existing installation: scipy 1.6.2
    Uninstalling scipy-1.6.2:
      Successfully uninstalled scipy-1.6.2
Successfully installed gensim-4.3.1 scipy-1.10.1 smart-open-6.3.0
Note: you may need to restart the kernel to use updat

In [38]:
#We will then calculate the similarity between the job titles and the given keyword (stored as variable 'keywords' in the top of the code, at first it we use 'aspiring human resources')

from gensim.models import Word2Vec

# Load the Word2Vec model and tokenizer
model = Word2Vec.load("path/to/word2vec/model")
tokenizer = lambda text: text.split()

# Tokenize and encode the job titles using Word2Vec
job_title_encodings = data["job_title"].apply(
    lambda title: np.mean([model.wv[word] for word in tokenizer(title)], axis=0)
)

# Compute the cosine similarity between the encoded job titles and the keyword
keyword_encoding = np.mean([model.wv[word] for word in tokenizer(keywords)], axis=0)
similarity_scores = np.vstack(job_title_encodings.apply(lambda encoding: cosine_similarity(encoding.reshape(1, -1), keyword_encoding.reshape(1, -1)))).ravel()

# Add the similarity scores as a new column in the DataFrame
data["similarity_score_2vec"] = similarity_scores

FileNotFoundError: [Errno 2] No such file or directory: 'path/to/word2vec/model'

In [None]:
#We will then calculate the similarity between the job titles and the given keyword (stored as variable 'keywords' in the top of the code, at first it we use 'aspiring human resources')

# Load the BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize and encode the job titles using BERT
job_title_encodings = data["job_title"].apply(
    lambda title: model(**tokenizer(title, return_tensors="pt")).pooler_output.detach().numpy()
)

# Compute the cosine similarity between the encoded job titles and the keyword
keyword_encoding = model(**tokenizer(keywords, return_tensors="pt")).pooler_output.detach().numpy()
similarity_scores = np.vstack(job_title_encodings.apply(lambda encoding: cosine_similarity(encoding, keyword_encoding))).ravel()

# Add the similarity scores as a new column in the DataFrame
data["similarity_score"] = similarity_scores



In [33]:
data = data.sort_values(by='similarity_score', ascending=False)
data.tail()

Unnamed: 0,id,job_title,location,connection,fit,normalized_connections,similarity_score
69,70,"Retired Army National Guard Recruiter, office ...","Virginia Beach, Virginia",82,,0.164,0.759416
85,86,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,,0.008,0.757093
93,94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,,0.83,0.755037
65,66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,,0.114,0.752354
68,69,"Director of Human Resources North America, Gro...","Greater Grand Rapids, Michigan Area",500+,,1.0,0.742197


In [35]:
#reset dataframe index
find = data.reset_index()
#find where the starred items are now
find.loc[find['id'] == 1111]

Unnamed: 0,index,id,job_title,location,connection,fit,normalized_connections,similarity_score
43,104,1111,Machine learning,x,150,0.0,0.3,0.908728


In [None]:
#now make a new column fitness score based on an equation taking similarity_score and normalized_connections

# calculate the new column based on the equation
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1
data = data.sort_values(by='fitness_score', ascending=False)
data.tail(20)

# Starring candidates

In [None]:
# add a new column named 'starred' and set all values to 0
data['starred'] = 0

# set the value of the 7th row in 'starred' to 1 as an example
data.loc[7, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head(20)


In [None]:
# set the value of the 3rd, 4th, 5th, row in 'starred' to 1 as more example
data.loc[27, 'starred'] = 1
data.loc[55, 'starred'] = 1
data.loc[100, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head()

# Model

In [None]:
def ranking_model(data):
    #make x and y
    feature_cols = ['normalized_connections', 'similarity_score']
    X = data[feature_cols]
    y = data.starred 
    
    #split data into training and testing, check both have starred items
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    # Make sure at least one example of data with positive supervisory signal is in the training set
    print("Number of ranked items in training set:", y_train.sum())
    print("Number of ranked items in test set:",y_test.sum())
    while y_train.sum() ==0 or y_test.sum()== 0:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
        # Make sure at least one example of data with positive supervisory signal is in the training set
        print("Number of ranked items in training set:", y_train.sum())
        print("Number of ranked items in test set:",y_test.sum())
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)
    
    #define queries
    query_train = [X_train.shape[0]]
    query_val = [X_val.shape[0]]
    query_test = [X_test.shape[0]]

    # Train the LightGBM model
    
    gbm = lgb.LGBMRanker(
    objective="lambdarank",
    metric="ndcg",)
    
    gbm.fit(X_train, y_train, group=query_train,
        eval_set=[(X_val, y_val)], eval_group=[query_val],
        eval_at=[5, 10, 20], early_stopping_rounds=50)


    # predict on test set and return
    return gbm.predict(X)

In [None]:
predictions = ranking_model(data)

# Updating fitness and ranking

In [None]:
#rank based on model output
df = data
df['ranking'] = predictions
df = df.sort_values(by="ranking", ascending=False)

In [None]:
df.head()

In [None]:
#reset dataframe index
df = df.reset_index()
#find where the starred items are now
df.loc[df['starred'] == 1]

Problem: Starred rows are dispersed through the newly ranked dataset

# Trying with more starred candidates

In [None]:
data.head(50)

In [None]:
# set some rows as 'starred' for more example
data.loc[60, 'starred'] = 1
data.loc[67, 'starred'] = 1
data.loc[12, 'starred'] = 1
data.loc[98, 'starred'] = 1
data.loc[80, 'starred'] = 1
data.loc[59, 'starred'] = 1
data.loc[28, 'starred'] = 1
data.loc[73, 'starred'] = 1
data.loc[57, 'starred'] = 1

#update the fitness_score based on starring
data['fitness_score'] = data['similarity_score'] * 0.9 + data['normalized_connections'] * 0.1 + data['starred']
data = data.sort_values(by='fitness_score', ascending=False)
# print the updated dataframe
data.head(10)

In [None]:
predictions = ranking_model(data)
df = data
df['ranking'] = predictions
df = df.sort_values(by='ranking', ascending=False)

In [None]:
df.head(20)

In [None]:
#reset dataframe index
df = df.reset_index()
#find where the starred items are now
df.loc[df['starred'] == 1]

## Result for ranking

I have ranked candidates based on a fitness score that incorporates cosine similarity and normalized connections. I then rerank if someone manually stars a candidate.
I then built a lightgbm ranking model to train on the data with the y='starred' and reranked the data based on the rankings obtained.
However I then see that my previously starred candidates don't all appear at the top of the new ranking as the model didn't train well on such a small positive class (n=4).
Adding more starred candidates (n=10) and retraining also gives better but still not great results.