 ## Project 3: Potential Talents

Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

Download Data:

https://docs.google.com/spreadsheets/d/117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU/edit?usp=sharing

Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

### 1. Load the necessary packages and the data

In [24]:
# Load and import necessary modules
import pandas as pd
import numpy as np
import os

In [21]:
# We could use spaCy to tokenize and stemm our keywords for NLP analyses
import spacy
nlp = spacy.load("en_core_web_trf")

In [29]:
# Loading the raw data
talents_df =  pd.read_excel('potential-talents.xlsx')
print(talents_df.shape)
talents_df.head(10)

(104, 5)


Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,
7,8,HR Senior Specialist,San Francisco Bay Area,500+,
8,9,Student at Humber College and Aspiring Human R...,Kanada,61,
9,10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,


From the data frame  it is clear that there area few duplicates. Lets remove them

In [30]:
# Clean the duplicates in the data
talents_df_clean = talents_df.drop(columns= 'id').drop_duplicates().reset_index(drop= True)
print(talents_df_clean.shape)
talents_df_clean.head(10)

(53, 4)


Unnamed: 0,job_title,location,connection,fit
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
5,Aspiring Human Resources Specialist,Greater New York City Area,1,
6,Student at Humber College and Aspiring Human R...,Kanada,61,
7,HR Senior Specialist,San Francisco Bay Area,500+,
8,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,
9,Student at Chapman University,"Lake Forest, California",2,


So there are just 53 unique rows rather than 104 entries

### 2. Modeling

We can estimate the fit score by looking for matching or similar words between the search term and candidate job title extracted from this profile.  We can additionally use location information and connections to improve the ranking. Here we implement this using spacy to tokenize and stem the words, then we can use the tf-idf vectorizer to vectorize our words and use either jacquard similarity or cosine transformation to find the similarity between the search term and candidate job title.

Furthermore we can hard-code the importance of the no of connections by encoding them as ordingal classes with

<100 :0 (low); 101-200: 1 (medium); 200-500: 2 (high); 500+: 3 (very-high)


As for location, we can extract the city, state and country information from the 'location' column and match if it is the same location as the job search keyword and assign a score accordingly.
so if just the country matches a score of 1 is matched, if additionaly the state matches a score of 2 is assigned and finally if city also matches then a score of 3 is assigned. If the country is not matched, then a score of 0 is assigned.

#### 2.0 Just looking at similar/same words  between the "search term keywords" and the "job profile"  and quanitfying similarity through jacquard similarity

In [31]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextSimilarityTransformerJS(BaseEstimator, TransformerMixin):
    def __init__(self, search_words, spacy_nlp):
        """
        Custom transformer to calculate jacquard similarity  
        between the search key words and the job_titles.
        """
        self.search_words = search_words
        self.spacy_nlp = spacy_nlp
    
    def fit(self, X, y=None):
        return self

    
    def transform(self, X):
      
        # Calculate jacquared similarity between the job search words and job title of the candidates. 
        similarity_scores = X.apply(self.job_similarity_kw_js, search_words= self.search_words)
    
        return similarity_scores
    

    def spacy_tokenize_lemmatize(self, text):
        """
        Tokenizes and lemmatizes text using spaCy.
        """
        if (text) or (text is not None):
            # We only want to use nouns and proper nouns as the filler words 
            # do not add much to improve the score
            return([w.lemma_.lower() for w in self.spacy_nlp(text) if w.pos_ in ['NOUN','PROPN']])
        else:
            return []
    
    # Defining jacquard similarity
    def job_similarity_kw_js(self, job_title, search_words):

        """
        Calculates jacqurd similarity between 2 sentences/phrases
        """
        job_title_token = set(self.spacy_tokenize_lemmatize(job_title))
        search_words_token = set(self.spacy_tokenize_lemmatize(search_words))
        n_common_words = len(job_title_token.intersection(search_words_token))
        total_words = len(job_title_token.union(search_words_token))
        return n_common_words/total_words

    

# Sample search keywrods that we use to find the fit
search_words = "Aspiring Human Resources"

# Initialize and apply the transformer
similarity_transformer = TextSimilarityTransformerJS(search_words =search_words,  spacy_nlp=nlp)
talents_df_clean['fit']  = similarity_transformer.fit_transform(talents_df_clean['job_title'])

#Displaying candidates in descending order of fit score
talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10)

Unnamed: 0,job_title,location,connection,fit
45,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.666667
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.666667
13,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.666667
5,Aspiring Human Resources Specialist,Greater New York City Area,1,0.666667
42,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,0.4
6,Student at Humber College and Aspiring Human R...,Kanada,61,0.333333
20,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,0.333333
47,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.25
22,Human Resources Professional,Greater Boston Area,16,0.25
36,Human Resources Management Major,"Milpitas, California",18,0.2


The problem with this model is that we are looking at similar words across the job title and keywords, so even though HR and Human resources mean the same this model thinks of them as different tokens. Also another major issue with this model is that this model does not consider the order of the words, for eg. for this model, resources human is same as human resources. This issue can be adressed by considering the n-gram models discussed next. Finally, this model penalizes job titles that have useful information but just dont have same words as the titles, i.e, this model would give a score of 1 to job titles that are same as the search terms.

#### 2.1 Using N-Gram models and Tf-Idf Vectorizers and using cosine transformers

In [33]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from spacy.lang.en.stop_words import STOP_WORDS

    
class TextSimilarityTransformerCS(BaseEstimator, TransformerMixin):
    def __init__(self, search_words, spacy_nlp):
        """
        Custom transformer to calculate cosine similarity 
        between the search key words and the job_titles after td-idf vectorizer.

        """
        self.search_words = search_words
        self.spacy_nlp = spacy_nlp
        stop_words_tokenized = set(self.spacy_tokenize_lemmatize(' '.join(sorted(STOP_WORDS))))
        self.vectorizer =  TfidfVectorizer(min_df =0.1, max_features=100, 
                                    stop_words= list(stop_words_tokenized), ngram_range = (1,4))
        
    
        
    def fit(self, X, y=None):
        # Fit the vectorizer on the search keywords and the job titles
        lemmatized_search_words = ' '.join(self.spacy_tokenize_lemmatize(self.search_words))
        
        # Fit the vectorizer on the input texts and the lemmatized reference text
        self.vectorizer.fit(X.tolist() + [lemmatized_search_words])
        return self

    
    def transform(self, X):
        # Transform the job titles to TF-IDF vectors
        tfidf_matrix = self.vectorizer.transform(X.tolist())
        
        # Transform the search key words to a TF-IDF vector
        ref_vector = self.vectorizer.transform([' '.join(self.spacy_tokenize_lemmatize(self.search_words))])
        
        # Calculate cosine similarity between search keywrord vectors and job title vectors
        similarity_scores = cosine_similarity(ref_vector, tfidf_matrix)
        
        return similarity_scores.reshape(-1)
    
    
    def spacy_tokenize_lemmatize(self, text):
        """
        Tokenizes and lemmatizes text using spaCy.
        """
        if (text) or (text is not None):
            # Only using proper nouns and nouns
            return([w.lemma_.lower() for w in self.spacy_nlp(text) if w.pos_ in ['NOUN','PROPN']])
    
        else:
            return []
    
# Sample search keywrods that we use to find the fit
search_words = "Aspiring Human Resources"

# Initialize and apply the transformer

similarity_transformer = TextSimilarityTransformerCS(search_words =search_words,  spacy_nlp=nlp)
talents_df_clean['fit']  = similarity_transformer.fit_transform(talents_df_clean['job_title'])

#Displaying candidates in descending order of fit score
talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10)

Unnamed: 0,job_title,location,connection,fit
25,Human Resources|\nConflict Management|\nPolici...,Dallas/Fort Worth Area,409,0.569669
36,Human Resources Management Major,"Milpitas, California",18,0.569669
17,"Director of Human Resources North America, Gro...","Greater Grand Rapids, Michigan Area",500+,0.569669
26,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,0.458555
11,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,0.458555
19,"Human Resources Generalist at ScottMadden, Inc.","Raleigh-Durham, North Carolina Area",500+,0.458555
16,Human Resources Specialist at Luxottica,Greater New York City Area,500+,0.458555
29,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,0.458555
37,Director Human Resources at EY,Greater Atlanta Area,349,0.458555
49,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,0.458555


As said earlier, here we are using an n-gram model, with 1-gram, 2-gram 3 and 4-gram tokens. So word order is preserved. However, this model still relies on counts of occurences of tokens between the search word and job titles. So similar words are still not addressed, which word2vec model address as discussed later. Secondly, idf transformer in the tf-idf vecotrizer penalizes common words across the document, which in this case unfortunaley penalizes the term "Human resources" because it occurs across multiple job titles, and boosts rarer words which unfortunately may not be directy linked to the job term.

#### 2.2 Extended N-gram model. Now let us add information from the location as well as the connections to estimate fit

So far we have only used information in the job titles to estimate the fit score. We can also use the location informtion and the connection scores to get more relavent candidates by improving the fit scores.

In [34]:
# We will be using OpenCage Geocode API 
# Also extracting the word2vec model path
# Loading the API Key from the envir. variables

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv('config/.env'))
GeocodingKey= os.getenv('GeocodingKey')
WORD2VEC_PATH= os.getenv('GoogleNewsPath')

In [35]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from opencage.geocoder import OpenCageGeocode
from sklearn.metrics.pairwise import cosine_similarity

# This custom transformer calculates the cosine similarity after tfidf vectorizer 
# between the job tilte and the search key terms
class CosineSimilarityTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, search_words, vectorizer =TfidfVectorizer()) :
        
        """
        Custom transformer to calculate cosine similarity 
        between the search key words and the job_titles after td-idf vectorizer.
        
        search_words: str, the search keywords against which to compute similarity
        vectorizer: Vectorizer deafualts ot tfidf vectorizer to convert words to vectors

        """
        self.search_words= search_words
        self.vectorizer = vectorizer

    def fit(self, X, y=None):
        # Ensure that X is treated as a Series and convert to list if necessary
        if isinstance(X, pd.DataFrame):
            X = X.iloc[:, 0]
        # Fit the vectorizer on the search keywords  and the job titles
        self.vectorizer.fit(X.tolist() + [self.search_words])
        return self

    def transform(self, X):
                # Ensure that X is treated as a Series and convert to list if necessary
        if isinstance(X, pd.DataFrame):
            X = X.iloc[:, 0]
        # Transform the search keywords and the job titles into TF-IDF vectors
        X_tfidf = self.vectorizer.transform(X.tolist())
        sw_tfidf = self.vectorizer.transform([self.search_words])

        # Calculate the cosine similarity of job titles with the search keywords
        similarity_scores = cosine_similarity(X_tfidf, sw_tfidf)
        
        # Return similarity scores as a 1D array of scores
        return similarity_scores.reshape(-1, 1)
    
# Usign OpenCage Geocode API to get location information and configuring it
geocoder = OpenCageGeocode(GeocodingKey)

# Finding and extacting location infromation from the keywords
def get_location_info(location_name):
   
    """
    uses location name string and outputs the city state and country information for the top search result 
    for the location string as a dict after quertying the Open Cage Geocode API



    Args:
        location names : string containing location information

    Returns:
        dict: TDictionary containg the city, state and country inforation for the search string

    """

    try:
    # Querying the API for location info
        results = geocoder.geocode(location_name, no_annotations='1')
        keys = ["city", "state", "country"]
        location = dict.fromkeys(keys)
        if results:
            # Usually, the first result is the most relevant one
            top_result = results[0]['components']
            for key in keys:
                location[key] = top_result.get(key, None)
    except Exception as e:
        print(f"Error geocoding {location_name}: {e}")
        location = dict.fromkeys(["city", "state", "country"], None)
    return location


# Calculating the location scores, if same city, then the score is 3, if same state score is 2 and 
# if same country then the score is 1, if none match, then zero
def location_scores(location_names, search_words):
    """
    uses  search keywords and information in the location column of a data frame to find similarity
 
    Args:
        location names : Data frame column  (pandas series) containing locatoin information for job candidates
        search_words :  str containing the search term

    Returns:
        pd.series: The calculated similarity scores between loc column and search string
    """
    loc_search_kw = [ent.text for ent in nlp(search_words).ents if ent.label_ == 'GPE']
    if loc_search_kw:
        kw_loc_dict = get_location_info(''.join(loc_search_kw))
        scores = []
        for location_name in location_names:
            location_dict = get_location_info(location_name)

            score = 0
            if loc_search_kw:
                
                if (location_dict['country'] and (location_dict['country'] == kw_loc_dict['country'])):
                    score += 1
                    if (location_dict['state'] and (location_dict['state'] == kw_loc_dict['state'])):
                        score += 1
                        if (location_dict['city'] and (location_dict['city'] == kw_loc_dict['city'])): 
                            score += 1
            scores.append(score)
        return np.array(scores).reshape(-1, 1)
    else:
        return np.zeros(len(location_names)).reshape(-1, 1)


# Encoding  LinkedIn connections
def categorize_connections(num:int) -> int:

    '''
    Bins the no of LinkedIn connections to ordinal values
    '''
    if num =='500+ ':
        return 3
    elif 0 <= int(num)< 100:
        return 0
    elif 100 <= int(num) < 200:
        return 1
    elif 200 <= int(num) < 500:
        return 2
    else:
        return None



# run the pipeline and and get the fit score 

search_words = "Human Resources in California" # location information in the search words can improve predictions

preprocessor = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections),  validate=False), ['connection']),
        ('loc', FunctionTransformer(location_scores, validate=False, kw_args={'search_words': search_words}) ,  'location'),
        ('title', CosineSimilarityTransformer(search_words = search_words, 
                                              vectorizer = TfidfVectorizer(min_df=0.1, max_features=100, ngram_range=(1, 4))), ['job_title'])
    ] 
)

# Weights of each score. Let us asssign an importance score of 0.2, 0.5 and 0.7 for information 
# contained in their no of connections, location and title respectively
# The values of no of connections can be from 0 to 3, and no of locations also has the same range, 
# but the cosine similarity is between -1 and 1

weights = np.array([0.2, 0.5, 0.7 ]) 
talents_df_clean['fit'] = (preprocessor.fit_transform(talents_df_clean)*weights).sum(axis=1)

#Displaying candidates in descending order of fit score
talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10)

Unnamed: 0,job_title,location,connection,fit
23,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",500+,1.923369
7,HR Senior Specialist,San Francisco Bay Area,500+,1.6
26,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,1.469546
11,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,1.469546
36,Human Resources Management Major,"Milpitas, California",18,1.461281
32,Human Resources professional for the world lea...,"Highland, California",50,1.440512
31,HR Manager at Endemol Shine North America,"Los Angeles, California",268,1.4
13,Seeking Human Resources Opportunities,"Chicago, Illinois",390,1.223369
48,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,1.177091
42,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,1.168958


By adding the location infomation, a lot of top search results are now from California, thus improving our fit score estimation. Similarly, a lot of the top results have location have a high number of connections.

#### 2.3 Using word embeddings word from Word2Vec utlizing Gensim module and calculating the similarity score between pairwise word embeddings using cosine similarity

We can use break the sentence into individual words and use the word embeddings from word2vec models to capture similarity using cosine transformation. Just averaging the word vectors from a sentence may dilute its meaning, hence we will calculate the pairwise cosine simiarity of vectors from words from the job tiles and the search word and just take the mean of the top 20%ile of the pairwise cosine values.

In [41]:
from sklearn.base import TransformerMixin, BaseEstimator
from gensim.models import KeyedVectors
from gensim.utils import simple_preprocess
import itertools
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

class CosineSimilarityTransformerW2V(TransformerMixin, BaseEstimator):
    def __init__(self, model_path, search_kw) :
        """
        Custom transformer to calculate mean pairwise cosine similarity 
        between the words in the search key words and the job_titles after calculating
        word embeddings

        search_kw: str, the search keywords against which to compute similarity
        model_path: str, path to the pre-trained word2vec model
        """
        self.model_path = model_path
        self.search_kw = search_kw
        self.model = None
        self.search_words = None

    def fit(self, X, y=None):
        # Load the word2vec model
        self.model = KeyedVectors.load_word2vec_format(self.model_path, binary=True)
        # Exract the words for the search keywords
        self.search_words= self.get_words(self.search_kw)
        return self

    def transform(self, X):
        # Compute cosine similarity between each sentence in job tilte (X) and the search keywords
        cosine_sim = [self.pairwise_cs(self.get_words(sentence), self.search_words) for _, sentence in X.items()]        
        return np.array(cosine_sim).reshape(-1, 1)
    

    def get_words(self, sentence):
        # Convert sentence to tokens, ignoring out-of-vocabulary words
        words = [word for word in simple_preprocess(sentence) if word in list(self.model.index_to_key)]
        return words

    def cosine_similarity(self, vec1, vec2):
        # Compute cosine similarity between two vectors
        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    
    def pairwise_cs(self, sentence1, sentence2):
        if (sentence1 and sentence2): # So that both are not empty. If so, just return 0
            cosine_vals = []
            for  word1, word2 in list(itertools.product(sentence1, sentence2)):
                cosine_vals.append(self.cosine_similarity(self.model[word1], self.model[word2]))
            cosine_vals = np.array(cosine_vals)
            # Only taking the mean of the top 20 %ile of cosine values and calculating the mean as the similarity between two sentences
            return np.mean(cosine_vals[cosine_vals >= np.percentile( cosine_vals, 80) ])
        else:
            return 0

# Now lets run the pipeline and get the fit information
search_words = "Human Resources in California" # location information in the search words can improve predictions
preprocessor = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections),  validate=False), ['connection']),
        ('loc', FunctionTransformer(location_scores, validate=False, kw_args={'search_words': search_words}) ,  'location'),
        ('title', CosineSimilarityTransformerW2V(search_kw = search_words, model_path = WORD2VEC_PATH), 'job_title')
    ] 
)

# Weights of each score. Let us asssign an importance score of 0.2, 0.5 and 0.7 for information contained in their no of connections, location and title respectively
# The values of no of connections can be from 0 to 3, and no of locations also has the same range, but the cosine similarity is between -1 and 1
weights = np.array([0.2, 0.5, 0.7])
talents_df_clean['fit'] = (preprocessor.fit_transform(talents_df_clean)*weights).sum(axis=1)

#Displaying candidates in descending order of fit score
talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10)

Unnamed: 0,job_title,location,connection,fit
23,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",500+,1.942121
7,HR Senior Specialist,San Francisco Bay Area,500+,1.717908
31,HR Manager at Endemol Shine North America,"Los Angeles, California",268,1.637331
11,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,1.621325
26,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,1.585514
36,Human Resources Management Major,"Milpitas, California",18,1.447768
32,Human Resources professional for the world lea...,"Highland, California",50,1.427759
3,People Development Coordinator at Ryan,"Denton, Texas",500+,1.364098
13,Seeking Human Resources Opportunities,"Chicago, Illinois",390,1.310571
42,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,1.27623


As anticipated, because HR and Human Resources mean the sane thing, by using the word2vec model,  so we have additional relevant results ranked higher. Job titles such as people development coordinater is also ranked higher as the job title is very relevant to HR, which all the previous models failed to capture as they were relying on same words and not word similarity in the semantic space. However, even this model has a disadvantage as we are dealing with word embeddings, i.e, looking at each word indiviudally and not capturng the  context the word is in. Sentence embeddings can better capture the information in a sentence.

#### 2.4 Using sentence level representation from the [CLS] token from BERT and calculating the similarity score using cosine similarity

In [37]:
from sklearn.base import TransformerMixin, BaseEstimator
from transformers import BertModel, BertTokenizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
import torch

class CosineSimilarityTransformerBERT(TransformerMixin, BaseEstimator):
    def __init__(self, search_kw, model_name='bert-base-uncased') :
        """
        Custom transformer to calculate cosine similarity 
        between the  key words and the job_titles segments using the BERT model

        search_kw: str, the search key words against which to compute similarity
        model_name: str, model name of the BERT model to use

        """
        self.model_name = model_name
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertModel.from_pretrained(model_name)
        self.search_kw = search_kw
        self.model.eval()  # Set model to evaluation mode
        self.kw_vec =  None

    def fit(self, X, y=None):
        # Calculating the sentence embedding of the search key words
        inputs = self.tokenizer(self.search_kw , return_tensors='pt', padding=True, truncation=True)
        with torch.no_grad():
            self.kw_vec = self.model(**inputs).last_hidden_state[:, 0, :].squeeze()  # CLS token representation
        return self

    def transform(self, X):

        # Transform each job_title in the DataFrame column to cosine similarity score with the search key words
        cos_sim_scores = []
        for  _, job_title in X.items():
            inputs = self.tokenizer(job_title, return_tensors='pt', padding=True, truncation=True)
            with torch.no_grad():
                job_title_vec = self.model(**inputs).last_hidden_state[:, 0, :].squeeze()
            cos_sim = torch.nn.functional.cosine_similarity(self.kw_vec, job_title_vec, dim=0)
            cos_sim_scores.append(cos_sim.item())
        return np.array(cos_sim_scores).reshape(-1, 1)
    

# Now lets run the pipeline and get the fit information
search_words = "Human Resources in California" # location information in the search words can improve predictions

preprocessor = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections),  validate=False), ['connection']),
        ('loc', FunctionTransformer(location_scores, validate=False, kw_args={'search_words': search_words}) ,  'location'),
        ('title', CosineSimilarityTransformerBERT(search_kw = search_words, model_name='bert-base-uncased'), 'job_title')
    ] 
)

# Weights of each score. Let us asssign an importance score of 0.2, 0.5 and 0.7 
# for information contained in their no of connections, location and title respectively
# The values of no of connections can be from 0 to 3, and no of locations also has the same range, 
# but the cosine similarity is between -1 and 1

weights = np.array([0.2, 0.5, 0.7])
talents_df_clean['fit'] = (preprocessor.fit_transform(talents_df_clean)*weights).sum(axis=1)

#Displaying candidates in descending order of fit score
talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10)

Unnamed: 0,job_title,location,connection,fit
7,HR Senior Specialist,San Francisco Bay Area,500+,2.203727
23,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",500+,2.179316
31,HR Manager at Endemol Shine North America,"Los Angeles, California",268,1.946667
3,People Development Coordinator at Ryan,"Denton, Texas",500+,1.766357
26,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,1.72001
11,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,1.71505
52,Director Of Administration at Excellence Logging,"Katy, Texas",500+,1.693956
9,Student at Chapman University,"Lake Forest, California",2,1.631176
41,Admissions Representative at Community medical...,"Long Beach, California",9,1.579031
32,Human Resources professional for the world lea...,"Highland, California",50,1.575069


#### 2.5 Using SentenceTransformers  from Sentence-BERT and calculating the similarity score using cosine similarity

Finally, we will use Sentence BERT that was trained to get semantically meaningful sentence embeddings

In [38]:
from sentence_transformers import SentenceTransformer, util
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

class CosineSimilarityTransformerSBERT(TransformerMixin, BaseEstimator):
    def __init__(self,  search_kw, model_name='all-MiniLM-L6-v2'):
        """
        Custom transformer to calculate cosine similarity 
        between the  key words and the job_titles segments using the Sentence-BERT model

        search_kw: str, the search key words against which to compute similarity
        model_name: str, model name of the Sentence-BERT model to use

        """
        # Initialize and loading the model information
        self.model_name = model_name
        self. search_kw =  search_kw
        self.model = SentenceTransformer(model_name)
        self.ref_vec  = None
    
    def fit(self, X, y=None):
        
        # Precompute the search keywords embedding
        self.ref_vec = self.model.encode(self.search_kw, convert_to_tensor=True)
        return self
    
    def transform(self, X):
        # Calculate cosine similarity of each job title to the search keywords
        sentence_embeddings = self.model.encode(X, convert_to_tensor=True)
        cosine_similarities = util.pytorch_cos_sim(self.ref_vec, sentence_embeddings).flatten()
        return cosine_similarities.numpy().reshape(-1, 1)


# Now lets run the pipeline and get the fit information
search_words = "Human Resources in California" # location information in the search words can improve predictions

preprocessor = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections),  validate=False), ['connection']),
        ('loc', FunctionTransformer(location_scores, validate=False, kw_args={'search_words': search_words}) ,  'location'),
        ('title', CosineSimilarityTransformerSBERT(search_kw = search_words, model_name='all-MiniLM-L6-v2'), 'job_title')
    ] 
)

# Weights of each score. Let us asssign an importance score of 0.2, 0.5 and 0.7 for information contained in their no of connections, location and title respectively
# The values of no of connections can be from 0 to 3, and no of locations also has the same range, but the cosine similarity is between -1 and 1
weights = np.array([0.2, 0.5, 0.7])
talents_df_clean['fit'] = (preprocessor.fit_transform(talents_df_clean)*weights).sum(axis=1)

#Displaying candidates in descending order of fit score
talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10)

Unnamed: 0,job_title,location,connection,fit
7,HR Senior Specialist,San Francisco Bay Area,500+,1.907255
23,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",500+,1.87552
31,HR Manager at Endemol Shine North America,"Los Angeles, California",268,1.690532
11,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,1.542001
26,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,1.452684
36,Human Resources Management Major,"Milpitas, California",18,1.409303
3,People Development Coordinator at Ryan,"Denton, Texas",500+,1.389546
32,Human Resources professional for the world lea...,"Highland, California",50,1.316694
13,Seeking Human Resources Opportunities,"Chicago, Illinois",390,1.311758
52,Director Of Administration at Excellence Logging,"Katy, Texas",500+,1.268445


As can be seen above the top 10 results are very relavent to the search keywords (Human resources in California) with all of them based in USA, and a significant chunk of them based in California. Hence it looks like Sentence-BERT is performing as expected.

### 3. Updating the fit scores based on the starred candidates

If any profile is starred then, we can modifying the reference search vector as a weighted sum of the keywords_search vector and the starred candidates job title vector, we can increase the cosine similarity between the updated reference search vector and similar candidates

In [6]:
# Defining the necessary transformations and functions

from sentence_transformers import util
from sklearn.base import TransformerMixin, BaseEstimator
from opencage.geocoder import OpenCageGeocode

class CosineSimilarityTransformerSBERT2(TransformerMixin, BaseEstimator):
    def __init__(self, reference_vec, model):
        """
        Custom transformer to calculate cosine similarity 
        between the  key words and the job_titles segments using the Sentence-BERT model

        reference_vec: the reference vec against which to compute similarity for all job titles
        model: Sentence BERT model used


        """

        self.model = model
        self.reference_vec  = reference_vec
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Calculate cosine similarity of each job title to the reference vector
        
        sentence_embeddings = self.model.encode(X, convert_to_tensor=True)
        cosine_similarities = util.pytorch_cos_sim(self.reference_vec, sentence_embeddings).flatten()
        return cosine_similarities.numpy().reshape(-1, 1)
    

# Calculating the location scores based on the location_df, if same city, then the score is 3,
# if same state score is 2 and if same country then the score is 1, if none match, then zero

def location_scores2(location_names, locations_df):

    """
    uses  search keywords and information in the location column of a data frame to find similarity
 
    Args:
        location names : Data frame column  (pandas series) containing locatoin information for job candidates
        locations_df :  Data frame containing city, county and state for all previously starred candiates and the search term

    Returns:
        pd.series: The calculated similarity scores between loc column and search string
    """

    if not locations_df.empty:
        scores = []
        for location_name in location_names:
            location_dict = get_location_info(location_name)
            score = 0
            if location_dict:
                for i, loc_row in locations_df.iterrows():
                    if (location_dict['country'] and (location_dict['country'] == loc_row['country'])):
                        score += 1
                        if (location_dict['state'] and (location_dict['state'] == loc_row['state'])):
                            score += 1
                            if (location_dict['city'] and (location_dict['city'] == loc_row['city'])): 
                                score += 1
            scores.append(score)
        scores = np.array(scores)
        scaled_scores = (scores - scores.min(axis=0)) / (scores.max(axis =0) - scores.min(axis=0)) 
        # Scaling scores so that they are in a range between 0 and 1
        return scaled_scores.reshape(-1, 1)
    else:
        return np.zeros(len(location_names)).reshape(-1, 1)

# Encoding  LinkedIn connections so that they are always between 0 and 1
def categorize_connections2(num:int, threshold:float = 0)-> int:
    '''
    Bins the no of LinkedIn connections to ordinal values based on threshold

    ''' 
    if num:
        if not threshold:
            if num =='500+ ':
                return 3/3
            elif 0 <= int(num) < 100:
                return 0/3
            elif 100 <= int(num) < 200:
                return 1/3
            elif 200 <= int(num) < 500:
                return 2/3
            else:
                return None
        else:
            if num =='500+ ':
                return 1
            elif num >= threshold:
                return 1
            else:
                return 0
    else:
        return None
    
# Usign OpenCage Geocode API to get location information
geocoder = OpenCageGeocode(GeocodingKey)

# Finding and extacting location infromation from the keywords
def get_location_info(location_name):
       
    """
    uses location name string and outputs the city state and country information for the top search result 
    for the location string as a dict after quertying the Open Cage Geocode API

    Args:
        location names : string containing location information

    Returns:
        dict: Dictionary containg the city, state and country inforation for the search string

    """

    try:
        results = geocoder.geocode(location_name, no_annotations='1')
        keys = ["city", "state", "country"]
        location = dict.fromkeys(keys)
        if results:
            # Usually, the first result is the most relevant one
            top_result = results[0]['components']
            for key in keys:
                location[key] = top_result.get(key, None)
    except Exception as e:
        print(f"Error geocoding {location_name}: {e}")
        location = dict.fromkeys(["city", "state", "country"], None)
    return location


In [39]:
# Now we will test out the updating pipeline that can adjust according to the starred candidates

from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

alpha = 0.3 # This is the parameter which modifies the reference search vector by weighting the starred candidate
search_words = "Human Resources in California"
model = SentenceTransformer('all-MiniLM-L6-v2') # Sentence-BERT model

# Extracting information from the search words to get inital set of predictions
# Deriving sentence embeddings for the search words
reference_vec = model.encode(search_words, convert_to_tensor=True)
 
# Extracting and saving any location information in the search keywords
locations_df = pd.DataFrame(columns = ['country','state', 'city'])
search_loc = [ent.text for ent in nlp(search_words).ents if ent.label_ == 'GPE']
if search_loc:
    search_loc_dict = get_location_info(' '.join(search_loc))
    locations_df = locations_df._append(search_loc_dict,  ignore_index=True)

# Thresholds for the categorizing the no of LinkedIn connections
thresholds = [0]

# Defining the transformer pipeline
preprocessor = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections2), validate=False, 
                                     kw_args={'threshold': np.mean(np.array(thresholds))}), ['connection']),
        ('loc', FunctionTransformer(location_scores2, validate=False, kw_args={'locations_df': locations_df}) ,'location'),
        ('title', CosineSimilarityTransformerSBERT2(reference_vec = reference_vec, model = model), 'job_title')
    ] 
)

# Weights of each score. Let us asssign an importance score of 0.2, 0.5 and 0.7 
# for information contained in their no of connections, location and title respectively
# The values of no of connections can be from 0 to 1, and no of locations also has the same range,
#  but the cosine similarity is between -1 and 1

weights = np.array([0.2, 0.5, 0.7])
talents_df_clean['fit'] = (preprocessor.fit_transform(talents_df_clean)*weights).sum(axis=1)

#Displaying candidates in descending order of fit score
talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10)


Unnamed: 0,job_title,location,connection,fit
7,HR Senior Specialist,San Francisco Bay Area,500+,1.007255
23,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",500+,0.97552
31,HR Manager at Endemol Shine North America,"Los Angeles, California",268,0.923865
36,Human Resources Management Major,"Milpitas, California",18,0.909303
11,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,0.892001
32,Human Resources professional for the world lea...,"Highland, California",50,0.816694
26,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,0.802684
13,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.795091
3,People Development Coordinator at Ryan,"Denton, Texas",500+,0.739546
24,Aspiring Human Resources Professional | Passio...,"New York, New York",212,0.7297


In [40]:
sindx = 31 # If it is the index selected
# Extracting information from the job title to update the search _vector

job_title_col = talents_df_clean.loc[sindx, 'job_title']
reference_vec = (1-alpha)*reference_vec +(alpha* model.encode(job_title_col, convert_to_tensor=True))

# Extracting information for updating locations
location_col = talents_df_clean.loc[sindx, 'location']
locations_df = locations_df._append(get_location_info(location_col),  ignore_index=True)

# Extracting information for updating connections

connections_col = talents_df_clean.loc[sindx, 'connection']
if connections_col=='500+ ':
    thresholds.append(500)
else:
    thresholds.append(int(connections_col))



preprocessor = ColumnTransformer(
transformers=[

    ('conn', FunctionTransformer(np.vectorize(categorize_connections2), validate=False, kw_args={'threshold': np.mean(np.array(thresholds))}), ['connection']),
    ('loc', FunctionTransformer(location_scores2, validate=False, kw_args={'locations_df': locations_df}) ,  'location'),
    ('title', CosineSimilarityTransformerSBERT2(reference_vec = reference_vec, model = model), 'job_title')
    ] 
    )

transformed = preprocessor.fit_transform(talents_df_clean)
weights = np.array([0.2, 0.5, 0.7])
talents_df_clean['fit'] = (transformed*weights).sum(axis=1)
talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10)

Unnamed: 0,job_title,location,connection,fit
31,HR Manager at Endemol Shine North America,"Los Angeles, California",268,1.175953
7,HR Senior Specialist,San Francisco Bay Area,500+,0.979793
23,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",500+,0.927448
11,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",500+,0.894896
36,Human Resources Management Major,"Milpitas, California",18,0.862147
13,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.845434
30,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.822542
26,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,0.789763
24,Aspiring Human Resources Professional | Passio...,"New York, New York",212,0.771326
32,Human Resources professional for the world lea...,"Highland, California",50,0.752812


As expected since index 31 was starred, the fit score of it has improved and it ranks as the candidate with the highest fit score. Hence the model is updating based on the starred candidates.