 ## Project3: Potential Talents

Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

Download Data:

https://docs.google.com/spreadsheets/d/117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU/edit?usp=sharing

Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

In [15]:
# Load and import necessary modules
import pandas as pd
import numpy as np
import pickle
import glob
import re
import requests
import json
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [16]:
import spacy
nlp = spacy.load("en_core_web_trf")

In [21]:
# Loading the data
talents_df =  pd.read_excel(os.path.join('aFBAMWTp3ZsqgUXd','potential-talents.xlsx'))

In [74]:
talents_df.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,c.t. bauer college business graduate magna cum...,"Houston, Texas",85,
1,2,english teacher epik english program korea,Kanada,500+,
2,3,human professional,"Raleigh-Durham, North Carolina Area",44,
3,4,development coordinator ryan,"Denton, Texas",500+,
4,5,advisory board member celal bayar university,"İzmir, Türkiye",500+,


In [67]:
from spacy.lang.en.stop_words import STOP_WORDS

def tokenize(text):
    if (text) or (text is not None):
        return([w.lemma_.lower() for w in nlp(text) if w.pos_ in ['NOUN','PROPN']])
    else:
        return []
 
stop_words_tokenized = set(tokenize(' '.join(sorted(STOP_WORDS))))


In [71]:
def job_similarity_kw_js(job_title, key_words):
    job_title_token = set(tokenize(job_title))
    key_words_token = set(tokenize(key_words))
    return(len(job_title_token.intersection(key_words_token))/len(job_title_token.union(key_words_token)))

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def job_similarity_kw_cs(job_title, key_words):
    job_title_tokenized = ' '.join(tokenize(job_title))
    key_words_tokenized = ' '.join(tokenize(key_words))
    # Initialize the TF-IDF Vectorizer


    # Fit and transform the texts
    tfidf_matrix = vectorizer.fit_transform([job_title_tokenized, key_words_tokenized])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return(similarity[0][0])

In [73]:
job_title = "Director Human Resources  at EY"
key_words = "Aspiring Human Resources"
print(job_similarity_kw_cs(job_title, key_words))
print(job_similarity_kw_js(job_title, key_words))

0.10371551133313006
0.2


In [88]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from spacy.lang.en.stop_words import STOP_WORDS
def spacy_tokenize_lemmatize(text):
    """
    Tokenizes and lemmatizes text using spaCy.
    """
    if (text) or (text is not None):
        return([w.lemma_.lower() for w in nlp(text) if w.pos_ in ['NOUN','PROPN']])
    else:
        return []
    
class TextSimilarityTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, search_words, spacy_nlp):
        """
        Custom transformer to calculate TF-IDF cosine similarity 
        between a reference text and a series of input texts.
        """
        self.search_words = search_words
        self.spacy_nlp = spacy_nlp
        stop_words_tokenized = set(spacy_tokenize_lemmatize(' '.join(sorted(STOP_WORDS))))
        self.vectorizer =  TfidfVectorizer(min_df =0.1, max_features=100, 
                                    stop_words= list(stop_words_tokenized), ngram_range = (1,4))
        

    def spacy_tokenize_lemmatize(self, text):
        """
        Tokenizes and lemmatizes text using spaCy.
        """
        if (text) or (text is not None):
            return([w.lemma_.lower() for w in nlp(text) if w.pos_ in ['NOUN','PROPN']])
        else:
            return []
    
        
    def fit(self, X, y=None):
        # Fit the vectorizer on the reference text and the input texts
        lemmatized_search_words = ' '.join(self.spacy_tokenize_lemmatize(self.search_words))
        
        # Fit the vectorizer on the input texts and the lemmatized reference text
        self.vectorizer.fit(X.tolist() + [lemmatized_search_words])
        return self

    
    def transform(self, X):
        # Transform the input texts to TF-IDF vectors
        tfidf_matrix = self.vectorizer.transform(X.tolist())
        
        # Transform the reference text to a TF-IDF vector
        ref_vector = self.vectorizer.transform([' '.join(self.spacy_tokenize_lemmatize(self.search_words))])
        
        # Calculate cosine similarity between the reference text and each input text
        similarity_scores = cosine_similarity(ref_vector, tfidf_matrix)
        
        return similarity_scores.reshape(-1)

In [89]:
# Sample DataFrame and reference sentence
search_words = "Aspiring Human Resources"

# Initialize and apply the transformer
similarity_transformer = TextSimilarityTransformer(search_words =search_words,  spacy_nlp=nlp)
talents_df['fit']  = similarity_transformer.fit_transform(talents_df['job_title'])

In [90]:
talents_df

Unnamed: 0,id,job_title,location,connection,fit
0,1,c.t. bauer college business graduate magna cum...,"Houston, Texas",85,0.221268
1,2,english teacher epik english program korea,Kanada,500+,0.000000
2,3,human professional,"Raleigh-Durham, North Carolina Area",44,0.502047
3,4,development coordinator ryan,"Denton, Texas",500+,0.000000
4,5,advisory board member celal bayar university,"İzmir, Türkiye",500+,0.000000
...,...,...,...,...,...
99,100,human manager may entry level human position s...,"Cape Girardeau, Missouri",103,1.000000
100,101,human generalist loparex,"Raleigh-Durham, North Carolina Area",500+,0.457441
101,102,business intelligence,Greater New York City Area,49,0.000000
102,103,success,Greater Los Angeles Area,500+,0.000000
