 ## Project3: Potential Talents

Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

Download Data:

https://docs.google.com/spreadsheets/d/117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU/edit?usp=sharing

Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

### 1. Load the necessary packages and the data

In [1]:
# Load and import necessary modules
import pandas as pd
import numpy as np
import pickle
import glob
import re
import requests
import json
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
# We could use spaCy to tokenize and stemming our keywords and job searches
import spacy
nlp = spacy.load("en_core_web_trf")

In [3]:
# Loading the data
talents_df =  pd.read_excel('potential-talents.xlsx')
print(talents_df.shape)
print(talents_df.head().to_markdown())

(104, 5)
|    |   id | job_title                                                                                                | location                            | connection   |   fit |
|---:|-----:|:---------------------------------------------------------------------------------------------------------|:------------------------------------|:-------------|------:|
|  0 |    1 | 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional | Houston, Texas                      | 85           |   nan |
|  1 |    2 | Native English Teacher at EPIK (English Program in Korea)                                                | Kanada                              | 500+         |   nan |
|  2 |    3 | Aspiring Human Resources Professional                                                                    | Raleigh-Durham, North Carolina Area | 44           |   nan |
|  3 |    4 | People Development Coordinator at Ryan                             

From the data frame  it is clear that there area few duplicates. Lets remove them

In [4]:
# Clean the duplicates in the data
talents_df_clean = talents_df.drop(columns= 'id').drop_duplicates().reset_index(drop= True)
print(talents_df_clean.shape)
print(talents_df_clean.head().to_markdown())

(53, 4)
|    | job_title                                                                                                | location                            | connection   |   fit |
|---:|:---------------------------------------------------------------------------------------------------------|:------------------------------------|:-------------|------:|
|  0 | 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional | Houston, Texas                      | 85           |   nan |
|  1 | Native English Teacher at EPIK (English Program in Korea)                                                | Kanada                              | 500+         |   nan |
|  2 | Aspiring Human Resources Professional                                                                    | Raleigh-Durham, North Carolina Area | 44           |   nan |
|  3 | People Development Coordinator at Ryan                                                                   | Den

### 2. Modeling
We plan to use 3 ways to build the model
1. Look for similar words between the search term and candidate job title extracted from this profile.  We can additionally use location information and connections to improve the ranking. Here we implement this using spacy to tokenize and stem the words, then we can use the tf-idf vectorizer to vectorize our words and use either jacquard similarity or cosine transformation to find the similarity between the search term and candidate job title.

Furthermore we can hard-code the importance of the no of connections by encoding them as ordingal classes with

<100: 'Low'; 101-200: 'Average'; 200-500: 'High'; 500+: 'Very-high'


As for location, we can extract the city, state and country information from the 'location' column and match if it is the same location as the job search keyword and assign a score accordingly.

#### 2.0 Just looking at similar words  between the "search term keywords" and the "job profile"  and quanitfying similarity through jacquard simialrity

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer

class TextSimilarityTransformerJS(BaseEstimator, TransformerMixin):
    def __init__(self, search_words, spacy_nlp):
        """
        Custom transformer to calculate TF-IDF cosine similarity 
        between a reference text and a series of input texts.
        """
        self.search_words = search_words
        self.spacy_nlp = spacy_nlp
 

    def spacy_tokenize_lemmatize(self, text):
        """
        Tokenizes and lemmatizes text using spaCy.
        """
        if (text) or (text is not None):
            return([w.lemma_.lower() for w in self.spacy_nlp(text) if w.pos_ in ['NOUN','PROPN']])
        else:
            return []
    
    # Defining jacquard similarity
    def job_similarity_kw_js(self, job_title, search_words):
        job_title_token = set(self.spacy_tokenize_lemmatize(job_title))
        search_words_token = set(self.spacy_tokenize_lemmatize(search_words))
        return(len(job_title_token.intersection(search_words_token))/len(job_title_token.union(search_words_token)))

    
    def fit(self, X, y=None):
        return self

    
    def transform(self, X):
        # Transform the input texts to TF-IDF vectors
    
        # Calculate cosine similarity between the reference text and each input text
        similarity_scores = X.apply(self.job_similarity_kw_js, search_words= self.search_words)
    
        return similarity_scores
    
# Sample DataFrame and reference sentence
search_words = "Aspiring Human Resources"

# Initialize and apply the transformer
similarity_transformer = TextSimilarityTransformerJS(search_words =search_words,  spacy_nlp=nlp)
talents_df_clean['fit']  = similarity_transformer.fit_transform(talents_df_clean['job_title'])

# Printing out the top 10 candidates based on the highest fit score
print(talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10).to_markdown())

|    | job_title                                                              | location                            |   connection |      fit |
|---:|:-----------------------------------------------------------------------|:------------------------------------|-------------:|---------:|
| 45 | Aspiring Human Resources Professional                                  | Kokomo, Indiana Area                |           71 | 0.666667 |
|  2 | Aspiring Human Resources Professional                                  | Raleigh-Durham, North Carolina Area |           44 | 0.666667 |
| 13 | Seeking Human Resources Opportunities                                  | Chicago, Illinois                   |          390 | 0.666667 |
|  5 | Aspiring Human Resources Specialist                                    | Greater New York City Area          |            1 | 0.666667 |
| 42 | Seeking Human  Resources Opportunities. Open to travel and relocation. | Amerika Birleşik Devletleri         |          415 | 0.4

#### 2.1 Using N-Gram models and Tf-Idf Vectorizers and using cosine transformers

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from spacy.lang.en.stop_words import STOP_WORDS

    
class TextSimilarityTransformerCS(BaseEstimator, TransformerMixin):
    def __init__(self, search_words, spacy_nlp):
        """
        Custom transformer to calculate TF-IDF cosine similarity 
        between a reference text and a series of input texts.
        """
        self.search_words = search_words
        self.spacy_nlp = spacy_nlp
        stop_words_tokenized = set(self.spacy_tokenize_lemmatize(' '.join(sorted(STOP_WORDS))))
        self.vectorizer =  TfidfVectorizer(min_df =0.1, max_features=100, 
                                    stop_words= list(stop_words_tokenized), ngram_range = (1,4))
        

    def spacy_tokenize_lemmatize(self, text):
        """
        Tokenizes and lemmatizes text using spaCy.
        """
        if (text) or (text is not None):
            return([w.lemma_.lower() for w in self.spacy_nlp(text) if w.pos_ in ['NOUN','PROPN']])
        else:
            return []
    
        
    def fit(self, X, y=None):
        # Fit the vectorizer on the reference text and the input texts
        lemmatized_search_words = ' '.join(self.spacy_tokenize_lemmatize(self.search_words))
        
        # Fit the vectorizer on the input texts and the lemmatized reference text
        self.vectorizer.fit(X.tolist() + [lemmatized_search_words])
        return self

    
    def transform(self, X):
        # Transform the input texts to TF-IDF vectors
        tfidf_matrix = self.vectorizer.transform(X.tolist())
        
        # Transform the reference text to a TF-IDF vector
        ref_vector = self.vectorizer.transform([' '.join(self.spacy_tokenize_lemmatize(self.search_words))])
        
        # Calculate cosine similarity between the reference text and each input text
        similarity_scores = cosine_similarity(ref_vector, tfidf_matrix)
        
        return similarity_scores.reshape(-1)
    
# Sample DataFrame and reference sentence
search_words = "Aspiring Human Resources"

# Initialize and apply the transformer
similarity_transformer = TextSimilarityTransformerCS(search_words =search_words,  spacy_nlp=nlp)
talents_df_clean['fit']  = similarity_transformer.fit_transform(talents_df_clean['job_title'])

# Printing out the top 10 candidates based on the highest fit score
print(talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10).to_markdown())

|    | job_title                                                        | location                            | connection   |      fit |
|---:|:-----------------------------------------------------------------|:------------------------------------|:-------------|---------:|
| 25 | Human Resources|                                                 | Dallas/Fort Worth Area              | 409          | 0.569669 |
|    | Conflict Management|                                             |                                     |              |          |
|    | Policies & Procedures|Talent Management|Benefits & Compensation  |                                     |              |          |
| 36 | Human Resources Management Major                                 | Milpitas, California                | 18           | 0.569669 |
| 17 | Director of Human Resources North America, Groupe Beneteau       | Greater Grand Rapids, Michigan Area | 500+         | 0.569669 |
| 26 | Human Resources Generalist 

#### 2.2 Extended N-gram model. Now let us add information from the location as well as the connections to estimate fit

In [7]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv('config/.env'))
GeocodingKey= os.getenv('GeocodingKey')

In [34]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from opencage.geocoder import OpenCageGeocode
from sklearn.metrics.pairwise import cosine_similarity

# This custom transformer calculates the cosine similarity after ifidf vectorizer between the job tilte and the search key terms
class CosineSimilarityTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, search_words, vectorizer =TfidfVectorizer()) :
        self.search_words= search_words
        self.vectorizer = vectorizer

    def fit(self, X, y=None):
        # Ensure that X is treated as a Series and convert to list if necessary
        if isinstance(X, pd.DataFrame):
            X = X.iloc[:, 0]
        # Fit the vectorizer on the reference sentence and the document corpus
        self.vectorizer.fit(X.tolist() + [self.search_words])
        return self

    def transform(self, X):
                # Ensure that X is treated as a Series and convert to list if necessary
        if isinstance(X, pd.DataFrame):
            X = X.iloc[:, 0]
        # Transform the documents and the reference sentence into TF-IDF vectors
        X_tfidf = self.vectorizer.transform(X.tolist())
        sw_tfidf = self.vectorizer.transform([self.search_words])

        # Calculate the cosine similarity of each document with the reference sentence
        similarity_scores = cosine_similarity(X_tfidf, sw_tfidf)
        
        # Return similarity scores as a 1D array of scores
        return similarity_scores.reshape(-1, 1)
    
# Usign OpenCage Geocode API to get location information
geocoder = OpenCageGeocode(GeocodingKey)

# Finding and extacting location infromation from the keywords
def get_location_info(location_name):
    # Assuming `geocoder` is already imported and configured
    try:
        results = geocoder.geocode(location_name, no_annotations='1')
        keys = ["city", "state", "country"]
        location = dict.fromkeys(keys)
        if results:
            # Usually, the first result is the most relevant one
            top_result = results[0]['components']
            for key in keys:
                location[key] = top_result.get(key, None)
    except Exception as e:
        print(f"Error geocoding {location_name}: {e}")
        location = dict.fromkeys(["city", "state", "country"], None)
    return location


# Calculating the location scores, if same city, then the score is 3, if same state score is 2 and if same country then the score is 1, if none match, then zero
def location_scores(location_names, search_words):

    loc_search_kw = [ent.text for ent in nlp(search_words).ents if ent.label_ == 'GPE']
    if loc_search_kw:
        kw_loc_dict = get_location_info(''.join(loc_search_kw))
        scores = []
        for location_name in location_names:
            location_dict = get_location_info(location_name)

            score = 0
            if loc_search_kw:
                # This implies only one keyword is processed, which might need adjusting
                
                if (location_dict['country'] and (location_dict['country'] == kw_loc_dict['country'])):
                    score += 1
                    if (location_dict['state'] and (location_dict['state'] == kw_loc_dict['state'])):
                        score += 1
                        if (location_dict['city'] and (location_dict['city'] == kw_loc_dict['city'])): 
                            score += 1
            scores.append(score)
        return np.array(scores).reshape(-1, 1)
    else:
        return np.zeros(len(location_names)).reshape(-1, 1)


# Encoding  LinkedIn connections
def categorize_connections(num):
    if num =='500+ ':
        return 3
    elif 0 <= int(num)< 100:
        return 0
    elif 100 <= int(num) < 200:
        return 1
    elif 200 <= int(num) < 500:
        return 2
    else:
        return None



# Now lets run the pipeline and get the necessary info
search_words = "Human Resources in California"
preprocessor = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections),  validate=False), ['connection']),
        ('loc', FunctionTransformer(location_scores, validate=False, kw_args={'search_words': search_words}) ,  'location'),
        ('title', CosineSimilarityTransformer(search_words = search_words, vectorizer = TfidfVectorizer(min_df=0.1, max_features=100, ngram_range=(1, 4))), ['job_title'])
    ] 
)

# Weights of each score. Let us asssign an importance score of 0.2, 0.2 and 0.6 for information contained in their no of connections, location and title respectively
# The values of no of connections can be from 0 to 3, and no of locations also has the same range, but the cosine similarity is between -1 and 1
weights = np.array([0.2, 0.5, 0.7 ])
talents_df_clean['fit'] = (preprocessor.fit_transform(talents_df_clean)*weights).sum(axis=1)

# Printing out the top 10 candidates based on the highest fit score
print(talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10).to_markdown())

|    | job_title                                                                                                             | location                    | connection   |     fit |
|---:|:----------------------------------------------------------------------------------------------------------------------|:----------------------------|:-------------|--------:|
| 23 | Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!!  (408) 709-2621                  | San Jose, California        | 500+         | 1.92337 |
|  7 | HR Senior Specialist                                                                                                  | San Francisco Bay Area      | 500+         | 1.6     |
| 26 | Human Resources Generalist at Schwan's                                                                                | Amerika Birleşik Devletleri | 500+         | 1.46955 |
| 11 | Human Resources Coordinator at InterContinental Buckhead Atlanta                   

#### 2.3 Using word embeddings word from Word2Vec and calculating the similarity score using cosine similarity

In [19]:
from sklearn.base import TransformerMixin, BaseEstimator
from gensim.models import KeyedVectors
from gensim.utils import simple_preprocess
import itertools
class CosineSimilarityTransformerW2V(TransformerMixin, BaseEstimator):
    def __init__(self, model_path, search_kw) :
        """
        search_kw: str, the search keywords against which to compute similarity
        model_path: str, path to the pre-trained word2vec model
        """
        self.model_path = model_path
        self.search_kw = search_kw
        self.model = None
        self.search_words = None

    def fit(self, X, y=None):
        # Load the word2vec model
        self.model = KeyedVectors.load_word2vec_format(self.model_path, binary=True)
        # Compute the vector for the search keywords
        self.search_words= self.get_words(self.search_kw)
        return self

    def transform(self, X):
        # Compute cosine similarity between each sentence in X and the search keywords

        return np.array([self.pairwise_cs(self.get_words(sentence), self.search_words) for _, sentence in X.items()]).reshape(-1, 1)
    

    def get_words(self, sentence):
        # Convert sentence to tokens, ignoring out-of-vocabulary words
        words = [word for word in simple_preprocess(sentence) if word in list(self.model.index_to_key)]
        return words

    def cosine_similarity(self, vec1, vec2):
        # Compute cosine similarity between two vectors
        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    
    def pairwise_cs(self, sentence1, sentence2):
        if (sentence1 and sentence2): # So that both are not empty lists
            cosine_vals = []
            for  word1, word2 in list(itertools.product(sentence1, sentence2)):
                cosine_vals.append(self.cosine_similarity(self.model[word1], self.model[word2]))
            cosine_vals = np.array(cosine_vals)
            # Only taking the mean of the top 20 %ile of cosine values and calculating the mean as the similarity between two sentences
            return np.mean(cosine_vals[cosine_vals >= np.percentile( cosine_vals, 80) ])
        else:
            return 0

# Now lets run the pipeline and get the necessary info
search_words = "Human Resources in California"
preprocessor = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections),  validate=False), ['connection']),
        ('loc', FunctionTransformer(location_scores, validate=False, kw_args={'search_words': search_words}) ,  'location'),
        ('title', CosineSimilarityTransformerW2V(search_kw = search_words, model_path = '../data/GoogleNews-vectors-negative300.bin'), 'job_title')
    ] 
)

# Weights of each score. Let us asssign an importance score of 0.2, 0.2 and 0.6 for information contained in their no of connections, location and title respectively
# The values of no of connections can be from 0 to 3, and no of locations also has the same range, but the cosine similarity is between -1 and 1
weights = np.array([0.2, 0.5, 0.7])
talents_df_clean['fit'] = (preprocessor.fit_transform(talents_df_clean)*weights).sum(axis=1)

# Printing out the top 10 candidates based on the highest fit score
print(talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10).to_markdown())

|    | job_title                                                                                            | location                    | connection   |     fit |
|---:|:-----------------------------------------------------------------------------------------------------|:----------------------------|:-------------|--------:|
| 23 | Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!!  (408) 709-2621 | San Jose, California        | 500+         | 1.94212 |
|  7 | HR Senior Specialist                                                                                 | San Francisco Bay Area      | 500+         | 1.71791 |
| 31 | HR Manager at Endemol Shine North America                                                            | Los Angeles, California     | 268          | 1.63733 |
| 11 | Human Resources Coordinator at InterContinental Buckhead Atlanta                                     | Atlanta, Georgia            | 500+         | 1.62132 |
| 26 | Hum

#### 2.4 Using sentence level representation from the [CLS] token from BERT and calculating the similarity score using cosine similarity

In [23]:
from sklearn.base import TransformerMixin, BaseEstimator
from transformers import BertModel, BertTokenizer
import torch

class CosineSimilarityTransformerBERT(TransformerMixin, BaseEstimator):
    def __init__(self, search_kw, model_name='bert-base-uncased') :
        """
        search_kw: str, the search key words against which to compute similarity
        model_name: str, model name of the BERT model to use
        """
        self.model_name = model_name
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertModel.from_pretrained(model_name)
        self.search_kw = search_kw
        self.model.eval()  # Set model to evaluation mode
        self.ref_vec =  None

    def fit(self, X, y=None):
        inputs = self.tokenizer(self.search_kw , return_tensors='pt', padding=True, truncation=True)
        with torch.no_grad():
            self.ref_vec = self.model(**inputs).last_hidden_state[:, 0, :].squeeze()  # CLS token representation
        return self

    def transform(self, X):
        # Transform each sentence in the DataFrame column to cosine similarity score
        cos_sim_scores = []
        for  _, sentence in X.items():
            inputs = self.tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)
            with torch.no_grad():
                sentence_vec = self.model(**inputs).last_hidden_state[:, 0, :].squeeze()
            cos_sim = torch.nn.functional.cosine_similarity(self.ref_vec, sentence_vec, dim=0)
            cos_sim_scores.append(cos_sim.item())
        return np.array(cos_sim_scores).reshape(-1, 1)
    

# Now lets run the pipeline and get the necessary info
search_words = "Human Resources in California"
preprocessor = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections),  validate=False), ['connection']),
        ('loc', FunctionTransformer(location_scores, validate=False, kw_args={'search_words': search_words}) ,  'location'),
        ('title', CosineSimilarityTransformerBERT(search_kw = search_words, model_name='bert-base-uncased'), 'job_title')
    ] 
)

# Weights of each score. Let us asssign an importance score of 0.2, 0.2 and 0.6 for information contained in their no of connections, location and title respectively
# The values of no of connections can be from 0 to 3, and no of locations also has the same range, but the cosine similarity is between -1 and 1
weights = np.array([0.2, 0.5, 0.7])
talents_df_clean['fit'] = (preprocessor.fit_transform(talents_df_clean)*weights).sum(axis=1)

# Printing out the top 10 candidates based on the highest fit score
print(talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10).to_markdown())

|    | job_title                                                                                            | location                    | connection   |     fit |
|---:|:-----------------------------------------------------------------------------------------------------|:----------------------------|:-------------|--------:|
|  7 | HR Senior Specialist                                                                                 | San Francisco Bay Area      | 500+         | 2.20373 |
| 23 | Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!!  (408) 709-2621 | San Jose, California        | 500+         | 2.17932 |
| 31 | HR Manager at Endemol Shine North America                                                            | Los Angeles, California     | 268          | 1.94667 |
|  3 | People Development Coordinator at Ryan                                                               | Denton, Texas               | 500+         | 1.76636 |
| 26 | Hum

#### 2.5 Using SentenceTransformers  from Sentence-BERT and calculating the similarity score using cosine similarity

In [26]:
from sentence_transformers import SentenceTransformer, util
from sklearn.base import TransformerMixin, BaseEstimator

class CosineSimilarityTransformerSBERT(TransformerMixin, BaseEstimator):
    def __init__(self,  search_kw, model_name='all-MiniLM-L6-v2'):
        self.model_name = model_name
        self. search_kw =  search_kw
        self.model = SentenceTransformer(model_name)
        self.ref_vec  = None
    
    def fit(self, X, y=None):
        
        # Precompute the search keywords embedding
        self.ref_vec = self.model.encode(self.search_kw, convert_to_tensor=True)
        return self
    
    def transform(self, X):
        # Calculate cosine similarity of each sentence to the reference
        sentence_embeddings = self.model.encode(X, convert_to_tensor=True)
        cosine_similarities = util.pytorch_cos_sim(self.ref_vec, sentence_embeddings).flatten()
        return cosine_similarities.numpy().reshape(-1, 1)
# Now lets run the pipeline and get the necessary info

search_words = "Human Resources in California"
preprocessor = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections),  validate=False), ['connection']),
        ('loc', FunctionTransformer(location_scores, validate=False, kw_args={'search_words': search_words}) ,  'location'),
        ('title', CosineSimilarityTransformerSBERT(search_kw = search_words, model_name='all-MiniLM-L6-v2'), 'job_title')
    ] 
)

# Weights of each score. Let us asssign an importance score of 0.2, 0.2 and 0.6 for information contained in their no of connections, location and title respectively
# The values of no of connections can be from 0 to 3, and no of locations also has the same range, but the cosine similarity is between -1 and 1
weights = np.array([0.2, 0.5, 0.7])
talents_df_clean['fit'] = (preprocessor.fit_transform(talents_df_clean)*weights).sum(axis=1)

# Printing out the top 10 candidates based on the highest fit score
print(talents_df_clean.sort_values(by =['fit'], ascending=False, inplace = False).head(10).to_markdown())

|    | job_title                                                                                            | location                    | connection   |     fit |
|---:|:-----------------------------------------------------------------------------------------------------|:----------------------------|:-------------|--------:|
|  7 | HR Senior Specialist                                                                                 | San Francisco Bay Area      | 500+         | 1.90726 |
| 23 | Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!!  (408) 709-2621 | San Jose, California        | 500+         | 1.87552 |
| 31 | HR Manager at Endemol Shine North America                                                            | Los Angeles, California     | 268          | 1.69053 |
| 11 | Human Resources Coordinator at InterContinental Buckhead Atlanta                                     | Atlanta, Georgia            | 500+         | 1.542   |
| 26 | Hum

### 3. Improving the model based on the chosen options

If any profile is starred then, we can modifying the reference search vector as a weighted sum of the keywords_search vector and the starred reference vector, we can increase the cosine similarity between the new search vector and similar candidates

In [None]:
# Sample DataFrame and reference sentence
search_words = "Aspiring Human Resources"

# Initialize and apply the transformer
similarity_transformer = TextSimilarityTransformer(search_words =search_words,  spacy_nlp=nlp)
talents_df_clean['fit']  = similarity_transformer.fit_transform(talents_df_clean['job_title'])

In [32]:

class CosineSimilarityTransformerSBERT2(TransformerMixin, BaseEstimator):
    def __init__(self, ref_vec, model):
        self.model = model
        self.ref_vec  = ref_vec
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Calculate cosine similarity of each sentence to the reference
        
        sentence_embeddings = self.model.encode(X, convert_to_tensor=True)
        cosine_similarities = util.pytorch_cos_sim(self.ref_vec, sentence_embeddings).flatten()
        return cosine_similarities.numpy().reshape(-1, 1)
    

# Calculating the location scores based on the location_df, if same city, then the score is 3, if same state score is 2 and if same country then the score is 1, if none match, then zero
def location_scores2(location_names, locations_df):
    if locations_df.empty:
        scores = []
        for location_name in location_names:
            location_dict = get_location_info(location_name)
            score = 0
            if location_dict:
                for i, loc_row in locations_df.iterrows():
                    if (location_dict['country'] and (location_dict['country'] == loc_row['country'])):
                        score += 1
                        if (location_dict['state'] and (location_dict['state'] == loc_row['state'])):
                            score += 1
                            if (location_dict['city'] and (location_dict['city'] == loc_row['city'])): 
                                score += 1
            scores.append(score)
        scores = np.array(scores)
        scaled_scores = (scores - scores.min(axis=0)) / (scores.max(axis =0) - scores.min(axis=0)) 
        # Scaling scores so that they are in a range between 0 and 1
        return scaled_scores.reshape(-1, 1)
    else:
        return np.zeros(len(location_names)).reshape(-1, 1)

# Encoding  LinkedIn connections so that they are always between 0 and 1
def categorize_connections(num, threshold = 0):
    if num:
        if not threshold:
            if num =='500+ ':
                return 3/3
            elif 0 <= int(num)< 100:
                return 0/3
            elif 100 <= int(num) < 200:
                return 1/3
            elif 200 <= int(num) < 500:
                return 2/3
            else:
                return None
        else:
            if num>= threshold:
                return 1
            else:
                return 0
    else:
        return None


In [None]:

search_words = "Human Resources in California"
locations_df = pd.DataFrame(columns = ['country','state', 'city'])
search_loc = [ent.text for ent in nlp(search_words).ents if ent.label_ == 'GPE']
if search_loc:
    search_loc_dict = get_location_info(' '.join(search_loc))
    locations_df = locations_df.append(search_loc_dict,  ignore_index=True)

thresholds = [0]
sindx = 10


# Extracting information from the job title to update the search _vector

job_title_col = talents_df_clean.loc[sindx, 'job_title']


# Extracting information for updating locations
location_col = talents_df_clean.loc[sindx, 'location']
locations_df = locations_df.append(get_location_info(location_col),  ignore_index=True)

# Extracting information for updating connections

connections_col = talents_df_clean.loc[sindx, 'connection']
if connections_col=='500+':
    thresholds.append(500)
else:
    thresholds.append(int(connections_col))

threshold = np.mean(thresholds)



In [None]:

preprocessor_init = ColumnTransformer(
    transformers=[

        ('conn', FunctionTransformer(np.vectorize(categorize_connections),  validate=False), ['connection']),
        ('loc', FunctionTransformer(location_scores, validate=False, kw_args={'search_words': search_words}) ,  'location'),
        ('title', CosineSimilarityTransformerSBERTU(search_kw = search_words, model_name='all-MiniLM-L6-v2'), 'job_title')
    ] 
)

        self. search_kw =  search_kw
        self.model = SentenceTransformer(model_name)
transformed_df = transformers.init