## Potential Talent
#### Context
We are a talent sourcing and management company helping tech firms find top candidates. This is challenging because:

- We must deeply understand client needs.
- We must know what makes a candidate stand out.
- Finding the right talent is difficult.

Currently, our process is manual and labor-intensive. To streamline it, we want to build a machine learning pipeline that:

- Predicts candidate fitness for a given role.
- Ranks candidates accordingly.
- Adapts rankings when we “star” a candidate as ideal, reordering based on that feedback.

We source candidates via keyword searches (e.g., “full-stack software engineer”, “engineering manager”, “aspiring human resources”). After generating a list, we manually review and sometimes select candidates further down the ranking. Our goal is to automate re-ranking based on such supervisory signals.

#### Data Description
Our dataset anonymizes candidates with unique IDs.

Attributes:

- id: candidate identifier (numeric)
- job_title: candidate’s job title (text)
- location: geographical location (text)
- screening_score: candidates were screened (numeric, 0-100)

Target:
- fit: candidate’s fitness for a role (numeric, probability 0–1)
- Sample keywords: “aspiring human resources”, “seeking human resources”

Goals
- Predict candidate fitness.
- Rank candidates by fitness score.
- Re-rank dynamically when candidates are starred.

Success Metrics
- Improved candidate ranking accuracy.
- Robustness of re-ranking after feedback.

Bonus
- Methods to filter out irrelevant candidates.
- Defining cut-off thresholds without losing strong fits.
- Reducing human bias through further automation.
- Explore GenAI solutions using prompt engineering

## Read in Data and Initial Exploratory Data Analysis

In [1]:
# Importing Standard Libraries
import pandas as pd
import numpy as np
import os

from sklearn.metrics.pairwise import linear_kernel
pd.options.display.max_columns = 60

# Set the option to display the full text in DataFrame columns
pd.set_option('display.max_colwidth', None)

In [2]:
# Reading in the data
path = os.getcwd()
df = pd.read_excel(path + '\\Dataset for Potential Talents.xlsx').set_index('id')
df.head()

Unnamed: 0_level_0,title,location,screening_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.,United States,100
2,ms applied data science student usc research assistant usc former data analytics intern at dr reddys laboratories former data science intern quadratyx actively seeking full time roles in summer 2025,United States,100
3,computer science student seeking full-time software engineerdeveloper positions ai sql data visualization toolspython ssrs,United States,100
4,microsoft certified power bi data analyst mba business analytics unt business intelligence engineer data scientist data engineer business analytics predictive analytics statistical analysis ex-ericsson,United States,100
5,graduate research assistant at uab masters in data science student at uab ex jio,United States,100


In [3]:
df.rename(columns={"title":"job_title"}, inplace=True)
# df.rename(columns={"screening_score":"connection"}, inplace=True)
df.job_title.value_counts()

job_title
data analyst                                                                                                    19
data scientist                                                                                                  16
--                                                                                                              15
software engineer                                                                                                5
researcher                                                                                                       3
                                                                                                                ..
masters in applied statistics and supply chain analyst for aldi                                                  1
master of science in analytics at georgia institute of technology aspiring data scientist                        1
data engineer student at iit and upm                                  

In [4]:
df.job_title.value_counts()[:10]

job_title
data analyst                                      19
data scientist                                    16
--                                                15
software engineer                                  5
researcher                                         3
student at the university of memphis               2
Software Engineer                                  2
aspiring data analyst                              2
data scientist i                                   2
master of science in biostatistics columbia 25     2
Name: count, dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1285 entries, 1 to 1285
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   job_title        1281 non-null   object
 1   location         1285 non-null   object
 2   screening_score  1285 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 40.2+ KB


**Observations**
- The frequent job titles are Data Scientists and Data Analyst
- There were some null values. Some job titles were null values or left blank. 
- Many job titles are an aggregate of multiple titles
- Some titles are repeated because they are differently capitalized
- We should get rid of of all null values and replaced empty job titles with 'blank' to standardize the missing values. 

In [6]:
# dropping null values and cleaning up data
df = (df[~df['job_title'].isna()])
df.replace("--", "blank", inplace=True)
df.replace(" ", "blank", inplace=True)
df.replace(".", "blank", inplace=True)


In [7]:
# Checking if there are any empty rows 
df[df.job_title == ' ']

Unnamed: 0_level_0,job_title,location,screening_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1281 entries, 1 to 1285
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   job_title        1281 non-null   object
 1   location         1281 non-null   object
 2   screening_score  1281 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 40.0+ KB


The cleaned up dataset has 1,281 observations. There are no more null values.  We're ready to start ranking the candidates by similarity scores. 

## Process
We will be comparing various NLP, LLM and Finally GenAI methods for robustness: TF-IDF, Word2Vec, GloVe, FastText, and BERT. These are all techniques for generating word embeddings, which are numerical vector representations of words. 

Each method learns these embeddings in a different way, leading to various strengths and weaknesses. The primary difference lies in how they capture a word's meaning, particularly in context. 

The final thing we will do is work with prompt engineering and GenAI model. 

## 1. TF-IDF
 We will start with TF-IDF (Term Frequency-Inverse Document Frequency). This is a statistical method that weighs a word's importance based on how frequently it appears in a document (Term Frequency) and how rare it is across the entire collection of documents (Inverse Document Frequency).

We will compare how close the search term is to the job titles through cosine similarity and return a fit score from 0-1. 

**Limitation**

TF-IDF does not capture any meaning or semantic relationship between words. It only looks at word frequency, completely ignoring the surrounding context and word order.

### Prepping our Text for Modelling

In [9]:
# Importing Libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Prep our Text for Modelling
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (1, 2))
docs_tfidf = vectorizer.fit_transform(df["job_title"])

In [10]:
# Defining a function for Calculating Cosine Similarity of vectorized tfidf queries with job titles
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cos_sim = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    
    return cos_sim

def get_all_similarity(query):
    
    # TFIDF similarity
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    return df

# Function to return the top candidates per search term 
def top_candidates(n, by='tfidf_fit', ascending=False, min_con=0, location=None):

    # Add screening_score column 
    by = [by, 'screening_score']
    
    # If location is not provided, use all locations (no filtering)
    if location is None:
        location_filter = df.location.notnull()  # No location restriction
    else:
        location_filter = df.location == location

    # Create condition for columns in 'by' to be greater than 0
    score_filter = (df[by] > 0).all(axis=1)
    
    # Filter and sort
    
    df2 = df.loc[(df.screening_score >= min_con) & score_filter & location_filter]
    df2 = df2.sort_values(by=by, ascending=ascending).head(n).copy()
    
    if df2.empty:
        return "There are no suitable candidates"
    return df2

In [11]:
query = 'Data Analyst'
df = get_all_similarity(query)
top_candidates(n = 4)

Unnamed: 0_level_0,job_title,location,screening_score,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
363,data analyst,United States,100,1.0
74,data analyst,United States,90,1.0
589,data analyst,United States,85,1.0
622,data analyst,United States,85,1.0


We could set to return candidates with minimum screen score

In [12]:
top_candidates(n = 4, min_con=90)

Unnamed: 0_level_0,job_title,location,screening_score,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
363,data analyst,United States,100,1.0
74,data analyst,United States,90,1.0
570,business data analyst,United States,90,0.463726
25,data analyst and machine learning engineer,United States,95,0.389979


In [13]:
query = 'Sales Representative'
df = get_all_similarity(query)
top_candidates(n = 4)

Unnamed: 0_level_0,job_title,location,screening_score,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
328,software engineer mathematics data science and ai course representative,United Kingdom,100,0.228731
740,senior technical support representative at n-able with expertise in cyber security and project management,United Kingdom,80,0.165914
496,data analyst sales operation analytics compensation 8 years alteryx excel sql and python proficient new york usa,United States,95,0.164136


**Observations:**

TF-IDF does fairly well returning candidates per search term when the terms or job titles are an exact or a very close match. If there's no close match, it has a more challenging time and that's to be expected. There is no semantic understanding. Let's take a look at the next technique. 

## 2. Word2Vec Gensim
Word2Vec is a neural network-based method learns word vectors (embeddings) by predicting surrounding words from a target word (Skip-gram) or predicting a target word from its context (CBOW).

It captures semantic and syntactic relationships, so words with similar meanings have similar vectors. This allows for vector arithmetic like king - man + woman ≈ queen.

It generates a single, static vector for each word regardless of its context, which can cause issues with words that have multiple meanings (polysemy), like "bank"

**Limitations**
Word2Vec treats every word as a single, atomic unit. If it encounters a word it has never seen before (an out-of-vocabulary or OOV word), it cannot create a meaningful vector for it.

We will be using the pre-trained model by Google to get pre-trained embedding for our use. 

### Prepping our Text for Modelling

In [14]:
import re
import nltk
from tensorflow import keras

# processing texts for modelling
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
df['job_title_cleaned'] = df.job_title.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                            for w in x.split() 
                                                            if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                            not in stop_words) ) #nltk.download('stopwords')

In [15]:
# drop tfidf_fit column to preserve column order later
df.drop(columns="tfidf_fit", inplace=True)
df.head(2)

Unnamed: 0_level_0,job_title,location,screening_score,job_title_cleaned
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.,United States,100,innovative driven professional seeking role data analyticsdata science information technology industry
2,ms applied data science student usc research assistant usc former data analytics intern at dr reddys laboratories former data science intern quadratyx actively seeking full time roles in summer 2025,United States,100,ms applied data science student usc research assistant usc former data analytics intern dr reddys laboratories former data science intern quadratyx actively seeking full time roles summer


In [16]:
# Patch for keras_preprocessing compatibility with NumPy 2.0
if not hasattr(np, "unicode_"):
    np.unicode_ = np.str_


In [17]:
# tokenizing our cleaned job title data and padding every document to make them of the same size
from tensorflow.keras.preprocessing.text import Tokenizer
# from keras.layers import TextVectorization
from keras_preprocessing.sequence import pad_sequences
tokenizer=Tokenizer()

tokenizer.fit_on_texts(df.job_title_cleaned)
tokenized_documents=tokenizer.texts_to_sequences(df.job_title_cleaned)
tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
vocab_size=len(tokenizer.word_index)+1

### Loading pre-trained embeddings

In [None]:
import gensim

# Loading Google's pre-trained embeddings, each word is represented as a 300 dimensional vector
os.chdir(r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent")
# W2V_PATH="GoogleNews-vectors-negative300.bin.gz"
W2V_PATH="GoogleNews-vectors-negative300.bin"
path = os.getcwd()+'\\GoogleNews-vectors-negative300.bin\\'

# loading word2vec model
model_w2v = gensim.models.KeyedVectors.load_word2vec_format(path+W2V_PATH, binary=True)
model_w2v[0][:4]

array([ 0.00112915, -0.00089645,  0.00031853,  0.00153351], dtype=float32)

In [19]:
# creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
embedding_matrix=np.zeros((vocab_size,300))
for word,i in tokenizer.word_index.items():
    if word in model_w2v:
        embedding_matrix[i]=model_w2v[word]
        
# creating document-word embeddings
document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
for i in range(len(tokenized_paded_documents)):
    for j in range(len(tokenized_paded_documents[0])):
        document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
document_word_embeddings.shape

(1281, 64, 300)

In [20]:
# Creating a function to process our query term
def processing(query):
    df3 = pd.DataFrame([query], columns=['query'])
    stop_words = stopwords.words('english')
    df3['processed'] = df3['query'].apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                for w in x.split() 
                                                                                if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                not in stop_words) )
    
    tokenizer.fit_on_texts(df3.processed)
    tokenized_documents=tokenizer.texts_to_sequences(df3.processed)
    tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
    vocab_size=len(tokenizer.word_index)+1
    
    embedding_matrix=np.zeros((vocab_size,300))
    for word,i in tokenizer.word_index.items():
        if word in model_w2v:
            embedding_matrix[i]=model_w2v[word]

    # creating document-word embeddings
    query_document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
    for i in range(len(tokenized_paded_documents)):
        for j in range(len(tokenized_paded_documents[0])):
            query_document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
#     document_word_embeddings.shape
    
    return query_document_word_embeddings

In [21]:
print(processing('hello world!!!!').shape)
print(processing('hello world!!!!')[0][:3][0][:10])

(1, 64, 300)
[-0.05419922  0.01708984 -0.00527954  0.33203125 -0.25       -0.01397705
 -0.15039062 -0.265625    0.01647949  0.3828125 ]


In [None]:
# Function for getting Word2Vec similarity score
def get_w2v_query_similarity(document_word_embeddings, query):
    """
    query_w2v: processing the query
    model_w2v: word2vec embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_w2v = processing(query)
    
    nsamples, nx, ny = query_w2v.shape
    query_w2v_reshape = query_w2v.reshape((nsamples,nx*ny))

    nsamples, nx, ny = document_word_embeddings.shape
    document_word_embeddings_reshape = document_word_embeddings.reshape((nsamples,nx*ny))
    
    cos_sim_w2v = cosine_similarity(query_w2v_reshape, document_word_embeddings_reshape).flatten()
    
    return cos_sim_w2v

# Function to get similarity scores for both TF-IDF and Word2Vec
def get_all_similarity(query):
    
    # Word2Vec Similarity
    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    # Original TFIDF similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    return df

In [24]:
query = 'seeking human resources'

df = get_all_similarity(query)

top_candidates(n = 4, by = 'w2v_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,screening_score,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
882,pursuing data analytics sjsu,United States,70,pursuing data analytics sjsu,0.276351,0.0
384,seeking crime analyst roles,United States,100,seeking crime analyst roles,0.276286,0.104308
632,seeking full-time opportunities research assistant iub mscs indiana university bloomington,United States,85,seeking full time opportunities research assistant iub mscs indiana university bloomington,0.241914,0.065065
914,trying to understand the world through data and philosophy,United States,35,trying understand world data philosophy,0.204614,0.0


Comparing different similarity scores of different NLP techniques, we see that each technique yields different similarity scores. It's less effective comparing them by the similarity scores. But we can compare the top results against each other. 

### Let's compare the results of each technique against one another. 

In [25]:
def compare_results(n, query):
    
    df_compare = pd.DataFrame()
    df = get_all_similarity(query)
    cols = df.columns[4:].to_list()
    col_names = [x.split("_")[0] for x in df.columns[4:].to_list()]
    for tn, t in zip(col_names, cols):
        if type(top_candidates(n = n, by = t)) != str:
            if len(top_candidates(n = n, by = t)) < n:
                difference = n - len(top_candidates(n = n, by = t))
                zeros = [0] * difference
                df_compare[tn] = top_candidates(n = n, by = t)['job_title_cleaned'].to_list() + zeros
        
            else:
                df_compare[tn] = top_candidates(n = n, by = t)['job_title_cleaned'].to_list()
                
    return df_compare

In [None]:
n = 5
query = 'Senior Human Resources Business Partner at Heil Environmental'
compare_results(n, query)

Unnamed: 0,w2v,tfidf
0,senior data engineeranalytics,human centered software developer
1,senior data growth product manager cdp management product analytics mcit penn,senior data analyst
2,senior data analystbi developer,senior data engineer
3,senior data engineer,senior analyst
4,biologist data scientist teacher rpcv interested environmental science,environmental fluid mechanics phd candidate boston university


In [None]:
query = 'Staff Data Scientist'

df = get_all_similarity(query)

compare_results(n, query)

Unnamed: 0,w2v,tfidf
0,health data scientist,staff data scientist analytics engineer equifax
1,aspiring data scientist,data scientist
2,staff data scientist analytics engineer equifax,data scientist
3,medical data scientist engineer,data scientist
4,aiml engineerdata scientist,data scientist


**Observations:**

Word2Vec, similar to TF-IDF does fairly well returning candidates per search term when the terms or job titles are close or exact match. Like TF-IDF, it has a more challenging time if there's no close match. There is some semantic understanding but not much. Let's take a look at the next technique. 

## 3. GloVe

GloVe (Global Vectors for Word Representation) is a word-embedding technique developed by Stanford that leverages global co-occurrence statistics from an entire text corpus. Unlike Word2Vec, which uses a local context window, GloVe creates a word-context matrix and uses matrix factorization to generate word vectors. 

GloVe creates word vectors by analyzing how often words appear together (co-occurrence statistics) across an entire text corpus, constructing a large word-context matrix, and then applying matrix factorization to create low-dimensional vector representations. This differs from Word2Vec, which focuses only on local word relationships within a small "context window". By using global statistics, GloVe's word vectors can better capture nuanced semantic and syntactic relationships, leading to strong performance on analogy tasks and a broader understanding of language. 

### Examining GloVe's pre-trained vectors

In [32]:
# Navigating to directory where GloVe pre-trained vectors were downloaded
os.chdir(r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent\glove")
path = 'glove.840B.300d.txt'

In [33]:
with open(path) as file:
  for i in range(5):
    line = file.readline()
    print(line[:20])

, -0.082752 0.67204 
. 0.012001 0.20751 -
the 0.27204 -0.06203
and -0.18567 0.06600
to 0.31924 0.06316 -


In [34]:
df_glove = pd.read_csv(path, sep=" ", quoting=3, header=None, index_col=0)
df_glove.T.head()

Unnamed: 0,",",.,the,and,to,of,a,in,"""",:,is,for,I,),(,that,-,on,you,with,'s,it,The,are,by,at,be,this,as,from,...,trompettes,tylerdurden,unaturally,uniao,upstretched,usr/lib/oracle,v205,vakker,value-in-use,vampaneze,vinted,vocÃª,votesA,war/WEB-INF/lib,web.Our,what-might-have-been,wiid,windowsTransgender,woombie,wordsforyoungmen,work.Like,working.So,wried,wwent,xalisae,xtremecaffeine,yildirim,z/28,zipout,zulchzulu
1,-0.082752,0.012001,0.27204,-0.18567,0.31924,0.060216,0.043798,0.089187,-0.075242,0.008746,-0.084961,-0.17224,0.1941,-0.27142,-0.18024,0.09852,-0.20688,-0.070186,-0.11076,-0.099534,-0.06858,0.001363,-0.067679,-0.19859,-0.15552,-0.36769,-0.059177,-0.087595,-0.10648,0.01332,...,0.19232,0.66499,0.32269,0.20198,0.23488,0.51092,0.24627,0.33453,-0.26508,0.9066,0.75268,0.55804,-0.35556,1.1053,0.98946,0.56295,0.3851,-0.10235,0.65711,-0.3782,-0.23822,0.75465,0.54698,0.92179,0.33754,0.073032,0.22276,0.7344,0.21215,-0.07969
2,0.67204,0.20751,-0.06203,0.066008,0.06316,0.21799,0.024779,0.25792,0.57337,0.33214,0.502,0.18234,0.22603,0.047374,0.008411,0.25001,0.66724,0.15274,0.30786,0.028202,0.4647,0.35653,0.094515,-0.062818,-0.33723,0.59821,0.10653,0.35502,-0.016295,-0.051085,...,-1.029,0.15479,-0.41217,-0.50532,-0.94829,0.60875,-1.0254,-0.15606,-0.056282,-1.1523,-0.98967,-0.63074,-0.049174,-0.96066,-0.48815,-0.29378,-0.31523,-0.043862,-1.0671,-1.1546,-0.657,-0.29236,-0.50515,-0.34432,-0.13111,-1.0294,-0.29639,-0.33641,-0.99456,-0.22905
3,-0.14987,-0.12578,-0.1884,-0.25209,-0.27858,-0.04249,-0.20937,0.26282,-0.31908,-0.29175,0.002382,-0.27847,-0.43764,-0.17278,-0.30463,-0.27018,-0.14633,-0.33086,-0.5198,-0.23189,0.13214,-0.055497,-0.25173,-0.36614,-0.097191,0.13229,-0.21613,0.063868,-0.22755,-0.13207,...,-0.1669,-0.17786,0.044183,0.17818,0.41461,-0.19998,0.58306,0.62839,1.2318,-1.2483,-0.043626,-0.29618,-0.18134,0.15525,-0.69766,1.2792,-0.10369,0.19327,-0.27787,0.38742,-0.18234,-0.21121,0.56164,-0.50888,0.15595,-0.015436,0.69412,0.26918,1.1782,0.80366
4,-0.064983,-0.59325,0.023225,-0.11725,0.2612,-0.38618,0.49745,-0.029365,-0.18484,-0.15119,-0.16755,-0.084666,-0.11387,-0.029084,0.20997,-0.23186,0.4204,0.11609,0.035138,0.094477,0.18599,-0.16607,-0.24268,-0.41786,-0.21617,0.23506,-0.086178,0.29292,-0.18934,0.40386,...,-1.6097,0.020382,0.38208,0.45301,0.15354,0.58105,-0.10448,0.25511,-0.39186,-0.43616,-0.57828,0.17533,0.65321,0.54527,-0.90032,-0.070849,-0.11077,-0.22556,0.48507,1.2881,-0.27082,-0.10582,-0.29412,-0.000386,-1.1244,0.72615,0.19362,0.41843,2.0721,-0.78865
5,0.056491,0.12525,-0.018158,0.26513,0.079248,-0.15388,0.36019,0.47187,0.88867,-0.41842,0.30721,0.25442,-0.072725,-0.2191,0.085153,0.022378,0.19229,-0.17336,0.10368,0.12191,-0.037015,0.00314,-0.61093,0.20962,-0.30091,-0.046757,0.005223,-0.23635,0.14167,0.21135,...,-0.18375,-0.49401,0.356,-0.48223,0.51145,0.80905,-0.21452,-0.56103,-0.61902,0.13476,0.24781,-0.033645,-1.0413,0.84744,0.39204,-0.48752,0.17295,-0.14848,-0.51166,-0.80267,0.37388,-0.29527,-0.35497,-0.15145,0.10046,-0.99246,-0.31276,-0.189,-0.44271,-0.40567


We see that GloVe has nearly 2,200,000 represented words or characters. We will create a dictionary of these word - vector pairs and also create a representation for any OOV word. 

In [35]:
# Creating a dictionary of words and corresponding vectors
glove = { key: val.values for key, val in df_glove.T.items() }

In [36]:
glove['man'][:5]

array([-0.1731  ,  0.20663 ,  0.016543, -0.31026 ,  0.019719])

In [37]:
unknown_word = df_glove.mean().values

In [38]:
df_glove.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,...,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1
",",-0.082752,0.67204,-0.14987,-0.064983,0.056491,0.40228,0.002775,-0.3311,-0.30691,2.0817,0.031819,0.013643,0.30265,0.00713,-0.5819,-0.2774,-0.062254,1.1451,-0.24232,0.1235,-0.12243,0.33152,-0.006162,-0.30541,-0.13057,-0.054601,0.037083,-0.070552,0.5893,-0.30385,...,-0.4393,-0.26137,0.30088,-0.060772,-0.45312,-0.19076,-0.20288,0.27694,-0.060888,0.11944,0.62206,-0.19343,0.47849,-0.30113,0.059389,0.074901,0.061068,-0.4662,0.40054,-0.19099,-0.14331,0.018267,-0.18643,0.20709,-0.35598,0.05338,-0.050821,-0.1918,-0.37846,-0.06589
.,0.012001,0.20751,-0.12578,-0.59325,0.12525,0.15975,0.13748,-0.33157,-0.13694,1.7893,-0.47094,0.70434,0.26673,-0.089961,-0.18168,0.067226,0.053347,1.5595,-0.2541,0.038413,-0.01409,0.056774,0.023434,0.024042,0.31703,0.19025,-0.37505,0.035603,0.1181,0.012032,...,-0.26477,0.096566,0.062658,-0.30668,-0.43334,0.10006,0.21136,0.039459,-0.11077,0.24421,0.60942,-0.46646,0.086385,-0.39702,-0.23363,0.021307,-0.10778,-0.2281,0.50803,0.11567,0.16165,-0.066737,-0.29556,0.022612,-0.28135,0.0635,0.14019,0.13871,-0.36049,-0.035
the,0.27204,-0.06203,-0.1884,0.023225,-0.018158,0.006719,-0.13877,0.17708,0.17709,2.5882,-0.35179,-0.17312,0.43285,-0.10708,0.15006,-0.19982,-0.19093,1.1871,-0.16207,-0.23538,0.003664,-0.19156,-0.085662,0.039199,-0.066449,-0.04209,-0.19122,0.011679,-0.37138,0.21886,...,0.4823,-0.051759,-0.27285,-0.25893,0.16555,-0.1831,-0.06734,0.42457,0.010346,0.14237,0.25939,0.17123,-0.13821,-0.066846,0.015981,-0.30193,0.043579,-0.043102,0.35025,-0.19681,-0.4281,0.16899,0.22511,-0.28557,-0.1028,-0.018168,0.11407,0.13015,-0.18317,0.1323
and,-0.18567,0.066008,-0.25209,-0.11725,0.26513,0.064908,0.12291,-0.093979,0.024321,2.4926,-0.017916,-0.071218,-0.24782,-0.26237,-0.2246,-0.21961,-0.12927,1.0867,-0.66072,-0.031617,-0.057328,0.056903,-0.27939,-0.39825,0.14251,-0.085146,-0.14779,0.055067,-0.002869,-0.20917,...,0.019917,-0.28803,-0.010494,0.038412,-0.11718,-0.072462,0.16381,0.38488,-0.029783,0.23444,0.4532,0.14815,-0.027021,-0.073181,-0.1147,-0.005455,0.47796,0.090912,0.094489,-0.36882,-0.59396,-0.097729,0.20072,0.17055,-0.004736,-0.039709,0.32498,-0.023452,0.12302,0.3312
to,0.31924,0.06316,-0.27858,0.2612,0.079248,-0.21462,-0.10495,0.15495,-0.03353,2.4834,-0.50904,0.08749,0.21426,0.22151,-0.25234,-0.097544,-0.1927,1.3606,-0.11592,-0.10383,0.21929,0.11997,-0.11063,0.14212,-0.16643,0.21815,0.004209,-0.070012,-0.23532,-0.26518,...,0.62255,-0.072391,0.090129,0.15428,0.023163,-0.13028,0.061762,0.33803,-0.091581,0.21039,0.05108,0.19184,0.10444,0.2138,-0.35091,-0.23702,0.038399,-0.10031,0.18359,0.025178,-0.12977,0.3713,0.18888,-0.004274,-0.10645,-0.2581,-0.044629,0.082745,0.097801,0.25045


In [39]:
glove[word][:3]

array([-0.039599,  0.64594 , -0.32744 ])

In [40]:
# Creating a vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

n = 0
doc_sent_vec = []
for sentences in job_titles:
    word_vec = []
    for word in sentences.split():
        if word in glove:
            vectors = glove[word]
            word_vec.append(vectors)
        else:
            word_vec.append(unknown_word)
    
    n = n + 1
    word_vec_mean = sum(word_vec) / len(word_vec) # returning a mean for each job title
    doc_sent_vec.append(word_vec_mean) # returning a list for all job titles
    
doc_sent_vec[0].shape

# Creating a vectorize representation for each query
def q_sent_vec(query):
    q_sent_vec = []
    q_word_vec = []
    
    for word in query.split():
        if word in glove:
            vectors = glove[word]
            q_word_vec.append(vectors)
        else:
            q_word_vec.append(unknown_word)
        q_word_vec_mean = sum(q_word_vec) / len(q_word_vec)
    q_sent_vec.append(q_word_vec_mean)
        
    return q_sent_vec

In [41]:
query = 'native english speaking'
print("Length: " + str(len(q_sent_vec(query))))
print("Shape: " + str(q_sent_vec(query)[0].shape))
print("First 5 values: " + str(q_sent_vec(query)[0][:5]))

Length: 1
Shape: (300,)
First 5 values: [-0.29654333  0.12640833 -0.49922333  0.22307667  0.4358    ]


Like before let's create functions to get similarity scores to our search terms and compare different NLP methods

In [None]:
def get_glove_query_similarity(doc_sent_vec, query):
    """
    query_glove: processing the query
    doc_sent_vec: glove embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_glove = q_sent_vec(query)
    
    cos_sim_glove = cosine_similarity(query_glove, doc_sent_vec).flatten()
    
    return cos_sim_glove

def get_all_similarity(query):
    
    #GloVe similarity
    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    return df

In [44]:
query = 'Aspiring human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'glove_fit', ascending = False, min_con = 70)

Unnamed: 0_level_0,job_title,location,screening_score,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
375,data science student tackling real-world challenges with ai analytics actively seeking opportunities,United States,100,data science student tackling real world challenges ai analytics actively seeking opportunities,0.04275,0.0,0.665426
431,aspiring data science professional focused on data analysis machine learning and data visualization actively seeking opportunities,United States,95,aspiring data science professional focused data analysis machine learning data visualization actively seeking opportunities,0.222291,0.098981,0.648091
1,innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.,United States,100,innovative driven professional seeking role data analyticsdata science information technology industry,0.065781,0.0,0.630105
880,passionate and driven bioinformatics graduate from san jose state university actively seeking full time opportunities for making a difference in computational biology research.,United States,70,passionate driven bioinformatics graduate san jose state university actively seeking full time opportunities making difference computational biology research,0.095373,0.0,0.626737
487,research assistant penn state seeking opportunities in the data field data analyst with experience at sritech software expertise in machine learning data evaluation passionate about transforming data into insights,United States,95,research assistant penn state seeking opportunities data field data analyst experience sritech software expertise machine learning data evaluation passionate transforming data insights,0.014574,0.0,0.61146
28,aspiring data scientist passion for data-driven decision making master of science in business analytics graduate - university of new hampshire,United States,95,aspiring data scientist passion data driven decision making master science business analytics graduate university new hampshire,0.199752,0.084835,0.611246
18,actively seeking full-time roles in bioinformatics masters in biomedical informatics and data science asu,United States,100,actively seeking full time roles bioinformatics masters biomedical informatics data science asu,0.021807,0.0,0.606219
458,data science student at uc berkeley emphasis in cognition passionate about analyzing data to develop well-informed and effective solutions.,United States,95,data science student uc berkeley emphasis cognition passionate analyzing data develop well informed effective solutions,0.038584,0.0,0.604401
171,aspiring product manager information systems graduate with a strong technical background passionate about driving innovation and solving complex problems,United States,80,aspiring product manager information systems graduate strong technical background passionate driving innovation solving complex problems,0.199559,0.072334,0.60293
456,artificial intelligence professional natural language processing deep learning reinforcement learning machine learning automation 3 years of professional experience.,United States,95,artificial intelligence professional natural language processing deep learning reinforcement learning machine learning automation years professional experience,0.055939,0.0,0.602467


In [45]:
n = 5
compare_results(n, query)

Unnamed: 0,w2v,tfidf,glove
0,aspiring data scientist,human centered software developer,data science student tackling real world challenges ai analytics actively seeking opportunities
1,aspiring data analyst,aspiring data analyst,aspiring data science professional focused data analysis machine learning data visualization actively seeking opportunities
2,aspiring data analyst,aspiring data analyst,passionate data scientist seeking exciting opportunities make impact
3,aspiring director automation data insights,aspiring data scientist,economics phd candidate penn state industrial organization dedicated solving real life challenges using innovative research
4,master science financial mathematics,aspiring director automation data insights,innovative driven professional seeking role data analyticsdata science information technology industry


## 4. FastText 
FastText is a library developed by Facebook AI for NLP - known for its training speed and accuracy.  
FastText extends the Word2Vec model by incorporating subword information. Instead of treating each word as an indivisible unit, it represents words as the sum of their character n-grams. 

FastText effectively handles rare words, misspellings, and OOV words by constructing their vectors from subword components. This is especially useful for morphologically rich languages.

### Loading and examining FastText's pre-trained vectors

In [47]:
os.chdir(r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent\fastText-0.9.2")

# !pip install fasttext-wheel
import fasttext as fasttext

ft = fasttext.load_model('cc.en.300.bin')
ft.get_word_vector('hello')[:20]

array([ 0.15757619,  0.04378209, -0.00451272,  0.06659314,  0.07703468,
        0.00485855,  0.00819822,  0.00652403,  0.009259  ,  0.0353899 ,
       -0.02313953, -0.04918071, -0.08326425,  0.01560145,  0.25485662,
        0.03454237, -0.01074514, -0.07801886, -0.07080995,  0.07623856],
      dtype=float32)

In [48]:
ft.get_words()[:10]

[',', 'the', '.', 'and', 'to', 'of', 'a', '</s>', 'in', 'is']

In [49]:
# Creating a dictionary of fasttext word and vector representation
ft_words = ft.get_words()
ft_vectors = [ft.get_word_vector(word) for word in ft_words]
ft_dict = dict(zip(ft_words, ft_vectors))
ft_dict['hello'][:20]

array([ 0.15757619,  0.04378209, -0.00451272,  0.06659314,  0.07703468,
        0.00485855,  0.00819822,  0.00652403,  0.009259  ,  0.0353899 ,
       -0.02313953, -0.04918071, -0.08326425,  0.01560145,  0.25485662,
        0.03454237, -0.01074514, -0.07801886, -0.07080995,  0.07623856],
      dtype=float32)

In [50]:
df_ft = pd.DataFrame(ft_dict.items(), columns = ['ft_words', 'ft_vectors'])
df_ft.head(3)

Unnamed: 0,ft_words,ft_vectors
0,",","[0.12502378, -0.10790165, 0.02450176, -0.25286365, 0.1057171, -0.018444797, 0.117678985, -0.07007254, -0.040074684, -0.008026216, 0.07716709, -0.02257145, 0.089262165, -0.04868145, -0.08966993, -0.08349128, 0.019988708, 0.027310487, -0.01935611, 0.09643278, 0.08747688, 0.009819358, 0.045297798, 0.015498773, 0.14624609, 0.022521427, 0.04475486, 0.013749474, 0.057015173, 0.1764235, -0.1071837, -0.082620285, 0.017277328, 0.10895962, 0.020679405, -0.12712738, 0.2444892, 0.037465177, -0.020877417, -0.044460505, 0.053991955, 0.12817593, 0.043671336, 0.058789518, 0.09843587, 0.05393798, 0.00044774427, 0.12903026, 0.024213549, -0.012008867, -0.048041053, 0.03460624, -0.06643045, -0.032984406, -0.06247217, -0.070759535, -0.057862796, 0.17382768, 0.44483587, 0.037006963, -0.10010116, -0.0031810577, 0.035880014, -0.06850616, -0.036060803, 0.007000481, 0.13161308, -0.094532624, -0.06097764, 0.017754983, -0.07628012, -0.019208273, 0.0032959182, 0.005632444, 0.18779793, -0.0754082, -0.009459897, 0.04464071, -0.058813374, 0.024390636, -0.025075123, -0.049303107, 0.030831667, -0.035886865, -0.18844126, -0.09883648, 0.18867746, 0.04589819, -0.08158643, -0.15238018, -0.037457667, -0.06915909, 0.042720053, -0.047074586, -0.008642857, -0.21905208, -0.0064076814, 0.08774324, -0.007448593, -0.1400358, ...]"
1,the,"[-0.051744193, 0.073963955, -0.01305688, 0.044726558, -0.034320366, 0.021216884, 0.0069114864, -0.016327847, -0.018074857, -0.0019965237, -0.10204669, 0.005904886, 0.025654055, -0.002596621, -0.058556058, -0.037758686, 0.016311873, 0.01463237, -0.008759298, -0.017594784, -0.008547327, -0.007793376, -0.018278033, 0.008798243, 0.0013020262, -0.093829416, 0.013899146, 0.014892999, -0.039370976, -0.029441122, 0.009422931, -0.025228418, -0.010441078, -0.22131945, -0.022859765, -0.008935269, -0.03222265, 0.08217016, 0.002099978, 0.028173504, 0.007170668, -0.009125605, -0.035169393, -0.017804421, -0.07055402, 0.06302309, -0.009246307, -0.022327038, -0.005585512, 0.0514723, -0.03069112, 0.043648228, -0.010969555, -0.055454243, 0.008938285, -0.06726995, 0.010507602, 0.05740975, 0.009920523, -0.028267926, 0.047040958, 0.0052922955, 0.0030449405, 0.00071547925, 0.044293776, 0.006895274, -0.033405542, 0.009057372, -0.0075827073, 0.006601395, 0.09174107, 0.031111507, 0.05429111, 0.028172497, -0.019965246, -0.033377998, 0.0052875523, 0.03638041, 0.22493297, 0.09276069, -0.012265386, 0.008560304, -0.059897833, 0.06762706, 0.04024453, 0.0011667766, 0.046392195, -0.043697126, 0.005942209, 0.09172087, -0.04124823, -0.015125338, -0.023081664, 0.009499152, 0.05883145, 0.027860444, 0.06469925, -0.056754317, -0.012956021, 0.047435097, ...]"
2,.,"[0.03423236, -0.08014102, 0.116187684, -0.39683825, -0.014666078, -0.05333376, 0.0606309, -0.105187, 0.0004822225, -0.036015246, 0.025738074, 0.017741874, 0.028525142, 0.0036812234, -0.041895356, 0.23742425, 0.0073372344, -0.030286761, -0.05776126, -0.061607026, 0.0064677577, 0.0054974114, 0.061985064, -0.0035603195, -0.107664384, -0.10458943, 0.06542359, -0.00065885123, 0.023493404, 0.044855215, 0.0012925226, -0.049584012, -0.0029731453, 0.13319224, 0.031394668, -0.015184948, 0.07726878, -0.3238144, -0.008129742, 0.01077384, -0.0478446, 0.10366743, -0.089419544, 0.14941524, 0.5012751, -0.18421888, -0.025935497, 0.07800455, -0.029555596, 0.059735887, 0.04384649, -0.047654208, -0.03593738, -0.06039133, 0.037578516, -0.045454044, -0.13247262, -0.05950857, -0.09992922, -0.08243029, -0.09629086, -0.08551892, -0.024352599, 0.50798106, -0.027145516, -0.08863297, -0.015968971, -0.050326522, -0.029528841, -0.01774156, 0.38464957, 0.10462516, 0.16921097, -0.011959946, 0.046539865, -0.08007814, 0.012553597, 0.05216411, 0.10962657, 0.20337108, 0.0128176045, 0.0064291875, -0.06376205, 0.02083857, 0.12471656, 0.0043035937, 0.08625324, 0.113382444, 0.03137607, 0.087006256, 0.058067933, 0.013879853, 0.112878084, 0.0039297733, -0.19282798, -0.1918144, -0.22638488, 0.031872883, -0.010841944, -0.057225518, ...]"


In [51]:
# May not need to do this for fasttext
oov_word = np.zeros((300,))

In [52]:
# Creating a fasttext vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_vec_ft = []

for sentences in job_titles:
    word_vec_ft = []
    for word in sentences.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            word_vec_ft.append(vectors)
        else:
            word_vec_ft.append(oov_word)
    word_vec_mean_ft = sum(word_vec_ft) / len(word_vec_ft) # returning a mean for each job title
    doc_sent_vec_ft.append(word_vec_mean_ft) # returning a list for all job titles

Let's create functions to get similarity scores to our search terms and compare different NLP methods

In [53]:
# Creating a fasttext vectorize representation for each query
def q_sent_vec_ft(query):
    q_sent_vec_ft = []
    q_word_vec_ft = []
    
    for word in query.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            q_word_vec_ft.append(vectors)
        else:
            q_word_vec_ft.append(oov_word)
    q_word_vec_mean_ft = sum(q_word_vec_ft) / len(q_word_vec_ft) # This was indented but just fixed this round - if it breaks, this should be indented again
    q_sent_vec_ft.append(q_word_vec_mean_ft)
        
    return q_sent_vec_ft

def get_fasttext_query_similarity(doc_sent_vec_ft, query):
    """
    query_fasttext: processing the query
    doc_sent_vec: glove embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_fasttext = q_sent_vec_ft(query)
    
    cos_sim_fasttext = cosine_similarity(query_fasttext, doc_sent_vec_ft).flatten()
    
    return cos_sim_fasttext


def get_all_similarity(query):

    #Fasttext similarity
    cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query)
    df['fasttext_fit'] = cos_sim_fasttext

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove
    
    return df

In [54]:
query = 'Aspiring human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'fasttext_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,screening_score,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit,fasttext_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1001,human centered software developer,United States,35,human centered software developer,0.095687,0.380939,0.603374,0.678345
595,scientist ambitious living to learn everyday,United States,85,scientist ambitious living learn everyday,0.043255,0.0,0.582886,0.494642
535,computer science student with skills in python and machine learning,United States,90,computer science student skills python machine learning,0.100376,0.0,0.55296,0.484984
456,artificial intelligence professional natural language processing deep learning reinforcement learning machine learning automation 3 years of professional experience.,United States,95,artificial intelligence professional natural language processing deep learning reinforcement learning machine learning automation years professional experience,0.055939,0.0,0.602467,0.484207
710,natural language engineer at pryon,United States,80,natural language engineer pryon,0.09412,0.0,0.511562,0.478237
1081,master of science artificial intelligence,United States,30,master science artificial intelligence,0.159533,0.0,0.568447,0.466913
13,machine learning artificial intelligence and data science enthusiast,United States,100,machine learning artificial intelligence data science enthusiast,0.046439,0.0,0.591342,0.463759
18,actively seeking full-time roles in bioinformatics masters in biomedical informatics and data science asu,United States,100,actively seeking full time roles bioinformatics masters biomedical informatics data science asu,0.021807,0.0,0.606219,0.462713
426,master of science in analytics at georgia institute of technology aspiring data scientist,United States,95,master science analytics georgia institute technology aspiring data scientist,0.150712,0.117938,0.590222,0.4625
267,machine learning engineer genworth masters in computer sciences,United States,30,machine learning engineer genworth masters computer sciences,0.06561,0.0,0.47487,0.457715


In [55]:
n = 5
compare_results(n, query)

Unnamed: 0,w2v,tfidf,glove,fasttext
0,aspiring data scientist,human centered software developer,data science student tackling real world challenges ai analytics actively seeking opportunities,human centered software developer
1,aspiring data analyst,aspiring data analyst,aspiring data science professional focused data analysis machine learning data visualization actively seeking opportunities,scientist ambitious living learn everyday
2,aspiring data analyst,aspiring data analyst,passionate data scientist seeking exciting opportunities make impact,computer science student skills python machine learning
3,aspiring director automation data insights,aspiring data scientist,economics phd candidate penn state industrial organization dedicated solving real life challenges using innovative research,artificial intelligence professional natural language processing deep learning reinforcement learning machine learning automation years professional experience
4,master science financial mathematics,aspiring director automation data insights,innovative driven professional seeking role data analyticsdata science information technology industry,natural language engineer pryon


**Observations:**

FastText's results are closely matched with the previous NLP methods. They do fairly well. There are higher semantic understanding beyond TF-IDF but so far not a huge difference. 

So far none of these are large language models. We will be looking at BERT, the first LLM in the group. 

## 5. BERT

BERT (Bidirectional Encoder Representations from Transformers) is a large language model developed by Google that uses a Transformer encoder architecture. It revolutionized NLP by being the first deeply bidirectional, unsupervised language representation model. 

Features:
- Context-sensitive (bidirectional): Unlike previous models, BERT processes text bidirectionally, considering the context from both the left and right of a word simultaneously. This allows it to generate different word embeddings for the same word based on its usage, effectively handling polysemy.
- Pre-training and fine-tuning: BERT is first pre-trained on a massive amount of unlabeled text.

Limitation:
BERT is an LLM, but since it is based on an encoder-only architecture, BERT is not as effective for text generation as models with decoders, like GPT

### Loading BERT's pre-trained model and creating functions to encode our texts


In [None]:
os.chdir("..")

from transformers import AutoTokenizer, AutoModel, TFAutoModel
import torch
import torch.nn.functional as F

# Load the tokenizer and the model from HuggingFace Hub
bert_tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
bert_model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

In [60]:
# Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = bert_tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = bert_model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return embeddings

In [None]:
# get bert embedding for all docs
titles_list = df['job_title_cleaned'].to_list()
doc_emb = encode(titles_list)

Let's create functions to get similarity scores to our search terms and compare different NLP methods

In [62]:
def get_bert_query_similarity(doc_emb, query):
    """
    query_bert: processing the query
    doc_emb: bert embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_bert = encode(query)
    
    #Compute dot score between query and all document embeddings
    cos_sim_bert = torch.mm(query_bert, doc_emb.transpose(0, 1))[0].cpu().tolist()
    
    return cos_sim_bert

def get_all_similarity(query):
    
    #Bert similarity
    cos_sim_bert = get_bert_query_similarity(doc_emb, query)
    df['bert_fit'] = cos_sim_bert

    #Fasttext similarity
    cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query)
    df['fasttext_fit'] = cos_sim_fasttext

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove
    
    return df

In [63]:
query = 'seeking human resources'
df = get_all_similarity(query)
top_candidates(n = 5, by = 'bert_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,screening_score,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit,fasttext_fit,bert_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
117,hr recruiter,United States,80,hr recruiter,0.031891,0.0,0.299885,0.179079,0.587304
632,seeking full-time opportunities research assistant iub mscs indiana university bloomington,United States,85,seeking full time opportunities research assistant iub mscs indiana university bloomington,0.241914,0.065065,0.686044,0.404845,0.483924
628,actively looking for a data analyst opportunity in usa.,United States,85,actively looking data analyst opportunity usa,0.112534,0.0,0.710262,0.397597,0.475045
755,actively seeking data analyst jobs,United States,80,actively seeking data analyst jobs,0.111464,0.119038,0.729096,0.498957,0.457845
1124,m.s. in statistics from the university of illinois in urbana-champaign. actively seeking employment as a statistician or data analyst.,United States,30,m s statistics university illinois urbana champaign actively seeking employment statistician data analyst,0.04091,0.059955,0.60911,0.27558,0.450902


In [64]:
compare_results(n, query)

Unnamed: 0,w2v,tfidf,glove,fasttext,bert
0,pursuing data analytics sjsu,human centered software developer,passionate data scientist seeking exciting opportunities make impact,human centered software developer,hr recruiter
1,seeking crime analyst roles,actively seeking data analyst jobs,data science student tackling real world challenges ai analytics actively seeking opportunities,experienced programmer seeking opportunities,seeking full time opportunities research assistant iub mscs indiana university bloomington
2,seeking full time opportunities research assistant iub mscs indiana university bloomington,experienced programmer seeking opportunities,passionate driven bioinformatics graduate san jose state university actively seeking full time opportunities making difference computational biology research,innovative driven professional seeking role data analyticsdata science information technology industry,actively looking data analyst opportunity usa
3,trying understand world data philosophy,seeking crime analyst roles,innovative driven professional seeking role data analyticsdata science information technology industry,passionate data scientist seeking exciting opportunities make impact,actively seeking data analyst jobs
4,seeking data science mle aiml roles masters cs uf ml research uf intelligent critical care center,mds rice seeking full time data scientistanalyst,aspiring data science professional focused data analysis machine learning data visualization actively seeking opportunities,artificial intelligence professional natural language processing deep learning reinforcement learning machine learning automation years professional experience,m s statistics university illinois urbana champaign actively seeking employment statistician data analyst



**Observation:**

BERT actually does better with a difficult search term. It correctly understood hr recruiter as the closest match to the term "seeking human resources". While other NLP methods performs reasonably well with perfect or close matches, BERT is able provide **better semantic understanding** than previous methods. 

Finally, let's take a look at a Generative AI method. 

## 6. Generative AI
This is a huge shift from older NLP techniques like TF-IDF, Word2Vec, GloVe, and even BERT. While the older methods focus on analyzing existing data, generative AI utilizes large language model (LLM) designed to generate new content.

**Limitations:**

Large language models can suffer from inaccuracies (hallucinations), biases from its training data, and high computational costs, especially for fine-tuning or training. 

To reduce latency and reduce high computational processing we will feed our search term through the previous NLP methods to return the top terms from each and feed these curated terms as a prompt for the GenAI model to return the best terms and provide an explanation of its reasoning. 

In [None]:
# Function for a curated list of job titles from the search term
def get_titles(n = 5, query=query):
    df = get_all_similarity(query)
    
    df_compare = compare_results(n, query)
    
    titles = []
    for col in df_compare.columns:
        titles.extend(df_compare[col].to_list())
    
    titles = list(set(titles))
    titles = [x for x in titles if x != 0]
    return titles

### Importing pre-trained Mistral LLM model
We will be using Mistral-7B-Instruct-v0.2.Q4_K_M.gguf model which is an open source LLM designed to generate new content, fine-tuned specifically to follow instructions. The ".gguf" format has been optimized for running efficiently on consumer hardware.

It's built on a transformer architecture with advanced features like Grouped-Query Attention (GQA) and Sliding-Window Attention (SWA), for increased speed and ability to handle longer text sequences. It learns patterns from its training data and predicts what word, sound, or pixel comes next to generate content.

Mistral 7B is an open-source model which can be deployed on a variety of hardware and is more accessible than many proprietary or closed models.

In [None]:
import math
from typing import List, Iterable, Tuple

from llama_cpp import Llama
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig

# Reading in the data
path = os.getcwd()
GGUF_PATH = r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent\mistral-7b-instruct-v0.2.Q4_K_M.gguf"  

In [None]:

# Create a single global Llama instance (don’t recreate per call)
llm = Llama(
    model_path=GGUF_PATH,
    n_ctx=4096,           # context window; keep generous for long lists
    n_threads=os.cpu_count(),  # use all CPU threads
    n_gpu_layers=0,       # pure CPU
    verbose=False
)

# Light, strong CPU embedding model (no GPU needed)
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")


llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


In [75]:
def generate_chunk_ranking(query: str, options_chunk: list, max_tokens: int = 512) -> list:
    """
    Reorders a chunk of job titles based on semantic similarity to a search term using llama-cpp-python,
    and also provides explanations for why each title was ranked that way.

    Args:
        query (str): Job search term
        options_chunk (list of str): List of job titles to rank
        max_tokens (int): Max tokens for LLaMA output

    Returns:
        list of dict: Each dict has {"title": str, "explanation": str}
    """
    # --- Build prompt ---
    options_text = "\n".join(options_chunk)
    prompt = (
        "You are an expert career coach.\n"
        "I will give you a list of job titles and a search term.\n"
        "Your task: reorder ONLY the given titles from most to least semantically similar to the search term.\n"
        "For each title, provide a one-sentence explanation of why it fits that position.\n"
        "Do not invent, remove, or modify titles. Output exactly one title per line, followed by a colon and explanation.\n\n"
        f"Search term: {query}\n"
        f"Job titles:\n{options_text}\n\n"
        "Return the reordered list in this format:\n"
        "Title 1: explanation...\n"
        "Title 2: explanation...\n"
        "Title 3: explanation...\n"
    )

    # --- Generate completion ---
    out = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=0.3,
        top_p=0.9,
        repeat_penalty=1.1,
        stop=["</s>"]
    )

    raw_text = out["choices"][0]["text"].strip()
    # print("\n=== RAW OUTPUT ===\n", raw_text, "\n==================\n")

    results = []

    # Normalize options for fuzzy matching
    allowed = {opt.lower().strip(): opt for opt in options_chunk}

    for ln in raw_text.splitlines():
        ln = ln.strip()
        if not ln:
            continue

        # Remove numbering like "1. " at start
        ln = re.sub(r'^\d+\.\s*', '', ln)

        # Try to split into "title: explanation"
        if ":" in ln:
            title_part, explanation = ln.split(":", 1)
            title_part = title_part.strip(" '\"*`()").strip()
            explanation = explanation.strip()
        else:
            title_part, explanation = ln.strip(" '\"*`()").strip(), ""

        # Normalize and fuzzy match
        normalized = title_part.lower().strip()

        # Try exact match first
        if normalized in allowed:
            results.append({
                "title": allowed[normalized],
                "explanation": explanation
            })
        else:
            # Try partial/fuzzy containment
            for key in allowed:
                if key in normalized or normalized in key:
                    results.append({
                        "title": allowed[key],
                        "explanation": explanation
                    })
                    break

    return results


In [77]:
n = 5
query = "human resources manager"
df = get_all_similarity(query)
df_compare = compare_results(n, query)
print(query)
df_compare

human resources manager


Unnamed: 0,w2v,tfidf,glove,fasttext,bert
0,performance marketing manager,human centered software developer,research assistant penn state seeking opportunities data field data analyst experience sritech software expertise machine learning data evaluation passionate transforming data insights,human centered software developer,hr recruiter
1,mathnasium center director,associate manager,senior technical support representative n able expertise cyber security project management,performance marketing manager,associate manager
2,natural language engineer pryon,program manager,actively looking full time opportunities graduate student business analytics artificial intelligence utd data enthusiast ex associate software engineer cgi,operations manager assistant program coordinator,operations manager assistant program coordinator
3,data analystsis administrator,performance marketing manager,driving business technology integration results driven management professional project operations manager business intelligence process optimization ai enabled decision making data analytics automation,data scientist data analyst civil engineer railways finance economics machine learning engineer artificial intelligence,human centered software developer
4,senior data engineer,operations manager assistant program coordinator,ms business analytics university tampa graduate research assistant president u t gold actively looking full time data analytics data science business intelligence business analyst positions,associate manager,engineering supervisor


In [78]:
options = get_titles(n, query)
options

['operations manager assistant program coordinator',
 'program manager',
 'senior data engineer',
 'engineering supervisor',
 'natural language engineer pryon',
 'ms business analytics university tampa    graduate research assistant president u t gold actively looking full time data analytics data science business intelligence business analyst positions',
 'research assistant penn state seeking opportunities data field data analyst experience sritech software expertise machine learning data evaluation passionate transforming data insights',
 'human centered software developer',
 'driving business technology integration results driven management professional project operations manager business intelligence process optimization ai enabled decision making data analytics automation',
 'hr recruiter',
 'associate manager',
 'mathnasium center director',
 'performance marketing manager',
 'actively looking full time opportunities graduate student business analytics artificial intelligence ut

In [79]:
chunk_results = []
for chunk in chunk_list(options, len(options)):
    ranked_chunk = generate_chunk_ranking(query, chunk)
    chunk_results.extend(ranked_chunk)
    
df_genai = pd.DataFrame(chunk_results)
df_genai

Unnamed: 0,title,explanation
0,hr recruiter,specifically responsible for recruiting new employees for a company.
1,associate manager,assists in managing day-to-day operations within an organization or department.
2,performance marketing manager,focuses on using digital marketing strategies to improve a company's performance and reach its target audience more effectively.
3,program manager,"oversees the planning, coordination, and implementation of projects from start to finish."
4,operations manager assistant program coordinator,assists in managing daily operations within an organization or department.
5,ms business analytics university tampa graduate research assistant president u t gold actively looking full time data analytics data science business intelligence business analyst positions,"assists researchers in their work by conducting experiments, collecting data, and analyzing results."
6,research assistant penn state seeking opportunities data field data analyst experience sritech software expertise machine learning data evaluation passionate transforming data insights,uses statistical methods and data analysis techniques to interpret and understand complex data.
7,ms business analytics university tampa graduate research assistant president u t gold actively looking full time data analytics data science business intelligence business analyst positions,"evaluates business data to identify trends, inefficiencies, and opportunities for improvement."
8,data scientist data analyst civil engineer railways finance economics machine learning engineer artificial intelligence,"applies scientific methods, processes, algorithms, and systems to extract knowledge from structured and unstructured data."
9,senior technical support representative n able expertise cyber security project management,provides advanced technical assistance and support to clients and internal teams.


GenAI is able to correctly order the job titles and provide a good explanation for the its choice. Let's test it against different job search terms and see how it compares against the other NLP methods.

In [82]:
jobs = [
    "UX Designer", "Cybersecurity Analyst",
    "Cloud Architect", "Full Stack Developer", "Graphic Designer", "Technical Writer", "Sales Representative"
]

In [85]:
for query in jobs:
    
    options = get_titles(n, query)
    options = list(set(options))
    chunk_results = []
    for chunk in chunk_list(options, len(options)):
        ranked_chunk = generate_chunk_ranking(query, chunk)
        chunk_results.extend(ranked_chunk)
        
    df_genai = pd.DataFrame(chunk_results)
    df_compare = compare_results(n, query)
    
    print(query)
    display(df_compare)
    display(df_genai.drop_duplicates(subset="title"))

UX Designer


Unnamed: 0,w2v,tfidf,glove,fasttext,bert
0,ai engineer,informatics student ux designer data driven problem solver,analytics dashboard designer report development,informatics student ux designer data driven problem solver,passionate design love work others develop create products professional create impact
1,software engineer,analytics dashboard designer report development,software developer,technical product owner agile methodologies devops platform engineering data healthcare,instructional technologist walkme developer years experience awarded walkme global top designers
2,software engineer,data analyst product designer bridging creativity analytics ex etsy rewriting code ucla ece ucla vmg alumna,software developer,devops engineer automation enthusiast gen ai advocate problem solver innovator,mechanical engineer
3,software engineer rbc vghc,0,instructional technologist walkme developer years experience awarded walkme global top designers,devops assistant guglielmo associates,software engineer
4,software engineer,0,software developer paycom,lecturer informatics analytics uncg,software engineer


Unnamed: 0,title,explanation
0,informatics student ux designer data driven problem solver,designs and improves user experience in digital products or services.
1,data analyst product designer bridging creativity analytics ex etsy rewriting code ucla ece ucla vmg alumna,bridges creativity with analytics to design effective data dashboards and reports.
2,instructional technologist walkme developer years experience awarded walkme global top designers,creates educational technology solutions for engaging learning experiences.
4,analytics dashboard designer report development,specializes in designing interactive data visualizations for clear insights.
6,software engineer,builds and maintains software systems and applications.
7,technical product owner agile methodologies devops platform engineering data healthcare,manages the development process of a product using Agile methodologies.
8,devops engineer automation enthusiast gen ai advocate problem solver innovator,automates processes between software development and IT operations.
9,ai engineer,"designs, develops, and implements artificial intelligence systems."
11,lecturer informatics analytics uncg,teaches courses in the field of informatics or computer science.
12,mechanical engineer,"designs, builds, and tests mechanical systems using principles of physics and mathematics."


Cybersecurity Analyst


Unnamed: 0,w2v,tfidf,glove,fasttext,bert
0,cybersecurity,cybersecurity,cybersecurity,analyst,cybersecurity
1,computer analyst,analyst,analyst,databusiness analyst,computer analyst
2,databusiness analyst,analyst,analyst,analyst capgemini,systems analyst
3,cybersecurity professional todyl casp mscsia wgu,bachelors business analytics information systems cybersecurity concentration,databusiness analyst,analyst,junior security analyst unlv information technology
4,systems analyst,cybersecurity professional todyl casp mscsia wgu,senior analyst,financial analyst,bachelors business analytics information systems cybersecurity concentration


Unnamed: 0,title,explanation
0,cybersecurity,"This term is the search term itself, and it's a job title as well. It refers to professionals who protect computer systems, networks, and data from digital attacks."
1,analyst,"A cybersecurity analyst is responsible for monitoring networks and systems for security breaches, analyzing potential threats, and implementing countermeasures. They are experts in information security and risk management."
2,junior security analyst unlv information technology,"Junior Security Analysts assist Cybersecurity Analysts in securing computer systems and networks. They may perform tasks such as monitoring logs, analyzing network traffic, and installing security software."
3,senior analyst,"A Senior Analyst is an experienced professional who provides guidance to less experienced team members. In the context of cybersecurity, a Senior Analyst might lead a team of Cybersecurity Analysts or consult with clients on security issues."
5,analyst capgemini,"An Analyst at Capgemini is responsible for collecting, processing, and analyzing data to help organizations make informed decisions. While not specifically focused on cybersecurity, their analytical skills can be applied to identifying and addressing security threats."
6,computer analyst,"A Computer Analyst designs, develops, tests, and maintains computer systems and applications. They may also provide support for network security and implement security measures to protect against unauthorized access or data breaches."
7,systems analyst,Systems Analysts study an organization's current computer systems and propose improvements to increase efficiency and effectiveness. Their knowledge of both business processes and technology can help them identify potential vulnerabilities and recommend solutions to address them.
8,bachelors business analytics information systems cybersecurity concentration,"This individual has a Bachelor's degree with a focus on Business Analytics and Information Systems, including Cybersecurity. They may work as analysts in"


Cloud Architect


Unnamed: 0,w2v,tfidf,glove,fasttext,bert
0,software engineer,ms student ai long island university certified pega system architect csa pega certified senior system architect cssa certified aws cloud practitioner,full stack developer specializing aiml cloud computing,former cloud support intern aws bachelor computer science,former cloud support intern aws bachelor computer science
1,software engineer,solution architect data scientist process engineer,analytics engineer,data engineer cloud devops advocate finance building scalable data solutions python aws agile business analyst business intelligence developer iab certified,ms student ai long island university certified pega system architect csa pega certified senior system architect cssa certified aws cloud practitioner
2,software engineer rbc vghc,certified aws solutions architect franciscan health,data engineer cloud devops advocate finance building scalable data solutions python aws agile business analyst business intelligence developer iab certified,full stack developer specializing aiml cloud computing,ds grad northeastern university aws cloud engineer looking internship opportunities
3,software engineer,contact center architect bank america,software developer,technical product owner agile methodologies devops platform engineering data healthcare,full stack developer specializing aiml cloud computing
4,software engineer,automation engineer specialist software developer process optimization rpa expert pos cloud solutions architect problem solver innovator,software developer,data engineering cloud migration aws gcp data warehousing snowflake,certified data engineer building scalable etl pipelines cloud big data specialist aws azure snowflake driving data driven decisions ai analytics


Unnamed: 0,title,explanation
0,ms student ai long island university certified pega system architect csa pega certified senior system architect cssa certified aws cloud practitioner,has a certification in implementing AWS cloud services.
1,ds grad northeastern university aws cloud engineer looking internship opportunities,"designs, deploys, and manages applications within the AWS environment."
2,automation engineer specialist software developer process optimization rpa expert pos cloud solutions architect problem solver innovator,"designs, builds, and implements cloud solutions for organizations."
3,software developer,develops software applications that run in a cloud environment.
6,certified data engineer building scalable etl pipelines cloud big data specialist aws azure snowflake driving data driven decisions ai analytics,"builds scalable data solutions using cloud technologies like AWS, Azure, and Snowflake."
7,contact center architect bank america,designs and builds contact center solutions that run on the cloud.
11,technical product owner agile methodologies devops platform engineering data healthcare,designs and builds the underlying infrastructure that supports cloud applications and services.
12,solution architect data scientist process engineer,"uses data analysis techniques to extract insights from large datasets, which is essential for designing effective cloud solutions."
14,analytics engineer,designs and builds analytics systems that process and analyze data in the cloud.


Full Stack Developer


Unnamed: 0,w2v,tfidf,glove,fasttext,bert
0,full stack developer,full stack developer,full stack developer,full stack developer,full stack developer
1,full stack developer specializing aiml cloud computing,full stack developer specializing aiml cloud computing,full stack software developer,full stack software developer,full stack software developer
2,full stack developer mechanical engineer encoding future,full stack software developer,software developer,full stack developer specializing aiml cloud computing,data scientist full stack software engineer
3,full stack software developer,full stack developer mechanical engineer encoding future,software developer,full stack developer mechanical engineer encoding future,software engineer ms computer science backend full stack development ex rakuten
4,full stack developer react spring boot firebase building scalable web apps,full stack developer m s computer science georgia institute technology b s computer science university akron,software developer paycom,data scientist full stack software engineer,full stack developer specializing aiml cloud computing


Unnamed: 0,title,explanation
0,full stack developer,"This title is an exact match for the search term. A full stack developer is a software engineer who has expertise in both front-end and back-end development, enabling them to build complete web applications from scratch."
2,data scientist full stack software engineer,"While not an exact match, this title is semantically close. Data scientists often work on developing and implementing machine learning models, which can involve both front-end (data visualization) and back-end (model training) components, making them similar to full stack developers."
3,software engineer ms computer science backend full stack development ex rakuten,"This title is semantically similar as it includes the term ""full stack development,"" indicating expertise in both front-end and back-end development. The ""MS Computer Science"" and ""Ex Rakuten"" parts are additional qualifications not directly related to full-stack development."
4,software developer,"A software developer is a broader term that can include full stack developers, but not all software developers have expertise in both front-end and back-end development."
5,full stack developer react spring boot firebase building scalable web apps,"This title is similar as it also refers to a developer with expertise in full-stack development, specifically using technologies like React, Spring Boot, and Firebase for building scalable web applications."
7,full stack software developer,"This title is identical to the search term with the addition of the term ""software"" which does not change the meaning."
8,full stack developer mechanical engineer encoding future,"While a mechanical engineer typically works on physical systems, this title includes the term ""full stack developer,"" indicating expertise in both front-end and back-end development. The ""encoding future"" part is likely an additional qualification or company name."


Graphic Designer


Unnamed: 0,w2v,tfidf,glove,fasttext,bert
0,software engineer,analytics dashboard designer report development,software developer,analytics dashboard designer report development,software engineer
1,software engineer,informatics student ux designer data driven problem solver,software developer,senior survey design engineer newtonx,software engineer
2,software engineer rbc vghc,data analyst product designer bridging creativity analytics ex etsy rewriting code ucla ece ucla vmg alumna,analytics dashboard designer report development,writer researcher developer,software engineer
3,software engineer,0,instructional technologist walkme developer years experience awarded walkme global top designers,engineer grid analytics systems engineering design,software engineer
4,software engineer,0,software developer scheme designers inc developed features aircraft configurators,software developer scheme designers inc developed features aircraft configurators,software engineer


Unnamed: 0,title,explanation
0,informatics student ux designer data driven problem solver,"UX Designers focus on enhancing user experience through the design of intuitive interfaces and visual elements, making them most semantically similar to Graphic Designers as they both deal with designing visual components for digital platforms."
1,data analyst product designer bridging creativity analytics ex etsy rewriting code ucla ece ucla vmg alumna,"Product Designers create the look, feel, and functionality of products, including digital ones, making them a good fit as they share many responsibilities with Graphic Designers."
2,instructional technologist walkme developer years experience awarded walkme global top designers,"Although not strictly related to graphic design, Instructional Technologists use visuals and multimedia to develop educational content, which is similar to the role of a Graphic Designer in creating engaging visuals for various applications."
3,analytics dashboard designer report development,"Analytics Dashboard Designers focus on creating visually appealing and functional data displays, which is an essential aspect of graphic design in presenting complex information in an easily digestible format."
4,software developer,"While primarily focused on coding, Software Developers at Scheme Designers Inc also create user interfaces and visual designs for their software products, making them somewhat similar to Graphic Designers."
5,senior survey design engineer newtonx,"Although not directly related to graphic design, this title is semantically close due to its emphasis on designing user interfaces for surveys, which involves creating visually appealing and functional forms, similar to what Graphic Designers do."
6,software engineer,"While primarily focused on coding, Software Engineers at RBC VGHC also collaborate with UX/UI designers to create visually appealing and functional software interfaces, making their work semantically related to that of a Graphic Designer."


Technical Writer


Unnamed: 0,w2v,tfidf,glove,fasttext,bert
0,researcher writer filmmaker,writer researcher developer,writer researcher developer,writer researcher developer,writer researcher developer
1,computational linguist,researcher writer filmmaker,researcher writer filmmaker,researcher writer filmmaker,technology consultant
2,computational linguist,technical project manager data scientist data analyst big data analytics senior geoscientist,software engineer,researcher,computer scientist
3,computational scientist,ml engineer ai consultant writer instructional associate reinforcement learning georgia tech,software engineer,researcher,researcher writer filmmaker
4,financial analyst,technical product owner agile methodologies devops platform engineering data healthcare,software engineer,researcher,mechanical engineer


Unnamed: 0,title,explanation
0,ml engineer ai consultant writer instructional associate reinforcement learning georgia tech,"The term ""Technical Writer"" is a type of writer, and the given list includes several other types of writers."
1,researcher,Technical Writers often conduct research to write accurate and informative content.
2,computer scientist,"A Computer Scientist may also work as a Technical Writer, especially when documenting complex code or algorithms."
3,technical project manager data scientist data analyst big data analytics senior geoscientist,"Data Scientists might need to create reports and documentation, making them similar to Technical Writers in some aspects."
10,technical product owner agile methodologies devops platform engineering data healthcare,Documenting processes and creating guides for DevOps practices falls under the umbrella of Technical Writing.
11,writer researcher developer,"This role combines writing, researching, and developing skills, which can be beneficial for creating technical content."


Sales Representative


Unnamed: 0,w2v,tfidf,glove,fasttext,bert
0,technology consultant,software engineer mathematics data science ai course representative,actively seeking data analyst jobs,insurance agent customer service analytical experience,customer support specialist ii
1,engineering supervisor,senior technical support representative n able expertise cyber security project management,insurance agent customer service analytical experience,performance marketing manager,analytics associate
2,program manager,data analyst sales operation analytics compensation years alteryx excel sql python proficient new york usa,looking entry level analyst jobs,business analyst inventory management solutions,consultation agent
3,software engineer,0,associate manager,business intelligence leader analyst,hr recruiter
4,software engineer,0,operations manager assistant program coordinator,crm analyst email lifecycle marketing data driven,associate manager


## Observations and Conclusion

Generative AI is the closest to a human reasoning and most accurate in matching candidates with the recruiter's search terms, even with no shot prompting. 

The limitation is that it is more computationally expensive, time intensive and can hallucination. We see one search term didn't provide any results. 

These limitations could be overcome if paired with a preprocessing step such as feeding it through less complex NLP methods beforehand. 

For future studies, other LLM models, as well as, other Generative AI enhanced techniques could be tested for better speed and performance but this has worked very well for this use case. 