## Potential Talent

### **Context:**

As a **talent sourcing and management company**, we are interested in **finding talented individuals** for sourcing these candidates to technology companies. **Finding talented candidates is not easy**, for **several reasons**. The **first** reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The **second** reason is one needs to understand what makes a candidate shine for the role we are in search for. **Third**, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of **manual operations**. Towards **automating this process** we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

### Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

#### Attributes:
**id** : unique identifier for candidate (numeric)

**job_title** : job title for candidate (text)

**location** : geographical location for candidate (text)

**connections** : number of connections candidate has, 500+ means over 500 (text)

**Output (desired target)**:
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

#### Download Data:

https://docs.google.com/spreadsheets/d/117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU/edit?usp=sharing

#### Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

#### Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

In [1]:
# Importing Standard Libraries
import pandas as pd
import numpy as np
import os

from sklearn.metrics.pairwise import linear_kernel
pd.options.display.max_columns = 60

In [2]:
# Set the option to display the full text in DataFrame columns
pd.set_option('display.max_colwidth', None)

## Initial Exploratory Data Analysis

In [3]:
path = os.getcwd()
path


'c:\\Users\\Alex Chung\\Documents\\the_Lab\\Apziva\\Potential Talent'

In [4]:
df = pd.read_excel(path + '\\Dataset for Potential Talents.xlsx').set_index('id')
df.head()

Unnamed: 0_level_0,title,location,screening_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.,United States,100
2,ms applied data science student usc research assistant usc former data analytics intern at dr reddys laboratories former data science intern quadratyx actively seeking full time roles in summer 2025,United States,100
3,computer science student seeking full-time software engineerdeveloper positions ai sql data visualization toolspython ssrs,United States,100
4,microsoft certified power bi data analyst mba business analytics unt business intelligence engineer data scientist data engineer business analytics predictive analytics statistical analysis ex-ericsson,United States,100
5,graduate research assistant at uab masters in data science student at uab ex jio,United States,100


In [5]:
df.rename(columns={"title":"job_title"}, inplace=True)
df.rename(columns={"screening_score":"connection"}, inplace=True)
df.job_title.value_counts()

job_title
data analyst                                                                                                    19
data scientist                                                                                                  16
--                                                                                                              15
software engineer                                                                                                5
researcher                                                                                                       3
                                                                                                                ..
masters in applied statistics and supply chain analyst for aldi                                                  1
master of science in analytics at georgia institute of technology aspiring data scientist                        1
data engineer student at iit and upm                                  

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1285 entries, 1 to 1285
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   job_title   1281 non-null   object
 1   location    1285 non-null   object
 2   connection  1285 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 40.2+ KB


In [7]:
# dropping null values
df = (df[~df['job_title'].isna()])

In [8]:
df.replace("--", "blank", inplace=True)
df.replace(" ", "blank", inplace=True)
df.replace(".", "blank", inplace=True)

In [9]:
df[df.job_title == ' ']

Unnamed: 0_level_0,job_title,location,connection
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1281 entries, 1 to 1285
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   job_title   1281 non-null   object
 1   location    1281 non-null   object
 2   connection  1281 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 40.0+ KB


In [11]:
# df.replace('500+ ','501', inplace=True)
# df['connection'] = pd.to_numeric(df['connection'])

In [12]:
xxx

NameError: name 'xxx' is not defined

# TF-IDF

Term Frequency-Inverse Document Frequency (Statistical Method)
### Prepping our Text for Modelling


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Prep our Text for Modelling
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (1, 2))
docs_tfidf = vectorizer.fit_transform(df["job_title"])

In [None]:
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cos_sim = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    
    return cos_sim

In [None]:
def top_candidates(n, by = 'tfidf_fit', ascending = False, min_con = 0, location = df.location):
    
    df2 = df.loc[(df.connection >= min_con) & 
                 (df[by] > 0) & 
                 (df.location == location)].sort_values(by = by, ascending = ascending).head(n).copy()
    
    if df2.empty:
        return "There are no suitable candidates"
    
    else:
        return df2

In [None]:
query = 'Data Analyst'

cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query)

df['tfidf_fit'] = cos_sim

top_candidates(n = 4)

In [None]:
df_compare = pd.DataFrame()
df_compare['tfidf_fit'] = top_candidates(n = 10)['job_title']
df_compare

In [None]:
# Reranking by learning to rank

# Let's do this for Word2Vec, GloVe, Fasttext, BERT and finally GenAI

# Word2Vec Gensim
Word embedding

### Prepping our Text for Modelling

In [None]:
import re
import nltk
# nltk.download('stopwords')

# processing texts for modelling
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
df['job_title_cleaned'] = df.job_title.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                            for w in x.split() 
                                                            if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                            not in stop_words) ) #nltk.download('stopwords')

In [None]:
# drop tfidf_fit column to preserve column order later
df.drop(columns="tfidf_fit", inplace=True)
df.head(2)

In [None]:
from tensorflow import keras

# tokenize and pad every document to make them of the same size
from tensorflow.keras.preprocessing.text import Tokenizer
# from keras.layers import TextVectorization
from keras_preprocessing.sequence import pad_sequences
tokenizer=Tokenizer()

tokenizer.fit_on_texts(df.job_title_cleaned)
tokenized_documents=tokenizer.texts_to_sequences(df.job_title_cleaned)
tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
vocab_size=len(tokenizer.word_index)+1

In [None]:
# loading pre-trained embeddings, each word is represented as a 300 dimensional vector
import gensim

# Navigating to directory where pre-trained embeddings were downloaded
os.chdir(r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent")
# W2V_PATH="GoogleNews-vectors-negative300.bin.gz"
W2V_PATH="GoogleNews-vectors-negative300.bin"

In [None]:
path = os.getcwd()+'\\GoogleNews-vectors-negative300.bin\\'

In [None]:
# loading word2vec model
model_w2v = gensim.models.KeyedVectors.load_word2vec_format(path+W2V_PATH, binary=True)
model_w2v[0][:4]

In [None]:
# creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
embedding_matrix=np.zeros((vocab_size,300))
for word,i in tokenizer.word_index.items():
    if word in model_w2v:
        embedding_matrix[i]=model_w2v[word]
        
# creating document-word embeddings
document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
for i in range(len(tokenized_paded_documents)):
    for j in range(len(tokenized_paded_documents[0])):
        document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
document_word_embeddings.shape

In [None]:
# document_word_embeddings[0][0][:10]

In [None]:
# model_w2v['england'][:5]

In [None]:
def processing(query):
    df3 = pd.DataFrame([query], columns=['query'])
    stop_words = stopwords.words('english')
    df3['processed'] = df3['query'].apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                for w in x.split() 
                                                                                if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                not in stop_words) )
    
    tokenizer.fit_on_texts(df3.processed)
    tokenized_documents=tokenizer.texts_to_sequences(df3.processed)
    tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
    vocab_size=len(tokenizer.word_index)+1
    
    embedding_matrix=np.zeros((vocab_size,300))
    for word,i in tokenizer.word_index.items():
        if word in model_w2v:
            embedding_matrix[i]=model_w2v[word]

    # creating document-word embeddings
    query_document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
    for i in range(len(tokenized_paded_documents)):
        for j in range(len(tokenized_paded_documents[0])):
            query_document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
#     document_word_embeddings.shape
    
    return query_document_word_embeddings

In [None]:
processing('hello world!!!!').shape

In [None]:
processing('hello world!!!!')[0][:3][0][:20]

In [None]:
def get_w2v_query_similarity(document_word_embeddings, query):
    """
    query_w2v: processing the query
    model_w2v: word2vec embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_w2v = processing(query)
    
    nsamples, nx, ny = query_w2v.shape
    query_w2v_reshape = query_w2v.reshape((nsamples,nx*ny))

    nsamples, nx, ny = document_word_embeddings.shape
    document_word_embeddings_reshape = document_word_embeddings.reshape((nsamples,nx*ny))
    
    cos_sim_w2v = cosine_similarity(query_w2v_reshape, document_word_embeddings_reshape).flatten()
    
    return cos_sim_w2v

In [None]:
def get_all_similarity(query):
    
    # Word2Vec Similarity
    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    # Original TFIDF similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    return df

In [None]:
query = 'seeking human resources'

df = get_all_similarity(query)

top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 0)

In [None]:

def compare_results(n, query):
    
    df_compare = pd.DataFrame()
    df = get_all_similarity(query)
    cols = df.columns[4:].to_list()
    col_names = [x.split("_")[0] for x in df.columns[4:].to_list()]
    for tn, t in zip(col_names, cols):
        if type(top_candidates(n = n, by = t)) != str:
            if len(top_candidates(n = n, by = t)) < n:
                difference = n - len(top_candidates(n = n, by = t))
                zeros = [0] * difference
                df_compare[tn] = top_candidates(n = n, by = t)['job_title'].to_list() + zeros
        
            else:
                df_compare[tn] = top_candidates(n = n, by = t)['job_title'].to_list()
                
    return df_compare

In [None]:
df.info()

In [None]:
n = 5
query = 'Senior Human Resources Business Partner at Heil Environmental'
compare_results(n, query)

In [None]:
query = 'Senior Human Resources Business Partner at Heil Environmental'

df = get_all_similarity(query)

top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 0)

In [None]:
top_candidates(n = 5, by = 'w2v_fit', ascending = False, min_con = 50)

In [None]:
top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 20, location = 'Greater New York City Area')

In [None]:
query = 'Staff Data Scientist'

df = get_all_similarity(query)

top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 0)

# GloVe - 

https://nlp.stanford.edu/projects/glove/

In [None]:
# Downloading GloVe pre-trained vectors
# !pip install wget
# import wget
# wget.download('https://nlp.stanford.edu/data/glove.840B.300d.zip')

In [None]:
# Extracting GloVe vector file
# import zipfile as zf
# files = zf.ZipFile("glove.840B.300d.zip", 'r')
# files.extractall('GloVe')
# files.close()

In [None]:
# Navigating to directory where GloVe pre-trained vectors were downloaded
os.chdir(r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent\glove")
path = 'glove.840B.300d.txt'

In [None]:
with open(path) as file:
  for i in range(10):
    line = file.readline()
    print(line[:10])

In [None]:
df_glove = pd.read_csv(path, sep=" ", quoting=3, header=None, index_col=0)
df_glove.T

In [None]:
glove = { key: val.values for key, val in df_glove.T.items() }

In [None]:
glove['man'][:5]

In [None]:
unknown_word = df_glove.mean().values

In [None]:
df_glove.head()

In [None]:
glove[word][:3]

In [None]:
# Creating a vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_vec = []
# rem_words = []

# n = 0
for sentences in job_titles:
    
    word_vec = []
    # print(n)
    # print(sentences)
    for word in sentences.split():
        # if word not in rem_words:
        #     rem_words.append(word)
        if word in glove:
            vectors = glove[word]
            word_vec.append(vectors)
        else:
            word_vec.append(unknown_word)
        # print(word)
        # print(word_vec[0][:3])
    
    n = n + 1
    word_vec_mean = sum(word_vec) / len(word_vec) # returning a mean for each job title
    doc_sent_vec.append(word_vec_mean) # returning a list for all job titles

In [None]:
doc_sent_vec[0].shape

In [None]:
# Creating a vectorize representation for each query
def q_sent_vec(query):
    q_sent_vec = []
    q_word_vec = []
    
    for word in query.split():
        if word in glove:
            vectors = glove[word]
            q_word_vec.append(vectors)
        else:
            q_word_vec.append(unknown_word)
        q_word_vec_mean = sum(q_word_vec) / len(q_word_vec)
    q_sent_vec.append(q_word_vec_mean)
        
    return q_sent_vec

In [None]:
query = 'native english speaking'
len(q_sent_vec(query))

In [None]:
q_sent_vec(query)[0].shape

In [None]:
q_sent_vec(query)[0][:5]

In [None]:
query = 'student indiana university'
q_sent_vec(query)[0][:5]

In [None]:
def get_glove_query_similarity(doc_sent_vec, query):
    """
    query_glove: processing the query
    doc_sent_vec: glove embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_glove = q_sent_vec(query)
    
    cos_sim_glove = cosine_similarity(query_glove, doc_sent_vec).flatten()
    
    return cos_sim_glove

In [None]:
def get_all_similarity(query):
    
    #GloVe similarity
    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    return df

In [None]:
query = 'Aspiring human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'glove_fit', ascending = False, min_con = 0)

In [None]:
query = 'seeking human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'glove_fit', ascending = False, min_con = 0)

In [None]:
query = 'senior data analyst'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'glove_fit', ascending = False, min_con = 0)

In [None]:
n = 5
query = 'senior data analyst'
compare_results(n, query)

# Fasttext 
FastText is a library developed by Facebook for NLP - known for its training speed and accuracy.  

In [None]:
# import sys
# sys.path

# # !pip install wget
# !pip3.10 install --user wget

In [None]:
# # # Downloading fastText pre-trained vectors
# import wget
# wget.download('https://github.com/facebookresearch/fastText/archive/v0.9.2.zip')

In [None]:
# # # Extracting fastText vector file
# import zipfile as zf
# files = zf.ZipFile("fastText-0.9.2.zip", 'r')
# files.extractall()
# files.close()

In [None]:
os.chdir(r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent\fastText-0.9.2")

#### Issues and workarounds with installing fasttext:

https://stackoverflow.com/questions/44951456/pip-error-microsoft-visual-c-14-0-is-required

In [None]:
# !pip install --upgrade pip
# !pip install --upgrade wheel
# !pip install --upgrade setuptools
# !pip install Cython --install-option="--no-cython-compile"

In [None]:
# !pip install fasttext
# !pip install fasttext-wheel

In [None]:
import fasttext as fasttext

In [None]:
# Downloading pretrained model trained on Common Crawl and Wikipedia
# import fasttext.util
# fasttext.util.download_model('en', if_exists='ignore')  # English Skip downloading if you've already downloaded


In [None]:
ft = fasttext.load_model('cc.en.300.bin')

In [None]:
ft.get_word_vector('hello')[:20]

In [None]:
ft.get_words()[:10]

In [None]:
# Creating a dictionary of fasttext word and vector representaiton
ft_words = ft.get_words()
ft_vectors = [ft.get_word_vector(word) for word in ft_words]
ft_dict = dict(zip(ft_words, ft_vectors))

In [None]:
ft_dict['hello'][:20]

In [None]:
df_ft = pd.DataFrame(ft_dict.items(), columns = ['ft_words', 'ft_vectors'])

In [None]:
df_ft.head(3)

In [None]:
# May not need to do this for fasttext
oov_word = np.zeros((300,))

In [None]:
# Creating a fasttext vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_vec_ft = []

for sentences in job_titles:
    word_vec_ft = []
    for word in sentences.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            word_vec_ft.append(vectors)
        else:
            word_vec_ft.append(oov_word)
    word_vec_mean_ft = sum(word_vec_ft) / len(word_vec_ft) # returning a mean for each job title
    doc_sent_vec_ft.append(word_vec_mean_ft) # returning a list for all job titles

In [None]:
# Creating a fasttext vectorize representation for each query
def q_sent_vec_ft(query):
    q_sent_vec_ft = []
    q_word_vec_ft = []
    
    for word in query.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            q_word_vec_ft.append(vectors)
        else:
            q_word_vec_ft.append(oov_word)
    q_word_vec_mean_ft = sum(q_word_vec_ft) / len(q_word_vec_ft) # This was indented but just fixed this round - if it breaks, this should be indented again
    q_sent_vec_ft.append(q_word_vec_mean_ft)
        
    return q_sent_vec_ft

In [None]:
def get_fasttext_query_similarity(doc_sent_vec_ft, query):
    """
    query_fasttext: processing the query
    doc_sent_vec: glove embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_fasttext = q_sent_vec_ft(query)
    
    cos_sim_fasttext = cosine_similarity(query_fasttext, doc_sent_vec_ft).flatten()
    
    return cos_sim_fasttext

In [None]:


def get_all_similarity(query):

    #Fasttext similarity
    cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query)
    df['fasttext_fit'] = cos_sim_fasttext

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove
    
    return df

In [None]:
query = 'Aspiring human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'fasttext_fit', ascending = False, min_con = 0)

In [None]:
query = 'seeking human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'fasttext_fit', ascending = False, min_con = 0)

In [None]:
n = 5
query = 'senior data analyst'
compare_results(n, query)

In [None]:
os.chdir("..")

In [None]:
df.head(5)

In [None]:

# for tn, t in zip(col_names, cols):


def compare_results(n, query):
    
    df_compare = pd.DataFrame()
    df = get_all_similarity(query)
    cols = df.columns[4:].to_list()
    col_names = [x.split("_")[0] for x in df.columns[4:].to_list()]
    for tn, t in zip(col_names, cols):
        if type(top_candidates(n = n, by = t)) != str:
            if len(top_candidates(n = n, by = t)) < n:
                difference = n - len(top_candidates(n = n, by = t))
                zeros = [0] * difference
                df_compare[tn] = top_candidates(n = n, by = t)['job_title'].to_list() + zeros
        
            else:
                df_compare[tn] = top_candidates(n = n, by = t)['job_title'].to_list()
                
    return df_compare

# BERT - 

In [None]:
# First install
# !pip install transformers 
# !pip install transformers -U --use-feature 2020-resolver

In [None]:
# !pip install --upgrade pip

In [None]:
# !pip config set --user global.use-feature 2020-resolver

In [None]:
# !pip install torch torchvision torchaudio

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load the tokenizer and the model from HuggingFace Hub
bert_tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
bert_model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
# Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = bert_tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = bert_model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return embeddings

In [None]:
# get bert embedding for all docs
titles_list = df['job_title_cleaned'].to_list()

doc_emb = encode(titles_list)

In [None]:
titles_list[:3]

In [None]:
def get_bert_query_similarity(doc_emb, query):
    """
    query_bert: processing the query
    doc_emb: bert embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_bert = encode(query)
    
    #Compute dot score between query and all document embeddings
    cos_sim_bert = torch.mm(query_bert, doc_emb.transpose(0, 1))[0].cpu().tolist()
    
    return cos_sim_bert

In [None]:
def get_all_similarity(query):
    
    #Bert similarity
    cos_sim_bert = get_bert_query_similarity(doc_emb, query)
    df['bert_fit'] = cos_sim_bert

    #Fasttext similarity
    cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query)
    df['fasttext_fit'] = cos_sim_fasttext

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove
    
    return df

In [None]:
query = 'seeking human resources'
df = get_all_similarity(query)
top_candidates(n = 5, by = 'bert_fit', ascending = False, min_con = 0)

In [None]:
query = 'Nurse'
# query = 'Doctor'
# query = 'Help Desk'
# query = 'Researcher'
# query = 'Soldier'
# query = 'Athlete'
# query = 'Cashier'
# query = 'Spy'
query = 'Writer'
query = 'Singer'
# query = 'Chef'
# query = 'Data Scientist'
# query = 'Architect'
# query = 'Accountant'
# query = 'Recruiter'
query = 'human resources'
df = get_all_similarity(query)
n = 3

df_compare = compare_results(n, query)
df_compare

In [None]:
popular_jobs = [
    "Software Engineer",
    "Data Scientist",
    "Product Manager",
    "UX Designer",
    "Project Manager",
    "Marketing Manager",
    "Business Analyst",
    "DevOps Engineer",
    "AI/ML Engineer",
    "Cybersecurity Analyst",
    "Cloud Architect",
    "Full Stack Developer",
    "Frontend Developer",
    "Backend Developer",
    "Database Administrator",
    "Financial Analyst",
    "HR Manager",
    "Graphic Designer",
    "Systems Administrator",
    "Technical Writer",
    "Mechanical Engineer",
    "Civil Engineer",
    "Electrical Engineer",
    "Accountant",
    "Sales Representative"
]


In [None]:
n = 5
for job in popular_jobs:
#     df = get_all_similarity(job)
#     df_compare = compare_results(n, job)
    print(job)
#     display(df_compare)

In [None]:
# WordtoVec  Same thing but with pretrained word embedding average of word
# Try to see who I'm connected with 
# skill review surrvey - schedule interview - motivated 

Process:
1. Sentence transformer:
    https://sbert.net/
    https://www.geeksforgeeks.org/sentence-similarity-using-bert-transformer/


2. Gen AI
https://stackoverflow.com/questions/75673222/semantic-searching-using-google-flan-t5

3. Utilizing LLM via prompting
GPT general purpose transformer - closed boxed model through an Open AI API
- Focus on instead, take advantage of open source LLM such as LLama 3 model from Meta
- Mistral, Llama 2, Grok maybe?

Bert

# Gen AI

In [13]:
popular_jobs = [
    "Software Engineer",
    "Data Scientist",
    "Product Manager",
    "UX Designer",
    "Project Manager",
    "Marketing Manager",
    "Business Analyst",
    "DevOps Engineer",
    "AI/ML Engineer",
    "Cybersecurity Analyst",
    "Cloud Architect",
    "Full Stack Developer",
    "Frontend Developer",
    "Backend Developer",
    "Database Administrator",
    "Financial Analyst",
    "HR Manager",
    "Graphic Designer",
    "Systems Administrator",
    "Technical Writer",
    "Mechanical Engineer",
    "Civil Engineer",
    "Electrical Engineer",
    "Accountant",
    "Sales Representative"
]


In [14]:
n = 5
for job in popular_jobs:
#     df = get_all_similarity(job)
#     df_compare = compare_results(n, job)
    print(job)
#     display(df_compare)

Software Engineer
Data Scientist
Product Manager
UX Designer
Project Manager
Marketing Manager
Business Analyst
DevOps Engineer
AI/ML Engineer
Cybersecurity Analyst
Cloud Architect
Full Stack Developer
Frontend Developer
Backend Developer
Database Administrator
Financial Analyst
HR Manager
Graphic Designer
Systems Administrator
Technical Writer
Mechanical Engineer
Civil Engineer
Electrical Engineer
Accountant
Sales Representative


In [15]:
nameList = ['driven professional'] 

strings = ['innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.',
 'ms applied data science student usc research assistant usc former data analytics intern at dr reddys laboratories former data science intern quadratyx actively seeking full time roles in summer 2025',
 'computer science student seeking full-time software engineerdeveloper positions ai sql data visualization toolspython ssrs',
 'microsoft certified power bi data analyst mba business analytics unt business intelligence engineer data scientist data engineer business analytics predictive analytics statistical analysis ex-ericsson',
 'graduate research assistant at uab masters in data science student at uab ex jio']

# any(name in title.lower() for title in strings for name in nameList)
# any([name in nameList if name in title.split() for title in strings])

matching_strings = [
    title for title in strings
    if any(name in title.lower() for name in nameList)
]

print(matching_strings)

['innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.']


In [16]:
df.job_title.to_list()

['innovative and driven professional seeking a role in data analyticsdata science in the information technology industry.',
 'ms applied data science student usc research assistant usc former data analytics intern at dr reddys laboratories former data science intern quadratyx actively seeking full time roles in summer 2025',
 'computer science student seeking full-time software engineerdeveloper positions ai sql data visualization toolspython ssrs',
 'microsoft certified power bi data analyst mba business analytics unt business intelligence engineer data scientist data engineer business analytics predictive analytics statistical analysis ex-ericsson',
 'graduate research assistant at uab masters in data science student at uab ex jio',
 'student at kennesaw state university',
 'data analyst business analyst python snowflake sql machine learning power bi tableau equipped with analytics driven by insights and passionate about impactful solutions.',
 'graduate research aide student at ariz

In [None]:
# %pip install -U datasets==2.17.0

# %pip install --upgrade pip
# %pip install --disable-pip-version-check \
#     torch==1.13.1 \
#     torchdata==0.5.1 --quiet

# %pip install \
#     transformers==4.27.2 --quiet

In [None]:
# !pip install --upgrade transformers

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# model_name='google/flan-t5-base'

# model_name = 'google/gemma-7b-it'

# model_name = 'google/gemma-3-1b-it'

# gen_ai_tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# gen_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [17]:
n = 5
query = job

def get_titles(n = 5, query=query):
    df = get_all_similarity(query)
    
    df_compare = compare_results(n, query)
    # print(job)
    # display(df_compare)
    
    titles = []
    
    for col in df_compare.columns:
        titles.extend(df_compare[col].to_list())
    
    # add additional titles
    titles.extend(df.sample(80)['job_title'].to_list())
    titles = list(set(titles))
    titles = [x for x in titles if x != 0]
    return titles



In [None]:
# def build_prompt(query, options):
#     options_text = "\n".join(f"- {opt}" for opt in options)
#     prompt = """Which applicant best fits the job query? 
#         QUERY: {query}
#         OPTIONS:
#         {options}""".format(query=query, options=options_text)
#     return prompt

In [18]:
titles = df['job_title'].to_list()
titles = list(set(titles))
titles

['assistant professor of electrical and computer engineering kennesaw state university',
 'Computer Engineer',
 'Your Partner in IT Integrity and Information Security',
 'blank',
 'data science graduate student academic services assistant',
 'data analyst at western frontier trading llc transforming data into actionable insights sql python machine learning data visualization psychology graduate ba',
 'data analystsis administrator',
 'data scientist results-driven ai professional expert in python machine learning',
 'analytics engineer',
 'uchicago mcam 2024 uiuc math concentrate on data optimization seeking 2025 full time in data science data analyst and quantitive finance agentic ai enthusiast',
 'analyst ii ops at constellation data science',
 'surgical neurophysiologist the mount sinai hospital data analysis neuroscience',
 'seeking opportunities data analyst machine learning python sql azure power bi cloud ai solutions data-driven decision making',
 'senior optimization engineer m

In [None]:
def build_prompt(query, options):
    options_text = "\n".join(f"- {opt}" for opt in options)
    prompt = """I'm going to provide a list of candidates' job titles, as well as the search term.
I want you to rank the job titles based on their semantic similarity to the search term. Return the ranked list without changing the job
titles, but only the order of them. Here is the search term: {query}, and here is the list of job titles: {options}""".format(query=query, options=options_text)
    return prompt


In [None]:
n = 5
query = job

options = get_titles(n, query)

query = popular_jobs[5]

NameError: name 'get_all_similarity' is not defined

In [None]:
# from transformers import AutoTokenizer, AutoModelForCausalLM
# import torch

# tokenizer = AutoTokenizer.from_pretrained("gpt2")
# model = AutoModelForCausalLM.from_pretrained("gpt2")

# model="google/gemma-3-1b-it"

# def build_prompt(query, options):
#     options_text = "\n".join(f"- {opt}" for opt in options)
#     return f"""
# I'm going to provide a list of candidates' job titles, as well as the search term.
# I want you to rank the job titles based on their semantic similarity to the search term. 
# Return the ranked list without changing the job titles, but only the order of them.

# Search term: {query}
# Job titles:
# {options_text}
# """

# # Function to generate from a chunk
# def generate_from_chunk(query, options_chunk, max_tokens=400):
#     prompt = build_prompt(query, options_chunk)
#     inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
#     outputs = model.generate(**inputs, max_new_tokens=max_tokens)
#     return tokenizer.decode(outputs[0], skip_special_tokens=True)

# # Chunking
# def chunk_list(lst, chunk_size):
#     for i in range(0, len(lst), chunk_size):
#         yield lst[i:i + chunk_size]

# # Run in chunks
# all_chunks = []
# for chunk in chunk_list(options, 50):  # Adjust chunk size based on model limit
#     result = generate_from_chunk(query, chunk)
#     all_chunks.append(result)

# Optional: Re-rank merged top results with embeddings


In [None]:
# import re

# def extract_ranked_job_titles(text):
#     lines = text.strip().split("\n")
#     titles = []
#     start_ranking = False

#     for line in lines:
#         line = line.strip()

#         # Skip empty or clearly non-ranking sections
#         if not line:
#             continue
#         if "reasoning:" in line.lower():
#             break  # Stop parsing after list ends
#         if "most to least similar:" in line.lower():
#             start_ranking = True
#             continue
#         if not start_ranking:
#             continue

#         # Remove leading numbers, bullets, markdown
#         clean = re.sub(r"^\s*[\-*]?\s*\d+\.\s*", "", line)      # remove "1. " or "- 1."
#         clean = re.sub(r"\*\*", "", clean)                      # remove bold markers
#         clean = re.sub(r"\s*\(.*?\)", "", clean)                # remove things like "(Most Similar)"
#         clean = clean.strip()

#         if clean and len(clean.split()) >= 2:
#             titles.append(clean)

#     return titles


In [19]:
import re

def extract_ranked_job_titles(data):
    """
    Extract job titles from a numbered list in raw text starting after the second line containing 'here's the',
    removing all asterisks, explanations, and splitting multi-title lines correctly.

    Args:
        data (str): Raw text with lines separated by newlines, containing job titles and other text.

    Returns:
        list: Ordered list of unique job titles (lowercase).
    """
    lines = data.strip().split('\n')

    # Find all lines that contain "here's the"
    matching_indices = [i for i, line in enumerate(lines) if re.search(r"here's the", line, re.IGNORECASE)]

    # Use the second occurrence if available, otherwise fallback to the first
    if len(matching_indices) >= 2:
        start_index = matching_indices[1]
    elif matching_indices:
        start_index = matching_indices[0]
    else:
        return []

    # Common non-title terms to exclude (lowercase for consistency)
    exclude_terms = {'python', 'sql', 'r', 'tableau', 'spark', 'aws', 'gcp'}

    job_titles = []
    seen = set()

    for line in lines[start_index:]:
        # Match lines that start with a number followed by a period
        if re.match(r'^\d+\.\s*', line.strip()):
            # Remove number, asterisks, and trailing explanations
            cleaned = re.sub(r'^\d+\.\s*|\*+|\s*[-–(].*$', '', line).strip()
            # Split multiple titles in one line
            split_titles = re.split(r',|/| and | or ', cleaned)
            for title in split_titles:
                title = title.strip().lower()
                if title and title not in exclude_terms and title not in seen:
                    seen.add(title)
                    job_titles.append(title)

    return job_titles


# Example usage
data = """Search term: Marketing Manager
Job titles:
- software developer scheme designers inc developed features for aircraft configurators
- graduate research and teaching assistant at michigan state university
- data science data engineer machine learning data scientist project 990 ms in data science iub seeking full-time roles may 25
- senior big data engineer actively seeking for new opportunities data engineer big data sql aws hadoop azure pyspark kafka yarn hdfs scala etl
- staff data scientist and analytics engineer equifax
- phd in mol bio bioinformatic data science i ai ml l aws saa linux python and r docker kubernetes devops cicd pipelinegit-github terraform l ansible
- encargado de marketing y ventas en sego eventos
- data analyst at kingston brass

model
Okay, here's the ranked list of job titles, ordered from most to least semantically similar to the search term "Marketing Manager," based on their relevance:
1.  **data science data engineer machine learning data scientist project 990 ms in data science iub seeking full-time roles may 25**  
2.  **senior big data engineer actively seeking for new opportunities data engineer big data sql aws azure hdfs yarn kafka** 
3.  **software developer scheme designers inc developed features for aircraft configurators** 
4.  **graduate research and teaching assistant at michigan state university** 
5.  **enchargado de marketing y ventas en sego eventos** 
6.  **data analyst at kingston brass** 
7.  **data science data engineer machine learning data scientist** 
8. **staff data scientist and analytics engineer equifax** 


Search term: AI/ML Engineer
Job titles:
- computer science graduate from the university of south africa
- data science machine learning artificial intelligence nlp
- biomedical engineer thermo fisher scientific
- data scientist ai startup data science uc berkeley
- data science graduate passionate about ai ml analytics
- rashmika nattam graduate teaching assistant uic python sql java c html web dev enthusiast
- data analyst data scientist healthcare mit applied data science big sql energy advanced data analytics certified data darling

model
Okay, here’s the ranked list of job titles, ordered from most to least semantically similar to the search term "AI/ML Engineer," based on relevance:

1.  **data science machine learning artificial intelligence nlp** - This is the most directly related, encompassing the core focus of the search term.
2.  **data scientist ai startup data science uc berkeley** -  Highlights experience in AI/ML, often in a startup or research setting.
3.  **biomedical engineer thermo fisher scientific** -  Specifically focuses on a field heavily reliant on AI/ML and data analysis.
4.  **data science graduate passionate about ai ml analytics** -  Clearly indicates an interest in AI/ML and data science.
5.  **data analyst data scientist healthcare mit applied data science big sql** -  Combines data analysis with AI/ML, a common role.
6.  **data science graduate from the university of south africa** -  A strong indicator of an AI/ML background, particularly with a focus on data science.
7.  **computer science graduate from the university of south africa** -  A broader entry point, but still relevant if the candidate has a strong foundation in computer science and AI.
8.  **rashmika nattam graduate teaching assistant uic python sql java c html web dev enthusiast** -  Focuses on a technical role, but with a strong emphasis on programming and related skills.
9.  **data analyst data scientist healthcare mit applied data science big sql** - Similar to the previous one, but with a slightly different emphasis.
10. **data analyst data scientist healthcare mit applied data science big sql energy advanced data analytics certified data darling** -  This is a more general role, but still relevant if the candidate has experience in data analysis and potentially some AI/ML knowledge.

**Reasoning for the ranking:**"""


job_titles = extract_ranked_job_titles(data)
print(job_titles)

['data science data engineer machine learning data scientist project 990 ms in data science iub seeking full', 'senior big data engineer actively seeking for new opportunities data engineer big data sql aws azure hdfs yarn kafka', 'software developer scheme designers inc developed features for aircraft configurators', 'graduate research', 'teaching assistant at michigan state university', 'enchargado de marketing y ventas en sego eventos', 'data analyst at kingston brass', 'data science data engineer machine learning data scientist', 'staff data scientist', 'analytics engineer equifax', 'data science machine learning artificial intelligence nlp', 'data scientist ai startup data science uc berkeley', 'biomedical engineer thermo fisher scientific', 'data science graduate passionate about ai ml analytics', 'data analyst data scientist healthcare mit applied data science big sql', 'data science graduate from the university of south africa', 'computer science graduate from the university of

In [20]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

# Use the correct Gemma model from Hugging Face Hub
model_id = "google/gemma-3-1b-it"


  from .autonotebook import tqdm as notebook_tqdm


In [21]:

# max_token = gemma_tokenizer.model_max_length
max_token = 512 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer (run in float16 if on GPU)
ai_tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
).to(device)

# streamer = TextStreamer(gemma_tokenizer, skip_prompt=True, skip_special_tokens=True)

In [22]:

# Prompt construction function
def build_prompt(query, options):
    options_text = "\n".join(f"- {opt}" for opt in options)
    return f"""<bos><start_of_turn>user
I will give you a list of job titles and a search term. 
Rank them in the list by how semantically similar they are to the search term.
Do not change the job titles I give you — just reorder them in a ranked list from most to least similar.

Search term: Marketing Manager
Job titles:
- software developer scheme designers inc developed features for aircraft configurators
- graduate research and teaching assistant at michigan state university
- data science data engineer machine learning data scientist project 990 ms in data science iub seeking full-time roles may 25
- senior big data engineer actively seeking for new opportunities data engineer big data sql aws hadoop azure pyspark kafka yarn hdfs scala etl
- staff data scientist and analytics engineer equifax
- phd in mol bio bioinformatic data science i ai ml l aws saa linux python and r docker kubernetes devops cicd pipelinegit-github terraform l ansible
- encargado de marketing y ventas en sego eventos
- data analyst at kingston brass

model
Okay, here's the ranked list of job titles, ordered from most to least semantically similar to the search term "Marketing Manager," based on their relevance:
1.  **data science data engineer machine learning data scientist project 990 ms in data science iub seeking full-time roles may 25**  
2.  **senior big data engineer actively seeking for new opportunities data engineer big data sql aws azure hdfs yarn kafka** 
3.  **software developer scheme designers inc developed features for aircraft configurators** 
4.  **graduate research and teaching assistant at michigan state university** 
5.  **enchargado de marketing y ventas en sego eventos** 
6.  **data analyst at kingston brass** 
7.  **data science data engineer machine learning data scientist** 
8. **staff data scientist and analytics engineer equifax** 


Search term: {query}
Job titles:
{options_text}
<end_of_turn>
<start_of_turn>model
"""


In [23]:
# Prompt construction function
def build_prompt_NS(query, options):
    options_text = "\n".join(f"- {opt}" for opt in options)
    return f"""<bos><start_of_turn>user
I will give you a list of job titles and a search term. 
Rank them by how semantically similar they are to the search term.
Do not change the job titles I give you — just reorder them into a new list from most to least similar.

Search term: {query}
Job titles:
{options_text}
<end_of_turn>
<start_of_turn>model
"""

# Chunking logic
def chunk_list(lst, chunk_size):
    for i in range(0, len(lst), chunk_size):
        yield lst[i:i + chunk_size]

# Generate for a chunk
def generate_chunk_ranking(query, options_chunk, max_tokens=max_token, type = 'NS'):
    if type == 'NS':
        prompt = build_prompt_NS(query, options_chunk)
    else:
        prompt = build_prompt(query, options_chunk)
        
    # Load model and tokenizer (run in float16 if on GPU)
    ai_tokenizer = AutoTokenizer.from_pretrained(model_id)

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
    ).to(device)
        
    inputs = ai_tokenizer(prompt, return_tensors="pt", return_attention_mask=True).to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=False,
        temperature=0.1,
        pad_token_id=ai_tokenizer.eos_token_id,
        # streamer=streamer
    )
    # print()  # Ensure newline after streamed output
    # return gemma_tokenizer.decode(outputs[0], skip_special_tokens=True)

    response = ai_tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(response)
    return extract_ranked_job_titles(response)

more models Qwen, Lamma2, Lamma3
tempurature
max token length to bigger
10 chunk

Gemma No Shot

In [36]:
titles = df['job_title'].to_list()
titles = list(set(titles))
options = titles[:30]
query = popular_jobs[15]
query

'Financial Analyst'

In [37]:
# Use the correct Gemma model from Hugging Face Hub
model_id = "google/gemma-3-1b-it"

chunk_results = []
for chunk in chunk_list(options, 10):
    ranked_chunk = generate_chunk_ranking(query, chunk)
    print("✅ Ranked chunk output:", ranked_chunk)
    chunk_results.extend(ranked_chunk)  # collect all reordered titles

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


user
I will give you a list of job titles and a search term. 
Rank them by how semantically similar they are to the search term.
Do not change the job titles I give you — just reorder them into a new list from most to least similar.

Search term: Financial Analyst
Job titles:
- assistant professor of electrical and computer engineering kennesaw state university
- Computer Engineer
- Your Partner in IT Integrity and Information Security
- blank
- data science graduate student academic services assistant
- data analyst at western frontier trading llc transforming data into actionable insights sql python machine learning data visualization psychology graduate ba
- data analystsis administrator
- data scientist results-driven ai professional expert in python machine learning
- analytics engineer
- uchicago mcam 2024 uiuc math concentrate on data optimization seeking 2025 full time in data science data analyst and quantitive finance agentic ai enthusiast

model
Okay, here's the ranked list 

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


user
I will give you a list of job titles and a search term. 
Rank them by how semantically similar they are to the search term.
Do not change the job titles I give you — just reorder them into a new list from most to least similar.

Search term: Financial Analyst
Job titles:
- analyst ii ops at constellation data science
- surgical neurophysiologist the mount sinai hospital data analysis neuroscience
- seeking opportunities data analyst machine learning python sql azure power bi cloud ai solutions data-driven decision making
- senior optimization engineer marathon petroleum supply chain linear programming refining value chain optimization
- statistics phd candidate at the university of rochester with experience in clinical trials genomics and statistical machine learning. seeking statistician roles.
- Data Scientist Python SQL Machine Learning Excel Data Cleaning
- research intern data science ai exploring the power of data technology iot electrical engineer
- ms data science at unive

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


user
I will give you a list of job titles and a search term. 
Rank them by how semantically similar they are to the search term.
Do not change the job titles I give you — just reorder them into a new list from most to least similar.

Search term: Financial Analyst
Job titles:
- information systems graduate student at university of memphis
- looking for spring 2025 internships and full time roles graduate student at university of north texas data analyst tata consultancy services
- institutional review specialist bioinformatician data scientist data analytics
- data analytics python javascript sql
- data scientist data analyst machine learning engineer
- student at university of southern california
- seeking entry-level data related job industrial and operations engineering university of michigan
- machine learningdata scienceai researchcomputer vision machine perceptionnlp llms genai
- MIT-certified Data Analyst SQL Power BI Python Business Intelligence Predictive Analytics Data-Driven

In [40]:
ranked_chunk

['data analyst in cvs health',
 'data analyst',
 'data scientist',
 'financial analyst',
 'machine learning engineer',
 'information systems graduate student at university of memphis',
 'student at university of southern california',
 'institutional review specialist bioinformatician data scientist data analytics',
 'machine learning data science ai research computer vision machine perception nlp llms genai']

In [39]:
chunk_results

['data analystsis administrator',
 'data analyst',
 'data scientist',
 'assistant professor of electrical',
 'computer engineering kennesaw state university',
 'computer engineer',
 'your partner in it integrity',
 'information security',
 'blank',
 'data analyst at western frontier trading llc',
 'psychology graduate ba',
 'data scientist',
 'data analyst',
 'statistical analyst',
 'business intelligence analyst',
 'senior optimization engineer',
 'data',
 'analyst ii ops at constellation data science',
 'surgical neurophysiologist',
 'statistics phd candidate',
 'data cleaning',
 'research intern data science ai',
 'ms data science',
 'data analyst in cvs health',
 'data analyst',
 'data scientist',
 'financial analyst',
 'machine learning engineer',
 'information systems graduate student at university of memphis',
 'student at university of southern california',
 'institutional review specialist bioinformatician data scientist data analytics',
 'machine learning data science ai rese

One Shot

In [41]:
model_id = "google/gemma-3-1b-it"

chunk_results = []
for chunk in chunk_list(options, 10):
    ranked_chunk = generate_chunk_ranking(query, chunk, type="OneShot")
    print("✅ Ranked chunk output:", ranked_chunk)
    chunk_results.extend(ranked_chunk)  # collect all reordered titles

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


user
I will give you a list of job titles and a search term. 
Rank them in the list by how semantically similar they are to the search term.
Do not change the job titles I give you — just reorder them in a ranked list from most to least similar.

Search term: Marketing Manager
Job titles:
- software developer scheme designers inc developed features for aircraft configurators
- graduate research and teaching assistant at michigan state university
- data science data engineer machine learning data scientist project 990 ms in data science iub seeking full-time roles may 25
- senior big data engineer actively seeking for new opportunities data engineer big data sql aws hadoop azure pyspark kafka yarn hdfs scala etl
- staff data scientist and analytics engineer equifax
- phd in mol bio bioinformatic data science i ai ml l aws saa linux python and r docker kubernetes devops cicd pipelinegit-github terraform l ansible
- encargado de marketing y ventas en sego eventos
- data analyst at kingsto

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


user
I will give you a list of job titles and a search term. 
Rank them in the list by how semantically similar they are to the search term.
Do not change the job titles I give you — just reorder them in a ranked list from most to least similar.

Search term: Marketing Manager
Job titles:
- software developer scheme designers inc developed features for aircraft configurators
- graduate research and teaching assistant at michigan state university
- data science data engineer machine learning data scientist project 990 ms in data science iub seeking full-time roles may 25
- senior big data engineer actively seeking for new opportunities data engineer big data sql aws hadoop azure pyspark kafka yarn hdfs scala etl
- staff data scientist and analytics engineer equifax
- phd in mol bio bioinformatic data science i ai ml l aws saa linux python and r docker kubernetes devops cicd pipelinegit-github terraform l ansible
- encargado de marketing y ventas en sego eventos
- data analyst at kingsto

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


user
I will give you a list of job titles and a search term. 
Rank them in the list by how semantically similar they are to the search term.
Do not change the job titles I give you — just reorder them in a ranked list from most to least similar.

Search term: Marketing Manager
Job titles:
- software developer scheme designers inc developed features for aircraft configurators
- graduate research and teaching assistant at michigan state university
- data science data engineer machine learning data scientist project 990 ms in data science iub seeking full-time roles may 25
- senior big data engineer actively seeking for new opportunities data engineer big data sql aws hadoop azure pyspark kafka yarn hdfs scala etl
- staff data scientist and analytics engineer equifax
- phd in mol bio bioinformatic data science i ai ml l aws saa linux python and r docker kubernetes devops cicd pipelinegit-github terraform l ansible
- encargado de marketing y ventas en sego eventos
- data analyst at kingsto

In [42]:
chunk_results

['data science data engineer machine learning data scientist project 990 ms in data science iub seeking full',
 'senior big data engineer actively seeking for new opportunities data engineer big data sql aws azure hdfs yarn kafka',
 'software developer scheme designers inc developed features for aircraft configurators',
 'graduate research',
 'teaching assistant at michigan state university',
 'enchargado de marketing y ventas en sego eventos',
 'data analyst at kingston brass',
 'data science data engineer machine learning data scientist',
 'staff data scientist',
 'analytics engineer equifax',
 'data science graduate student academic services assistant',
 'data analyst at western frontier trading llc transforming data into actionable insights sql python machine learning data visualization psychology graduate ba',
 'assistant professor of electrical',
 'computer engineering kennesaw state university',
 'computer engineer',
 'data analystsis administrator',
 'data scientist results',
 

In [None]:
xxx

NameError: name 'xxx' is not defined

In [None]:
# Use the correct Gemma model from Hugging Face Hub
model_id = "NousResearch/Meta-Llama-3-70B-Instruct"

chunk_results = []
for chunk in chunk_list(options, 7):
    ranked_chunk = generate_chunk_ranking(query, chunk)
    print("✅ Ranked chunk output:", ranked_chunk)
    chunk_results.extend(ranked_chunk)  # collect all reordered titles

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

### Qwen

In [None]:
# !pip install hf_xet
!pip install -U huggingface_hub[hf_xet]

Collecting huggingface_hub[hf_xet]
  Using cached huggingface_hub-0.33.4-py3-none-any.whl.metadata (14 kB)
Using cached huggingface_hub-0.33.4-py3-none-any.whl (515 kB)
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.32.4
    Uninstalling huggingface-hub-0.32.4:
      Successfully uninstalled huggingface-hub-0.32.4
Successfully installed huggingface_hub-0.33.4


In [None]:

model_id = "Qwen/Qwen2.5-72B-Instruct"

chunk_results = []
for chunk in chunk_list(options, 10):
    ranked_chunk = generate_chunk_ranking(query, chunk, type="OneShot")
    print("✅ Ranked chunk output:", ranked_chunk)
    chunk_results.extend(ranked_chunk)  # collect all reordered titles

Fetching 37 files:   0%|          | 0/37 [00:00<?, ?it/s]

In [None]:
ranked_chunk

In [None]:
chunk_results

### Llama 3

**google/gemma-2-27b-it**, **NousResearch/Meta-Llama-3-70B-Instruct**, **Mistral-Large-Instruct-2411**

In [None]:
lines

In [None]:
seen

In [None]:
chunk_results

In [None]:
import re

def extract_ranked_job_titles(text):
    lines = text.strip().split("\n")
    titles = []
    start_ranking = False

    for line in lines:
        line = line.strip()

        # Skip empty or clearly non-ranking sections
        if not line:
            continue
        if "reasoning:" in line.lower():
            break  # Stop parsing after list ends
        if "most to least similar:" in line.lower():
            start_ranking = True
            continue
        if not start_ranking:
            continue

        # Remove leading numbers, bullets, markdown
        clean = re.sub(r"^\s*[\-*]?\s*\d+\.\s*", "", line)      # remove "1. " or "- 1."
        clean = re.sub(r"\*\*", "", clean)                      # remove bold markers
        clean = re.sub(r"\s*\(.*?\)", "", clean)                # remove things like "(Most Similar)"
        clean = clean.strip()

        if clean and len(clean.split()) >= 2:
            titles.append(clean)

    return titles


In [None]:
sample_output = """
Job titles:
- Junior Software Engineer
- Senior Data Scientist
- Marketing Analyst
- Lead Machine Learning Engineer
- AI Researcher
- Product Manager

model
Here's the ranking of the job titles by semantic similarity to the search term "senior data scientist," from most to least similar:

1.  **Senior Data Scientist** (Most Similar)
2.  Lead Machine Learning Engineer
3.  AI Researcher
4.  Product Manager
5.  Marketing Analyst
6.  Junior Software Engineer 

**Reasoning:**
...
"""

print(extract_ranked_job_titles(sample_output))


In [None]:
chunk_results = []

for chunk in chunk_list(options, 50):
    ranked_chunk = generate_chunk_ranking(query, chunk)
    chunk_results.extend(ranked_chunk)

print("🎯 FINAL chunk_results:", chunk_results)


In [None]:
for i, chunk in enumerate(chunk_list(options, 50)):
    print(f"\n🔍 Processing chunk {i+1}")
    ranked_chunk = generate_chunk_ranking(query, chunk)
    print("✅ Ranked chunk output:", ranked_chunk)
    chunk_results.extend(ranked_chunk)

print("\n🎯 FINAL chunk_results:", chunk_results)

In [None]:
data = ['user',
 'I will give you a list of job titles and a search term.',
 'Rank the job titles by how semantically similar they are to the search term.',
 'Do not change the titles — just reorder them in a ranked list from most to least similar.',
 'Search term: senior data scientist',
 'Job titles:',
 'Junior Software Engineer',
 'Senior Data Scientist',
 'Marketing Analyst',
 'Lead Machine Learning Engineer',
 'AI Researcher',
 'Product Manager',
 'model',
 'Here\'s the ranking of the job titles by semantic similarity to the search term "senior data scientist," from most to least similar:',
 '**Senior Data Scientist** (Most Similar)',
 'Lead Machine Learning Engineer',
 'AI Researcher',
 'Product Manager',
 'Marketing Analyst',
 'Junior Software Engineer',
 '**Reasoning:**',
 '*   **Senior Data Scientist** is the closest in terms of the core responsibilities and skill set – a senior role focused on data science.',
 '*   **Lead Machine Learning Engineer** directly mirrors the role of a senior data scientist, focusing on leading a team and complex machine learning projects.',
 '*   **AI Researcher** is closely related, often involving research and development of new AI techniques, which overlaps with data science.',
 "*   **Product Manager** is a broader role, but often involves significant data analysis and understanding of user behavior, which can be a component of a data scientist's work.",
 "*   **Marketing Analyst** is a data-driven role, but it's less focused on the technical complexities of statistical modeling and more on interpreting data for marketing purposes.",
 '*   **Junior Software Engineer** is the least related, representing a foundational role']

extract_ranked_job_titles(data)

In [None]:
import re
test = "**Senior Data Scientist** (Most Similar)"
# Matching pattern (unchanged, works as shown)
pattern = r'^\*{0,2}[A-Za-z\s]+\*{0,2}(?:\s*\(Most Similar\))?$'
print(re.match(pattern, test))  # Should match
# New cleaning pattern
cleaned = re.sub(r'^\*+|\*+\s*\(Most Similar\)$|\*+$', '', test).strip()
print(cleaned)  # Should print: Senior Data Scientist

In [None]:
start_index = next(i for i, item in enumerate(data) if re.search(r'\(Most Similar\)', item))
# Extract job titles from the ranked list section
job_titles = []
seen = set()

for item in data[start_index:]:
    # Match job titles (alphanumeric with spaces, optionally with "(Most Similar)", ignoring Markdown **)
    if re.match(r'^\*?\*?[A-Za-z\s]+(\s*\(Most Similar\))?\*?\*?$', item.strip()):
        # Remove Markdown ** and "(Most Similar)" to get the clean title
        title = re.sub(r'^\*?\*?|\*?\*?$|\s*\(Most Similar\)$', '', item).strip()
        if title not in seen:
            seen.add(title)
            job_titles.append(title)

In [None]:
job_titles

In [None]:
chunk_results

In [None]:
# Find the start index by matching any title containing "(Most Similar)"
start_index = next(i for i, item in enumerate(chunk_results) if re.search(r'\(Most Similar\)', item))

# Extract job titles from the ranked list section
job_titles = []
seen = set()

for item in chunk_results[start_index:]:
    # Match job titles (alphanumeric with spaces, optionally with "(Most Similar)")
    if re.match(r'^[A-Za-z\s]+(\s*\(Most Similar\))?$', item.strip()):
        # Extract only the title part, removing "(Most Similar)" if present
        title = re.sub(r'\s*\(Most Similar\)$', '', item).strip()
        if title not in seen:
            seen.add(title)
            job_titles.append(title)

print(job_titles)

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "google/flan-t5-large"

GenAI_tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
n = 5
query = "data scientist"
options = titles
# options = get_titles(n, query)

# prompt = """Which song fits the query.
# QUERY: I'm feeling so sad rn 
# OPTIONS 
# -happy song
# -some sad song 
# -a very happy song"""

prompt = build_prompt(query, options)

input_ids = GenAI_tokenizer(prompt, return_tensors="pt").input_ids  
outputs = model.generate(input_ids)
print(GenAI_tokenizer.decode(outputs[0], skip_special_tokens=True))

Use gemma model from hugging face. use Llama 3 and use Qwen3, minstral ... smol small model ... deep seek model
- 

In [None]:
# prompt = """Which song fits the query.
# QUERY: I'm feeling so sad rn 
# OPTIONS 
# -happy song
# -some sad song 
# -a very happy song"""

# # prompt = build_prompt(query, options)

# input_ids = GenAI_tokenizer(prompt, return_tensors="pt").input_ids  
# outputs = model.generate(input_ids)
# print(GenAI_tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
options

In [None]:
prompt

In [None]:
# decoded_output

# Scratch

In [None]:
titles = df['job_title'].to_list()
titles = list(set(titles))
titles

In [None]:
len(titles)

In [None]:
query

In [None]:
query = popular_jobs[5]
best_options = []
batch_size = 50
for i in range(0, len(titles), batch_size):
    batch = titles[i:i + batch_size]
    prompt = build_prompt(query, batch)
    input_ids = GenAI_tokenizer(prompt, return_tensors="pt").input_ids  
    outputs = model.generate(input_ids)
    response = GenAI_tokenizer.decode(outputs[0], skip_special_tokens=True)
    best_options.append(response)
print(query)

print(best_options)

In [None]:

titles = best_options 
batch_size = 50
for i in range(0, len(titles), batch_size):
    batch = titles[i:i + batch_size]
    prompt = build_prompt(query, batch)
    input_ids = GenAI_tokenizer(prompt, return_tensors="pt").input_ids  
    outputs = model.generate(input_ids)
    response = GenAI_tokenizer.decode(outputs[0], skip_special_tokens=True)
    best_options.append(response)
print(query)

print(best_options)

In [None]:
len(best_options)

In [None]:
query = popular_jobs[5]
best_options = []
batch_size = 50
for i in range(0, len(titles), batch_size):
    batch = titles[i:i + batch_size]
    prompt = build_prompt(query, batch)
    input_ids = GenAI_tokenizer(prompt, return_tensors="pt").input_ids  
    outputs = model.generate(input_ids)
    response = GenAI_tokenizer.decode(outputs[0], skip_special_tokens=True)
    best_options.append(response)
print(query)
best_options

In [None]:
# titles = ['Experienced Software Engineer AI Automation Specialist Python Cloud Enthusiast',
#  'data scientistgeoscientist',
#  'Data Analyst Experience with Python SQL Tableau Power BI Excel and Alteryx LLB Data Scientist',
#  'Business Analyst Data Science Competitive Programmer Web Developer',
#  'Data Scientist',
#  'data analyst python sql machine learning power bi tableau sql server',
#  'msc data science graduate full stack developer python sql java react',
#  'data analyst with 2 years of experience in extracting analyzing and visualizing data to empower organizations with',
#  'ms in ds at university of new haven grad 2023 actively seeking for',
#  'looking for a opportunity in field of Data Analytic Data Science NLP Power Bi SQL Machine Learning',
#  'software engineer',
#  '3rd year data science student at arizona state university. passionate about turning data into',
#  'actively seeking full-time opportunities in data science python sql r spark machine',
#  'experienced data analyst actively seeking opportunities to leverage data for business success sql python tableau',
#  'actively looking for internships and full-time opportunities 2025',
#  'Data Analyst Machine Learning Scientist',
#  'ms in cs k-state specializing in ai m',
#  'aspiring data analyst sql python tableau power bi skilled in excel data-driven',
#  'data engineer ai machine learning enthusiast sql python big data cloud technologies ',
#  'Digital Marketing Strategist Growth Performance Marketing SEO PPC Social Media Expert',
#  'data scientist ms in data science',
#  'data analyst turning complex data into actionable insights passionate about solving business challenges with data-driven solutions',
#  'actively seeking data analyst jobs',
#  'msc physics interested in responsible ai data',
#  'mlops engineer vosyn ai cicd docker aws']
# query = popular_jobs[5]
# best_options = []
# batch_size = 50
# for i in range(0, len(titles), batch_size):
#     batch = titles[i:i + batch_size]
#     prompt = build_prompt(query, batch)
#     input_ids = GenAI_tokenizer(prompt, return_tensors="pt").input_ids  
#     outputs = model.generate(input_ids)
#     response = GenAI_tokenizer.decode(outputs[0], skip_special_tokens=True)
#     best_options.append(response)
# print(query)
# best_options

In [None]:
print(query)
# best_options

In [None]:
batch

In [None]:
def generate_best_option(model, tokenizer, query, titles, batch_size=50):
    best_options = []
    for i in range(0, len(titles), batch_size):
        batch = options[i:i + batch_size]
        prompt = build_prompt(query, batch)
        input_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).input_ids
        outputs = model.generate(input_ids, max_new_tokens=20)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        best_options.append(response)
    return best_options

# def generate_best_option(model, tokenizer, query, titles, batch_size=50):
#     best_options = []
#     for i in range(0, len(titles), batch_size):
#         batch = titles[i:i + batch_size]
#         prompt = build_prompt(query, batch)
#         input_ids = tokenizer(prompt, return_tensors="pt").input_ids  
#         outputs = model.generate(input_ids)
#         response = tokenizer.decode(outputs[0], skip_special_tokens=True)
#         best_options.append(response)
#     return best_options

In [None]:
options

In [None]:
query = popular_jobs[3]
options = generate_best_option(model, GenAI_tokenizer, query, titles, batch_size=50)
options

In [None]:
query

In [None]:
xxxx

- Llama 4 - One of them can Large context window
- lack of compute and gpu | cpu 8-10x slower 
- flant5 is outdated
- build a prompt from scratch in a different way
- Prompt: i'm going to give a list of candidate's job titles as well as search term. 
I want you to rank the job titles based on their semantic similarity to the search term. I want you to return the ranked list without changing the job 
titles but only the order of them. Here is the search term {} and here is the list of job titles {}
- more clear instructions 
- Use gemma model from hugging face. use Llama 3 and use Qwen3
- Qualitative performance

In [None]:

for query in popular_jobs:
    options = generate_best_option(model, GenAI_tokenizer, query, titles, batch_size=50)
    
    prompt = build_prompt(query, options)

    input_ids = GenAI_tokenizer(prompt, return_tensors="pt").input_ids  
    outputs = model.generate(input_ids)
    decoded_output = GenAI_tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    print(f"Query: {query}")
    
    df = get_all_similarity(query)
    df_compare = compare_results(n, query)
    display(df_compare)
    print(f"AI Generated Output: {decoded_output}\n")
    print("")

# End Scratch

In [None]:
xxx

In [None]:
# n = 5

# for query in popular_jobs:

#     options = get_titles(n, query)
#     prompt = build_prompt(query, options)

#     input_ids = GenAI_tokenizer(prompt, return_tensors="pt").input_ids  
#     outputs = model.generate(input_ids)
#     decoded_output = GenAI_tokenizer.decode(outputs[0], skip_special_tokens=True)
    
#     print(f"Query: {query}")
#     print(f"AI Generated Output: {decoded_output}\n")
    # print(" ")

In [None]:
for query in popular_jobs:
    options = get_titles(n, query)
    prompt = build_prompt(query, options)

    input_ids = GenAI_tokenizer(prompt, return_tensors="pt").input_ids  
    outputs = model.generate(input_ids)
    decoded_output = GenAI_tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    print(f"Query: {query}")
    
    df = get_all_similarity(query)
    df_compare = compare_results(n, query)
    # print(job)
    display(df_compare)
    print(f"AI Generated Output: {decoded_output}\n")
    print("")

In [None]:
xxxx

In [None]:
outputs

In [None]:
prompt = """Sort all songs based on fit to query in descending order.
QUERY: I'm feeling so upbeat rn 
OPTIONS: 
-crazy song
-some song 
-blues
-a very happy song"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids  
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### Scratch

In [None]:
sentence = "What time is it, Tom?"

sentence_encoded = gen_ai_tokenizer(sentence, return_tensors='pt')

sentence_decoded = gen_ai_tokenizer.decode(
        sentence_encoded["input_ids"][0], 
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

In [None]:
sentence_encoded["input_ids"][0]

In [None]:
# Creating a vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_gen_ai = []

for sentence in job_titles:
    sentence_encoded = gen_ai_tokenizer(sentence, return_tensors='pt')

    sentence_encoded_mean = sum(sentence_encoded["input_ids"][0]) / len(sentence_encoded["input_ids"][0]) # returning a mean for each job title
    
    # print(sentence)
    # print(sentence_encoded_mean)
    
    doc_sent_gen_ai.append(sentence_encoded_mean.item()) # returning a list for all job titles
    
# word_vec_mean = sum(word_vec_mean) / len(word_vec_mean) # This was indented but just fixed this round - if it breaks, this should be indented again
# doc_sent_vec.append(doc_sent_vec)
    
# return doc_sent_vec

In [None]:
sentence_encoded

In [None]:
sentence_decoded = gen_ai_tokenizer.decode(
        sentence_encoded["input_ids"][0], 
        skip_special_tokens=True
    )

sentence_decoded

In [None]:
sentence_encoded

In [None]:
query = "What time is it, Tom?"

q_sentence_encoded = gen_ai_tokenizer(query, return_tensors='pt')
q_sentence_encoded_mean = sum(q_sentence_encoded["input_ids"][0]) / len(q_sentence_encoded["input_ids"][0])

In [None]:
q_sentence_encoded_mean.item()

### Unscratch

In [None]:
# Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def gen_encode(texts):
    # Tokenize sentences
    encoded_input = gen_ai_tokenizer(texts, padding=True, \
        truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        generated_output = gen_model.generate(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return generated_output

In [None]:
gen_encode(texts[0])

In [None]:
encoded_input.tokens

In [None]:
texts = df['job_title_cleaned'].to_list()

encoded_input = gen_ai_tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    # model_output = gen_model.generate(**encoded_input, return_dict=True)
    model_output = gen_model.generate(**encoded_input)

In [None]:
model_output

In [None]:
model_output[0]

In [None]:
encoded_input.items

In [None]:
encoded_input['input_ids'][0]

In [None]:
with torch.no_grad():
        model_output = gen_model(**encoded_input)

In [None]:
# get gen embedding for all docs
titles_list = df['job_title_cleaned'].to_list()
titles_list

In [None]:
gen_doc_emb = gen_encode(titles_list)

In [None]:
query

In [None]:


encoded_input = gen_ai_tokenizer(query, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = gen_model(**encoded_input, return_dict=True)
    # model_output = gen_model(**encoded_input)
    
mean_pooling(model_output, encoded_input['attention_mask'])

In [None]:
def get_gen_ai_query_similarity(gen_doc_emb, query):
    """
    query_gen: processing the query
    gen_doc_emb: bert embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_gen = gen_encode(query)
    
    #Compute dot score between query and all document embeddings
    cos_sim_gen = torch.mm(query_gen, gen_doc_emb.transpose(0, 1))[0].cpu().tolist()
    
    return cos_sim_gen

In [None]:
query = 'seeking human resources'
cos_sim_gen = get_gen_ai_query_similarity(gen_doc_emb, query)
cos_sim_gen

In [None]:
def get_all_similarity(query):
    
    #Bert similarity
    cos_sim_bert = get_bert_query_similarity(doc_emb, query)
    df['bert_fit'] = cos_sim_bert

    #Fasttext similarity
    cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query)
    df['fasttext_fit'] = cos_sim_fasttext

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove
    
    return df

In [None]:
query = 'seeking human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'bert_fit', ascending = False, min_con = 0)

In [None]:
# Creating a fasttext vectorize representation for each query
def q_sent_vec_ft(query):
    q_sent_vec_ft = []
    q_word_vec_ft = []
    
    for word in query.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            q_word_vec_ft.append(vectors)
        else:
            q_word_vec_ft.append(oov_word)
    q_word_vec_mean_ft = sum(q_word_vec_ft) / len(q_word_vec_ft) # This was indented but just fixed this round - if it breaks, this should be indented again
    q_sent_vec_ft.append(q_word_vec_mean_ft)
        
    return q_sent_vec_ft

In [None]:
doc_sent_vec

In [None]:
example_indices = [40, 200]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()