## Potential Talent

### **Context:**

As a **talent sourcing and management company**, we are interested in **finding talented individuals** for sourcing these candidates to technology companies. **Finding talented candidates is not easy**, for **several reasons**. The **first** reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The **second** reason is one needs to understand what makes a candidate shine for the role we are in search for. **Third**, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of **manual operations**. Towards **automating this process** we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

### Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

#### Attributes:
**id** : unique identifier for candidate (numeric)

**job_title** : job title for candidate (text)

**location** : geographical location for candidate (text)

**connections** : number of connections candidate has, 500+ means over 500 (text)

**Output (desired target)**:
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

#### Download Data:

https://docs.google.com/spreadsheets/d/117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU/edit?usp=sharing

#### Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

#### Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

In [None]:
# !pip install -U scikit-learn
# !pip install scikit-learn



In [77]:
# Importing Standard Libraries
import pandas as pd
import numpy as np
import os

from sklearn.metrics.pairwise import linear_kernel
pd.options.display.max_columns = 60

In [78]:
# Set the option to display the full text in DataFrame columns
pd.set_option('display.max_colwidth', None)

## Initial Exploratory Data Analysis

In [79]:
path = os.getcwd()

In [80]:
df = pd.read_csv(path + '\\potential-talents - Aspiring human resources - seeking human resources.csv').set_index('id')
df.head()

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,"Houston, Texas",85,
2,Native English Teacher at EPIK (English Program in Korea),Kanada,500+,
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
4,People Development Coordinator at Ryan,"Denton, Texas",500+,
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [81]:
df.replace('500+ ','501', inplace=True)
df['connection'] = pd.to_numeric(df['connection'])

In [82]:
df.job_title.value_counts()

job_title
2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional                 7
Aspiring Human Resources Professional                                                                                    7
Student at Humber College and Aspiring Human Resources Generalist                                                        7
People Development Coordinator at Ryan                                                                                   6
Native English Teacher at EPIK (English Program in Korea)                                                                5
Aspiring Human Resources Specialist                                                                                      5
HR Senior Specialist                                                                                                     5
Student at Chapman University                                                                                            4
SVP, C

In [83]:
df[df.job_title=='HR Senior Specialist']

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,HR Senior Specialist,San Francisco Bay Area,501,
26,HR Senior Specialist,San Francisco Bay Area,501,
38,HR Senior Specialist,San Francisco Bay Area,501,
51,HR Senior Specialist,San Francisco Bay Area,501,
61,HR Senior Specialist,San Francisco Bay Area,501,


In [84]:
# dropping duplicates
df = df.drop_duplicates()

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 53 entries, 1 to 104
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   job_title   53 non-null     object 
 1   location    53 non-null     object 
 2   connection  53 non-null     int64  
 3   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 2.1+ KB


# TF-IDF

Term Frequency-Inverse Document Frequency (Statistical Method)
### Prepping our Text for Modelling


In [86]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Prep our Text for Modelling
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (1, 2))
docs_tfidf = vectorizer.fit_transform(df["job_title"])

In [87]:
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cos_sim = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    
    return cos_sim

In [88]:
def top_candidates(n, by = 'fit', ascending = False, min_con = 0, location = df.location):
    
    df2 = df.loc[(df.connection >= min_con) & 
                 (df[by] > 0) & 
                 (df.location == location)].sort_values(by = by, ascending = ascending).head(n).copy()
    
    if df2.empty:
        return "There are no suitable candidates"
    
    else:
        return df2

In [89]:
query = 'Aspiring human resources'

cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query)

df['fit'] = cos_sim

top_candidates(n = 10, by = 'fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.735855
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.735855
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,0.50888
72,Business Management Major and Aspiring Human Resources Manager,"Monroe, Louisiana Area",5,0.38759
27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",501,0.374733
66,Experienced Retail Manager and aspiring Human Resources Professional,"Austin, Texas Area",57,0.373847
7,Student at Humber College and Aspiring Human Resources Generalist,Kanada,61,0.358949
74,Human Resources Professional,Greater Boston Area,16,0.340769
79,Liberal Arts Major. Aspiring Human Resources Analyst.,"Baton Rouge, Louisiana Area",7,0.336485


In [90]:
top_candidates(n = 10)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.735855
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.735855
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,0.50888
72,Business Management Major and Aspiring Human Resources Manager,"Monroe, Louisiana Area",5,0.38759
27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",501,0.374733
66,Experienced Retail Manager and aspiring Human Resources Professional,"Austin, Texas Area",57,0.373847
7,Student at Humber College and Aspiring Human Resources Generalist,Kanada,61,0.358949
74,Human Resources Professional,Greater Boston Area,16,0.340769
79,Liberal Arts Major. Aspiring Human Resources Analyst.,"Baton Rouge, Louisiana Area",7,0.336485


In [91]:
top_candidates(n = 10, min_con = 90)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",501,0.374733
82,Aspiring Human Resources Professional | An energetic and Team-Focused Leader,"Austin, Texas Area",174,0.31642
100,Aspiring Human Resources Manager | Graduating May 2020 | Seeking an Entry-Level Human Resources Position in St. Louis,"Cape Girardeau, Missouri",103,0.308829
76,Aspiring Human Resources Professional | Passionate about helping to create an inclusive and engaging work environment,"New York, New York",212,0.246772
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.220668
71,"Human Resources Generalist at ScottMadden, Inc.","Raleigh-Durham, North Carolina Area",501,0.196509
78,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,501,0.196509
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",501,0.196509
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,0.189503
89,Director Human Resources at EY,Greater Atlanta Area,349,0.187433


In [92]:
top_candidates(n = 50, location = 'Austin, Texas Area')

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
66,Experienced Retail Manager and aspiring Human Resources Professional,"Austin, Texas Area",57,0.373847
82,Aspiring Human Resources Professional | An energetic and Team-Focused Leader,"Austin, Texas Area",174,0.31642


In [93]:
top_candidates(n = 50, location = 'Greater New York City Area')

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,0.189503


In [94]:
query = 'Data Analyst'

cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query)

df['fit'] = cos_sim

top_candidates(n = 10)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
79,Liberal Arts Major. Aspiring Human Resources Analyst.,"Baton Rouge, Louisiana Area",7,0.242764
86,Information Systems Specialist and Programmer with a love for data and organization.,"Gaithersburg, Maryland",4,0.203453


In [95]:
top_candidates(n = 10, min_con = 90)

'There are no suitable candidates'

In [96]:
query = 'seeking human resources'

cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query)

df['fit'] = cos_sim

top_candidates(n = 10)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682
10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,501,0.432761
94,Seeking Human Resources Opportunities. Open to travel and relocation.,Amerika Birleşik Devletleri,415,0.38129
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,0.362648
74,Human Resources Professional,Greater Boston Area,16,0.295223
75,"Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!! (408) 709-2621","San Jose, California",501,0.273577
27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",501,0.245337
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319


In [33]:
# Reranking by learning to rank

# Word2Vec Gensim
Word embedding

In [None]:
# !pip install nltk
# !pip install keras
# !pip install -U gensim
# !pip install tensorflow


### Prepping our Text for Modelling

In [97]:
df.drop(columns="fit", inplace=True)

In [98]:
import re
import nltk
nltk.download('stopwords')

# processing texts for modelling
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
df['job_title_cleaned'] = df.job_title.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                for w in x.split() 
                                                                                if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                not in stop_words) ) #nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Alex
[nltk_data]     Chung\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# !pip install keras-preprocessing

In [99]:
from tensorflow import keras

# tokenize and pad every document to make them of the same size
from tensorflow.keras.preprocessing.text import Tokenizer
# from keras.layers import TextVectorization
from keras_preprocessing.sequence import pad_sequences
tokenizer=Tokenizer()

tokenizer.fit_on_texts(df.job_title_cleaned)
tokenized_documents=tokenizer.texts_to_sequences(df.job_title_cleaned)
tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
vocab_size=len(tokenizer.word_index)+1

In [100]:
# loading pre-trained embeddings, each word is represented as a 300 dimensional vector
import gensim

# Navigating to directory where pre-trained embeddings were downloaded
os.chdir(r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent")
W2V_PATH="GoogleNews-vectors-negative300.bin.gz"

In [101]:
model_w2v = gensim.models.KeyedVectors.load_word2vec_format(W2V_PATH, binary=True)
model_w2v[0][:10]

array([ 1.1291504e-03, -8.9645386e-04,  3.1852722e-04,  1.5335083e-03,
        1.1062622e-03, -1.4038086e-03, -3.0517578e-05, -4.1961670e-04,
       -5.7601929e-04,  1.0757446e-03], dtype=float32)

In [102]:
# creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
embedding_matrix=np.zeros((vocab_size,300))
for word,i in tokenizer.word_index.items():
    if word in model_w2v:
        embedding_matrix[i]=model_w2v[word]
        
# creating document-word embeddings
document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
for i in range(len(tokenized_paded_documents)):
    for j in range(len(tokenized_paded_documents[0])):
        document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
document_word_embeddings.shape

(53, 64, 300)

In [103]:
document_word_embeddings[0][0]

array([-2.08007812e-01,  3.41796875e-02,  2.57568359e-02,  1.79687500e-01,
       -1.81640625e-01, -3.41796875e-02, -1.40625000e-01, -1.63085938e-01,
       -8.59375000e-02, -1.52343750e-01, -9.57031250e-02, -1.34765625e-01,
       -1.92382812e-01,  2.43164062e-01, -1.91406250e-01,  4.93164062e-02,
        2.60009766e-02,  3.28125000e-01, -7.37304688e-02,  5.05371094e-02,
       -1.52343750e-01, -1.57226562e-01, -1.44958496e-04, -2.51953125e-01,
       -4.22363281e-02, -1.72119141e-02, -4.84375000e-01,  2.07031250e-01,
       -1.40625000e-01, -1.35498047e-02, -1.78222656e-02,  5.95092773e-03,
       -3.10058594e-02, -2.75390625e-01, -2.65625000e-01,  9.52148438e-02,
       -4.55078125e-01,  1.13281250e-01, -1.33789062e-01,  1.18652344e-01,
       -5.37109375e-02,  8.10546875e-02,  7.32421875e-02,  6.39648438e-02,
       -9.47265625e-02,  4.39453125e-02,  1.46484375e-01, -8.59375000e-02,
       -1.58203125e-01,  1.63085938e-01, -1.32812500e-01,  2.50000000e-01,
       -5.61523438e-02,  

In [104]:
model_w2v['england'][:5]

array([-0.3671875 , -0.03491211,  0.11083984,  0.40039062,  0.18261719],
      dtype=float32)

In [105]:
def processing(query):
    df3 = pd.DataFrame([query], columns=['query'])
    stop_words = stopwords.words('english')
    df3['processed'] = df3['query'].apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                for w in x.split() 
                                                                                if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                not in stop_words) )
    
    tokenizer.fit_on_texts(df3.processed)
    tokenized_documents=tokenizer.texts_to_sequences(df3.processed)
    tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
    vocab_size=len(tokenizer.word_index)+1
    
    embedding_matrix=np.zeros((vocab_size,300))
    for word,i in tokenizer.word_index.items():
        if word in model_w2v:
            embedding_matrix[i]=model_w2v[word]

    # creating document-word embeddings
    query_document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
    for i in range(len(tokenized_paded_documents)):
        for j in range(len(tokenized_paded_documents[0])):
            query_document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
#     document_word_embeddings.shape
    
    return query_document_word_embeddings

In [106]:
processing('hello world!!!!').shape

(1, 64, 300)

In [107]:
processing('hello world!!!!')[0][:3][0][:20]

array([-0.05419922,  0.01708984, -0.00527954,  0.33203125, -0.25      ,
       -0.01397705, -0.15039062, -0.265625  ,  0.01647949,  0.3828125 ,
       -0.03295898, -0.09716797, -0.16308594, -0.04443359,  0.00946045,
        0.18457031,  0.03637695,  0.16601562,  0.36328125, -0.25585938])

In [108]:
def get_w2v_query_similarity(document_word_embeddings, query):
    """
    query_w2v: processing the query
    model_w2v: word2vec embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_w2v = processing(query)
    
    nsamples, nx, ny = query_w2v.shape
    query_w2v_reshape = query_w2v.reshape((nsamples,nx*ny))

    nsamples, nx, ny = document_word_embeddings.shape
    document_word_embeddings_reshape = document_word_embeddings.reshape((nsamples,nx*ny))
    
    cos_sim_w2v = cosine_similarity(query_w2v_reshape, document_word_embeddings_reshape).flatten()
    
    return cos_sim_w2v

In [109]:
def get_all_similarity(query):
    
    # Word2Vec Similarity
    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    # Original TFIDF similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    return df

In [110]:
query = 'seeking human resources'

df = get_all_similarity(query)

top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,seeking human resources position,0.886226,0.675682
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resources opportunities,0.839381,0.675682
10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,501,seeking human resources hris generalist positions,0.703341,0.432761
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resources professional,0.663209,0.240319
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resources professional,0.663209,0.240319
6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resources specialist,0.645122,0.206629
94,Seeking Human Resources Opportunities. Open to travel and relocation.,Amerika Birleşik Devletleri,415,seeking human resources opportunities open travel relocation,0.639099,0.38129
89,Director Human Resources at EY,Greater Atlanta Area,349,director human resources ey,0.571728,0.162381
82,Aspiring Human Resources Professional | An energetic and Team-Focused Leader,"Austin, Texas Area",174,aspiring human resources professional energetic team focused leader,0.473859,0.103338
81,Senior Human Resources Business Partner at Heil Environmental,"Chattanooga, Tennessee Area",455,senior human resources business partner heil environmental,0.470671,0.102581


In [111]:
query = 'Senior Human Resources Business Partner at Heil Environmental'

df = get_all_similarity(query)

top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
81,Senior Human Resources Business Partner at Heil Environmental,"Chattanooga, Tennessee Area",455,senior human resources business partner heil environmental,1.0,1.0
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resources professional,0.466839,0.087131
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resources professional,0.466839,0.087131
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resources opportunities,0.433847,0.069312
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,seeking human resources position,0.431693,0.069312
89,Director Human Resources at EY,Greater Atlanta Area,349,director human resources ey,0.427987,0.058873
6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resources specialist,0.427519,0.074916
69,"Director of Human Resources North America, Groupe Beneteau","Greater Grand Rapids, Michigan Area",501,director human resources north america groupe beneteau,0.37333,0.037961
94,Seeking Human Resources Opportunities. Open to travel and relocation.,Amerika Birleşik Devletleri,415,seeking human resources opportunities open travel relocation,0.368118,0.039113
10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,501,seeking human resources hris generalist positions,0.364045,0.044393


In [113]:
top_candidates(n = 5, by = 'w2v_fit', ascending = False, min_con = 50)

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
81,Senior Human Resources Business Partner at Heil Environmental,"Chattanooga, Tennessee Area",455,senior human resources business partner heil environmental,1.0,1.0
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resources professional,0.466839,0.087131
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resources opportunities,0.433847,0.069312
89,Director Human Resources at EY,Greater Atlanta Area,349,director human resources ey,0.427987,0.058873
69,"Director of Human Resources North America, Groupe Beneteau","Greater Grand Rapids, Michigan Area",501,director human resources north america groupe beneteau,0.37333,0.037961


In [114]:
top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 20, location = 'Greater New York City Area')

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,business intelligence analytics travelers,0.127512,0.069344
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,human resources specialist luxottica,0.117098,0.059524


In [115]:
query = 'Staff Data Scientist'

df = get_all_similarity(query)

top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,human resources specialist luxottica,0.237845,0.0
90,Undergraduate Research Assistant at Styczynski Lab,Greater Atlanta Area,155,undergraduate research assistant styczynski lab,0.219707,0.0
80,Junior MES Engineer| Information Systems,"Myrtle Beach, South Carolina Area",52,junior mes engineer information systems,0.211864,0.0
13,Human Resources Coordinator at InterContinental Buckhead Atlanta,"Atlanta, Georgia",501,human resources coordinator intercontinental buckhead atlanta,0.179386,0.0
4,People Development Coordinator at Ryan,"Denton, Texas",501,people development coordinator ryan,0.172276,0.0
66,Experienced Retail Manager and aspiring Human Resources Professional,"Austin, Texas Area",57,experienced retail manager aspiring human resources professional,0.171317,0.0
86,Information Systems Specialist and Programmer with a love for data and organization.,"Gaithersburg, Maryland",4,information systems specialist programmer love data organization,0.17124,0.287726
8,HR Senior Specialist,San Francisco Bay Area,501,hr senior specialist,0.167594,0.0
87,Bachelor of Science in Biology from Victoria University of Wellington,"Baltimore, Maryland",40,bachelor science biology victoria university wellington,0.147978,0.0
102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,business intelligence analytics travelers,0.141885,0.0


# GloVe - 

https://nlp.stanford.edu/projects/glove/

In [38]:
# Downloading GloVe pre-trained vectors
# !pip install wget
# import wget
# wget.download('https://nlp.stanford.edu/data/glove.840B.300d.zip')

In [39]:
# Extracting GloVe vector file
# import zipfile as zf
# files = zf.ZipFile("glove.840B.300d.zip", 'r')
# files.extractall('GloVe')
# files.close()

In [116]:
# Navigating to directory where GloVe pre-trained vectors were downloaded
os.chdir(r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent\glove")
path = 'glove.840B.300d.txt'

In [117]:
with open(path) as file:
  for i in range(10):
    line = file.readline()
    print(line[:100])

, -0.082752 0.67204 -0.14987 -0.064983 0.056491 0.40228 0.0027747 -0.3311 -0.30691 2.0817 0.031819 0
. 0.012001 0.20751 -0.12578 -0.59325 0.12525 0.15975 0.13748 -0.33157 -0.13694 1.7893 -0.47094 0.704
the 0.27204 -0.06203 -0.1884 0.023225 -0.018158 0.0067192 -0.13877 0.17708 0.17709 2.5882 -0.35179 -
and -0.18567 0.066008 -0.25209 -0.11725 0.26513 0.064908 0.12291 -0.093979 0.024321 2.4926 -0.017916
to 0.31924 0.06316 -0.27858 0.2612 0.079248 -0.21462 -0.10495 0.15495 -0.03353 2.4834 -0.50904 0.087
of 0.060216 0.21799 -0.04249 -0.38618 -0.15388 0.034635 0.22243 0.21718 0.0068483 2.4375 -0.27418 0.
a 0.043798 0.024779 -0.20937 0.49745 0.36019 -0.37503 -0.052078 -0.60555 0.036744 2.2085 -0.23389 -0
in 0.089187 0.25792 0.26282 -0.029365 0.47187 -0.10389 -0.10013 0.08123 0.20883 2.5726 -0.67854 0.03
" -0.075242 0.57337 -0.31908 -0.18484 0.88867 -0.27381 0.077588 0.13905 -0.47746 1.4442 -0.56159 0.0
: 0.008746 0.33214 -0.29175 -0.15119 -0.41842 -0.23931 -0.23458 -0.055618 -0.09896 0.75175 

In [118]:
df_glove = pd.read_csv(path, sep=" ", quoting=3, header=None, index_col=0)
df_glove.T

Unnamed: 0,",",.,the,and,to,of,a,in,"""",:,is,for,I,),(,that,-,on,you,with,'s,it,The,are,by,at,be,this,as,from,...,trompettes,tylerdurden,unaturally,uniao,upstretched,usr/lib/oracle,v205,vakker,value-in-use,vampaneze,vinted,vocÃª,votesA,war/WEB-INF/lib,web.Our,what-might-have-been,wiid,windowsTransgender,woombie,wordsforyoungmen,work.Like,working.So,wried,wwent,xalisae,xtremecaffeine,yildirim,z/28,zipout,zulchzulu
1,-0.082752,0.012001,0.272040,-0.185670,0.319240,0.060216,0.043798,0.089187,-0.075242,0.008746,-0.084961,-0.172240,0.194100,-0.271420,-0.180240,0.098520,-0.20688,-0.070186,-0.110760,-0.099534,-0.068580,0.001363,-0.067679,-0.198590,-0.155520,-0.367690,-0.059177,-0.087595,-0.106480,0.013320,...,0.192320,0.664990,0.322690,0.201980,0.234880,0.510920,0.246270,0.33453,-0.265080,0.90660,0.752680,0.558040,-0.355560,1.10530,0.989460,0.562950,0.385100,-0.102350,0.65711,-0.378200,-0.23822,0.754650,0.54698,0.921790,0.337540,0.073032,0.222760,0.73440,0.21215,-0.079690
2,0.672040,0.207510,-0.062030,0.066008,0.063160,0.217990,0.024779,0.257920,0.573370,0.332140,0.502000,0.182340,0.226030,0.047374,0.008411,0.250010,0.66724,0.152740,0.307860,0.028202,0.464700,0.356530,0.094515,-0.062818,-0.337230,0.598210,0.106530,0.355020,-0.016295,-0.051085,...,-1.029000,0.154790,-0.412170,-0.505320,-0.948290,0.608750,-1.025400,-0.15606,-0.056282,-1.15230,-0.989670,-0.630740,-0.049174,-0.96066,-0.488150,-0.293780,-0.315230,-0.043862,-1.06710,-1.154600,-0.65700,-0.292360,-0.50515,-0.344320,-0.131110,-1.029400,-0.296390,-0.33641,-0.99456,-0.229050
3,-0.149870,-0.125780,-0.188400,-0.252090,-0.278580,-0.042490,-0.209370,0.262820,-0.319080,-0.291750,0.002382,-0.278470,-0.437640,-0.172780,-0.304630,-0.270180,-0.14633,-0.330860,-0.519800,-0.231890,0.132140,-0.055497,-0.251730,-0.366140,-0.097191,0.132290,-0.216130,0.063868,-0.227550,-0.132070,...,-0.166900,-0.177860,0.044183,0.178180,0.414610,-0.199980,0.583060,0.62839,1.231800,-1.24830,-0.043626,-0.296180,-0.181340,0.15525,-0.697660,1.279200,-0.103690,0.193270,-0.27787,0.387420,-0.18234,-0.211210,0.56164,-0.508880,0.155950,-0.015436,0.694120,0.26918,1.17820,0.803660
4,-0.064983,-0.593250,0.023225,-0.117250,0.261200,-0.386180,0.497450,-0.029365,-0.184840,-0.151190,-0.167550,-0.084666,-0.113870,-0.029084,0.209970,-0.231860,0.42040,0.116090,0.035138,0.094477,0.185990,-0.166070,-0.242680,-0.417860,-0.216170,0.235060,-0.086178,0.292920,-0.189340,0.403860,...,-1.609700,0.020382,0.382080,0.453010,0.153540,0.581050,-0.104480,0.25511,-0.391860,-0.43616,-0.578280,0.175330,0.653210,0.54527,-0.900320,-0.070849,-0.110770,-0.225560,0.48507,1.288100,-0.27082,-0.105820,-0.29412,-0.000386,-1.124400,0.726150,0.193620,0.41843,2.07210,-0.788650
5,0.056491,0.125250,-0.018158,0.265130,0.079248,-0.153880,0.360190,0.471870,0.888670,-0.418420,0.307210,0.254420,-0.072725,-0.219100,0.085153,0.022378,0.19229,-0.173360,0.103680,0.121910,-0.037015,0.003140,-0.610930,0.209620,-0.300910,-0.046757,0.005223,-0.236350,0.141670,0.211350,...,-0.183750,-0.494010,0.356000,-0.482230,0.511450,0.809050,-0.214520,-0.56103,-0.619020,0.13476,0.247810,-0.033645,-1.041300,0.84744,0.392040,-0.487520,0.172950,-0.148480,-0.51166,-0.802670,0.37388,-0.295270,-0.35497,-0.151450,0.100460,-0.992460,-0.312760,-0.18900,-0.44271,-0.405670
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,0.053380,0.063500,-0.018168,-0.039709,-0.258100,0.329200,0.080421,0.193680,-0.212800,0.700590,0.275500,-0.122380,-0.036109,0.066062,0.539500,-0.196140,0.46287,-0.074133,-0.309460,0.062584,-0.267630,-0.308240,0.045725,0.528140,-0.058269,0.288220,0.212630,-0.273250,0.160700,0.021495,...,0.072681,-0.510110,0.963790,0.968080,0.444420,0.106410,0.171370,-0.41248,-0.048871,-0.29475,-0.014543,-0.056681,-0.582340,1.34180,-1.764400,-0.080746,0.092665,-0.223050,-0.43677,0.503820,0.28653,-0.484500,-0.21834,-0.406440,-0.094080,0.127380,-0.413980,-0.48432,-0.41382,-0.205220
297,-0.050821,0.140190,0.114070,0.324980,-0.044629,-0.175970,-0.061246,-0.325460,-0.226150,-0.213710,-0.067180,-0.081083,0.112210,-0.241770,-0.235960,-0.270970,0.29255,0.398360,-0.218780,0.041636,0.042565,-0.262490,0.294420,-0.112190,-0.208670,-0.159410,-0.171000,-0.331120,-0.061876,-0.177390,...,0.083195,0.205560,0.222970,0.286810,0.809500,0.224640,-0.034606,0.26303,-0.038934,0.93432,-0.115660,0.342740,0.191160,0.33063,1.222400,-0.571630,-0.087504,0.161510,-0.43438,0.067534,-0.83764,-0.343440,0.44104,0.119900,-0.361960,0.129570,0.240180,-1.00550,-0.21139,0.268780
298,-0.191800,0.138710,0.130150,-0.023452,0.082745,0.117090,-0.300990,0.144210,0.328000,-0.286770,-0.215110,-0.126680,0.091957,-0.319120,-0.385520,-0.062639,-0.23318,-0.137350,-0.059105,-0.103020,-0.094786,-0.112370,-0.184880,-0.584510,-0.051586,0.272280,-0.427550,0.034460,-0.313430,0.063989,...,0.006273,0.760580,0.296960,-0.038455,0.036418,0.053789,0.079950,0.14481,0.585170,0.17064,0.015888,0.321060,0.202980,-0.39937,0.774630,0.428840,0.822160,0.119740,0.28584,0.852120,0.35327,0.935510,0.84572,0.380180,0.645910,0.422640,0.093864,0.63718,0.93427,-0.083561
299,-0.378460,-0.360490,-0.183170,0.123020,0.097801,-0.166920,-0.145840,-0.169000,-0.109340,-0.226630,-0.263040,-0.438560,0.386320,0.235390,0.243240,0.244240,0.68688,0.157500,0.476040,-0.039964,0.288780,0.078259,-0.035434,0.278790,0.062448,-0.048367,0.024956,-0.150270,0.087424,-0.530060,...,-0.266590,-0.117910,0.329110,0.109320,0.099993,0.099725,-0.223080,0.37887,-0.008621,0.56711,1.142700,0.864960,0.058047,0.10063,0.098948,-0.078753,-0.591150,0.290980,-0.18967,-0.973490,0.13764,-0.099674,-0.78417,0.345800,-0.080984,-0.113330,-0.165450,-0.13914,-0.93286,0.485320


In [119]:
glove = { key: val.values for key, val in df_glove.T.items() }

In [120]:
glove['man'][:20]

array([-1.7310e-01,  2.0663e-01,  1.6543e-02, -3.1026e-01,  1.9719e-02,
        2.7791e-01,  1.2283e-01, -2.6328e-01,  1.2522e-01,  3.1894e+00,
       -1.6291e-01, -8.8759e-02,  3.3067e-03, -2.9483e-03, -3.4398e-01,
        1.2779e-01, -9.4536e-02,  4.3467e-01,  4.9742e-01,  2.5068e-01])

In [121]:
unknown_word = df_glove.mean().values
unknown_word[:20]

array([ 0.22418612, -0.28881808,  0.13854355,  0.00365397, -0.12870769,
        0.1024395 ,  0.06162703,  0.07317769, -0.06135387, -1.34764119,
        0.42038748, -0.0635958 , -0.09683355,  0.18086288,  0.23704431,
        0.01412683,  0.1700973 , -1.14917018,  0.31498588,  0.06622261])

In [122]:
df_glove.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,...,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1
",",-0.082752,0.67204,-0.14987,-0.064983,0.056491,0.40228,0.002775,-0.3311,-0.30691,2.0817,0.031819,0.013643,0.30265,0.00713,-0.5819,-0.2774,-0.062254,1.1451,-0.24232,0.1235,-0.12243,0.33152,-0.006162,-0.30541,-0.13057,-0.054601,0.037083,-0.070552,0.5893,-0.30385,...,-0.4393,-0.26137,0.30088,-0.060772,-0.45312,-0.19076,-0.20288,0.27694,-0.060888,0.11944,0.62206,-0.19343,0.47849,-0.30113,0.059389,0.074901,0.061068,-0.4662,0.40054,-0.19099,-0.14331,0.018267,-0.18643,0.20709,-0.35598,0.05338,-0.050821,-0.1918,-0.37846,-0.06589
.,0.012001,0.20751,-0.12578,-0.59325,0.12525,0.15975,0.13748,-0.33157,-0.13694,1.7893,-0.47094,0.70434,0.26673,-0.089961,-0.18168,0.067226,0.053347,1.5595,-0.2541,0.038413,-0.01409,0.056774,0.023434,0.024042,0.31703,0.19025,-0.37505,0.035603,0.1181,0.012032,...,-0.26477,0.096566,0.062658,-0.30668,-0.43334,0.10006,0.21136,0.039459,-0.11077,0.24421,0.60942,-0.46646,0.086385,-0.39702,-0.23363,0.021307,-0.10778,-0.2281,0.50803,0.11567,0.16165,-0.066737,-0.29556,0.022612,-0.28135,0.0635,0.14019,0.13871,-0.36049,-0.035
the,0.27204,-0.06203,-0.1884,0.023225,-0.018158,0.006719,-0.13877,0.17708,0.17709,2.5882,-0.35179,-0.17312,0.43285,-0.10708,0.15006,-0.19982,-0.19093,1.1871,-0.16207,-0.23538,0.003664,-0.19156,-0.085662,0.039199,-0.066449,-0.04209,-0.19122,0.011679,-0.37138,0.21886,...,0.4823,-0.051759,-0.27285,-0.25893,0.16555,-0.1831,-0.06734,0.42457,0.010346,0.14237,0.25939,0.17123,-0.13821,-0.066846,0.015981,-0.30193,0.043579,-0.043102,0.35025,-0.19681,-0.4281,0.16899,0.22511,-0.28557,-0.1028,-0.018168,0.11407,0.13015,-0.18317,0.1323
and,-0.18567,0.066008,-0.25209,-0.11725,0.26513,0.064908,0.12291,-0.093979,0.024321,2.4926,-0.017916,-0.071218,-0.24782,-0.26237,-0.2246,-0.21961,-0.12927,1.0867,-0.66072,-0.031617,-0.057328,0.056903,-0.27939,-0.39825,0.14251,-0.085146,-0.14779,0.055067,-0.002869,-0.20917,...,0.019917,-0.28803,-0.010494,0.038412,-0.11718,-0.072462,0.16381,0.38488,-0.029783,0.23444,0.4532,0.14815,-0.027021,-0.073181,-0.1147,-0.005455,0.47796,0.090912,0.094489,-0.36882,-0.59396,-0.097729,0.20072,0.17055,-0.004736,-0.039709,0.32498,-0.023452,0.12302,0.3312
to,0.31924,0.06316,-0.27858,0.2612,0.079248,-0.21462,-0.10495,0.15495,-0.03353,2.4834,-0.50904,0.08749,0.21426,0.22151,-0.25234,-0.097544,-0.1927,1.3606,-0.11592,-0.10383,0.21929,0.11997,-0.11063,0.14212,-0.16643,0.21815,0.004209,-0.070012,-0.23532,-0.26518,...,0.62255,-0.072391,0.090129,0.15428,0.023163,-0.13028,0.061762,0.33803,-0.091581,0.21039,0.05108,0.19184,0.10444,0.2138,-0.35091,-0.23702,0.038399,-0.10031,0.18359,0.025178,-0.12977,0.3713,0.18888,-0.004274,-0.10645,-0.2581,-0.044629,0.082745,0.097801,0.25045


In [123]:
# Creating a vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_vec = []

for sentences in job_titles:
    word_vec = []
    for word in sentences.split():
        if word in glove:
            vectors = glove[word]
            word_vec.append(vectors)
        else:
            word_vec.append(unknown_word)
    word_vec_mean = sum(word_vec) / len(word_vec) # returning a mean for each job title
    doc_sent_vec.append(word_vec_mean) # returning a list for all job titles

In [124]:
doc_sent_vec[0].shape

(300,)

In [126]:
# Creating a vectorize representation for each query
def q_sent_vec(query):
    q_sent_vec = []
    q_word_vec = []
    
    for word in query.split():
        if word in glove:
            vectors = glove[word]
            q_word_vec.append(vectors)
        else:
            q_word_vec.append(unknown_word)
        q_word_vec_mean = sum(q_word_vec) / len(q_word_vec)
    q_sent_vec.append(q_word_vec_mean)
        
    return q_sent_vec

In [127]:
query = 'native english speaking'
len(q_sent_vec(query))

1

In [128]:
q_sent_vec(query)[0].shape

(300,)

In [129]:
q_sent_vec(query)[0][:5]

array([-0.29654333,  0.12640833, -0.49922333,  0.22307667,  0.4358    ])

In [130]:
query = 'student indiana university'
q_sent_vec(query)[0][:5]

array([-0.10656   ,  0.06428367,  0.10134093, -0.19890667,  0.51552   ])

In [132]:
def get_glove_query_similarity(doc_sent_vec, query):
    """
    query_glove: processing the query
    doc_sent_vec: glove embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_glove = q_sent_vec(query)
    
    cos_sim_glove = cosine_similarity(query_glove, doc_sent_vec).flatten()
    
    return cos_sim_glove

In [133]:
def get_all_similarity(query):
    
    #GloVe similarity
    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    return df

In [134]:
query = 'Aspiring human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'glove_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resources professional,0.898174,0.735855,0.851023
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resources professional,0.898174,0.735855,0.851023
6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resources specialist,0.873679,0.632697,0.848638
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,aspiring human resources manager seeking internship human resources,0.584569,0.50888,0.84536
74,Human Resources Professional,Greater Boston Area,16,human resources professional,0.13422,0.340769,0.836803
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resources opportunities,0.619797,0.220668,0.825179
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",501,human resources generalist loparex,0.20252,0.196509,0.799749
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,human resources specialist luxottica,0.151158,0.189503,0.790386
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,seeking human resources position,0.654387,0.220668,0.77637
27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",501,aspiring human resources management student seeking internship,0.628601,0.374733,0.773825


In [None]:
query = 'seeking human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'glove_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resources opportunities,0.839381,0.675682,0.970024
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,seeking human resources position,0.886226,0.675682,0.953714
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,aspiring human resources manager seeking internship human resources,0.431644,0.362648,0.935586
74,Human Resources Professional,Greater Boston Area,16,human resources professional,0.133104,0.295223,0.903558
94,Seeking Human Resources Opportunities. Open to travel and relocation.,Amerika Birleşik Devletleri,415,seeking human resources opportunities open travel relocation,0.639099,0.38129,0.885495
6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resources specialist,0.645122,0.206629,0.874185
100,Aspiring Human Resources Manager | Graduating May 2020 | Seeking an Entry-Level Human Resources Position in St. Louis,"Cape Girardeau, Missouri",103,aspiring human resources manager graduating may seeking entry level human resources position st louis,0.343832,0.220083,0.870053
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resources professional,0.663209,0.240319,0.864091
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resources professional,0.663209,0.240319,0.864091
88,Human Resources Management Major,"Milpitas, California",18,human resources management major,0.170611,0.177288,0.859179


In [136]:
query = 'senior data analyst'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'glove_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8,HR Senior Specialist,San Francisco Bay Area,501,hr senior specialist,0.116745,0.23664,0.687605
86,Information Systems Specialist and Programmer with a love for data and organization.,"Gaithersburg, Maryland",4,information systems specialist programmer love data organization,0.170309,0.171334,0.676482
70,"Retired Army National Guard Recruiter, office manager, seeking a position in Human Resources.","Virginia Beach, Virginia",82,retired army national guard recruiter office manager seeking position human resources,0.058292,0.0,0.636123
79,Liberal Arts Major. Aspiring Human Resources Analyst.,"Baton Rouge, Louisiana Area",7,liberal arts major aspiring human resources analyst,0.037353,0.204439,0.625898
81,Senior Human Resources Business Partner at Heil Environmental,"Chattanooga, Tennessee Area",455,senior human resources business partner heil environmental,0.192444,0.156516,0.623344
80,Junior MES Engineer| Information Systems,"Myrtle Beach, South Carolina Area",52,junior mes engineer information systems,0.241222,0.0,0.620929
72,Business Management Major and Aspiring Human Resources Manager,"Monroe, Louisiana Area",5,business management major aspiring human resources manager,0.079708,0.0,0.616812
102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,business intelligence analytics travelers,0.164699,0.0,0.589284
66,Experienced Retail Manager and aspiring Human Resources Professional,"Austin, Texas Area",57,experienced retail manager aspiring human resources professional,0.18274,0.0,0.583173
84,Human Resources professional for the world leader in GIS software,"Highland, California",50,human resources professional world leader gis software,0.089697,0.0,0.564574


# Fasttext 
FastText is a library developed by Facebook for NLP - known for its training speed and accuracy.  

In [58]:
# import sys
# sys.path

# # !pip install wget
# !pip3.10 install --user wget

In [59]:
# # # Downloading fastText pre-trained vectors
# import wget
# wget.download('https://github.com/facebookresearch/fastText/archive/v0.9.2.zip')

In [60]:
# # # Extracting fastText vector file
# import zipfile as zf
# files = zf.ZipFile("fastText-0.9.2.zip", 'r')
# files.extractall()
# files.close()

In [59]:
os.chdir(r"C:\Users\Alex Chung\Documents\the_Lab\Apziva\Potential Talent\fastText-0.9.2")

#### Issues and workarounds with installing fasttext:

https://stackoverflow.com/questions/44951456/pip-error-microsoft-visual-c-14-0-is-required

In [None]:
# !pip install --upgrade pip
# !pip install --upgrade wheel
# !pip install --upgrade setuptools
# !pip install Cython --install-option="--no-cython-compile"

An error occurred during configuration: option use-feature: invalid choice: '2020-resolver' (choose from 'fast-deps', 'truststore', 'no-binary-enable-wheel-cache')


In [None]:
# !pip install fasttext
# !pip install fasttext-wheel

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
     ---------------------------------------- 0.0/73.4 kB ? eta -:--:--
     ----- ---------------------------------- 10.2/73.4 kB ? eta -:--:--
     --------------- ---------------------- 30.7/73.4 kB 330.3 kB/s eta 0:00:01
     -------------------------------------- 73.4/73.4 kB 448.2 kB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml): started
  Building wheel for fasttext (pyp

  error: subprocess-exited-with-error
  
  × Building wheel for fasttext (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [105 lines of output]
      !!
      
              ********************************************************************************
              Usage of dash-separated 'description-file' will not be supported in future
              versions. Please use the underscore name 'description_file' instead.
      
              By 2025-Mar-03, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
              ********************************************************************************
      
      !!
        opt = self.warn_dash_deprecation(opt, section)
      running bdist_wheel
      running build
      running build_py
      creating build\lib.win-amd64-cpython-312\fasttext
 

Collecting fasttext-wheel
  Downloading fasttext_wheel-0.9.2-cp312-cp312-win_amd64.whl.metadata (16 kB)
Collecting pybind11>=2.2 (from fasttext-wheel)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Downloading fasttext_wheel-0.9.2-cp312-cp312-win_amd64.whl (234 kB)
   ---------------------------------------- 0.0/234.7 kB ? eta -:--:--
   --- ----------------------------------- 20.5/234.7 kB 640.0 kB/s eta 0:00:01
   ----- --------------------------------- 30.7/234.7 kB 640.0 kB/s eta 0:00:01
   --------------------------- ------------ 163.8/234.7 kB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 234.7/234.7 kB 1.6 MB/s eta 0:00:00
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Installing collected packages: pybind11, fasttext-wheel
Successfully installed fasttext-wheel-0.9.2 pybind11-2.13.6


In [60]:
import fasttext as fasttext

In [65]:
# Downloading pretrained model trained on Common Crawl and Wikipedia
# import fasttext.util
# fasttext.util.download_model('en', if_exists='ignore')  # English Skip downloading if you've already downloaded


In [61]:
ft = fasttext.load_model('cc.en.300.bin')

In [62]:
ft.get_word_vector('hello')[:20]

array([ 0.15757619,  0.04378209, -0.00451272,  0.06659314,  0.07703468,
        0.00485855,  0.00819822,  0.00652403,  0.009259  ,  0.0353899 ,
       -0.02313953, -0.04918071, -0.08326425,  0.01560145,  0.25485662,
        0.03454237, -0.01074514, -0.07801886, -0.07080995,  0.07623856],
      dtype=float32)

In [63]:
ft.get_words()[:10]

[',', 'the', '.', 'and', 'to', 'of', 'a', '</s>', 'in', 'is']

In [64]:
# Creating a dictionary of fasttext word and vector representaiton
ft_words = ft.get_words()
ft_vectors = [ft.get_word_vector(word) for word in ft_words]
ft_dict = dict(zip(ft_words, ft_vectors))

In [65]:
ft_dict['hello'][:20]

array([ 0.15757619,  0.04378209, -0.00451272,  0.06659314,  0.07703468,
        0.00485855,  0.00819822,  0.00652403,  0.009259  ,  0.0353899 ,
       -0.02313953, -0.04918071, -0.08326425,  0.01560145,  0.25485662,
        0.03454237, -0.01074514, -0.07801886, -0.07080995,  0.07623856],
      dtype=float32)

In [66]:
df_ft = pd.DataFrame(ft_dict.items(), columns = ['ft_words', 'ft_vectors'])

In [67]:
df_ft.head(10)

Unnamed: 0,ft_words,ft_vectors
0,",","[0.12502378, -0.10790165, 0.02450176, -0.25286365, 0.1057171, -0.018444797, 0.117678985, -0.07007254, -0.040074684, -0.008026216, 0.07716709, -0.02257145, 0.089262165, -0.04868145, -0.08966993, -0.08349128, 0.019988708, 0.027310487, -0.01935611, 0.09643278, 0.08747688, 0.009819358, 0.045297798, 0.015498773, 0.14624609, 0.022521427, 0.04475486, 0.013749474, 0.057015173, 0.1764235, -0.1071837, -0.082620285, 0.017277328, 0.10895962, 0.020679405, -0.12712738, 0.2444892, 0.037465177, -0.020877417, -0.044460505, 0.053991955, 0.12817593, 0.043671336, 0.058789518, 0.09843587, 0.05393798, 0.00044774427, 0.12903026, 0.024213549, -0.012008867, -0.048041053, 0.03460624, -0.06643045, -0.032984406, -0.06247217, -0.070759535, -0.057862796, 0.17382768, 0.44483587, 0.037006963, -0.10010116, -0.0031810577, 0.035880014, -0.06850616, -0.036060803, 0.007000481, 0.13161308, -0.094532624, -0.06097764, 0.017754983, -0.07628012, -0.019208273, 0.0032959182, 0.005632444, 0.18779793, -0.0754082, -0.009459897, 0.04464071, -0.058813374, 0.024390636, -0.025075123, -0.049303107, 0.030831667, -0.035886865, -0.18844126, -0.09883648, 0.18867746, 0.04589819, -0.08158643, -0.15238018, -0.037457667, -0.06915909, 0.042720053, -0.047074586, -0.008642857, -0.21905208, -0.0064076814, 0.08774324, -0.007448593, -0.1400358, ...]"
1,the,"[-0.051744193, 0.073963955, -0.01305688, 0.044726558, -0.034320366, 0.021216884, 0.0069114864, -0.016327847, -0.018074857, -0.0019965237, -0.10204669, 0.005904886, 0.025654055, -0.002596621, -0.058556058, -0.037758686, 0.016311873, 0.01463237, -0.008759298, -0.017594784, -0.008547327, -0.007793376, -0.018278033, 0.008798243, 0.0013020262, -0.093829416, 0.013899146, 0.014892999, -0.039370976, -0.029441122, 0.009422931, -0.025228418, -0.010441078, -0.22131945, -0.022859765, -0.008935269, -0.03222265, 0.08217016, 0.002099978, 0.028173504, 0.007170668, -0.009125605, -0.035169393, -0.017804421, -0.07055402, 0.06302309, -0.009246307, -0.022327038, -0.005585512, 0.0514723, -0.03069112, 0.043648228, -0.010969555, -0.055454243, 0.008938285, -0.06726995, 0.010507602, 0.05740975, 0.009920523, -0.028267926, 0.047040958, 0.0052922955, 0.0030449405, 0.00071547925, 0.044293776, 0.006895274, -0.033405542, 0.009057372, -0.0075827073, 0.006601395, 0.09174107, 0.031111507, 0.05429111, 0.028172497, -0.019965246, -0.033377998, 0.0052875523, 0.03638041, 0.22493297, 0.09276069, -0.012265386, 0.008560304, -0.059897833, 0.06762706, 0.04024453, 0.0011667766, 0.046392195, -0.043697126, 0.005942209, 0.09172087, -0.04124823, -0.015125338, -0.023081664, 0.009499152, 0.05883145, 0.027860444, 0.06469925, -0.056754317, -0.012956021, 0.047435097, ...]"
2,.,"[0.03423236, -0.08014102, 0.116187684, -0.39683825, -0.014666078, -0.05333376, 0.0606309, -0.105187, 0.0004822225, -0.036015246, 0.025738074, 0.017741874, 0.028525142, 0.0036812234, -0.041895356, 0.23742425, 0.0073372344, -0.030286761, -0.05776126, -0.061607026, 0.0064677577, 0.0054974114, 0.061985064, -0.0035603195, -0.107664384, -0.10458943, 0.06542359, -0.00065885123, 0.023493404, 0.044855215, 0.0012925226, -0.049584012, -0.0029731453, 0.13319224, 0.031394668, -0.015184948, 0.07726878, -0.3238144, -0.008129742, 0.01077384, -0.0478446, 0.10366743, -0.089419544, 0.14941524, 0.5012751, -0.18421888, -0.025935497, 0.07800455, -0.029555596, 0.059735887, 0.04384649, -0.047654208, -0.03593738, -0.06039133, 0.037578516, -0.045454044, -0.13247262, -0.05950857, -0.09992922, -0.08243029, -0.09629086, -0.08551892, -0.024352599, 0.50798106, -0.027145516, -0.08863297, -0.015968971, -0.050326522, -0.029528841, -0.01774156, 0.38464957, 0.10462516, 0.16921097, -0.011959946, 0.046539865, -0.08007814, 0.012553597, 0.05216411, 0.10962657, 0.20337108, 0.0128176045, 0.0064291875, -0.06376205, 0.02083857, 0.12471656, 0.0043035937, 0.08625324, 0.113382444, 0.03137607, 0.087006256, 0.058067933, 0.013879853, 0.112878084, 0.0039297733, -0.19282798, -0.1918144, -0.22638488, 0.031872883, -0.010841944, -0.057225518, ...]"
3,and,"[0.008239111, -0.089902766, 0.026525287, -0.0085538775, -0.060939744, 0.0067593013, 0.06523226, 0.010621363, -0.047525924, -0.0076114833, -0.0022587562, 0.00090840796, -0.0066886954, -0.02258287, -0.0066159326, -0.07248071, 0.020306896, 0.021449976, -0.050742615, 0.039191604, 0.053320736, -0.0045275465, -0.00186902, 0.05571583, 0.014650077, -0.05705756, 0.0004939493, -0.0070915874, 0.014855932, -0.05283078, 0.028485052, -0.049625643, 0.022202913, 0.058646828, 0.02132575, 0.007913686, 0.018733248, 0.09189167, -0.027784545, 0.023947353, 0.042574838, -0.009941715, -0.022348596, 0.08558617, -0.04476485, -0.014187764, -0.00012688711, -0.10023913, -0.012434232, 0.059548877, 0.023598243, 0.015628256, 0.019426357, 0.008582446, 0.015153295, 0.04160884, 0.019806052, -0.0074611306, -0.030770263, -0.021053193, 0.027433489, -0.041715737, -0.03561859, -0.021703135, 0.013787903, 0.02723198, 0.029117769, 0.016238129, 0.022331439, 0.03120071, -0.0075290566, 0.010258479, 0.029436834, 0.012595084, 0.07288315, -0.012444434, -0.044359386, 0.03307009, -0.09758015, -0.025886375, 0.005345704, 0.017554294, -0.019499544, 0.030036539, -0.031897828, 0.028107848, 0.05480569, -0.015252747, -0.04369261, -0.05411026, 0.011225717, 0.021609865, -0.036047105, -0.014485395, -0.0036204532, 0.019571058, 0.012176432, -0.057097465, -0.027010625, 0.06381462, ...]"
4,to,"[0.0046811374, 0.02812425, -0.029631453, -0.010813727, -0.062001865, -0.053247184, -0.09798437, 0.09703769, -0.086365804, -0.04250967, 0.027205678, -0.037576407, -0.042823818, 0.03736327, -0.0040072934, -0.035351653, 0.006441958, 0.01815276, -0.05837233, 0.66449475, 0.06439174, 0.07065982, -0.040183958, -0.10739311, -0.22830972, -0.07923524, 0.0025080463, -0.009183379, -0.20778863, 0.1934049, -0.030089566, -0.12588945, 0.044410635, 0.07125176, 0.043271232, -0.041530486, -0.6235891, 0.36537746, -0.012258322, -0.04234533, 0.043535333, 0.050220348, -0.13021699, 0.06494241, 0.26694646, 0.120960064, 0.03457541, 0.18504952, 0.05205511, 0.028337307, 0.07763192, -0.10175669, -0.056725707, -0.072766125, -0.025379542, -0.07050234, -0.12926963, 0.15151018, -0.11727919, -0.13767786, -0.020377263, -0.02169984, 0.0070206984, -0.5927937, -0.05946322, -0.0665735, 0.031815324, -0.08782044, -0.016749077, -0.046038333, 0.06996748, 0.021253472, 0.098476626, -0.020462183, 0.008500855, -0.09389196, 0.07737605, 0.11884761, -0.10668497, -0.20179836, 0.028771268, -0.039884202, -0.02189152, -0.013390127, 0.4109497, -0.023387067, 0.10170012, 0.2881374, -0.04492867, -0.33682323, -0.03149225, -0.08089485, 0.093989424, -0.05177318, -0.04751726, -0.5166862, -0.1999483, 0.022620726, -0.081143945, -0.043707617, ...]"
5,of,"[-7.303824e-05, -0.18774074, -0.07105116, -0.46324894, 0.00019782504, 0.0115067465, -0.058770157, 0.057423938, -0.02752339, -0.0035551887, -0.00979289, -0.041395325, 0.093476504, 0.054072887, -0.1959487, -0.29094914, -0.0632413, 0.046580408, -0.016015815, 0.030754661, -0.067918874, 0.09268616, -0.013569073, -0.014274055, -0.24648216, -0.122125186, -0.030555435, -0.01954064, -0.02387623, 0.19385405, 0.085706815, -0.023139978, 0.114564165, -0.16688682, -0.05639682, -0.06965482, 0.4156065, 0.09697553, -0.038180716, -0.053976264, -0.032692987, 0.103521354, 0.14758176, -0.028609693, -0.09773134, -0.58612776, -0.034827933, -0.15963723, 0.03680496, 0.034980565, -0.12743777, -0.076639794, -0.21874027, -0.10197091, 0.029214144, -0.0610992, -0.12475587, -0.12257973, -0.15607978, 0.0025683397, -0.05673503, -0.012803017, 0.019915858, -0.20575635, 0.017204322, -0.001666827, 0.08162742, 0.012283689, 0.037113506, 0.035213906, 0.6317011, 0.100954436, 0.084194995, 0.033921327, 0.017377023, -0.15896183, -0.023561172, 0.006464903, 0.43838742, -0.13262495, 0.0837738, -0.021634273, -0.0125222765, -0.0029243426, 0.15919153, 0.02480376, 0.308703, -0.734057, -0.16039647, 0.18091154, -0.07471148, 0.00041486407, 0.033960316, -0.06286861, 0.25210813, -0.13146792, -0.10915509, -0.08675993, -0.019711522, 0.013091255, ...]"
6,a,"[0.08764305, -0.49590126, -0.04985499, -0.093654394, -0.047178503, -0.021117728, 0.26236942, 0.0268894, -0.089959934, -0.036261387, -0.030623315, -0.014235822, 0.03367929, 0.100069314, -0.15231942, 0.6262041, -0.0114860255, -0.0062563913, 0.001060482, 0.012539644, -0.0591685, 0.07877674, -0.014958241, -0.09538608, 0.0027257048, 0.0076487586, -0.012263658, -0.003749517, 0.002154208, 0.2065641, 0.058354504, 0.01857947, 0.0607996, -0.2229555, 0.03791136, -0.043574236, 0.029297724, -0.37319386, 0.17104687, -0.0031074833, -0.11759481, 0.088898785, -0.010741252, 0.028883707, 0.11350054, 0.31754473, -0.03604033, 0.117753275, -0.003170231, -0.028610826, 0.08265117, 0.029414259, -0.13009402, -0.06169178, -0.018803172, 0.100972004, -0.038234353, -0.24120487, -0.22182922, -0.06455899, -0.041422192, -0.054245185, 0.03124349, -0.4290254, -0.11428877, -0.12948188, -0.041371375, -0.060233433, 0.07984591, -0.057600193, -0.020673078, 0.06542265, -0.0057717185, -0.09014553, -0.07131091, -0.096630655, 0.10818069, 0.070128806, 0.4568352, 0.026307512, -0.010435385, -0.020696575, -0.0028784582, 0.018490905, 0.4613926, -0.03574436, 0.18480995, 0.060354125, 0.098244615, -0.16591291, -0.049602885, -0.1048729, 0.055849317, -0.06042133, -0.27243835, 0.2925759, -0.19915986, 0.00474691, -0.0612738, -0.019686209, ...]"
7,</s>,"[0.073061086, -0.24302974, -0.035331346, -0.36307716, 0.038005635, 0.057967912, -0.19240326, 0.028341608, -0.073688455, 0.012652616, 0.09165293, -0.023344276, 0.08577412, -0.013607377, 0.0568931, 0.1741828, -0.08139733, -0.068106644, -0.033839006, -0.18333961, 0.01745886, 0.08023569, 0.11897105, -0.076179765, -0.32418066, -0.03948656, -0.079627715, 0.019259285, -0.03521772, 0.7771416, -0.025468618, -0.042181388, 0.060864676, 0.079217575, -0.10564223, -0.017011112, 0.21335451, -0.3750907, -0.007139957, 0.069730245, 0.0305717, 0.157805, -0.15869948, 0.004916782, 0.5307537, -0.14632113, 0.038294494, 0.27310213, 0.15618576, -0.07141493, -0.1342371, -0.0813873, 0.061510954, 0.021182325, -0.0446935, -0.09638717, -0.22248471, -0.0074498137, 0.10770239, 0.011026342, 0.016908865, 0.08392427, 0.03494793, 0.51371026, -0.0066248407, 0.10205543, 0.16887316, -0.094636954, 0.033916768, -0.03419787, 0.411129, -0.031160763, 0.060164817, -0.050208922, 0.13036817, 0.0027418889, 0.05457663, -0.13271675, 0.28996035, 0.11260987, 0.07594252, -0.037185814, 0.22133024, -0.04586053, 0.28527716, -0.02156251, -0.089473225, 0.1296509, -0.09233219, -0.01047978, -0.016067324, -0.17818375, -0.2344257, -0.008850978, -0.26237988, -0.052541, -0.3413344, 0.1510452, 0.04292434, -0.14043133, ...]"
8,in,"[-0.014047037, -0.25217462, 0.07150193, -0.024555584, -0.063723065, -0.036379293, 0.16186544, 0.026761735, -0.04750579, 0.01918093, 0.010213073, 0.024324609, 0.069100544, -0.013087587, 0.010005959, 0.03840606, 0.004898765, -0.064563274, 0.0061554275, -0.26439512, -0.04362177, 0.048197616, 0.09647768, 0.030350363, -0.041944303, -0.07252897, -0.03092399, 0.035789367, 0.018318348, 0.103446476, 0.11454638, -0.06875847, -0.005294898, -0.28194532, -0.22962308, -0.091969244, 0.003245044, 0.7827216, 0.09718003, -0.03441883, -0.081329264, 0.10418365, -0.009676969, -0.040444985, -0.14603288, -0.085874245, -0.057135433, 0.1671402, -0.03207896, -0.020384531, -0.03160913, 0.062324155, -0.16381112, -0.03750087, -0.051192444, -0.27564037, -0.20670308, -0.20152973, 0.08775812, -0.035025813, -0.038485676, -0.0003907204, 0.04736526, -0.12999825, 0.0051986678, -0.04836414, 0.08402147, -0.13813986, -0.064184144, 0.014048185, -0.104400575, -0.045596775, 0.08768672, -0.022863748, -0.041942406, -0.10827366, 0.07675704, -0.005818991, 0.06831849, -0.38194704, -0.048864186, -0.013682356, 0.004204192, 0.027389837, 0.0017619773, -0.07640405, 0.018783942, -0.0126136495, -0.02659796, -0.0022307348, -0.03972538, -0.032568187, 0.10041849, 0.00078873034, -0.37095523, 0.014291819, -0.50567734, -0.056323647, -0.032365527, -0.015043827, ...]"
9,is,"[-0.09776052, -0.20827363, -0.10372388, -0.016016617, -0.24025242, -0.04489053, 0.0029898915, 0.09899958, -0.040384315, -0.050430093, -0.09243243, 0.04039243, 0.08296915, -0.025647575, -0.17884456, 0.30788016, 0.028456949, -0.080813415, -0.04330902, -0.04204109, -0.01586664, 0.11471716, -0.055063657, -0.03386086, 0.16513433, -0.020072173, 0.015030911, 0.06260364, -0.2056474, 0.30481568, -0.09576765, 0.006035844, -0.065149985, 0.0363793, -0.015768338, -0.018179018, 0.20863846, -0.44426122, 0.19094028, -0.108303264, -0.14508459, -0.020933276, -0.05557899, -0.056745093, 0.22588195, -0.054717097, -0.083924346, 0.16085467, 0.093937494, 0.027538713, -0.033322804, -0.29504985, -0.08213945, -0.093954645, 0.004275933, -0.12296378, -0.06015374, -0.25978413, -0.08173254, 0.05399314, 0.011633151, -0.110731475, -0.048569657, -0.29798222, -0.02626277, 0.016136264, -0.058513246, -0.0037710408, 0.037993368, 0.036786538, 0.1951053, -0.09374667, -0.0006316002, 0.013760613, -0.034666326, -0.10093439, 0.16862835, 0.052446134, -0.37724254, 0.0031395552, 0.0809714, -0.014820747, -0.0809641, 0.07720839, -0.2655702, -0.12712148, 0.29013678, -0.13840415, 0.11516189, -0.021465525, -0.071355276, -0.11682801, 0.19231452, -0.11954482, -0.079179466, 0.5574007, -0.54753125, 0.0761819, -0.0829211, -0.05117509, ...]"


In [68]:
# May not need to do this for fasttext
oov_word = np.zeros((300,))

In [69]:
# Creating a fasttext vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_vec_ft = []

for sentences in job_titles:
    word_vec_ft = []
    for word in sentences.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            word_vec_ft.append(vectors)
        else:
            word_vec_ft.append(oov_word)
    word_vec_mean_ft = sum(word_vec_ft) / len(word_vec_ft) # returning a mean for each job title
    doc_sent_vec_ft.append(word_vec_mean_ft) # returning a list for all job titles

In [70]:
# Creating a fasttext vectorize representation for each query
def q_sent_vec_ft(query):
    q_sent_vec_ft = []
    q_word_vec_ft = []
    
    for word in query.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            q_word_vec_ft.append(vectors)
        else:
            q_word_vec_ft.append(oov_word)
    q_word_vec_mean_ft = sum(q_word_vec_ft) / len(q_word_vec_ft) # This was indented but just fixed this round - if it breaks, this should be indented again
    q_sent_vec_ft.append(q_word_vec_mean_ft)
        
    return q_sent_vec_ft

In [71]:
def get_fasttext_query_similarity(doc_sent_vec_ft, query):
    """
    query_fasttext: processing the query
    doc_sent_vec: glove embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_fasttext = q_sent_vec_ft(query)
    
    cos_sim_fasttext = cosine_similarity(query_fasttext, doc_sent_vec_ft).flatten()
    
    return cos_sim_fasttext

In [72]:


def get_all_similarity(query):

    #Fasttext similarity
    cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query)
    df['fasttext_fit'] = cos_sim_fasttext

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove
    
    return df

In [73]:
query = 'Aspiring human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'fasttext_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit,fasttext_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319,aspiring human resources professional,0.898174,0.735855,0.851023,0.905892
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319,aspiring human resources professional,0.898174,0.735855,0.851023,0.905892
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.206629,aspiring human resources specialist,0.873679,0.632697,0.848638,0.888034
74,Human Resources Professional,Greater Boston Area,16,0.295223,human resources professional,0.13422,0.340769,0.836803,0.877046
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,0.362648,aspiring human resources manager seeking internship human resources,0.584569,0.50888,0.84536,0.860426
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",501,0.170244,human resources generalist loparex,0.20252,0.196509,0.799749,0.841223
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,0.164174,human resources specialist luxottica,0.151158,0.189503,0.790386,0.834961
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682,seeking human resources opportunities,0.619797,0.220668,0.825179,0.815429
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682,seeking human resources position,0.654387,0.220668,0.77637,0.78393
27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",501,0.245337,aspiring human resources management student seeking internship,0.628601,0.374733,0.773825,0.776401


In [74]:
query = 'seeking human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'fasttext_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit,fasttext_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682,seeking human resources opportunities,0.839381,0.675682,0.970024,0.981158
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682,seeking human resources position,0.886226,0.675682,0.953714,0.95687
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,0.362648,aspiring human resources manager seeking internship human resources,0.431644,0.362648,0.935586,0.924971
74,Human Resources Professional,Greater Boston Area,16,0.295223,human resources professional,0.133104,0.295223,0.903558,0.90522
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,0.164174,human resources specialist luxottica,0.150797,0.164174,0.852014,0.893244
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",501,0.170244,human resources generalist loparex,0.204287,0.170244,0.805987,0.876087
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.206629,aspiring human resources specialist,0.645122,0.206629,0.874185,0.871857
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319,aspiring human resources professional,0.663209,0.240319,0.864091,0.865438
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319,aspiring human resources professional,0.663209,0.240319,0.864091,0.865438
27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",501,0.245337,aspiring human resources management student seeking internship,0.464157,0.245337,0.856657,0.830511


# BERT - 

In [None]:
# First install
# !pip install transformers 
# !pip install transformers -U --use-feature 2020-resolver

: 

In [None]:
# !pip install --upgrade pip

An error occurred during configuration: option use-feature: invalid choice: '2020-resolver' (choose from 'fast-deps', 'truststore', 'no-binary-enable-wheel-cache')


In [None]:
# !pip config set --user global.use-feature 2020-resolver

Writing to C:\Users\Alex Chung\AppData\Roaming\pip\pip.ini


In [None]:
# !pip install torch torchvision torchaudio

Collecting torch
  Downloading torch-2.6.0-cp312-cp312-win_amd64.whl.metadata (28 kB)
Collecting torchvision
  Downloading torchvision-0.21.0-cp312-cp312-win_amd64.whl.metadata (6.3 kB)
Collecting torchaudio
  Downloading torchaudio-2.6.0-cp312-cp312-win_amd64.whl.metadata (6.7 kB)
Collecting networkx (from torch)
  Using cached networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
  Using cached jinja2-3.1.5-py3-none-any.whl.metadata (2.6 kB)
Collecting sympy==1.13.1 (from torch)
  Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch)
  Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting pillow!=8.3.*,>=5.3.0 (from torchvision)
  Downloading pillow-11.1.0-cp312-cp312-win_amd64.whl.metadata (9.3 kB)
Downloading torch-2.6.0-cp312-cp312-win_amd64.whl (204.1 MB)
   ---------------------------------------- 0.0/204.1 MB ? eta -:--:--
   ---------------------------------------- 0.1/2

In [128]:
print(torch.__version__)

2.6.0+cpu


In [75]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load the tokenizer and the model from HuggingFace Hub
bert_tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
bert_model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [77]:
# Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = bert_tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = bert_model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return embeddings

In [78]:
# get bert embedding for all docs
titles_list = df['job_title_cleaned'].to_list()

doc_emb = encode(titles_list)

In [79]:
def get_bert_query_similarity(doc_emb, query):
    """
    query_bert: processing the query
    doc_emb: bert embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_bert = encode(query)
    
    #Compute dot score between query and all document embeddings
    cos_sim_bert = torch.mm(query_bert, doc_emb.transpose(0, 1))[0].cpu().tolist()
    
    return cos_sim_bert

In [80]:
def get_all_similarity(query):
    
    #Bert similarity
    cos_sim_bert = get_bert_query_similarity(doc_emb, query)
    df['bert_fit'] = cos_sim_bert

    #Fasttext similarity
    cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query)
    df['fasttext_fit'] = cos_sim_fasttext

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove
    
    return df

In [81]:
query = 'seeking human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'bert_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit,fasttext_fit,bert_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682,seeking human resources position,0.886226,0.675682,0.953714,0.95687,0.904125
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682,seeking human resources opportunities,0.839381,0.675682,0.970024,0.981158,0.899172
10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,501,0.432761,seeking human resources hris generalist positions,0.703341,0.432761,0.817692,0.693091,0.798526
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.206629,aspiring human resources specialist,0.645122,0.206629,0.874185,0.871857,0.780673
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319,aspiring human resources professional,0.663209,0.240319,0.864091,0.865438,0.7727
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319,aspiring human resources professional,0.663209,0.240319,0.864091,0.865438,0.7727
74,Human Resources Professional,Greater Boston Area,16,0.295223,human resources professional,0.133104,0.295223,0.903558,0.90522,0.727105
67,"Human Resources, Staffing and Recruiting Professional","Jackson, Mississippi Area",501,0.135583,human resources staffing recruiting professional,0.179637,0.135583,0.838173,0.827532,0.696157
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,0.362648,aspiring human resources manager seeking internship human resources,0.431644,0.362648,0.935586,0.924971,0.693516
100,Aspiring Human Resources Manager | Graduating May 2020 | Seeking an Entry-Level Human Resources Position in St. Louis,"Cape Girardeau, Missouri",103,0.220083,aspiring human resources manager graduating may seeking entry level human resources position st louis,0.343832,0.220083,0.870053,0.573788,0.669404


In [87]:
# WordtoVec  Same thing but with pretrained word embedding average of word
# Try to see who I'm connected with 
skill review surrvey - schedule interview - motivated 

SyntaxError: invalid syntax (<ipython-input-87-598d8a39b810>, line 3)

Process:
1. Sentence transformer:
    https://sbert.net/
    https://www.geeksforgeeks.org/sentence-similarity-using-bert-transformer/


2. Gen AI
https://stackoverflow.com/questions/75673222/semantic-searching-using-google-flan-t5

3. Utilizing LLM via prompting
GPT general purpose transformer - closed boxed model through an Open AI API
- Focus on instead, take advantage of open source LLM such as LLama 3 model from Meta
- Mistral, Llama 2, Grok maybe?

Bert

# Gen AI

In [None]:
# %pip install -U datasets==2.17.0

# %pip install --upgrade pip
# %pip install --disable-pip-version-check \
#     torch==1.13.1 \
#     torchdata==0.5.1 --quiet

# %pip install \
#     transformers==4.27.2 --quiet

In [82]:
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

In [83]:
model_name='google/flan-t5-base'

gen_ai_tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
gen_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
Scratch

In [84]:
sentence = "What time is it, Tom?"

sentence_encoded = gen_ai_tokenizer(sentence, return_tensors='pt')

sentence_decoded = gen_ai_tokenizer.decode(
        sentence_encoded["input_ids"][0], 
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])

DECODED SENTENCE:
What time is it, Tom?


In [85]:
sentence_encoded["input_ids"][0]

tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])

In [102]:
sentence

'director administration excellence logging'

In [86]:
# Creating a vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_gen_ai = []

for sentence in job_titles:
    sentence_encoded = gen_ai_tokenizer(sentence, return_tensors='pt')

    sentence_encoded_mean = sum(sentence_encoded["input_ids"][0]) / len(sentence_encoded["input_ids"][0]) # returning a mean for each job title
    
    print(sentence)
    print(sentence_encoded_mean)
    
    doc_sent_gen_ai.append(sentence_encoded_mean.item()) # returning a list for all job titles
    
# word_vec_mean = sum(word_vec_mean) / len(word_vec_mean) # This was indented but just fixed this round - if it breaks, this should be indented again
# doc_sent_vec.append(doc_sent_vec)
    
# return doc_sent_vec

     c t  bauer college business graduate  magna cum laude  aspiring human resources professional
tensor(2549.7827)
native english teacher epik  english program korea 
tensor(5184.2310)
aspiring human resources professional
tensor(4716.3335)
people development coordinator ryan
tensor(2711.7144)
advisory board member celal bayar university
tensor(4359.5557)
aspiring human resources specialist
tensor(5296.6665)
student humber college aspiring human resources generalist
tensor(3160.3333)
hr senior specialist
tensor(1234.5000)
seeking human resources hris generalist positions
tensor(1138.)
student chapman university
tensor(4093.6001)
svp  chro  marketing   communications  csr officer   engie   houston   woodlands   energy   gphr   sphr
tensor(2024.7576)
human resources coordinator intercontinental buckhead atlanta
tensor(5590.4614)
aspiring human resources management student seeking internship
tensor(5203.)
seeking human resources opportunities
tensor(1593.)
experienced retail manager aspi

In [87]:
query = "What time is it, Tom?"

q_sentence_encoded = gen_ai_tokenizer(query, return_tensors='pt')
q_sentence_encoded_mean = sum(q_sentence_encoded["input_ids"][0]) / len(q_sentence_encoded["input_ids"][0])

In [88]:
q_sentence_encoded_mean.item()

454.625

In [None]:
Unscratch

In [None]:
# Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def gen_encode(texts):
    # Tokenize sentences
    encoded_input = gen_ai_tokenizer(texts, padding=True, \
        truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        generated_output = gen_model.generate(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return generated_output

In [113]:
encoded_input.tokens

<bound method BatchEncoding.tokens of {'input_ids': tensor([[    3,    75,     3,  ...,     0,     0,     0],
        [ 4262, 22269,  3145,  ...,     0,     0,     0],
        [    3, 25149,   936,  ...,     0,     0,     0],
        ...,
        [  268,  6123,  9952,  ...,     0,     0,     0],
        [  373,   356,  1269,  ...,     0,     0,     0],
        [ 2090,  3602,  8978,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}>

In [115]:
texts = df['job_title_cleaned'].to_list()

encoded_input = gen_ai_tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    # model_output = gen_model.generate(**encoded_input, return_dict=True)
    model_output = gen_model.generate(**encoded_input)

In [116]:
model_output

tensor([[    0,     3,    75,  ...,     0,     0,     0],
        [    0, 19067,     1,  ...,     0,     0,     0],
        [    0,     3,  9406,  ...,     0,     0,     0],
        ...,
        [    0,   936,     1,  ...,     0,     0,     0],
        [    0,     3,     9,  ...,     0,     0,     0],
        [    0,  4210,     3,  ...,     3,    51,    52]])

In [117]:
model_output[0]

tensor([    0,     3,    75,    17,     3,  2635,    49,     3, 12513,    15,
            7,     3, 25149,   936,  1438,   771,     1,     0,     0,     0,
            0])

In [None]:
encoded_input

{'input_ids': tensor([[    3,    75,     3,  ...,     0,     0,     0],
        [ 4262, 22269,  3145,  ...,     0,     0,     0],
        [    3, 25149,   936,  ...,     0,     0,     0],
        ...,
        [  268,  6123,  9952,  ...,     0,     0,     0],
        [  373,   356,  1269,  ...,     0,     0,     0],
        [ 2090,  3602,  8978,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

In [118]:
with torch.no_grad():
        model_output = gen_model(**encoded_input, return_dict=True)

ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds

In [94]:
# get gen embedding for all docs
titles_list = df['job_title_cleaned'].to_list()

gen_doc_emb = gen_encode(titles_list)

TypeError: T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=768, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (1-11): 11 x T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=768, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (final_layer_norm): T5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (decoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerCrossAttention(
            (EncDecAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (2): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=768, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (1-11): 11 x T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerCrossAttention(
            (EncDecAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (2): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=768, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
              (act): NewGELUActivation()
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (final_layer_norm): T5LayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (lm_head): Linear(in_features=768, out_features=32128, bias=False)
) got multiple values for keyword argument 'return_dict'

In [119]:
gen_doc_emb

NameError: name 'gen_doc_emb' is not defined

In [124]:
query

'What time is it, Tom?'

In [None]:


encoded_input = gen_ai_tokenizer(query, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = gen_model(**encoded_input, return_dict=True)
    # model_output = gen_model(**encoded_input)
    
mean_pooling(model_output, encoded_input['attention_mask'])

ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds

In [None]:
def get_gen_ai_query_similarity(gen_doc_emb, query):
    """
    query_gen: processing the query
    gen_doc_emb: bert embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_gen = gen_encode(query)
    
    #Compute dot score between query and all document embeddings
    cos_sim_gen = torch.mm(query_gen, gen_doc_emb.transpose(0, 1))[0].cpu().tolist()
    
    return cos_sim_gen

In [None]:
query = 'seeking human resources'
cos_sim_gen = get_gen_ai_query_similarity(gen_doc_emb, query)
cos_sim_gen

ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds

In [None]:
def get_all_similarity(query):
    
    #Bert similarity
    cos_sim_bert = get_bert_query_similarity(doc_emb, query)
    df['bert_fit'] = cos_sim_bert

    #Fasttext similarity
    cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query)
    df['fasttext_fit'] = cos_sim_fasttext

    # original TFIDF similarity and Word2Vec Similarity for comparison
    cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query) 
    df['tfidf_fit'] = cos_sim

    cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query)
    df['w2v_fit'] = cos_sim_w2v

    cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query)
    df['glove_fit'] = cos_sim_glove
    
    return df

In [None]:
query = 'seeking human resources'
df = get_all_similarity(query)
top_candidates(n = 10, by = 'bert_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit,fasttext_fit,bert_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682,seeking human resources position,0.886226,0.675682,0.953714,0.95687,0.904125
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682,seeking human resources opportunities,0.839381,0.675682,0.970024,0.981158,0.899172
10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,501,0.432761,seeking human resources hris generalist positions,0.703341,0.432761,0.817692,0.693091,0.798526
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.206629,aspiring human resources specialist,0.645122,0.206629,0.874185,0.871857,0.780673
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319,aspiring human resources professional,0.663209,0.240319,0.864091,0.865438,0.7727
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319,aspiring human resources professional,0.663209,0.240319,0.864091,0.865438,0.7727
74,Human Resources Professional,Greater Boston Area,16,0.295223,human resources professional,0.133104,0.295223,0.903558,0.90522,0.727105
67,"Human Resources, Staffing and Recruiting Profe...","Jackson, Mississippi Area",501,0.135583,human resources staffing recruiting professional,0.179637,0.135583,0.838173,0.827532,0.696157
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.362648,aspiring human resources manager seeking inte...,0.431644,0.362648,0.935586,0.924971,0.693516
100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,0.220083,aspiring human resources manager graduating ...,0.343832,0.220083,0.870053,0.573788,0.669403


In [None]:
# Creating a fasttext vectorize representation for each query
def q_sent_vec_ft(query):
    q_sent_vec_ft = []
    q_word_vec_ft = []
    
    for word in query.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            q_word_vec_ft.append(vectors)
        else:
            q_word_vec_ft.append(oov_word)
    q_word_vec_mean_ft = sum(q_word_vec_ft) / len(q_word_vec_ft) # This was indented but just fixed this round - if it breaks, this should be indented again
    q_sent_vec_ft.append(q_word_vec_mean_ft)
        
    return q_sent_vec_ft

In [None]:
doc_sent_vec

[array([ 7.34825000e-02, -5.35725000e-02,  3.20142500e-01, -1.47167500e-01,
        -7.46900000e-02,  1.64942075e-01, -4.36302500e-02, -2.70900000e-01,
        -2.01575000e-03,  2.53940000e+00, -2.37806250e-01, -1.38730000e-01,
        -1.82355000e-01,  4.30290000e-02,  1.12399500e-01, -2.00297500e-01,
         2.71710000e-01,  8.48010000e-01,  7.55265000e-02,  1.29407250e-01,
         6.19197500e-02, -3.99665000e-02, -9.88310000e-02, -1.15147500e-01,
        -6.96875000e-02,  8.05385000e-02,  7.30035000e-02,  1.39980000e-01,
         1.56244250e-01,  5.41490000e-02,  6.71452500e-02, -1.97808250e-01,
         1.47617500e-01,  1.05322750e-01, -1.60950000e-02, -2.61552500e-02,
        -4.53952500e-01, -8.71215000e-02, -3.96845000e-02,  1.49897500e-01,
         3.24370500e-01,  2.84050000e-02, -7.98422500e-02,  4.22825000e-03,
         8.92950000e-02, -2.11195000e-01,  5.29450000e-03,  2.19537250e-01,
         1.34656750e-01, -3.64854000e-02,  3.77537500e-01,  8.47375000e-02,
         2.3

In [None]:
example_indices = [40, 200]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()