## Potential Talent

### **Context:**

As a **talent sourcing and management company**, we are interested in **finding talented individuals** for sourcing these candidates to technology companies. **Finding talented candidates is not easy**, for **several reasons**. The **first** reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The **second** reason is one needs to understand what makes a candidate shine for the role we are in search for. **Third**, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of **manual operations**. Towards **automating this process** we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

### Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

#### Attributes:
**id** : unique identifier for candidate (numeric)

**job_title** : job title for candidate (text)

**location** : geographical location for candidate (text)

**connections** : number of connections candidate has, 500+ means over 500 (text)

**Output (desired target)**:
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

#### Download Data:

https://docs.google.com/spreadsheets/d/117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU/edit?usp=sharing

#### Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

#### Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

In [1]:
# !pip install -U scikit-learn

In [1]:
# Importing Standard Libraries
import pandas as pd
import numpy as np
import os

from sklearn.metrics.pairwise import linear_kernel
pd.options.display.max_columns = 30

## Initial Exploratory Data Analysis

In [2]:
df = pd.read_csv('potential-talents - Aspiring human resources - seeking human resources.csv').set_index('id')
df.head()

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
4,People Development Coordinator at Ryan,"Denton, Texas",500+,
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104 entries, 1 to 104
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   job_title   104 non-null    object 
 1   location    104 non-null    object 
 2   connection  104 non-null    object 
 3   fit         0 non-null      float64
dtypes: float64(1), object(3)
memory usage: 4.1+ KB


In [4]:
df.replace('500+ ','501', inplace=True)
df['connection'] = pd.to_numeric(df['connection'])

In [5]:
df.job_title.value_counts()

2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional                 7
Aspiring Human Resources Professional                                                                                    7
Student at Humber College and Aspiring Human Resources Generalist                                                        7
People Development Coordinator at Ryan                                                                                   6
Native English Teacher at EPIK (English Program in Korea)                                                                5
Aspiring Human Resources Specialist                                                                                      5
HR Senior Specialist                                                                                                     5
Student at Chapman University                                                                                            4
SVP, CHRO, Marke

In [6]:
df.job_title.iloc[0]

'2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional'

In [7]:
df = df.drop_duplicates()

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 1 to 104
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   job_title   53 non-null     object 
 1   location    53 non-null     object 
 2   connection  53 non-null     int64  
 3   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 2.1+ KB


# TF-IDF

### Prepping our Text for Modelling

In [12]:
!pip uninstall scikit-learn

^C


In [13]:
!pip install scikit-learn



In [9]:
import sklearn

In [28]:
!pip install --user "numpy<1.23.0"



In [10]:
import numpy as np

In [11]:
sklearn.__version__

'1.0.2'

In [12]:
np.__version__


'1.21.5'

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Prep our Text for Modelling
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (1, 2))
docs_tfidf = vectorizer.fit_transform(df["job_title"])

In [14]:
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cos_sim = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    
    return cos_sim

In [15]:
query = 'Aspiring human resources'

cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query)

df['fit'] = cos_sim

In [16]:
def top_candidates(n, by = 'fit', ascending = False, min_con = 0, location = df.location):
    
    df2 = df.loc[(df.connection >= min_con) & (df.location == location)].sort_values(by = by, ascending = ascending).head(n).copy()
    
    return df2

In [17]:
top_candidates(n = 10, by = 'fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.735855
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.735855
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.50888
72,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,0.38759
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",501,0.374733
66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.373847
7,Student at Humber College and Aspiring Human R...,Kanada,61,0.358949
74,Human Resources Professional,Greater Boston Area,16,0.340769
79,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,0.336485


In [18]:
top_candidates(n = 10)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.735855
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.735855
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.50888
72,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,0.38759
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",501,0.374733
66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.373847
7,Student at Humber College and Aspiring Human R...,Kanada,61,0.358949
74,Human Resources Professional,Greater Boston Area,16,0.340769
79,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,0.336485


In [19]:
top_candidates(n = 10, by = 'fit', min_con = 90)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",501,0.374733
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.31642
100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,0.308829
76,Aspiring Human Resources Professional | Passio...,"New York, New York",212,0.246772
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.220668
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",501,0.196509
78,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,501,0.196509
71,"Human Resources Generalist at ScottMadden, Inc.","Raleigh-Durham, North Carolina Area",501,0.196509
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,0.189503
89,Director Human Resources at EY,Greater Atlanta Area,349,0.187433


In [20]:
top_candidates(n = 50, by = 'fit', location = 'Austin, Texas Area')

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.373847
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.31642


In [21]:
top_candidates(n = 50, by = 'fit', location = 'Greater New York City Area')

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.632697
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,0.189503
102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,0.0


In [22]:
query = 'seeking human resources'

cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query)

df['fit'] = cos_sim

In [23]:
top_candidates(n = 10, by = 'fit')

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682
10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,501,0.432761
94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,0.38129
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.362648
74,Human Resources Professional,Greater Boston Area,16,0.295223
75,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",501,0.273577
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",501,0.245337
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319


# Word2Vec

In [61]:
!pip install --upgrade keras
!pip install --upgrade tensorflow

Collecting keras
  Using cached keras-2.15.0-py3-none-any.whl (1.7 MB)
Installing collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 2.8.0
    Uninstalling keras-2.8.0:
      Successfully uninstalled keras-2.8.0
Successfully installed keras-2.15.0


ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

tensorflow 2.8.0 requires keras<2.9,>=2.8.0rc0, but you'll have keras 2.15.0 which is incompatible.


Collecting tensorflow
  Using cached tensorflow-2.13.1-cp38-cp38-win_amd64.whl (1.9 kB)


ERROR: Could not find a version that satisfies the requirement tensorflow-intel==2.13.1; platform_system == "Windows" (from tensorflow) (from versions: 0.0.1, 2.10.0.dev20220728, 2.10.0rc0, 2.10.0rc1, 2.10.0rc2, 2.10.0rc3, 2.10.0, 2.10.1, 2.11.0rc0, 2.11.0rc1, 2.11.0rc2, 2.11.0, 2.11.1, 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.12.1, 2.13.0rc0, 2.13.0rc1, 2.13.0rc2, 2.13.0)
ERROR: No matching distribution found for tensorflow-intel==2.13.1; platform_system == "Windows" (from tensorflow)


In [51]:
!pip install tensorflow

Collecting keras<2.9,>=2.8.0rc0
  Using cached keras-2.8.0-py2.py3-none-any.whl (1.4 MB)
Installing collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 2.6.0
    Uninstalling keras-2.6.0:
      Successfully uninstalled keras-2.6.0
Successfully installed keras-2.8.0


In [62]:
tf.__version__

'2.5.1'

In [53]:
from tensorflow.experimental import dtensor

print('TensorFlow version:', tf.__version__)

ImportError: cannot import name 'dtensor' from 'tensorflow.experimental' (c:\Users\Alex Chung\anaconda3\envs\ml\lib\site-packages\tensorflow\_api\v2\experimental\__init__.py)

In [54]:
keras.__version__

NameError: name 'keras' is not defined

In [63]:
# !pip install nltk
# !pip install keras
# !pip install -U gensim
# !pip install tensorflow
from tensorflow import keras
keras.__version__


'2.5.0'

In [56]:
import tensorflow as tf

In [57]:
!python -V

Python 3.8.5


### Prepping our Text for Modelling

In [65]:
import re
import nltk
nltk.download('stopwords')

# processing texts for modelling
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
df['job_title_cleaned'] = df.job_title.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                  for w in x.split() 
                                                                                  if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                  not in stop_words) ) #nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Alex
[nltk_data]     Chung\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 1 to 104
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   job_title          53 non-null     object 
 1   location           53 non-null     object 
 2   connection         53 non-null     int64  
 3   fit                53 non-null     float64
 4   job_title_cleaned  53 non-null     object 
dtypes: float64(1), int64(1), object(3)
memory usage: 2.5+ KB


In [59]:
from tensorflow import keras

In [67]:
# tokenize and pad every document to make them of the same size
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
tokenizer=Tokenizer()

tokenizer.fit_on_texts(df.job_title_cleaned)
tokenized_documents=tokenizer.texts_to_sequences(df.job_title_cleaned)
tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
vocab_size=len(tokenizer.word_index)+1

ImportError: cannot import name 'dtensor' from 'tensorflow.compat.v2.experimental' (c:\Users\Alex Chung\anaconda3\envs\ml\lib\site-packages\tensorflow\_api\v2\compat\v2\experimental\__init__.py)

In [None]:
# loading pre-trained embeddings, each word is represented as a 300 dimensional vector
import gensim

# Navigating to directory where pre-trained embeddings were downloaded
os.chdir(r"C:\Users\achung\OneDrive - Biological Dynamics, Inc\LX Temp\Apziva\Potential Talent")
W2V_PATH="GoogleNews-vectors-negative300.bin.gz"

In [None]:
model_w2v = gensim.models.KeyedVectors.load_word2vec_format(W2V_PATH, binary=True)
model_w2v[0][:10]

array([ 1.1291504e-03, -8.9645386e-04,  3.1852722e-04,  1.5335083e-03,
        1.1062622e-03, -1.4038086e-03, -3.0517578e-05, -4.1961670e-04,
       -5.7601929e-04,  1.0757446e-03], dtype=float32)

In [None]:
# creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
embedding_matrix=np.zeros((vocab_size,300))
for word,i in tokenizer.word_index.items():
    if word in model_w2v:
        embedding_matrix[i]=model_w2v[word]
        
# creating document-word embeddings
document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
for i in range(len(tokenized_paded_documents)):
    for j in range(len(tokenized_paded_documents[0])):
        document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
document_word_embeddings.shape

(53, 64, 300)

In [None]:
document_word_embeddings[0][0]

array([-2.08007812e-01,  3.41796875e-02,  2.57568359e-02,  1.79687500e-01,
       -1.81640625e-01, -3.41796875e-02, -1.40625000e-01, -1.63085938e-01,
       -8.59375000e-02, -1.52343750e-01, -9.57031250e-02, -1.34765625e-01,
       -1.92382812e-01,  2.43164062e-01, -1.91406250e-01,  4.93164062e-02,
        2.60009766e-02,  3.28125000e-01, -7.37304688e-02,  5.05371094e-02,
       -1.52343750e-01, -1.57226562e-01, -1.44958496e-04, -2.51953125e-01,
       -4.22363281e-02, -1.72119141e-02, -4.84375000e-01,  2.07031250e-01,
       -1.40625000e-01, -1.35498047e-02, -1.78222656e-02,  5.95092773e-03,
       -3.10058594e-02, -2.75390625e-01, -2.65625000e-01,  9.52148438e-02,
       -4.55078125e-01,  1.13281250e-01, -1.33789062e-01,  1.18652344e-01,
       -5.37109375e-02,  8.10546875e-02,  7.32421875e-02,  6.39648438e-02,
       -9.47265625e-02,  4.39453125e-02,  1.46484375e-01, -8.59375000e-02,
       -1.58203125e-01,  1.63085938e-01, -1.32812500e-01,  2.50000000e-01,
       -5.61523438e-02,  

In [None]:
# cosine_similarity = np.dot(model_w2v['spain'], model_w2v['england'])/(np.linalg.norm(model_w2v['spain'])* 
#                                                                       np.linalg.norm(model_w2v['england']))
# cosine_similarity

In [None]:
model_w2v['england'][:5]

array([-0.3671875 , -0.03491211,  0.11083984,  0.40039062,  0.18261719],
      dtype=float32)

In [None]:
def processing(query):
    df3 = pd.DataFrame([query], columns=['query'])
    stop_words = stopwords.words('english')
    df3['processed'] = df3['query'].apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                  for w in x.split() 
                                                                                  if re.sub(r'[^a-zA-Z]',' ',w).lower() 
                                                                                  not in stop_words) )
    
    tokenizer.fit_on_texts(df3.processed)
    tokenized_documents=tokenizer.texts_to_sequences(df3.processed)
    tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
    vocab_size=len(tokenizer.word_index)+1
    
    embedding_matrix=np.zeros((vocab_size,300))
    for word,i in tokenizer.word_index.items():
        if word in model_w2v:
            embedding_matrix[i]=model_w2v[word]

    # creating document-word embeddings
    query_document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
    for i in range(len(tokenized_paded_documents)):
        for j in range(len(tokenized_paded_documents[0])):
            query_document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
#     document_word_embeddings.shape
    
    return query_document_word_embeddings

In [None]:
processing('hello world!!!!').shape

(1, 64, 300)

In [None]:
processing('hello world!!!!')[0][:3][0][:20]

array([-0.05419922,  0.01708984, -0.00527954,  0.33203125, -0.25      ,
       -0.01397705, -0.15039062, -0.265625  ,  0.01647949,  0.3828125 ,
       -0.03295898, -0.09716797, -0.16308594, -0.04443359,  0.00946045,
        0.18457031,  0.03637695,  0.16601562,  0.36328125, -0.25585938])

In [None]:
def get_w2v_query_similarity(document_word_embeddings, query):
    """
    query_w2v: processing the query
    model_w2v: word2vec embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_w2v = processing(query)
    
    nsamples, nx, ny = query_w2v.shape
    query_w2v_reshape = query_w2v.reshape((nsamples,nx*ny))

    nsamples, nx, ny = document_word_embeddings.shape
    document_word_embeddings_reshape = document_word_embeddings.reshape((nsamples,nx*ny))
    
    cos_sim_w2v = cosine_similarity(query_w2v_reshape, document_word_embeddings_reshape).flatten()
    
    return cos_sim_w2v

In [None]:
query = 'Aspiring human resources'

# Word2Vec Similarity
cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query = query)
df['w2v_fit'] = cos_sim_w2v

# Original TFIDF similarity for comparison
cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query) 
df['tfidf_fit'] = cos_sim
# Dropping the original fit column
# df.drop('fit', axis=1, inplace=True)

In [None]:
top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319,aspiring human resources professional,0.898174,0.735855
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319,aspiring human resources professional,0.898174,0.735855
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.206629,aspiring human resources specialist,0.873679,0.632697
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682,seeking human resources position,0.654387,0.220668
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.103338,aspiring human resources professional energe...,0.641739,0.31642
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",501,0.245337,aspiring human resources management student se...,0.628601,0.374733
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682,seeking human resources opportunities,0.619797,0.220668
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.362648,aspiring human resources manager seeking inte...,0.584569,0.50888
76,Aspiring Human Resources Professional | Passio...,"New York, New York",212,0.080592,aspiring human resources professional passio...,0.551164,0.246772
10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,501,0.432761,seeking human resources hris generalist positions,0.519345,0.141333


In [None]:
query = 'seeking human resources'

# Word2Vec Similarity
cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query = query)
df['w2v_fit'] = cos_sim_w2v

# Original TFIDF similarity for comparison
cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query) 
df['tfidf_fit'] = cos_sim

In [None]:
top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682,seeking human resources position,0.886226,0.675682
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682,seeking human resources opportunities,0.839381,0.675682
10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,501,0.432761,seeking human resources hris generalist positions,0.703341,0.432761
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319,aspiring human resources professional,0.663209,0.240319
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319,aspiring human resources professional,0.663209,0.240319
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.206629,aspiring human resources specialist,0.645122,0.206629
94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,0.38129,seeking human resources opportunities open tr...,0.639099,0.38129
89,Director Human Resources at EY,Greater Atlanta Area,349,0.162381,director human resources ey,0.571728,0.162381
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.103338,aspiring human resources professional energe...,0.473859,0.103338
81,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,0.102581,senior human resources business partner heil e...,0.470671,0.102581


In [None]:
query = 'business intelligence specialist'

# Word2Vec Similarity
cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query = query)
df['w2v_fit'] = cos_sim_w2v

# Original TFIDF similarity for comparison
cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query) 
df['tfidf_fit'] = cos_sim

In [None]:
top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,0.0,business intelligence analytics travelers,0.552532,0.56006
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,0.164174,human resources specialist luxottica,0.44738,0.178214
8,HR Senior Specialist,San Francisco Bay Area,501,0.0,hr senior specialist,0.348536,0.168359
86,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,0.0,information systems specialist programmer love...,0.274835,0.099972
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",501,0.170244,human resources generalist loparex,0.251939,0.0
78,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,501,0.170244,human resources generalist schwan s,0.231181,0.0
13,Human Resources Coordinator at InterContinenta...,"Atlanta, Georgia",501,0.111899,human resources coordinator intercontinental b...,0.2158,0.0
72,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,0.126581,business management major aspiring human resou...,0.214225,0.12298
4,People Development Coordinator at Ryan,"Denton, Texas",501,0.0,people development coordinator ryan,0.205907,0.0
71,"Human Resources Generalist at ScottMadden, Inc.","Raleigh-Durham, North Carolina Area",501,0.170244,human resources generalist scottmadden inc,0.20292,0.0


In [None]:
top_candidates(n = 10, by = 'w2v_fit', ascending = False, min_con = 20, location = 'Greater New York City Area')

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,0.0,business intelligence analytics travelers,0.552532,0.56006
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,0.164174,human resources specialist luxottica,0.44738,0.178214


# GloVe - 

https://nlp.stanford.edu/projects/glove/

In [None]:
# Downloading GloVe pre-trained vectors
# !pip install wget
# import wget
# wget.download('https://nlp.stanford.edu/data/glove.840B.300d.zip')

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: wget
  Building wheel for wget (setup.py): started
  Building wheel for wget (setup.py): finished with status 'done'
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=a9df2cc2c098f18fdcec6c65d385fa46720171bf24a2369b0f862cc6739b9c3c
  Stored in directory: c:\users\achung\appdata\local\pip\cache\wheels\8b\f1\7f\5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


'glove.840B.300d.zip'

In [None]:
# Extracting GloVe vector file
# import zipfile as zf
# files = zf.ZipFile("glove.840B.300d.zip", 'r')
# files.extractall('GloVe')
# files.close()

In [None]:
# Navigating to directory where GloVe pre-trained vectors were downloaded
os.chdir(r"C:\Users\achung\OneDrive - Biological Dynamics, Inc\LX Temp\Apziva\Potential Talent\GloVe")
path = 'glove.840B.300d.txt'

In [None]:
with open(path) as file:
  for i in range(10):
    line = file.readline()
    print(line[:100])

, -0.082752 0.67204 -0.14987 -0.064983 0.056491 0.40228 0.0027747 -0.3311 -0.30691 2.0817 0.031819 0
. 0.012001 0.20751 -0.12578 -0.59325 0.12525 0.15975 0.13748 -0.33157 -0.13694 1.7893 -0.47094 0.704
the 0.27204 -0.06203 -0.1884 0.023225 -0.018158 0.0067192 -0.13877 0.17708 0.17709 2.5882 -0.35179 -
and -0.18567 0.066008 -0.25209 -0.11725 0.26513 0.064908 0.12291 -0.093979 0.024321 2.4926 -0.017916
to 0.31924 0.06316 -0.27858 0.2612 0.079248 -0.21462 -0.10495 0.15495 -0.03353 2.4834 -0.50904 0.087
of 0.060216 0.21799 -0.04249 -0.38618 -0.15388 0.034635 0.22243 0.21718 0.0068483 2.4375 -0.27418 0.
a 0.043798 0.024779 -0.20937 0.49745 0.36019 -0.37503 -0.052078 -0.60555 0.036744 2.2085 -0.23389 -0
in 0.089187 0.25792 0.26282 -0.029365 0.47187 -0.10389 -0.10013 0.08123 0.20883 2.5726 -0.67854 0.03
" -0.075242 0.57337 -0.31908 -0.18484 0.88867 -0.27381 0.077588 0.13905 -0.47746 1.4442 -0.56159 0.0
: 0.008746 0.33214 -0.29175 -0.15119 -0.41842 -0.23931 -0.23458 -0.055618 -0.09896 0.75175 

In [None]:
df_glove = pd.read_csv(path, sep=" ", quoting=3, header=None, index_col=0)
df_glove.T

Unnamed: 0,",",.,the,and,to,of,a,in,"""",:,is,for,I,),(,...,what-might-have-been,wiid,windowsTransgender,woombie,wordsforyoungmen,work.Like,working.So,wried,wwent,xalisae,xtremecaffeine,yildirim,z/28,zipout,zulchzulu
1,-0.082752,0.012001,0.272040,-0.185670,0.319240,0.060216,0.043798,0.089187,-0.075242,0.008746,-0.084961,-0.172240,0.194100,-0.271420,-0.180240,...,0.562950,0.385100,-0.102350,0.65711,-0.378200,-0.23822,0.754650,0.54698,0.921790,0.337540,0.073032,0.222760,0.73440,0.21215,-0.079690
2,0.672040,0.207510,-0.062030,0.066008,0.063160,0.217990,0.024779,0.257920,0.573370,0.332140,0.502000,0.182340,0.226030,0.047374,0.008411,...,-0.293780,-0.315230,-0.043862,-1.06710,-1.154600,-0.65700,-0.292360,-0.50515,-0.344320,-0.131110,-1.029400,-0.296390,-0.33641,-0.99456,-0.229050
3,-0.149870,-0.125780,-0.188400,-0.252090,-0.278580,-0.042490,-0.209370,0.262820,-0.319080,-0.291750,0.002382,-0.278470,-0.437640,-0.172780,-0.304630,...,1.279200,-0.103690,0.193270,-0.27787,0.387420,-0.18234,-0.211210,0.56164,-0.508880,0.155950,-0.015436,0.694120,0.26918,1.17820,0.803660
4,-0.064983,-0.593250,0.023225,-0.117250,0.261200,-0.386180,0.497450,-0.029365,-0.184840,-0.151190,-0.167550,-0.084666,-0.113870,-0.029084,0.209970,...,-0.070849,-0.110770,-0.225560,0.48507,1.288100,-0.27082,-0.105820,-0.29412,-0.000386,-1.124400,0.726150,0.193620,0.41843,2.07210,-0.788650
5,0.056491,0.125250,-0.018158,0.265130,0.079248,-0.153880,0.360190,0.471870,0.888670,-0.418420,0.307210,0.254420,-0.072725,-0.219100,0.085153,...,-0.487520,0.172950,-0.148480,-0.51166,-0.802670,0.37388,-0.295270,-0.35497,-0.151450,0.100460,-0.992460,-0.312760,-0.18900,-0.44271,-0.405670
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,0.053380,0.063500,-0.018168,-0.039709,-0.258100,0.329200,0.080421,0.193680,-0.212800,0.700590,0.275500,-0.122380,-0.036109,0.066062,0.539500,...,-0.080746,0.092665,-0.223050,-0.43677,0.503820,0.28653,-0.484500,-0.21834,-0.406440,-0.094080,0.127380,-0.413980,-0.48432,-0.41382,-0.205220
297,-0.050821,0.140190,0.114070,0.324980,-0.044629,-0.175970,-0.061246,-0.325460,-0.226150,-0.213710,-0.067180,-0.081083,0.112210,-0.241770,-0.235960,...,-0.571630,-0.087504,0.161510,-0.43438,0.067534,-0.83764,-0.343440,0.44104,0.119900,-0.361960,0.129570,0.240180,-1.00550,-0.21139,0.268780
298,-0.191800,0.138710,0.130150,-0.023452,0.082745,0.117090,-0.300990,0.144210,0.328000,-0.286770,-0.215110,-0.126680,0.091957,-0.319120,-0.385520,...,0.428840,0.822160,0.119740,0.28584,0.852120,0.35327,0.935510,0.84572,0.380180,0.645910,0.422640,0.093864,0.63718,0.93427,-0.083561
299,-0.378460,-0.360490,-0.183170,0.123020,0.097801,-0.166920,-0.145840,-0.169000,-0.109340,-0.226630,-0.263040,-0.438560,0.386320,0.235390,0.243240,...,-0.078753,-0.591150,0.290980,-0.18967,-0.973490,0.13764,-0.099674,-0.78417,0.345800,-0.080984,-0.113330,-0.165450,-0.13914,-0.93286,0.485320


In [None]:
glove = { key: val.values for key, val in df_glove.T.items() }

In [None]:
glove['man'][:20]

array([-1.7310e-01,  2.0663e-01,  1.6543e-02, -3.1026e-01,  1.9719e-02,
        2.7791e-01,  1.2283e-01, -2.6328e-01,  1.2522e-01,  3.1894e+00,
       -1.6291e-01, -8.8759e-02,  3.3067e-03, -2.9483e-03, -3.4398e-01,
        1.2779e-01, -9.4536e-02,  4.3467e-01,  4.9742e-01,  2.5068e-01])

In [None]:
unknown_word = df_glove.mean().values
unknown_word[:20]

array([ 0.22418612, -0.28881808,  0.13854355,  0.00365397, -0.12870769,
        0.1024395 ,  0.06162703,  0.07317769, -0.06135387, -1.34764119,
        0.42038748, -0.0635958 , -0.09683355,  0.18086288,  0.23704431,
        0.01412683,  0.1700973 , -1.14917018,  0.31498588,  0.06622261])

In [None]:
df_glove.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,...,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
",",-0.082752,0.67204,-0.14987,-0.064983,0.056491,0.40228,0.002775,-0.3311,-0.30691,2.0817,0.031819,0.013643,0.30265,0.00713,-0.5819,...,0.074901,0.061068,-0.4662,0.40054,-0.19099,-0.14331,0.018267,-0.18643,0.20709,-0.35598,0.05338,-0.050821,-0.1918,-0.37846,-0.06589
.,0.012001,0.20751,-0.12578,-0.59325,0.12525,0.15975,0.13748,-0.33157,-0.13694,1.7893,-0.47094,0.70434,0.26673,-0.089961,-0.18168,...,0.021307,-0.10778,-0.2281,0.50803,0.11567,0.16165,-0.066737,-0.29556,0.022612,-0.28135,0.0635,0.14019,0.13871,-0.36049,-0.035
the,0.27204,-0.06203,-0.1884,0.023225,-0.018158,0.006719,-0.13877,0.17708,0.17709,2.5882,-0.35179,-0.17312,0.43285,-0.10708,0.15006,...,-0.30193,0.043579,-0.043102,0.35025,-0.19681,-0.4281,0.16899,0.22511,-0.28557,-0.1028,-0.018168,0.11407,0.13015,-0.18317,0.1323
and,-0.18567,0.066008,-0.25209,-0.11725,0.26513,0.064908,0.12291,-0.093979,0.024321,2.4926,-0.017916,-0.071218,-0.24782,-0.26237,-0.2246,...,-0.005455,0.47796,0.090912,0.094489,-0.36882,-0.59396,-0.097729,0.20072,0.17055,-0.004736,-0.039709,0.32498,-0.023452,0.12302,0.3312
to,0.31924,0.06316,-0.27858,0.2612,0.079248,-0.21462,-0.10495,0.15495,-0.03353,2.4834,-0.50904,0.08749,0.21426,0.22151,-0.25234,...,-0.23702,0.038399,-0.10031,0.18359,0.025178,-0.12977,0.3713,0.18888,-0.004274,-0.10645,-0.2581,-0.044629,0.082745,0.097801,0.25045


In [None]:
# Creating a vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_vec = []

for sentences in job_titles:
    word_vec = []
    for word in sentences.split():
        if word in glove:
            vectors = glove[word]
            word_vec.append(vectors)
        else:
            word_vec.append(unknown_word)
    word_vec_mean = sum(word_vec) / len(word_vec) # returning a mean for each job title
    doc_sent_vec.append(word_vec_mean) # returning a list for all job titles

In [None]:
len(doc_sent_vec)

53

In [None]:
doc_sent_vec[0].shape

(300,)

In [None]:
# Creating a vectorize representation for each query
def q_sent_vec(query):
    q_sent_vec = []
    q_word_vec = []
    
    for word in query.split():
        if word in glove:
            vectors = glove[word]
            q_word_vec.append(vectors)
        else:
            q_word_vec.append(unknown_word)
        q_word_vec_mean = sum(q_word_vec) / len(q_word_vec)
    q_sent_vec.append(q_word_vec_mean)
        
    return q_sent_vec

In [None]:
query = 'native english speaking'
len(q_sent_vec(query))

1

In [None]:
q_sent_vec(query)[0].shape

(300,)

In [None]:
q_sent_vec(query)[0][:5]

array([-0.29654333,  0.12640833, -0.49922333,  0.22307667,  0.4358    ])

In [None]:
query = 'student indiana university'
q_sent_vec(query)[0][:5]

array([-0.10656   ,  0.06428367,  0.10134093, -0.19890667,  0.51552   ])

In [None]:
def get_glove_query_similarity(doc_sent_vec, query):
    """
    query_glove: processing the query
    doc_sent_vec: glove embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_glove = q_sent_vec(query)
    
    cos_sim_glove = cosine_similarity(query_glove, doc_sent_vec).flatten()
    
    return cos_sim_glove

In [None]:
query = 'Aspiring human resources'

#GloVe similarity
cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query = query)
df['glove_fit'] = cos_sim_glove

# original TFIDF similarity and Word2Vec Similarity for comparison
cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query) 
df['tfidf_fit'] = cos_sim

cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query = query)
df['w2v_fit'] = cos_sim_w2v

In [None]:
top_candidates(n = 10, by = 'glove_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319,aspiring human resources professional,0.898174,0.735855,0.851023
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319,aspiring human resources professional,0.898174,0.735855,0.851023
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.206629,aspiring human resources specialist,0.873679,0.632697,0.848638
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.362648,aspiring human resources manager seeking inte...,0.584569,0.50888,0.84536
74,Human Resources Professional,Greater Boston Area,16,0.295223,human resources professional,0.13422,0.340769,0.836803
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682,seeking human resources opportunities,0.619797,0.220668,0.825179
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",501,0.170244,human resources generalist loparex,0.20252,0.196509,0.799749
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,0.164174,human resources specialist luxottica,0.151158,0.189503,0.790386
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682,seeking human resources position,0.654387,0.220668,0.77637
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",501,0.245337,aspiring human resources management student se...,0.628601,0.374733,0.773825


In [None]:
query = 'seeking human resources'

#GloVe similarity
cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query = query)
df['glove_fit'] = cos_sim_glove

# original TFIDF similarity and Word2Vec Similarity for comparison
cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query) 
df['tfidf_fit'] = cos_sim

cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query = query)
df['w2v_fit'] = cos_sim_w2v

In [None]:
top_candidates(n = 10, by = 'glove_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,fit,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.675682,seeking human resources opportunities,0.839381,0.675682,0.970024
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.675682,seeking human resources position,0.886226,0.675682,0.953714
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.362648,aspiring human resources manager seeking inte...,0.431644,0.362648,0.935586
74,Human Resources Professional,Greater Boston Area,16,0.295223,human resources professional,0.133104,0.295223,0.903558
94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,0.38129,seeking human resources opportunities open tr...,0.639099,0.38129,0.885495
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.206629,aspiring human resources specialist,0.645122,0.206629,0.874185
100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,0.220083,aspiring human resources manager graduating ...,0.343832,0.220083,0.870053
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.240319,aspiring human resources professional,0.663209,0.240319,0.864091
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.240319,aspiring human resources professional,0.663209,0.240319,0.864091
88,Human Resources Management Major,"Milpitas, California",18,0.177288,human resources management major,0.170611,0.177288,0.859179


# Fasttext 
FastText is a library developed by Facebook for NLP - known for its training speed and accuracy.  

In [None]:
import sys

sys.path

['c:\\Users\\achung\\OneDrive - Biological Dynamics, Inc\\LX Temp\\Apziva\\Potential Talent',
 'c:\\Users\\achung\\Miniconda3\\envs\\al_env\\python310.zip',
 'c:\\Users\\achung\\Miniconda3\\envs\\al_env\\DLLs',
 'c:\\Users\\achung\\Miniconda3\\envs\\al_env\\lib',
 'c:\\Users\\achung\\Miniconda3\\envs\\al_env',
 '',
 'c:\\Users\\achung\\Miniconda3\\envs\\al_env\\lib\\site-packages',
 'c:\\Users\\achung\\Miniconda3\\envs\\al_env\\lib\\site-packages\\win32',
 'c:\\Users\\achung\\Miniconda3\\envs\\al_env\\lib\\site-packages\\win32\\lib',
 'c:\\Users\\achung\\Miniconda3\\envs\\al_env\\lib\\site-packages\\Pythonwin']

In [None]:
# !pip install wget
!pip3.10 install --user wget



In [None]:
# # Downloading fastText pre-trained vectors
import wget
wget.download('https://github.com/facebookresearch/fastText/archive/v0.9.2.zip')

'fastText-0.9.2.zip'

In [None]:
# # Extracting fastText vector file
import zipfile as zf
files = zf.ZipFile("fastText-0.9.2.zip", 'r')
files.extractall()
files.close()

In [None]:
os.chdir(r"C:\Users\achung\OneDrive - Biological Dynamics, Inc\LX Temp\Apziva\Potential Talent\fastText-0.9.2")

#### Issues and workarounds with installing fasttext:

https://stackoverflow.com/questions/44951456/pip-error-microsoft-visual-c-14-0-is-required

In [None]:
# !pip install --upgrade pip
# !pip install --upgrade wheel
# !pip install --upgrade setuptools
# !pip install Cython --install-option="--no-cython-compile"

Collecting wheel
  Downloading wheel-0.43.0-py3-none-any.whl.metadata (2.2 kB)
Downloading wheel-0.43.0-py3-none-any.whl (65 kB)
   ---------------------------------------- 65.8/65.8 kB 3.5 MB/s eta 0:00:00
Installing collected packages: wheel
  Attempting uninstall: wheel
    Found existing installation: wheel 0.37.1
    Uninstalling wheel-0.37.1:
      Successfully uninstalled wheel-0.37.1
Successfully installed wheel-0.43.0
Collecting setuptools
  Downloading setuptools-70.0.0-py3-none-any.whl.metadata (5.9 kB)
Downloading setuptools-70.0.0-py3-none-any.whl (863 kB)
   --------------------------------------- 863.4/863.4 kB 10.9 MB/s eta 0:00:00
Installing collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 61.2.0
    Uninstalling setuptools-61.2.0:
      Successfully uninstalled setuptools-61.2.0
Successfully installed setuptools-70.0.0



Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: --install-option


In [None]:
# !pip install fasttext
# !pip install fasttext-wheel



In [None]:
import fasttext as fasttext

In [None]:
# Downloading pretrained model trained on Common Crawl and Wikipedia
# import fasttext.util
# fasttext.util.download_model('en', if_exists='ignore')  # English Skip downloading if you've already downloaded


In [None]:
ft = fasttext.load_model('cc.en.300.bin')

MemoryError: bad allocation

In [None]:
ft.get_word_vector('hello')[:20]

AttributeError: module 'fasttext' has no attribute 'get_word_vector'

In [None]:
ft.get_words()[:10]

[',', 'the', '.', 'and', 'to', 'of', 'a', '</s>', 'in', 'is']

In [None]:
# Creating a dictionary of fasttext word and vector representaiton
ft_words = ft.get_words()
ft_vectors = [ft.get_word_vector(word) for word in ft_words]
ft_dict = dict(zip(ft_words, ft_vectors))

In [None]:
ft_dict['hello'][:20]

array([ 0.15757619,  0.04378209, -0.00451272,  0.06659314,  0.07703468,
        0.00485855,  0.00819822,  0.00652403,  0.009259  ,  0.0353899 ,
       -0.02313953, -0.04918071, -0.08326425,  0.01560145,  0.25485662,
        0.03454237, -0.01074514, -0.07801886, -0.07080995,  0.07623856],
      dtype=float32)

In [None]:
df_ft = pd.DataFrame(ft_dict.items(), columns = ['ft_words', 'ft_vectors'])

In [None]:
df_ft.head(10)

Unnamed: 0,ft_words,ft_vectors
0,",","[0.12502378, -0.10790165, 0.02450176, -0.25286..."
1,the,"[-0.051744193, 0.073963955, -0.01305688, 0.044..."
2,.,"[0.03423236, -0.08014102, 0.116187684, -0.3968..."
3,and,"[0.008239111, -0.089902766, 0.026525287, -0.00..."
4,to,"[0.0046811374, 0.02812425, -0.029631453, -0.01..."
5,of,"[-7.303824e-05, -0.18774074, -0.07105116, -0.4..."
6,a,"[0.08764305, -0.49590126, -0.04985499, -0.0936..."
7,</s>,"[0.073061086, -0.24302974, -0.035331346, -0.36..."
8,in,"[-0.014047037, -0.25217462, 0.07150193, -0.024..."
9,is,"[-0.09776052, -0.20827363, -0.10372388, -0.016..."


In [None]:
# May not need to do this for fasttext
oov_word = np.zeros((300,))

In [None]:
# Creating a fasttext vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_vec_ft = []

for sentences in job_titles:
    word_vec_ft = []
    for word in sentences.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            word_vec_ft.append(vectors)
        else:
            word_vec_ft.append(oov_word)
    word_vec_mean_ft = sum(word_vec_ft) / len(word_vec_ft) # returning a mean for each job title
    doc_sent_vec_ft.append(word_vec_mean_ft) # returning a list for all job titles

In [None]:
# Creating a fasttext vectorize representation for each query
def q_sent_vec_ft(query):
    q_sent_vec_ft = []
    q_word_vec_ft = []
    
    for word in query.split():
        if word in ft_dict:
            vectors = ft_dict[word]
            q_word_vec_ft.append(vectors)
        else:
            q_word_vec_ft.append(oov_word)
        q_word_vec_mean_ft = sum(q_word_vec_ft) / len(q_word_vec_ft)
    q_sent_vec_ft.append(q_word_vec_mean_ft)
        
    return q_sent_vec_ft

In [None]:
def get_fasttext_query_similarity(doc_sent_vec_ft, query):
    """
    query_fasttext: processing the query
    doc_sent_vec: glove embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_fasttext = q_sent_vec_ft(query)
    
    cos_sim_fasttext = cosine_similarity(query_fasttext, doc_sent_vec_ft).flatten()
    
    return cos_sim_fasttext

In [None]:
query_fasttext = 0

In [None]:
query = 'Aspiring human resources'

#Fasttext similarity
cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query = query)
df['fasttext_fit'] = cos_sim_fasttext

# original TFIDF similarity and Word2Vec Similarity for comparison
cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query) 
df['tfidf_fit'] = cos_sim

cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query = query)
df['w2v_fit'] = cos_sim_w2v

cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query = query)
df['glove_fit'] = cos_sim_glove

In [None]:
top_candidates(n = 10, by = 'fasttext_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit,fasttext_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resources professional,0.898174,0.735855,0.851023,0.905892
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resources professional,0.898174,0.735855,0.851023,0.905892
6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resources specialist,0.873679,0.632697,0.848638,0.888034
74,Human Resources Professional,Greater Boston Area,16,human resources professional,0.13422,0.340769,0.836803,0.877046
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,aspiring human resources manager seeking inte...,0.584569,0.50888,0.84536,0.860426
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",501,human resources generalist loparex,0.20252,0.196509,0.799749,0.841223
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,human resources specialist luxottica,0.151158,0.189503,0.790386,0.834961
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resources opportunities,0.619797,0.220668,0.825179,0.815429
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,seeking human resources position,0.654387,0.220668,0.77637,0.78393
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",501,aspiring human resources management student se...,0.628601,0.374733,0.773825,0.776401


In [None]:
query = 'seeking human resources'

#Fasttext similarity
cos_sim_fasttext = get_fasttext_query_similarity(doc_sent_vec_ft, query = query)
df['fasttext_fit'] = cos_sim_fasttext

# original TFIDF similarity and Word2Vec Similarity for comparison
cos_sim = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query) 
df['tfidf_fit'] = cos_sim

cos_sim_w2v = get_w2v_query_similarity(document_word_embeddings, query = query)
df['w2v_fit'] = cos_sim_w2v

cos_sim_glove = get_glove_query_similarity(doc_sent_vec, query = query)
df['glove_fit'] = cos_sim_glove

In [None]:
top_candidates(n = 10, by = 'fasttext_fit', ascending = False, min_con = 0)

Unnamed: 0_level_0,job_title,location,connection,job_title_cleaned,w2v_fit,tfidf_fit,glove_fit,fasttext_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resources opportunities,0.839381,0.675682,0.970024,0.981158
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,seeking human resources position,0.886226,0.675682,0.953714,0.95687
73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,aspiring human resources manager seeking inte...,0.431644,0.362648,0.935586,0.924971
74,Human Resources Professional,Greater Boston Area,16,human resources professional,0.133104,0.295223,0.903558,0.90522
68,Human Resources Specialist at Luxottica,Greater New York City Area,501,human resources specialist luxottica,0.150797,0.164174,0.852014,0.893244
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",501,human resources generalist loparex,0.204287,0.170244,0.805987,0.876087
6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resources specialist,0.645122,0.206629,0.874185,0.871857
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resources professional,0.663209,0.240319,0.864091,0.865438
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resources professional,0.663209,0.240319,0.864091,0.865438
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",501,aspiring human resources management student se...,0.464157,0.245337,0.856657,0.830511


In [None]:
print(similarities)

tensor([[0.4273],
        [0.2248],
        [0.7727],
        [0.3772],
        [0.2317],
        [0.7807],
        [0.6295],
        [0.4721],
        [0.8080],
        [0.2799],
        [0.3458],
        [0.5949],
        [0.6228],
        [0.8992],
        [0.5516],
        [0.7287],
        [0.6107],
        [0.5813],
        [0.6308],
        [0.6361],
        [0.5291],
        [0.6247],
        [0.7271],
        [0.4753],
        [0.6622],
        [0.5668],
        [0.6536],
        [0.5160],
        [0.3302],
        [0.4746],
        [0.6206],
        [0.4798],
        [0.5269],
        [0.1083],
        [0.3251],
        [0.1982],
        [0.6019],
        [0.5574],
        [0.3122],
        [0.2498],
        [0.4267],
        [0.2027],
        [0.6791],
        [0.2658],
        [0.2725],
        [0.7727],
        [0.1248],
        [0.9041],
        [0.6760],
        [0.6292],
        [0.1461],
        [0.1456],
        [0.2503]])


In [None]:
 
# printing values of sorted tensor 
print("Sorted values:\n", values) 
  
# printing indices of sorted value 
print("Indices:\n", indices)

Sorted values:
 tensor([[0.4273],
        [0.2248],
        [0.7727],
        [0.3772],
        [0.2317],
        [0.7807],
        [0.6295],
        [0.4721],
        [0.8080],
        [0.2799],
        [0.3458],
        [0.5949],
        [0.6228],
        [0.8992],
        [0.5516],
        [0.7287],
        [0.6107],
        [0.5813],
        [0.6308],
        [0.6361],
        [0.5291],
        [0.6247],
        [0.7271],
        [0.4753],
        [0.6622],
        [0.5668],
        [0.6536],
        [0.5160],
        [0.3302],
        [0.4746],
        [0.6206],
        [0.4798],
        [0.5269],
        [0.1083],
        [0.3251],
        [0.1982],
        [0.6019],
        [0.5574],
        [0.3122],
        [0.2498],
        [0.4267],
        [0.2027],
        [0.6791],
        [0.2658],
        [0.2725],
        [0.7727],
        [0.1248],
        [0.9041],
        [0.6760],
        [0.6292],
        [0.1461],
        [0.1456],
        [0.2503]])
Indices:
 tensor([[0],
      

# BERT - 

In [None]:
# First install
# # !pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting transformers<5.0.0,>=4.34.0 (from sentence-transformers)
  Downloading transformers-4.41.2-py3-none-any.whl.metadata (43 kB)
     ---------------------------------------- 43.8/43.8 kB 1.0 MB/s eta 0:00:00
Collecting tokenizers<0.20,>=0.19 (from transformers<5.0.0,>=4.34.0->sentence-transformers)
  Downloading tokenizers-0.19.1-cp310-none-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.1 (from transformers<5.0.0,>=4.34.0->sentence-transformers)
  Downloading safetensors-0.4.3-cp310-none-win_amd64.whl.metadata (3.9 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
   ---------------------------------------- 227.1/227.1 kB 2.8 MB/s eta 0:00:00
Downloading transformers-4.41.2-py3-none-any.whl (9.1 MB)
   ---------------------------------------- 9.1/9.1 MB 3.8 MB/s eta 0:00:00
Downloading safetensors-0.4.3-cp310-none-win_amd64.whl (287 kB)
 



In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")



In [None]:
titles_list = df['job_title_cleaned'].to_list()
embeddings = model.encode(titles_list)

In [None]:
def get_bert_query_similarity(embeddings, query):
    """
    embeddings: bert embedding for all docs
    query: query doc

    return: cosine similarity between query and all docs

    """
    query_embedding = model.encode(query)

    similarities = model.similarity(embeddings, query_embedding)
    
    cos_sim_bert = [x.item() for x in similarities]
    
    return cos_sim_bert

In [None]:
query = 'seeking human resources'
cos_sim_bert = get_bert_query_similarity(embeddings, query = query)
df['bert_fit'] = cos_sim_bert

top_candidates(n = 10, by = 'bert_fit', ascending = False, min_con = 0)

In [None]:
# Input candidates, query term, location, etc
query = 'seeking human resources'

def cos_sim_various(query):
    
    #Fasttext similarity
    df['fasttext_fit'] = get_fasttext_query_similarity(doc_sent_vec_ft, query = query)

    # original TFIDF similarity and Word2Vec Similarity for comparison
    df['tfidf_fit'] = get_tf_idf_query_similarity(vectorizer, docs_tfidf, query = query) 

    df['w2v_fit'] = get_w2v_query_similarity(document_word_embeddings, query = query)

    df['glove_fit'] = get_glove_query_similarity(doc_sent_vec, query = query)
    
    df['bert_fit'] = get_bert_query_similarity(embeddings, query = query)


In [None]:
# WordtoVec  Same thing but with pretrained word embedding average of word
# Try to see who I'm connected with 
skill review surrvey - schedule interview - motivated 

Process:
1. Sentence transformer:
    https://sbert.net/
    https://www.geeksforgeeks.org/sentence-similarity-using-bert-transformer/


2. Gen AI
https://stackoverflow.com/questions/75673222/semantic-searching-using-google-flan-t5

3. Utilizing LLM via prompting
GPT general purpose transformer - closed boxed model through an Open AI API
- Focus on instead, take advantage of open source LLM such as LLama 3 model from Meta
- Mistral, Llama 2, Grok maybe?

Bert

# Gen AI

In [None]:
%pip install -U datasets==2.17.0

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 --quiet

Collecting datasets==2.17.0
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
     -------------------------------------- 536.6/536.6 kB 6.7 MB/s eta 0:00:00
Collecting fsspec[http]<=2023.10.0,>=2023.1.0
  Downloading fsspec-2023.10.0-py3-none-any.whl (166 kB)
     -------------------------------------- 166.4/166.4 kB 9.8 MB/s eta 0:00:00
Collecting aiohttp
  Downloading aiohttp-3.9.5-cp310-cp310-win_amd64.whl (370 kB)
     ------------------------------------- 370.7/370.7 kB 11.6 MB/s eta 0:00:00
Collecting dill<0.3.9,>=0.3.0
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
     ---------------------------------------- 116.3/116.3 kB ? eta 0:00:00
Collecting pyarrow-hotfix
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting xxhash
  Downloading xxhash-3.4.1-cp310-cp310-win_amd64.whl (29 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
     ---------------------------------------- 134.8/134.8 kB ? eta 0:00:00
Collectin

In [None]:
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

In [None]:
model_name='google/flan-t5-base'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
sentence = "What time is it, Tom?"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0], 
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])

DECODED SENTENCE:
What time is it, Tom?


In [None]:
sentence_encoded["input_ids"][0]

tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])

In [None]:
Tokenized input


In [None]:
import torch
torch.mean(sentence_encoded["input_ids"][0])


RuntimeError: mean(): could not infer output dtype. Input dtype must be either a floating point or complex dtype. Got: Long

TypeError: mean() received an invalid combination of arguments - got (axis=NoneType, dtype=NoneType, out=NoneType, ), but expected one of:
 * (*, torch.dtype dtype)
 * (tuple of ints dim, bool keepdim, *, torch.dtype dtype)
 * (tuple of names dim, bool keepdim, *, torch.dtype dtype)


In [None]:
# Creating a vectorize representation for each job title in our dataframe
job_titles = df.job_title_cleaned

doc_sent_vec = []

for sentences in job_titles:
    sentence_encoded = tokenizer(sentence, return_tensors='pt')

    sentence_encoded = sum(word_vec) / len(word_vec) # returning a mean for each job title
    doc_sent_vec.append(word_vec_mean) # returning a list for all job titles

In [None]:
example_indices = [40, 200]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()