Below are the steps you have to take in this project (Theoretically
First read about word embeddings, word2vec, doc2vec, bert):
1. Only take 2 columns, job_title and connection.
2. Create doc2vec embeddings of "Aspiring human resources" (let's call
it w1) and also for each job_titles.
3. Now, calculate cosine similarity between w1 and the embeddings for
each job_title.
4. Your data should be now - job_title, connection, cosine_similarity.
5. Now make the "connection" column as a numeric column by making
"500+" to "500" and the rest being same.
6. Scale the connection column from 0-1 (maybe by using minmaxscaler).
7. Now create another column (ranking) which is weighted sum of
cosine_similarity and scaled_connection. Give higher weightage to
similarity like ----- ranking = 0.8* cosine_similarity + 0.2 *
scaled_connection
8. Based on the ranking column, sort the dataframe in descending order
and the top n candidates will be the candidates which are more
relevant.

Now repeat the same steps from 2-8 by adding sentence bert embeddings for
"Aspiring human resources" and also for each job_titles.

In [None]:
import pandas as pd
import numpy as np
from numpy.linalg import norm
import nltk
import warnings
from gensim.models.doc2vec import Doc2Vec,\
    TaggedDocument
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import MinMaxScaler
warnings.filterwarnings(action='ignore')

In [8]:
df = pd.read_csv("potential-talents - Aspiring human resources - seeking human resources.csv")
df.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [14]:
df_nodup = df.drop_duplicates(subset=["job_title", "location", "connection"],keep='first')
df_nodup.shape

(53, 5)

In [24]:
# define a list of documents.
phrases = df_nodup['job_title'].tolist()
# preproces the documents, and create TaggedDocuments
tagged_data = [TaggedDocument(words=word_tokenize(doc.lower()),
                              tags=[str(i)]) for i,
               doc in enumerate(phrases)]

# train the Doc2vec model
model = Doc2Vec(vector_size=20,
                min_count=2, epochs=50)
model.build_vocab(tagged_data)
model.train(tagged_data,
            total_examples=model.corpus_count,
            epochs=model.epochs)

# get the document vectors
document_vectors = [model.infer_vector(
    word_tokenize(doc.lower())) for doc in phrases]
w1 = model.infer_vector(word_tokenize("Aspiring human resources".lower()))
#  print the document vectors
for i, doc in enumerate(phrases):
    print("Title", i+1, ":", doc)
    print("Vector:", document_vectors[i])
    print()

Title 1 : 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
Vector: [ 0.00381987 -0.0454778   0.02028244  0.04151611  0.01053877  0.07933739
  0.05562778  0.08126044 -0.16619493 -0.01258378  0.00154463 -0.02654381
 -0.07882547 -0.10154993  0.11701605  0.11332016  0.08977819 -0.04232595
 -0.10046097 -0.12430368]

Title 2 : Native English Teacher at EPIK (English Program in Korea)
Vector: [-0.01632667 -0.01770469 -0.00136892  0.02422138  0.00019099  0.03657374
  0.05528605  0.09197691 -0.12931189 -0.01575536  0.00408067 -0.00784068
 -0.05370707 -0.04361983  0.09944172  0.11840008  0.07613727 -0.00361084
 -0.06568988 -0.10386073]

Title 3 : Aspiring Human Resources Professional
Vector: [-0.00092919  0.01492468  0.01494225  0.000406   -0.01550548  0.00863823
  0.00042895  0.00535629 -0.04352801 -0.02272538 -0.00511711 -0.00630873
 -0.0198491  -0.0108943   0.03064361  0.0188887   0.00905088  0.01554202
 -0.02414591 -0.02031969]

Title 4

In [29]:
#Cosine Similarity
# or sklearn
cosine_scores = []
for vec in document_vectors:
    cosine = np.dot(w1, vec) / (norm(w1) * norm(vec))
    cosine_scores.append(cosine)


In [32]:
df_nodup["cosine_similarity"] = cosine_scores
df_keep = df_nodup[["job_title", "connection", "cosine_similarity"]]
df_keep["connection"] = df_keep["connection"].astype(str).str.replace("+", "", regex=False).astype(int)
df_keep.sample(10)

Unnamed: 0,job_title,connection,cosine_similarity
67,Human Resources Specialist at Luxottica,500,0.059059
96,Aspiring Human Resources Professional,71,0.119435
78,Liberal Arts Major. Aspiring Human Resources A...,7,0.160538
100,Human Resources Generalist at Loparex,500,0.14905
93,Seeking Human Resources Opportunities. Open t...,415,0.054984
103,Director Of Administration at Excellence Logging,500,0.095891
66,"Human Resources, Staffing and Recruiting Profe...",500,0.298919
97,Student,4,0.141857
68,"Director of Human Resources North America, Gro...",500,0.146825
76,Human Resources|\nConflict Management|\nPolici...,409,0.146213


In [30]:
cosine_scores

[0.2111324,
 0.18700036,
 0.09355607,
 0.25342938,
 0.13225658,
 0.06949363,
 0.19027,
 0.1581087,
 0.15427653,
 0.1751178,
 0.10447391,
 0.017474493,
 0.17887844,
 0.20288602,
 0.286415,
 0.29891902,
 0.05905937,
 0.1468253,
 0.1275447,
 0.1225456,
 0.20892514,
 0.121346176,
 -0.13812737,
 0.18991542,
 0.29375875,
 0.14621304,
 0.10623543,
 0.16053829,
 0.25926775,
 0.03627459,
 0.124095745,
 0.13598607,
 0.14259462,
 0.21053061,
 0.21826716,
 0.24042818,
 0.26619425,
 0.04003927,
 -0.053166624,
 0.2758083,
 0.114032924,
 -0.18002363,
 0.054984078,
 0.33328524,
 0.21526062,
 0.11943461,
 0.14185712,
 0.09697039,
 0.2488652,
 0.14904968,
 -0.0049572224,
 0.064944215,
 0.095891364]

In [None]:

data = []

for phrase in phrases:
    tokens = [j.lower() for j in word_tokenize(phrase)]
    data.append(tokens)

In [20]:
data

[['2019',
  'c.t',
  '.',
  'bauer',
  'college',
  'of',
  'business',
  'graduate',
  '(',
  'magna',
  'cum',
  'laude',
  ')',
  'and',
  'aspiring',
  'human',
  'resources',
  'professional'],
 ['native',
  'english',
  'teacher',
  'at',
  'epik',
  '(',
  'english',
  'program',
  'in',
  'korea',
  ')'],
 ['aspiring', 'human', 'resources', 'professional'],
 ['people', 'development', 'coordinator', 'at', 'ryan'],
 ['advisory', 'board', 'member', 'at', 'celal', 'bayar', 'university'],
 ['aspiring', 'human', 'resources', 'specialist'],
 ['student',
  'at',
  'humber',
  'college',
  'and',
  'aspiring',
  'human',
  'resources',
  'generalist'],
 ['hr', 'senior', 'specialist'],
 ['seeking', 'human', 'resources', 'hris', 'and', 'generalist', 'positions'],
 ['student', 'at', 'chapman', 'university'],
 ['svp',
  ',',
  'chro',
  ',',
  'marketing',
  '&',
  'communications',
  ',',
  'csr',
  'officer',
  '|',
  'engie',
  '|',
  'houston',
  '|',
  'the',
  'woodlands',
  '|',
  'e