# Potential Talents

## 1. Explorative Data Analysis

### 1.1. Data Import

In [1]:
import pandas as pd

data = pd.read_csv('potential-talents.csv')

### 1.2. Data Description

In [2]:
corpus=""
for temp in data['job_title']:
    corpus=corpus+ " " +temp

In [3]:
import nltk

tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

tokens = tokenizer.tokenize(corpus)

In [4]:
freq = nltk.FreqDist(tokens)
freq

FreqDist({'Human': 63, 'Resources': 63, 'at': 46, 'and': 28, 'Aspiring': 27, 'College': 14, 'Student': 14, 'Generalist': 14, 'Professional': 12, 'University': 12, ...})

The most frequent words in the dataset are Human and Resources.

## 2. Candidate  Ranking

### 2.1. Ranking using Tokenize

In this section ranking of candidates is based on similarity with the target job title i.e. "Aspiring human resources" has been carried out.  
To achieve this, the sentence has been transformed into a group of words without stopwords. Using the cosine similarity method similarity between two sentences has been measured.  

In [5]:
X="Aspiring human resources"
X_token=tokenizer.tokenize(X.lower())

results = []
for row in range(len(data)):
    tokens = tokenizer.tokenize((data['job_title'][row].lower()))
    
    # sw contains the list of stopwords
    sw = nltk.corpus.stopwords.words('english') 
    l1 =[];l2 =[]

    # remove stop words from the string
    X_set = {w for w in X_token if not w in sw} 
    Y_set = {w for w in tokens if not w in sw}

    # form a set containing keywords of both strings 
    rvector = X_set.union(Y_set) 
    for w in rvector:
        if w in X_set: l1.append(1) # create a vector
        else: l1.append(0)
        if w in Y_set: l2.append(1)
        else: l2.append(0)
    c = 0

    # cosine formula 
    for i in range(len(rvector)):
            c+= l1[i]*l2[i]
    cosine = c / float((sum(l1)*sum(l2))**0.5)

    results.append({"id":  data['id'][row],"job_title": data['job_title'][row], "score": cosine})

    
results.sort(key=lambda k: k["score"], reverse=True)
results

[{'id': 3,
  'job_title': 'Aspiring Human Resources Professional',
  'score': 0.8660254037844387},
 {'id': 6,
  'job_title': 'Aspiring Human Resources Specialist',
  'score': 0.8660254037844387},
 {'id': 17,
  'job_title': 'Aspiring Human Resources Professional',
  'score': 0.8660254037844387},
 {'id': 21,
  'job_title': 'Aspiring Human Resources Professional',
  'score': 0.8660254037844387},
 {'id': 24,
  'job_title': 'Aspiring Human Resources Specialist',
  'score': 0.8660254037844387},
 {'id': 33,
  'job_title': 'Aspiring Human Resources Professional',
  'score': 0.8660254037844387},
 {'id': 36,
  'job_title': 'Aspiring Human Resources Specialist',
  'score': 0.8660254037844387},
 {'id': 46,
  'job_title': 'Aspiring Human Resources Professional',
  'score': 0.8660254037844387},
 {'id': 49,
  'job_title': 'Aspiring Human Resources Specialist',
  'score': 0.8660254037844387},
 {'id': 58,
  'job_title': 'Aspiring Human Resources Professional',
  'score': 0.8660254037844387},
 {'id': 60

From the results, we can see a candidate whose job title contains words from "Aspiring human resources" has been ranked higher.

### 2.2. Ranking using Vectorization

This section shows the implementation of candidates' ranking by transforming job titles as the average vector of words. The similarity of the candidate location with the targeted location has been calculated. The final similarity is the weighted average of both similarities.  
  
Xt = 0.9Xa + 0.1Xb  
Where,  
Xa = similarity score of job title  
Xb = similarity score of a location  
Xt = resultant similarity


In [6]:
import numpy as np


class DocSim:
    def __init__(self, w2v_model, stopwords=None):
        self.w2v_model = w2v_model
        self.stopwords = stopwords if stopwords is not None else []

    def vectorize(self, doc: str) -> np.ndarray:
        """
        Identify the vector values for each word in the given document
        :param doc:
        :return:
        """
        doc = doc.lower()
        words = [w for w in doc.split(" ") if w not in self.stopwords]
        word_vecs = []
        for word in words:
            try:
                vec = self.w2v_model[word]
                word_vecs.append(vec)
            except KeyError:
                # Ignore, if the word doesn't exist in the vocabulary
                pass

        # Assuming that document vector is the mean of all the word vectors
        # PS: There are other & better ways to do it.
        vector = np.mean(word_vecs, axis=0)
        return vector

    def _cosine_sim(self, vecA, vecB):
        """Find the cosine similarity distance between two vectors."""
        csim = np.dot(vecA, vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB))
        if np.isnan(np.sum(csim)):
            return 0
        return csim

In [7]:
from gensim.models.keyedvectors import KeyedVectors

# Using the pre-trained word2vec model trained using Google news corpus of 3 billion running words.
# The model can be downloaded here: https://bit.ly/w2vgdrive (~1.4GB)
# Feel free to use to your own model.
googlenews_model_path = './pt_data/GoogleNews-vectors-negative300.bin.gz'
stopwords_path = "./pt_data/stopwords_en.txt"

model = KeyedVectors.load_word2vec_format(googlenews_model_path, binary=True)
with open(stopwords_path, 'r') as fh:
    stopwords = fh.read().split(",")
ds = DocSim(model,stopwords=stopwords)

In [8]:
ideal_job = "Aspiring human resources"
ideal_loc = "New York"

results = []
for row in range(len(data)):
    cand_job = str(data['job_title'][row])
    cand_loc = str(data['location'][row])
    
    job_source_vec = ds.vectorize(ideal_job)
    job_target_vec = ds.vectorize(cand_job)
    job_sim_score = ds._cosine_sim(job_source_vec, job_target_vec)
    
    loc_source_vec = ds.vectorize(ideal_loc)
    loc_target_vec = ds.vectorize(cand_loc)
    loc_sim_score = ds._cosine_sim(loc_source_vec, loc_target_vec)
    
    sim_score=0.9*job_sim_score+0.1*loc_sim_score
    
    results.append({"id":  data['id'][row],"job_title": data['job_title'][row], "location": data['location'][row], "score": sim_score})


    results.sort(key=lambda k: k["score"], reverse=True)
results

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


[{'id': 6,
  'job_title': 'Aspiring Human Resources Specialist',
  'location': 'Greater New York City Area',
  'score': 0.9030113697052002},
 {'id': 24,
  'job_title': 'Aspiring Human Resources Specialist',
  'location': 'Greater New York City Area',
  'score': 0.9030113697052002},
 {'id': 36,
  'job_title': 'Aspiring Human Resources Specialist',
  'location': 'Greater New York City Area',
  'score': 0.9030113697052002},
 {'id': 49,
  'job_title': 'Aspiring Human Resources Specialist',
  'location': 'Greater New York City Area',
  'score': 0.9030113697052002},
 {'id': 60,
  'job_title': 'Aspiring Human Resources Specialist',
  'location': 'Greater New York City Area',
  'score': 0.9030113697052002},
 {'id': 97,
  'job_title': 'Aspiring Human Resources Professional',
  'location': 'Kokomo, Indiana Area',
  'score': 0.8979275226593019},
 {'id': 3,
  'job_title': 'Aspiring Human Resources Professional',
  'location': 'Raleigh-Durham, North Carolina Area',
  'score': 0.8797976911067963},
 

It can be seen that candidate that has similar words to the target job title has been ranked higher. Among similar job titles candidates, a candidate with a more similar location to the target location is ranked higher.

## 3. Similarity Reranking

If the candidate is starred then the job title of that candidate can be considered the target job title and the location of that candidate is considered the target location. Based on this job title and location similarity of all candidates can be recalculated and reranked.

In [9]:
stared_id = int(input("Enter id of stared candidate: "))

Enter id of stared candidate: 3


In [10]:
ideal_job = data['job_title'][stared_id-1]
ideal_loc = data['location'][stared_id-1]

results = []
for row in range(len(data)):
    cand_job = str(data['job_title'][row])
    cand_loc = str(data['location'][row])
    
    job_source_vec = ds.vectorize(ideal_job)
    job_target_vec = ds.vectorize(cand_job)
    job_sim_score = ds._cosine_sim(job_source_vec, job_target_vec)
    
    loc_source_vec = ds.vectorize(ideal_loc)
    loc_target_vec = ds.vectorize(cand_loc)
    loc_sim_score = ds._cosine_sim(loc_source_vec, loc_target_vec)
    
    sim_score=0.9*job_sim_score+0.1*loc_sim_score
    results.append({"id":  data['id'][row],"job_title": data['job_title'][row], "location": data['location'][row], "score": sim_score})


    results.sort(key=lambda k: k["score"], reverse=True)
results

[{'id': 3,
  'job_title': 'Aspiring Human Resources Professional',
  'location': 'Raleigh-Durham, North Carolina Area',
  'score': 1.0000000953674317},
 {'id': 17,
  'job_title': 'Aspiring Human Resources Professional',
  'location': 'Raleigh-Durham, North Carolina Area',
  'score': 1.0000000953674317},
 {'id': 21,
  'job_title': 'Aspiring Human Resources Professional',
  'location': 'Raleigh-Durham, North Carolina Area',
  'score': 1.0000000953674317},
 {'id': 33,
  'job_title': 'Aspiring Human Resources Professional',
  'location': 'Raleigh-Durham, North Carolina Area',
  'score': 1.0000000953674317},
 {'id': 46,
  'job_title': 'Aspiring Human Resources Professional',
  'location': 'Raleigh-Durham, North Carolina Area',
  'score': 1.0000000953674317},
 {'id': 58,
  'job_title': 'Aspiring Human Resources Professional',
  'location': 'Raleigh-Durham, North Carolina Area',
  'score': 1.0000000953674317},
 {'id': 97,
  'job_title': 'Aspiring Human Resources Professional',
  'location': '