#### Goals and Objectives

- Want to create a ranking algorithm that that determines the best fitting candidate based on their job title, location, connections
- Want the algorithm to produce a ranking list that can be manually reviewed and reranked based on human input
- Text process the job titles as there is not a standard for job titles
- tokenization, splitting, stopword, lemmatization/stemming -> bag of words/tfidf -> wordembeddings
- use cosine similarity for the fit
- Have to get the model to predict the fit, then use the predicted fit in the next iteration of the model
- Small sample size only ~100 points of data with some duplicates

Ways to filter: 
- not a lot of connections <10 connections should not be on the list.

Questions:

Confused on what kind of model to use. Predictive/Generative/RNNs. Don't think I have experience using the predicted target variables as training data.

Word Embeddings:
https://projector.tensorflow.org/

SentenceTransformers:
https://www.sbert.net/

#### Sentence Transformers

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')

In [3]:
potential_talents = pd.read_excel("../data/potential-talents.xlsx",index_col='id')

In [4]:
job_titles = potential_talents['job_title'].tolist()

In [5]:
#encode
embeddings = model.encode(job_titles)

In [6]:
# say we are searching for a term Human Resources Manager, want to compare Human Resources to each of the job titles and put the cosine similarity in the fit column
query = "Human Resources Manager"
query_vec = model.encode([query])[0]

In [7]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [8]:
id = 0
for sent in job_titles:
  sim = cosine(query_vec, model.encode([sent])[0])
  potential_talents["fit"].iloc[id] = sim
  id += 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  potential_talents["fit"].iloc[id] = sim


In [9]:
potential_talents.sort_values('fit', ascending=False)

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88,Human Resources Management Major,"Milpitas, California",18,0.882888
74,Human Resources Professional,Greater Boston Area,16,0.835964
60,Aspiring Human Resources Specialist,Greater New York City Area,1,0.795265
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.795265
49,Aspiring Human Resources Specialist,Greater New York City Area,1,0.795265
...,...,...,...,...
85,RRP Brand Portfolio Executive at JTI (Japan To...,Greater Philadelphia Area,500+,0.167967
87,Bachelor of Science in Biology from Victoria U...,"Baltimore, Maryland",40,0.146463
102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,0.131559
93,Admissions Representative at Community medical...,"Long Beach, California",9,0.119092


In [52]:
def connection(connect):
    if isinstance(connect, str):
        return 500

    return connect


def sen_embed(query, df, connects=0):
    num_query = len(query)

    # change number of connections
    df['connection'] = df['connection'].apply(lambda x: connection(x))
    df = df[df['connection'] > connects]

    # for each query get calculate the fit from job title and query vec
    fit_names = [] 
    for i in range(num_query):
        print(query[i])
        query_vec = model.encode([query][0])[i]

        df[f'fit_{i}'] = df['job_title'].apply(lambda x: cosine(query_vec, model.encode([x])[0]))
        fit_names.append(f'fit_{i}')

    # take average of the query vecs and merge into one
    df['fit'] = df[fit_names].mean(axis=1)
    df = df.drop(fit_names, axis = 1)
    
    return df.sort_values('fit', ascending =False)

In [53]:
sen_embed(["Human Resources", "Manager"], potential_talents).head(10)

Human Resources
Manager


Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
74,Human Resources Professional,Greater Boston Area,16,0.621488
88,Human Resources Management Major,"Milpitas, California",18,0.605289
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.547611
67,"Human Resources, Staffing and Recruiting Profe...","Jackson, Mississippi Area",500,0.546901
36,Aspiring Human Resources Specialist,Greater New York City Area,1,0.531672
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.531672
24,Aspiring Human Resources Specialist,Greater New York City Area,1,0.531672
60,Aspiring Human Resources Specialist,Greater New York City Area,1,0.531672
49,Aspiring Human Resources Specialist,Greater New York City Area,1,0.531672
33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.518655


In [None]:
potential_talents

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.482766
2,Native English Teacher at EPIK (English Progra...,Kanada,500,0.236672
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.758993
4,People Development Coordinator at Ryan,"Denton, Texas",500,0.464186
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500,0.218308
...,...,...,...,...
100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,0.701203
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500,0.588534
102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,0.131559
103,Always set them up for Success,Greater Los Angeles Area,500,0.097286


#### Text Preprocessing

token -> stopwords -> stemming/lemmatizing

In [None]:
potential_talents.head()

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
4,People Development Coordinator at Ryan,"Denton, Texas",500+,
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [None]:
# python functions for text preprocessing
from nltk.corpus import stopwords
import string
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

def lower(txt):
    '''
    Changes input string to lower case.
    '''
    txt = [word.lower() for word in txt]
    return txt

def remove_punc(txt):
    '''
    Removes punctuation from input string.
    '''
    table = str.maketrans('', '', string.punctuation)
    txt = [w.translate(table) for w in txt]
    return txt

def remove_stopwords(txt):
    '''
    Removes nltk stopword from input string.
    '''
    stop_words = stopwords.words('english')
    txt = [w for w in txt if not w in stop_words]
    return txt   
    
def my_stemmer(txt):
    '''
    Change input string into stem using nltk.
    '''
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in txt]
    return stemmed

def my_lemma(txt):
    '''
    Return lemmatized version of input string
    '''
    lemmatizer = WordNetLemmatizer()
    lemmad = [lemmatizer.lemmatize(word) for word in txt]
    return lemmad

def preprocess(txt):
    '''
    Applies the text preprocessing steps
    '''
    txt = txt.split()
    txt = remove_punc(txt)
    txt = lower(txt)
    txt = remove_stopwords(txt)
    # txt = my_stemmer(txt)
    txt = my_lemma(txt)
    return txt

In [None]:
job_titles_processed = potential_talents['job_title'].apply(lambda x: preprocess(x))

In [None]:
# word embedding, Word2Vec/Doc2Vec
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

#### Bonuses

Ways to filter out:

Some account may not have enough connections or be in the right location. So we can filter out <10 connections, or maybe we can filter out the people who are not in the country that we are searching for.