In [1]:
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF

## Data description: ##
The data comes from a talent sourcing and management company.  They are looking to automate searching and ranking potential candidates based on results from keyword searches of a database.  After an initial ranking, the goal is to star a particular candidate or set of candidates and use those entries to re-rank the candidates accordingly.

The provided data from search results are as follows:

id = unique identifier for candidate (numeric)
job_title = job title for candidate (text)
location = geographical location for candidate (text)
connections = number of connections candidate has, 500+ means over 500 (text)

The goal is develop an empirical fit calculatuation which will be used to rank the candidates and then re-calculated upon any candidate being starred.

### Empirical fit calculation ###
An empirical fit calculation was employed which weights the candidate based on how their `job_title` and `location` matched the specified search term.  This was done by breaking the search term into all possible consecutive substrings and searching for the relative frequency of the substring in each candidate's `job_title` and `location`, relative to how frequently that substring appeared in all candidate's in the database.

For example, say the search string was `aspiring human resources`.  The following substrings would be searched:

`aspiring`, `human`, `resources`, 
`aspiring human`, `human resources`, 
`aspiring human resources`

For each one of those substrings, each candidate was ranked on whether they had that substring in their `job_title` or `location`.  The more exactly they matched that substring and how commone this substring appeared in all candidates both go into that ranking.  This used a Natural Language Processing technique known as Term Frequency–Inverse Document Frequency (TF-IDF).

The rankings for each candidate for each substring are then summed and normalized.  This has the effect of the more exactly a candidate matches the search string, without additional descriptors in their `job_title` or `location` fields, the higher the fit score they receive.

### Starring and re-ranking ###
After the initial rankings are generated, a specific `id` can be selected for starring.  This operation can be repeated multiple times, if it's desired to star multiple candidates.

Once a candidate is starred, their `job_title` and `location` are used as the new search term (or one of multiple search terms each time an additional candidate is starred) and the empirical fit process is repeated.  This has the effect of increasing the fit score for all candidates who closely match the `job_title` and or `location` of the star or starred candidates.  Upon each new starring, more search terms are added and any candidate who best matches the combination of those search terms will rise in the rankings.

#### Example ####
An example is carried out below, where the initial search term of `aspiring human resources` is used and initial rankings are generated showing many candidates with that exact string in their `job_title` as rising to the top.

The `id=97` candidate is then starred, which changes the search term to match their `job_title` and `location` and it can be seen that now candidates who match `Aspiring Human Resources Professional` are more likely to appear higher in the list.

The `id=58` candidate is now starred, which reinforces the `Aspiring Human Resources Professional` title and now most of the top candidates have that in their `job_title` and some candidates in the `Raleigh-Durham, North Carolina Area` have risen up the list, matching `id=58`'s location.  Note that the top two candidates are `id=97` and `id=58` as well, as a result of them being starred.

Finally, `id=18` candidate is starred, with a `job_title` of `People Development Coordinator at Ryan`, rearranging the candidates to reflect a mix of `Aspiring Human Resources Professional` and `People Development Coordinator at Ryan` at the top of the rankings.

In [110]:
class PotentialTalents:
    def __init__(self, filepath):
        self.rawData = pd.read_csv(filepath)
        self.processedData = self.rawData.copy()
        self.stop_words = set(stopwords.words('english'))
        self.addDefaultStopWords()
        self.rankedData = None
        self.searchTerms = []
        self.textProcessing()

    def addStopWord(self, word):
        self.stop_words.add(word)
        
    def addDefaultStopWords(self):
        self.addStopWord('area')
        self.addStopWord('greater')
        
    def textProcessing(self):
        # Copy and merge in the location data
        self.processedData['mergeColumns'] = self.processedData['job_title'].astype(str) + " " + \
                                             self.processedData['location'].astype(str)

        # Tokenize merged column
        self.processedData['tokenized'] = self.processedData['mergeColumns'].apply(word_tokenize)

        # Move to lowercase, remove punctuation
        self.processedData['lower'] = self.processedData['tokenized'].apply(lambda x: [word.lower() for word in x])
        self.processedData['no_punc'] = self.processedData['lower'].apply(lambda x: [word for word in x if word not in string.punctuation])

        # Remove stop words
        self.processedData['stopwords_removed'] = self.processedData['no_punc'].apply(lambda x: [word for word in x if word not in self.stop_words])
        self.processedData['jobTitleString'] = [' '.join(map(str,l)) for l in self.processedData['stopwords_removed']]
        self.processedData['jobTitleString'] = self.processedData['jobTitleString'].str.replace('-', ' ')
    
        # Convert connections
        self.processedData['connectionsInt'] = self.processedData['connection'].str.replace(r'\D', '').astype(int)

    def getTFIDF(self, data=None, searchTermSize=1, minDF=3):
        vectorizer = TfidfVectorizer(max_df=0.9, min_df=minDF, max_features=1000, use_idf=True, ngram_range=(searchTermSize, searchTermSize))
        tfidf = vectorizer.fit_transform(data)
        tfidfNames = vectorizer.get_feature_names()
        tfidfDataFrame = pd.DataFrame(tfidf.toarray(), columns=list(tfidfNames))
        return tfidfDataFrame
    
    def genInitialRankings(self, searchTerm='Business', passThrough=False):
        data = self.processedData.copy()
        self.rankedData = data[['id', 'job_title', 'location', 'connection']].copy()
        self.rankedData['fit'] = 0.0
        self.rankedData['starred'] = 0.0
        
        ## Calculate TFIDF for search term
        searchTerms = searchTerm.split(' ')
        for i in range(len(searchTerms)):
            for j in range(len(searchTerms)-i):
                curSearchTerm = ' '.join(searchTerms[j:j+i+1])
                for mdf in [3,2,1]:
                    try:
                        tfidf = self.getTFIDF(data=data['jobTitleString'], searchTermSize=i+1, minDF=mdf)
                    except:
                        continue
                    tmpRankedData = pd.concat([data[['id','connectionsInt']], tfidf], axis=1)
                    tmpRankedData['fit'] = 0.0
                    try:
                        ### Calculate fit (TFIDF result + small delta for number of connections)
                        tmpRankedData['fit'] = tmpRankedData[curSearchTerm] + \
                                     tmpRankedData[curSearchTerm]*tmpRankedData['connectionsInt']/500.0
                        tmpRankedData['fit'] = tmpRankedData['fit']/tmpRankedData['fit'].max()
                        break
                    except:
                        if (mdf == 1):
                            print ('crash',i,j,mdf,curSearchTerm)
                        continue
                self.rankedData = pd.merge(self.rankedData, tmpRankedData[['id','fit']], on='id', how='left')
                self.rankedData['fit'] = self.rankedData['fit_x'] + self.rankedData['fit_y']
                self.rankedData = self.rankedData.drop(['fit_x', 'fit_y'], axis=1)
        
        # Extract only relevant columns and sort by fit
        self.rankedData['fit'] = self.rankedData['fit']/self.rankedData['fit'].max()
        self.rankedData = self.rankedData.sort_values(by='fit', ascending=False)
        self.rankedData = self.rankedData.reset_index(drop=True)
        if (not passThrough):
            display (self.rankedData.set_index('id')[0:15])
        
    def starEntry(self, starID=None):
        if (starID is not None):
            newSearchTerm = self.processedData.loc[self.processedData.id == starID].jobTitleString.values[0]
            self.searchTerms.append(newSearchTerm)
            
            baseRankings = [self.rankedData.copy()]
            baseRankings[0]['fit'] = 0.0
            for searchTerm in self.searchTerms[:]:
                self.genInitialRankings(searchTerm, passThrough=True)
                baseRankings.append(self.rankedData[['id', 'fit']].copy())

            for baseFits in baseRankings[1:]:
                baseRankings[0] = pd.merge(baseRankings[0], baseFits, on='id', how='left')
                baseRankings[0]['fit'] = baseRankings[0].fit_x + baseRankings[0].fit_y
                baseRankings[0] = baseRankings[0].drop(['fit_x', 'fit_y'], axis=1)
            
            starIndex = baseRankings[0].loc[baseRankings[0].id == starID].index.values[0]
            baseRankings[0].loc[starIndex, 'starred'] = 1.0
            self.rankedData = baseRankings[0].copy()
            
            # Adjust rankings so starred entries are at top
            for starIndex in self.rankedData.loc[self.rankedData.starred == 1.0].index.values:
                self.rankedData.loc[starIndex, 'fit'] = self.rankedData['fit'].max()
            self.rankedData['fit'] = self.rankedData['fit']/self.rankedData['fit'].max()
            self.rankedData = self.rankedData.sort_values(by=['fit','starred', 'id'], ascending=False)
            self.rankedData = self.rankedData.reset_index(drop=True)
        display(self.rankedData.set_index('id')[0:15])
        
    def getRankings(self):
        return self.rankedData.set_index('id').copy()

## USAGE INSTRUCTIONS
##### 1) processData --> `foo = PotentialTalents('<filepath>') `
##### 2) generateInitialRankings --> `foo.genInitialRankings(searchTerm='<desiredSearchTerm>')`
##### 3) star entries 1 at a time --> `foo.starEntry(starID='<id of entry to star>')`
##### Note: Only top 15 ranked entries are shown, to explore rest use `bar = foo.getRankings()` to extract rankings dataFrame

In [119]:
potentialEmployees = PotentialTalents('potential-talents - Aspiring human resources - seeking human resources.csv')

potentialEmployees.genInitialRankings(searchTerm='aspiring human resources')



Unnamed: 0_level_0,job_title,location,connection,starred,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.0,1.0
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.0,0.919808
79,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,0.0,0.908024
29,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.0,0.850376
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.0,0.850376
67,"Human Resources, Staffing and Recruiting Profe...","Jackson, Mississippi Area",500+,0.0,0.75001
66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.0,0.741823
100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,0.0,0.737929
76,Aspiring Human Resources Professional | Passio...,"New York, New York",212,0.0,0.596411
78,Human Resources Generalist at Schwan's,Amerika Birleşik Devletleri,500+,0.0,0.569798


In [120]:
potentialEmployees.starEntry(starID=97)

Unnamed: 0_level_0,job_title,location,connection,starred,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,1.0,1.0
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.0,0.419775
66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.0,0.322909
76,Aspiring Human Resources Professional | Passio...,"New York, New York",212,0.0,0.290667
67,"Human Resources, Staffing and Recruiting Profe...","Jackson, Mississippi Area",500+,0.0,0.227416
74,Human Resources Professional,Greater Boston Area,16,0.0,0.207372
79,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,0.0,0.206496
84,Human Resources professional for the world lea...,"Highland, California",50,0.0,0.194854
29,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.0,0.193386
27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.0,0.193386


In [121]:
potentialEmployees.starEntry(starID=58)

Unnamed: 0_level_0,job_title,location,connection,starred,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,1.0,1.0
58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,1.0,1.0
46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.0,0.573363
66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.0,0.441055
101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,0.0,0.41066


In [122]:
potentialEmployees.starEntry(starID=18)

Unnamed: 0_level_0,job_title,location,connection,starred,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,1.0,1.0
58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,1.0,1.0
18,People Development Coordinator at Ryan,"Denton, Texas",500+,1.0,1.0
46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.0,0.907237
59,People Development Coordinator at Ryan,"Denton, Texas",500+,0.0,0.779316
47,People Development Coordinator at Ryan,"Denton, Texas",500+,0.0,0.779316
