# Automated CV <-> Job Description Ranker 

**Task** :  This unsupervised machine learning algorithm finds the top fit for a certain job description using Topic Modeling and Cosine Similarity.

**Dataset** : around 145,000 job descriptions

### Content :

### 1. Modules
### 2. Preprocessing
### 3. TF-IDF VS BOW 
### 4. Topic Modeling
### 5. Cosine Similarity
### 6. Test
### 7. Possible Updates

## Modules 

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import corpus
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import gensim
from gensim import corpora, models
from pprint import pprint
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

## Preprocessing


#### Balancing out the dataset

#### Removing all unecessery characters and numbers from our text and removed stop words, stemmed and lemmantized our corpus for better accuracy.

In [118]:
jd2 = pd.read_csv('all_jobs.csv')

In [120]:
#choosing 300000random job descriptions
import random
random.seed(333)
indices = jd2.index.values.tolist()

random_30000 = random.sample(indices, 20000)

random_30000[:5]

[18180, 11495, 11571, 7069, 13251]

In [121]:
#loading the random 10000 of second dataset
jd2_train = jd2.loc[random_10000, :]
jd2_train = jd2_train.reset_index(drop = True)
jd2_train = jd2_train['Job Description']

In [122]:
jd3 = pd.read_csv('dice_com-job_us_sample.csv').dropna()

In [123]:
jd3_train = jd3['jobdescription']

#### Concatinating

In [124]:
jd2_train = jd2_train.append(jd3_train)

In [125]:
jd2_train = jd2_train.dropna()

In [126]:
dataset = jd2_train

In [128]:
#Remove all urls
def remove_urls(s):
    s = re.sub('[^\s]*.com[^\s]*', "", s)
    s = re.sub('[^\s]*www.[^\s]*', "", s)
    s = re.sub('[^\s]*.co.uk[^\s]*', "", s)
    return s

In [129]:
dataset = dataset.map(remove_urls)

In [130]:
# Remove the star_words
def remove_star_words(s):
    return re.sub('[^\s]*[\*]+[^\s]*', "", s)

In [131]:
dataset = dataset.map(remove_star_words)

In [132]:
#Removing all numbers
def remove_nums(s):
    return re.sub('[^\s]*[0-9]+[^\s]*', "", s)

In [133]:
dataset = dataset.map(remove_nums)

In [134]:
# Remove the punctuations
from string import punctuation

def remove_punctuation(s):
    global punctuation
    for p in punctuation:
        s = s.replace(p, '')
    return s

In [135]:
dataset = dataset.map(remove_punctuation)

In [136]:
# Convert to lower case
dataset =dataset.map(lambda x: x.lower())

In [137]:
for item in dataset.head(20):
    print(item)
    print('ENDDDDD/n')
    print(type(item))

title

engineer quality control inspector  kuwaitthe prepositioning and marine corps logistics services pmcls program is based in jacksonville florida we provide maintenance and logistics services to the us marine corps usmc and us navy we are seeking a engineer quality control inspector for our kuwait location position responsibilities inspects maintenance activities such as handling storing servicing and repairing of usmc engineering equipment assigned to the marine expeditionary unit augmentation program · responsible for conducting quality control inspections on all usmc engineer equipment · responsible for collecting and analyzing data to make decisions that improve maintenance quality performance and customer satisfaction · analyze and display data to allow decision making based on maintenance history and quality performance data · monitors the activities of all personnel engaged in the input receipt and dissemination of gcssmc and related reports · use and interpret usmc mpr das

#### Only accepting English words

In [138]:
#only accepting english words
words = set(nltk.corpus.words.words())
english_dataset = []
def english_words(text):
    text_clean = " ".join(w for w in nltk.wordpunct_tokenize(text) \
         if w.lower() in words or not w.isalpha())
    return text_clean

In [139]:
#only accepting english words
clean_dataset = [english_words(description) for description in dataset]

#### removing stop words

In [141]:
#converting our data list to a series
series = pd.Series((i for i in clean_dataset)) 

In [142]:
#removing stop words
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

In [143]:
#removing stop words
data = series.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

#### Lemmatize

In [144]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(word,get_wordnet_pos(word)) for word in tokenizer.tokenize(text)]

In [145]:
tokenizer = nltk.tokenize.WhitespaceTokenizer()

In [147]:
dataset_clean = [lemmatize_text(item) for item in data]

In [157]:
dataset_clean[5986]

['financial',
 'analyst',
 'phoenix',
 'banner',
 'health',
 'please',
 'enable',
 'continue',
 'please',
 'enable',
 'browser',
 'experience',
 'site',
 'ability',
 'apply',
 'job',
 'page',
 'candidate',
 'log',
 'back',
 'financial',
 'analyst',
 'job',
 'number',
 'facility',
 'corporate',
 'office',
 'department',
 'security',
 'address',
 'street',
 'north',
 'central',
 'ave',
 'address',
 'location',
 'work',
 'schedule',
 'day',
 'position',
 'type',
 'post',
 'category',
 'information',
 'technology',
 'nonclinical',
 'health',
 'care',
 'constantly',
 'banner',
 'health',
 'front',
 'change',
 'lead',
 'health',
 'care',
 'make',
 'experience',
 'best',
 'want',
 'change',
 'care',
 '–',
 'people',
 'choose',
 'take',
 'challenge',
 'health',
 'care',
 'well',
 'like',
 'something',
 'want',
 'part',
 'want',
 'hear',
 'business',
 'analyst',
 'organization',
 'execute',
 'management',
 'financial',
 'performance',
 'conduct',
 'deliver',
 'financial',
 'return',
 'investmen

In [174]:
#converting our data list to a series
series_clean = pd.Series(i for i in dataset_clean)

In [172]:
series_clean_str = [str(item) for item in series_clean]

#### Stemming

In [77]:
clean_data = series_clean.apply(lambda x:' '.join([stemmer.stem(str(word)) for word in x]))

## TF-IDF

Comparing TF-IDF vs BOW method.

In [80]:
vectorizer = TfidfVectorizer()

In [179]:
tfIdfVectorizer=TfidfVectorizer(use_idf=True)
tfIdf = tfIdfVectorizer.fit_transform(series_clean_str)
df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(10))

               TF-IDF
maintenance  0.360153
marine       0.326817
inspector    0.262237
engineer     0.216660
quality      0.205927
equipment    0.190849
corp         0.181680
readiness    0.174221
display      0.167905
recovery     0.160400


## Topic Modeling

Since our dataset is large, we've used Latent Diricht Allocation for Dimension Reduction.

Before training our model, we need to preprocess our input in order to get the best performance.

In [189]:
#creating a dictionary with number of occurrence
dictionary = gensim.corpora.Dictionary(series_clean)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 acceptable
1 achieve
2 allow
3 along
4 among
5 analyze
6 assign
7 augmentation
8 base
9 class
10 close


In [190]:
#filtering our dictionary
dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=500000)

In [193]:
#trying out BOW method
bow_corpus = [dictionary.doc2bow(doc) for doc in series_clean]

[(1, 1),
 (5, 1),
 (8, 1),
 (16, 2),
 (17, 1),
 (18, 1),
 (31, 2),
 (38, 1),
 (46, 1),
 (49, 1),
 (52, 1),
 (54, 2),
 (60, 1),
 (62, 1),
 (65, 2),
 (71, 1),
 (78, 1),
 (81, 1),
 (86, 1),
 (97, 1),
 (101, 2),
 (102, 1),
 (103, 1),
 (114, 1),
 (120, 2),
 (121, 1),
 (126, 1),
 (127, 1),
 (130, 3),
 (132, 1),
 (134, 1),
 (135, 2),
 (138, 1),
 (141, 3),
 (151, 2),
 (157, 1),
 (163, 2),
 (169, 1),
 (174, 1),
 (178, 1),
 (185, 1),
 (190, 1),
 (205, 1),
 (229, 1),
 (238, 3),
 (240, 1),
 (243, 1),
 (246, 5),
 (256, 2),
 (281, 1),
 (282, 1),
 (288, 1),
 (291, 1),
 (334, 1),
 (340, 1),
 (347, 1),
 (354, 1),
 (356, 2),
 (358, 6),
 (361, 1),
 (381, 1),
 (390, 4),
 (391, 25),
 (396, 2),
 (410, 7),
 (417, 2),
 (428, 1),
 (431, 1),
 (451, 1),
 (455, 1),
 (456, 1),
 (492, 1),
 (509, 1),
 (516, 1),
 (519, 3),
 (531, 1),
 (534, 2),
 (538, 1),
 (546, 2),
 (548, 1),
 (570, 2),
 (574, 1),
 (609, 5),
 (610, 2),
 (613, 1),
 (616, 1),
 (620, 1),
 (636, 1),
 (637, 1),
 (649, 1),
 (655, 1),
 (656, 1),
 (665, 1),

In [194]:
#trying out tfidf
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.06826236158097014),
 (1, 0.03712634345322998),
 (2, 0.057239571519254885),
 (3, 0.04346292678071762),
 (4, 0.052509443150834396),
 (5, 0.04128664015110409),
 (6, 0.05381386680255456),
 (7, 0.09710355509425028),
 (8, 0.03851779394637199),
 (9, 0.04568430857023668),
 (10, 0.04898439901857154),
 (11, 0.044395261174912964),
 (12, 0.05447986249082435),
 (13, 0.03395920881618943),
 (14, 0.10160160542957915),
 (15, 0.14899163356971953),
 (16, 0.03924776337523569),
 (17, 0.03946637979916749),
 (18, 0.03877345378047814),
 (19, 0.13515551220058386),
 (20, 0.08419529780523445),
 (21, 0.028609887898771526),
 (22, 0.08170166343394092),
 (23, 0.1507964224704855),
 (24, 0.02246435610289257),
 (25, 0.017807422846473252),
 (26, 0.14152658320422706),
 (27, 0.03395920881618943),
 (28, 0.13304909846409),
 (29, 0.03797592678912748),
 (30, 0.049883061061237396),
 (31, 0.05830731765694331),
 (32, 0.023847145599947973),
 (33, 0.028975920966831745),
 (34, 0.04979296914237805),
 (35, 0.23066321346095153)

In [220]:
#initializing our LDA model
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=15, workers=4)

### BOW Vectorized LDA Model

In [221]:
#seeing our Topics
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.013*"position" + 0.011*"health" + 0.009*"may" + 0.009*"must" + 0.007*"research" + 0.007*"education" + 0.007*"related" + 0.006*"please" + 0.006*"time" + 0.006*"prefer"
Topic: 1 
Words: 0.017*"financial" + 0.008*"process" + 0.008*"status" + 0.007*"’" + 0.007*"risk" + 0.006*"finance" + 0.006*"project" + 0.006*"accounting" + 0.006*"related" + 0.005*"ensure"
Topic: 2 
Words: 0.019*"test" + 0.019*"project" + 0.018*"design" + 0.017*"technical" + 0.012*"·" + 0.011*"understand" + 0.009*"technology" + 0.009*"system" + 0.009*"process" + 0.009*"application"
Topic: 3 
Words: 0.099*"•" + 0.028*"security" + 0.016*"network" + 0.012*"technical" + 0.010*"system" + 0.009*"must" + 0.008*"engineering" + 0.008*"service" + 0.007*"infrastructure" + 0.006*"level"
Topic: 4 
Words: 0.010*"’" + 0.009*"analytics" + 0.008*"learn" + 0.008*"help" + 0.007*"u" + 0.007*"new" + 0.007*"opportunity" + 0.006*"people" + 0.006*"make" + 0.006*"product"


In [206]:
#Using bow_corpus to use TFIDF 
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.06826236158097014),
 (1, 0.03712634345322998),
 (2, 0.057239571519254885),
 (3, 0.04346292678071762),
 (4, 0.052509443150834396),
 (5, 0.04128664015110409),
 (6, 0.05381386680255456),
 (7, 0.09710355509425028),
 (8, 0.03851779394637199),
 (9, 0.04568430857023668),
 (10, 0.04898439901857154),
 (11, 0.044395261174912964),
 (12, 0.05447986249082435),
 (13, 0.03395920881618943),
 (14, 0.10160160542957915),
 (15, 0.14899163356971953),
 (16, 0.03924776337523569),
 (17, 0.03946637979916749),
 (18, 0.03877345378047814),
 (19, 0.13515551220058386),
 (20, 0.08419529780523445),
 (21, 0.028609887898771526),
 (22, 0.08170166343394092),
 (23, 0.1507964224704855),
 (24, 0.02246435610289257),
 (25, 0.017807422846473252),
 (26, 0.14152658320422706),
 (27, 0.03395920881618943),
 (28, 0.13304909846409),
 (29, 0.03797592678912748),
 (30, 0.049883061061237396),
 (31, 0.05830731765694331),
 (32, 0.023847145599947973),
 (33, 0.028975920966831745),
 (34, 0.04979296914237805),
 (35, 0.23066321346095153)

### TFIDF Vectorized LDA Model

In [264]:
#Training our model with TFIDF vectors
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=5, id2word=dictionary, passes=15, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.023*"•" + 0.006*"project" + 0.005*"test" + 0.004*"system" + 0.004*"sap" + 0.004*"functional" + 0.003*"user" + 0.003*"technical" + 0.003*"client" + 0.003*"process"
Topic: 1 Word: 0.014*"web" + 0.012*"developer" + 0.008*"c" + 0.007*"code" + 0.007*"test" + 0.007*"server" + 0.006*"design" + 0.006*"application" + 0.006*"oracle" + 0.006*"net"
Topic: 2 Word: 0.026*"·" + 0.014*"security" + 0.011*"network" + 0.007*"cisco" + 0.006*"hardware" + 0.005*"•" + 0.005*"clearance" + 0.004*"government" + 0.004*"infrastructure" + 0.004*"system"
Topic: 3 Word: 0.011*"spark" + 0.010*"cloud" + 0.009*"big" + 0.009*"python" + 0.008*"tech" + 0.008*"locate" + 0.008*"engineer" + 0.008*"statistical" + 0.008*"hive" + 0.007*"azure"
Topic: 4 Word: 0.003*"’" + 0.003*"analytics" + 0.003*"health" + 0.003*"marketing" + 0.002*"status" + 0.002*"learn" + 0.002*"financial" + 0.002*"research" + 0.002*"people" + 0.002*"product"


#### BOW method seems to work better

### Preprocess for unseen documents :

In [265]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [282]:
def bow_vector(document):
    bow_vector = dictionary.doc2bow(preprocess(document))
    return bow_vector

In [269]:
CV1 = 'Structured cross border Funds, worked with consultants on preparation and review of Fund structures and documentation/ setting up of entities/ obtaining tax exemptions (13X/13R).'

In [271]:
CV2 = 'Reporting to Managing Partner, the FC role is responsible for financial control and all accounting activities as well as tax reporting of the Master-Feeder Fund and its 10  SPVs'

In [273]:
CV3 = 'Machine Learning Engineer, worked with nlp and computer vison'

In [295]:
bow_vector1 = dictionary.doc2bow(preprocess(CV1))
bow_vector2 = dictionary.doc2bow(preprocess(CV2))
bow_vector3 = dictionary.doc2bow(preprocess(CV3))

### Results

In [None]:
#preprocessing and vectorizing the unseen document i norder to get results
bow_vector = dictionary.doc2bow(preprocess(CV2))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

In [None]:
for idx, topic in lda_model.print_topics(-1)

## Cosine Similarity 

For comparing each CV to a certain job description, we're using Cosine Similarity.

In [290]:
from sklearn.metrics.pairwise import cosine_similarity

## Testing

In [297]:
job_description = 'Tetra Tech is seeking mid to senior level Silverlight Developer to support a highly visible federal government contract, particularly focused around supporting the mission of transportation. The candidate is responsible for working with business requirements and develop application using Silverlight. The candidate should be a self-starter, motivated, able to manage multiple priorities and tasks in a dynamic environment with a good understanding of software development standards and best practices.  Primary job duties and responsibilities may include, but are not limited to the following: This position will design and develop solutions using Silverlight to meet requirements provided by the customer.This position could feasibly work with XAML, JavaScript, Web Services, python scripting and RDBMS technologies such as SQL Server during a given release cycle.Work closely with project managers, partner companies, and customers to analyze complex problems, define requirements and implement solutions.The successful candidate will have:Requirements:7+ years of IT experience with at least 3 years in Silverlight developmentExcellent analytical and problem-solving skillsFamiliarity with basic User Interface and User Experience principlesUnderstanding of Web Services and Open-Standard Data SpecificationsMicrosoft Development Stack (C#, XAML, ASP.Net, JavaScript)Experience with ESRI’s Web API (Silverlight)Database programming experience with SQL ServerExperience working in a dynamic, Agile development environment Ability to prioritize, plan and estimate interdependent tasksAbility to operate independently as needed to implement assignmentsAbility to effectively communicate design concepts and architecture to customers and stakeholdersDesired Skills/Experience:Experience developing and integrating with Geospatial Map and Web ServicesExperience developing and integrating with Geospatial Processing TasksExperience developing with Python within ESRI’s ArcGIS Technology StackDeveloping WCF Services (SOAP, REST)Experience with ArcGIS Server data management operationsExperience developing on non-ESRI geospatial platforms (Google, Bing Maps, open source tools)Experience using FME Tetra Tech is an Equal Opportunity Employer, and we value workplace diversity. We invite resumes from all interested parties and consider applicants for all positions without regard to race, color, religion, sex, national origin, age, marital status, sexual preference, personal appearance, family responsibility, the presence of a non-job-related medical condition or physical disability, matriculation, political affiliation, veteran status, or any other legally protected status. Tetra Tech is a VEVRAA federal contractor and we request priority referral of veterans for available positions.'

In [298]:
vetorized_job_description = dictionary.doc2bow(preprocess(job_description))

In [None]:
cosine_similarity_list = sorted[(cosine_similarity(CV1,job_description)),(cosine_similarity(CV2,job_description),(cosine_similarity(CV3,job_description)]

In [307]:
print('{} is the best fit for this job description with {} similarity'.format(CV,cosine_similarity_list[0]))

CV3 is the best fit for this job description with 63.8 similarity


## Conclusion

### This unsupervised algorithem can catagorize CVs and Job Descriptions using Topic Modeling, and detect the most related CV to a job description, as well as finding the best fit amongst a group of CVs for a certain job description using cosine similarity.

## Future Possible Updates

<ul>
    <li> As you've observed, we have 5 categories of job description which mostly consist of health technician, developement,business and finance related jobs. for further update, we can add more categories such as law, art, data science, etc. </li>
    <li> Pipeline feature where you can compare multiple CVs together.</li>
    <li> We can add a skill extraction feature which detects specific skills in a CV</li>
    <li> We can add a PDF reader feature, which would have the ability to compare PDF CVs as wel. </li>
    <li> Ranking CVs based on seniority and education level. </li>
 