In [13]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

## Step 1: Getting jobs saved previously

In [56]:
df = pd.read_pickle("step1_df.pk")

In [57]:
print(df)

                                                 title  \
0                   Data Engineer - Columbus, GA 31909   
1                         Data Analyst - St. Louis, MO   
2                          Data Scientist - Newark, CA   
3                 Scientific Programmer - Berkeley, CA   
4    PwC Labs - Jr. Data Scientist - Machine Learni...   
..                                                 ...   
776  Software Product Manager, Framework and Applic...   
777  Natural Language Processing and Machine Learni...   
778           Data Scientist - San Francisco, CA 94103   
779                    Data Scientist - Glen Mills, PA   
780              Data Analyst (Part-Time) - Austin, TX   

                                                  body  \
0    Data Engineer - Columbus, GA 31909\nCelebratin...   
1    Data Analyst - St. Louis, MO\nDuties\nSummary\...   
2    Data Scientist - Newark, CA\nData Scientist\n\...   
3    Scientific Programmer - Berkeley, CA\nCaribou ...   
4    PwC Labs

## Step 2: Getting my resumé data and adding to the main dataframe

In [101]:
resume = {'title': ['Data Scientist'], 
'body': ['Skills\nPython, Pandas, machine learning, natural language processing\nExperience\nManning / Data Analyst\nOct 2019 PRESENT,  REMOTE\nAnalyzed\nand visualized vast amounts of data using Pandas, Python, and\nMatplotlib.\nEducation\nBerkeley / B.S. Mathematics\nAugust 2015 - May 2019,\nBERKELEY, CA\nGraduated summa cum laude.'], 
'bullets': [('Skills Python, Pandas, machine learning, natural language processing', 'Experience\nManning / Data Analyst\nOct 2019 PRESENT,  REMOTE\nAnalyzed and visualized\nvast amounts of data using Pandas, Python, and Matplotlib.', 'Education\nBerkeley / B.S. Mathematics\nAugust 2015 - May 2019,  BERKELEY, CA\nGraduated summa cum\nlaude.')]}

In [65]:
resume

{'title': ['Data Scientist'],
 'body': ['Skills\nPython, Pandas, machine learning, natural language processing\nㅡExperience\nManning / Data Analyst\nOct 2019 - PRESENT,  REMOTE\nAnalyzed\nand visualized vast amounts of data using Pandas, Python, and\nMatplotlib.\nEducation\nBerkeley / B.S. Mathematics\nAugust 2015 - May 2019,\nBERKELEY, CA\nGraduated summa cum laude.'],
 'bullets': [('Skills Python, Pandas, machine learning, natural language processing',
   'Experience\nManning / Data Analyst\nOct 2019 - PRESENT,  REMOTE\nAnalyzed and visualized\nvast amounts of data using Pandas, Python, and Matplotlib.',
   'Education\nBerkeley / B.S. Mathematics\nAugust 2015 - May 2019,  BERKELEY, CA\nGraduated summa cum\nlaude.')]}

In [102]:
dfresume = pd.DataFrame(resume)

Creating the dataframe of jobs with our resume in the position 0

In [103]:
dfjobs = pd.concat([dfresume, df])

In [104]:
print(dfjobs)

                                                 title  \
0                                       Data Scientist   
0                   Data Engineer - Columbus, GA 31909   
1                         Data Analyst - St. Louis, MO   
2                          Data Scientist - Newark, CA   
3                 Scientific Programmer - Berkeley, CA   
..                                                 ...   
776  Software Product Manager, Framework and Applic...   
777  Natural Language Processing and Machine Learni...   
778           Data Scientist - San Francisco, CA 94103   
779                    Data Scientist - Glen Mills, PA   
780              Data Analyst (Part-Time) - Austin, TX   

                                                  body  \
0    Skills\nPython, Pandas, machine learning, natu...   
0    Data Engineer - Columbus, GA 31909\nCelebratin...   
1    Data Analyst - St. Louis, MO\nDuties\nSummary\...   
2    Data Scientist - Newark, CA\nData Scientist\n\...   
3    Scientif

## 3. Creating Vectorizer
Vectorizer will be used to see the relevance of the resume in the job postings.

In [105]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

We are using the body to compare, this could be more managed

In [106]:
jobstfidf_matrix = tfidf_vectorizer.fit_transform(dfjobs['body'])

In [107]:
jobstfidf_np_matrix = jobstfidf_matrix.toarray()

We get the first row of the jobs because this is our resume

In [108]:
resume_vector = jobstfidf_np_matrix[0]

In [109]:
resume_nonzero_vector = np.flatnonzero(resume_vector)

In [121]:
print(resume_nonzero_vector)

[  227   232  1110  1134  1146  1515  1741  2083  3282  3373  4157  4695
  5534  6884  6915  6971  7264  7350  7438  7446  7986  8319  8652  9272
  9386  9660 10099 11017 11603 12629 12697 12829]


In [111]:
words = tfidf_vectorizer.get_feature_names()

In [112]:
resume_vector[resume_nonzero_vector]

array([0.19650473, 0.25770535, 0.13378734, 0.10821811, 0.22040107,
       0.22040107, 0.35598419, 0.06631507, 0.26876062, 0.07720463,
       0.09060629, 0.03955799, 0.26876062, 0.07718334, 0.26876062,
       0.0492957 , 0.0513137 , 0.26876062, 0.07213008, 0.11972752,
       0.09681251, 0.18848937, 0.18634829, 0.11268949, 0.08116062,
       0.0940112 , 0.15612792, 0.04627146, 0.26876062, 0.05569674,
       0.14153369, 0.26876062])

In [118]:
unique_words = [words[index] for index in resume_nonzero_vector]

Getting the words of the resume to be able to check how good are our comparinson

## 4. Calculating similarities

In [123]:
cosine_similarities = jobstfidf_np_matrix @ jobstfidf_np_matrix[0]
print(cosine_similarities)

[1.         0.03754444 0.03317312 0.04612628 0.04575569 0.03908868
 0.03383963 0.03383494 0.03900583 0.02513445 0.0371635  0.03603262
 0.01446425 0.07756219 0.05841718 0.02719081 0.03725852 0.04665421
 0.03042812 0.06522641 0.04189969 0.06516568 0.0230356  0.03585474
 0.03087603 0.03518727 0.03575997 0.06201172 0.06477555 0.05001014
 0.02189626 0.05065384 0.01795454 0.04013124 0.01867754 0.0385937
 0.04225348 0.01959301 0.02721979 0.03934944 0.05639278 0.02487399
 0.04991433 0.03390521 0.03416162 0.05596792 0.17874587 0.03112181
 0.03548009 0.02942349 0.01293063 0.02070731 0.0571567  0.01929568
 0.03046805 0.04425537 0.05810811 0.0388945  0.050312   0.04561416
 0.02882914 0.03768109 0.01611922 0.0341089  0.02145204 0.04117732
 0.18844803 0.04700643 0.05249362 0.02849865 0.01841594 0.03044352
 0.04889931 0.05076678 0.02097056 0.02790177 0.01281449 0.03490324
 0.03886317 0.03259585 0.04275095 0.02122137 0.02332675 0.091632
 0.05173754 0.03523042 0.01926647 0.02941115 0.07022835 0.0208742

In [129]:
most_similar_index = np.argsort(cosine_similarities)[-2]
similarity = cosine_similarities[most_similar_index]

most_similar_post = dfjobs['body'][most_similar_index]
print(f"The following post has a cosine similarity of {similarity:.2f} "
       "with jobs[0]:\n")
print(most_similar_post)

The following post has a cosine similarity of 0.22 with jobs[0]:

Applied Data Ethics Fellow - San Francisco, CA 94117
Applied Data Ethics Fellow
University of San Francisco

R0002362
Downtown Campus

Job Title:
Applied Data Ethics Fellow

Job Summary:
The Data Institute at the University of San Francisco is seeking applicants for an Applied Data Ethics Fellow. This role will be part of the Center for Applied Data Ethics (CADE), housed within the Data Institute at USF. The appointment will be for a minimum of 1 year, with the opportunity to extend up to 1 additional year, dependent upon funding.

This position is open to those with a wide range of backgrounds. We welcome applicants from any discipline (including, but not limited to computer science, statistics, law, social sciences, history, media studies, political science, public policy, business, etc.). We are looking for people who have shown deep interest and expertise in areas related to data ethics, including disinformation, sur

This is the most interesting thing, we can change the similarity variable to manage
the number of posts that we want to manage and save in the new datafram

In [147]:
similarity = 0.05 # greater than 10% of similarity to get the post
most_similar_post = [ind for ind, x in enumerate(cosine_similarities) if cosine_similarities[ind] > similarity]

In [148]:
dfjobs.iloc[most_similar_post] # these are the most similar post

Unnamed: 0,title,body,bullets
0,Data Scientist,"Skills\nPython, Pandas, machine learning, natu...","(Skills Python, Pandas, machine learning, natu..."
12,"Data Scientist, Natural Language Processing (N...","Data Scientist, Natural Language Processing (N...","(MSc or PhD in Statistics, Physics, Engineerin..."
13,"Data Scientist - San Diego, CA","Data Scientist - San Diego, CA\nJob Title: Dat...",(Work with stakeholders throughout the organiz...
18,"Data Scientist - Fort Lauderdale, FL","Data Scientist - Fort Lauderdale, FL\nOverview...","(Perform exploratory data analysis, generate a..."
20,"Data Scientist III - Pasadena, CA 91101","Data Scientist III - Pasadena, CA 91101\nMust ...","(Must be a Green Card Holder, Offer contingent..."
...,...,...,...
767,Data Scientist Intern - Pricing Strategy and A...,Data Scientist Intern - Pricing Strategy and A...,(Apply statistical methods to analyze the effe...
769,Data & Tableau Reporting Analyst - Santa Clara...,Data & Tableau Reporting Analyst - Santa Clara...,(4-6 years of recent experience in a business ...
771,"Senior Data Scientist, Education - Redwood Cit...","Senior Data Scientist, Education - Redwood Cit...","(Leverage data to understand product, identify..."
772,"Data Scientist (PhD) - Intern - Spring, TX","Data Scientist (PhD) - Intern - Spring, TX\nJo...","(Use data visualization, statistical analysis,..."


In [150]:
dfjobs.iloc[most_similar_post].to_pickle('step2_df.pk')