<a href="https://colab.research.google.com/github/just1nt1me/nlp_job_matching/blob/master/exploration_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# clone git repo (for google colab only)
! git clone https://github.com/just1nt1me/nlp_job_matching.git

Cloning into 'nlp_job_matching'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 16 (delta 5), reused 12 (delta 1), pack-reused 0[K
Unpacking objects: 100% (16/16), 162.57 KiB | 4.39 MiB/s, done.


# Import Libraries

In [3]:
! pip install pypdf

Collecting pypdf
  Downloading pypdf-3.12.0-py3-none-any.whl (254 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/254.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m225.3/254.5 kB[0m [31m7.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m254.5/254.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.12.0


In [20]:
# calculations
import numpy as np
import pandas as pd

# visuals
import matplotlib.pyplot as plt

# parsing
from pypdf import PdfReader

# tokenizing
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
from nltk.stem.porter import PorterStemmer

# vectorizing
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec

# similarity score
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings('ignore')
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from pandas.plotting import scatter_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Dataset

In [10]:
df = pd.read_csv('nlp_job_matching/job_data.csv')[['Job Description']]
df = df.sample(1000, ignore_index=True, random_state=22)
df.shape

(1000, 1)

In [11]:
df.head()

Unnamed: 0,Job Description
0,"<div id=""jobDescriptionText"" class=""jobsearch-..."
1,"<div id=""jobDescriptionText"" class=""jobsearch-..."
2,"<div id=""jobDescriptionText"" class=""jobsearch-..."
3,"<div id=""jobDescriptionText"" class=""jobsearch-..."
4,"<div id=""jobDescriptionText"" class=""jobsearch-..."


# Parse Job Listings

In [12]:
#function to process job listings
def clean_text(resumeText):
    resumeText = re.sub(r'<[^>]+>', '', resumeText)     # remove html tags
    resumeText = re.sub('http\S+\s*', ' ', resumeText)  # remove URLs
    resumeText = re.sub('RT|cc', ' ', resumeText)  # remove RT and cc
    resumeText = re.sub('#\S+', '', resumeText)  # remove hashtags
    resumeText = re.sub('@\S+', '  ', resumeText)  # remove mentions
    resumeText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', resumeText)  # remove punctuations
    resumeText = re.sub(r'[^\x00-\x7f]',r' ', resumeText)
    resumeText = re.sub('\s+', ' ', resumeText)  # remove extra whitespace
    resumeText = re.sub(r'(\w)(?<![A-Z])([A-Z])(?![A-Z])', r'\1 \2', resumeText)
    resumeText = resumeText.lower() #remove capital letters
    words = resumeText.split(' ')
    words = [word for word in words if len(word)>1]
    resumeText = ' '.join(words)
    return resumeText

In [13]:
# clean job description and add each to dataframe
df['Clean Job Description'] = df['Job Description'].apply(lambda x: clean_text(x))
df['Clean Job Description'][0]

'ount executive job number 50955730 description spread your wings we are the duck we inspire and are inspired listen and respond empower our people give back to our community and most importantly celebrate every su ess along the way we do it all the aflac way aflac fortune 500 company is an industry leader in voluntary insurance products that pay cash directly to policyholders and one of america best known brands aflac has been recognized by fortune magazine as one of the 100 best companies to work for in america for 20 consecutive years one of the best workplaces for millennials in 2015 the inaugural year of the award and one of america most admired companies for 18 years our business is about being there for people in need so ask yourself are you the duck if so there home and flourishing career for you at aflac the company aflac the location atlanta columbus ga the division communicorp the opportunity ount executive principal duties amp responsibilities provides ongoing sales and ser

# Parse Resume

In [15]:
# convert resume PDF into string
reader = PdfReader("nlp_job_matching/resume.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
pdf_text = page.extract_text()

In [16]:
pdf_text

'Justin Carville\nData Scientist\nAbout Me\nProfessional Experience"Every day is a new adventure." This philosophy brought\nme to Japan in 2017, where I have since leveraged my\nlove of languages to make a living. My experience\nworking in marketing and business operations got me\nexcited about coding and data, so I changed gears and\nam now on a mission to become fluent in this new field.\nTechnical Skills\nPython\nScikit-Learn\nMachine\nLearning\nNLP\nDeep Learning\nLanguages\nEnglish : native\nJapanese : business\nSpanish : conversational\nEducation\nLe Wagon - Tokyo (2023)\n#1 ranked bootcamp worldwide\n9-week intensive data science\nbootcamp\nKICL - Kyoto (2017-2019)\nJapanese language school\nPassed JLPT N2\nUniversity of Rhode Island (2010-2013)\nBachelor of Arts in Spanish, Journalism\n Graduated Magna Cum LaudeContact Info\njccarville@gmail.comTokyo, Japan\nwww.linkedin.com/in/jccarville/\nhttps://github.com/just1nt1me\nLink Academy (2019-2023)\nFreelance Writer, Editor, Trans

In [17]:
# clean resume text
resume = clean_text(pdf_text)
resume

'justin carville data scientist about me professional experience every day is new adventure this philosophy brought me to japan in 2017 where have since leveraged my love of languages to make living my experience working in marketing and business operations got me excited about coding and data so changed gears and am now on mission to become fluent in this new field technical skills python scikit learn machine learning nlp deep learning languages english native japanese business spanish conversational education le wagon tokyo 2023 ranked bootcamp worldwide week intensive data science bootcamp kicl kyoto 2017 2019 japanese language school passed jlpt n2 university of rhode island 2010 2013 bachelor of arts in spanish journalism graduated magna cum laude contact info arville japan www linkedin com in arville link academy 2019 2023 freelance writer editor translator 2018 2023 vipkid esl teacher 2017 2019 we love osaka link to articles sns video content creation for you tube instagram kpi 

# Tokenizing Texts

In [21]:
# funciton to tokenize text
stemmer = PorterStemmer()
def tokenize(df, column):
    for i in range (0, df.shape[0]):
        res = df[column][i]
        res = res.split()
        res = [stemmer.stem(word) for word in res if word not in stopwords.words('english') and word not in string.punctuation]
        df[column][i] = ' '.join(res)
    return df

In [22]:
tokenize(df, 'Clean Job Description')

Unnamed: 0,Job Description,Clean Job Description
0,"<div id=""jobDescriptionText"" class=""jobsearch-...",ount execut job number 50955730 descript sprea...
1,"<div id=""jobDescriptionText"" class=""jobsearch-...",primari purpos work part product manag team en...
2,"<div id=""jobDescriptionText"" class=""jobsearch-...",descript hire enterpris sale develop repres re...
3,"<div id=""jobDescriptionText"" class=""jobsearch-...",vice presid ad sale market nbc olymp respons w...
4,"<div id=""jobDescriptionText"" class=""jobsearch-...",gener summari senior compens analyst key partn...
...,...,...
995,"<div id=""jobDescriptionText"" class=""jobsearch-...",want help peopl feel better want work top rate...
996,"<div id=""jobDescriptionText"" class=""jobsearch-...",job open id 00315276 logist manag open job tit...
997,"<div id=""jobDescriptionText"" class=""jobsearch-...",morningstar busi develop team seek highli moti...
998,"<div id=""jobDescriptionText"" class=""jobsearch-...",schult compani seek task forc director sale jo...


In [23]:
#tokenize resume
res = resume
res = res.split()
res = [stemmer.stem(word) for word in res if word not in stopwords.words('english') and word not in string.punctuation]
tokenized_resume = ' '.join(res)
df['Resume'] = tokenized_resume

In [24]:
df.head()

Unnamed: 0,Job Description,Clean Job Description,Resume
0,"<div id=""jobDescriptionText"" class=""jobsearch-...",ount execut job number 50955730 descript sprea...,justin carvil data scientist profession experi...
1,"<div id=""jobDescriptionText"" class=""jobsearch-...",primari purpos work part product manag team en...,justin carvil data scientist profession experi...
2,"<div id=""jobDescriptionText"" class=""jobsearch-...",descript hire enterpris sale develop repres re...,justin carvil data scientist profession experi...
3,"<div id=""jobDescriptionText"" class=""jobsearch-...",vice presid ad sale market nbc olymp respons w...,justin carvil data scientist profession experi...
4,"<div id=""jobDescriptionText"" class=""jobsearch-...",gener summari senior compens analyst key partn...,justin carvil data scientist profession experi...


# Feature Engineering

## Number of Words in Job Description / Resume

In [25]:
df['JD_num_words'] = df['Clean Job Description'].apply(lambda x: len(x.split(' ')))
df['Resume_num_words'] = df['Resume'].apply(lambda x: len(x.split(' ')))
df.head()

Unnamed: 0,Job Description,Clean Job Description,Resume,JD_num_words,Resume_num_words
0,"<div id=""jobDescriptionText"" class=""jobsearch-...",ount execut job number 50955730 descript sprea...,justin carvil data scientist profession experi...,458,235
1,"<div id=""jobDescriptionText"" class=""jobsearch-...",primari purpos work part product manag team en...,justin carvil data scientist profession experi...,590,235
2,"<div id=""jobDescriptionText"" class=""jobsearch-...",descript hire enterpris sale develop repres re...,justin carvil data scientist profession experi...,498,235
3,"<div id=""jobDescriptionText"" class=""jobsearch-...",vice presid ad sale market nbc olymp respons w...,justin carvil data scientist profession experi...,430,235
4,"<div id=""jobDescriptionText"" class=""jobsearch-...",gener summari senior compens analyst key partn...,justin carvil data scientist profession experi...,351,235


## Number of Words in Common

In [26]:
def normalized_words_common(row):
    jd = set(map(lambda word: word.lower().strip(),row['Clean Job Description'].split(' ')))
    rez = set(map(lambda word: word.lower().strip(),row['Resume'].split(' ')))
    return 1.0 * len(jd & rez)
df['word_common'] = df.apply(normalized_words_common,axis = 1)
df.head()

Unnamed: 0,Job Description,Clean Job Description,Resume,JD_num_words,Resume_num_words,word_common
0,"<div id=""jobDescriptionText"" class=""jobsearch-...",ount execut job number 50955730 descript sprea...,justin carvil data scientist profession experi...,458,235,19.0
1,"<div id=""jobDescriptionText"" class=""jobsearch-...",primari purpos work part product manag team en...,justin carvil data scientist profession experi...,590,235,25.0
2,"<div id=""jobDescriptionText"" class=""jobsearch-...",descript hire enterpris sale develop repres re...,justin carvil data scientist profession experi...,498,235,33.0
3,"<div id=""jobDescriptionText"" class=""jobsearch-...",vice presid ad sale market nbc olymp respons w...,justin carvil data scientist profession experi...,430,235,14.0
4,"<div id=""jobDescriptionText"" class=""jobsearch-...",gener summari senior compens analyst key partn...,justin carvil data scientist profession experi...,351,235,19.0


## Number of Words in Total

In [27]:
def normalized_words_total(row):
    jd = set(map(lambda word: word.lower().strip(),row['Clean Job Description'].split(' ')))
    rez = set(map(lambda word: word.lower().strip(),row['Resume'].split(' ')))
    return 1.0 * (len(jd) + len(rez))
df['word_total'] = df.apply(normalized_words_total,axis = 1)
df.head()

Unnamed: 0,Job Description,Clean Job Description,Resume,JD_num_words,Resume_num_words,word_common,word_total
0,"<div id=""jobDescriptionText"" class=""jobsearch-...",ount execut job number 50955730 descript sprea...,justin carvil data scientist profession experi...,458,235,19.0,446.0
1,"<div id=""jobDescriptionText"" class=""jobsearch-...",primari purpos work part product manag team en...,justin carvil data scientist profession experi...,590,235,25.0,537.0
2,"<div id=""jobDescriptionText"" class=""jobsearch-...",descript hire enterpris sale develop repres re...,justin carvil data scientist profession experi...,498,235,33.0,469.0
3,"<div id=""jobDescriptionText"" class=""jobsearch-...",vice presid ad sale market nbc olymp respons w...,justin carvil data scientist profession experi...,430,235,14.0,437.0
4,"<div id=""jobDescriptionText"" class=""jobsearch-...",gener summari senior compens analyst key partn...,justin carvil data scientist profession experi...,351,235,19.0,388.0


## Percentage of Shared Words

In [28]:
def normalized_words_share(row):
    jd = set(map(lambda word: word.lower().strip(),row['Clean Job Description'].split(' ')))
    rez = set(map(lambda word: word.lower().strip(),row['Resume'].split(' ')))
    return 1.0 * len(jd & rez)/(len(jd) + len(rez))
df['word_share'] = df.apply(normalized_words_share,axis = 1)
df.head()

Unnamed: 0,Job Description,Clean Job Description,Resume,JD_num_words,Resume_num_words,word_common,word_total,word_share
0,"<div id=""jobDescriptionText"" class=""jobsearch-...",ount execut job number 50955730 descript sprea...,justin carvil data scientist profession experi...,458,235,19.0,446.0,0.042601
1,"<div id=""jobDescriptionText"" class=""jobsearch-...",primari purpos work part product manag team en...,justin carvil data scientist profession experi...,590,235,25.0,537.0,0.046555
2,"<div id=""jobDescriptionText"" class=""jobsearch-...",descript hire enterpris sale develop repres re...,justin carvil data scientist profession experi...,498,235,33.0,469.0,0.070362
3,"<div id=""jobDescriptionText"" class=""jobsearch-...",vice presid ad sale market nbc olymp respons w...,justin carvil data scientist profession experi...,430,235,14.0,437.0,0.032037
4,"<div id=""jobDescriptionText"" class=""jobsearch-...",gener summari senior compens analyst key partn...,justin carvil data scientist profession experi...,351,235,19.0,388.0,0.048969


## Fuzzy Wuzzy

# Word2Vec - Embeddings

In [None]:
# Train Word2Vec model
embedding_dim = 100
model = Word2Vec(sentences=df['Clean Job Description'], vector_size=embedding_dim, min_count=1)

In [None]:
# Generate word embeddings for each word/token in the texts
embeddings1 = [model.wv[word] for word in df['Clean Job Description'][0] if word in model.wv]
embeddings2 = [model.wv[word] for word in df['Resume'][0] if word in model.wv]

# Calculate document vectors by taking the mean of word embeddings
doc_vector1 = np.mean(embeddings1, axis=0)
doc_vector2 = np.mean(embeddings2, axis=0)

In [None]:
doc_vector1.shape, doc_vector2.shape

((100,), (100,))

## Cosine Similarity Score

In [None]:
similarity_score = cosine_similarity([doc_vector1], [doc_vector2])[0, 0]

In [None]:
similarity_score

0.9311392

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Create an empty list to store similarity scores
similarity_scores = []

# Loop through the dataframe
for index, row in df.iterrows():
    # Generate word embeddings for each word/token in the texts
    embeddings1 = [model.wv[word] for word in row['Clean Job Description'] if word in model.wv]
    embeddings2 = [model.wv[word] for word in row['Resume'] if word in model.wv]

    # Calculate document vectors by taking the mean of word embeddings
    doc_vector1 = np.mean(embeddings1, axis=0)
    doc_vector2 = np.mean(embeddings2, axis=0)

    # Calculate similarity score using cosine similarity
    similarity_score = cosine_similarity([doc_vector1], [doc_vector2])[0, 0]

    # Append the similarity score to the list
    similarity_scores.append(similarity_score)

# Add the similarity scores as a new column in the dataframe
df['Similarity Score'] = similarity_scores

In [None]:
df.head()

Unnamed: 0,Job Description,Clean Job Description,Resume,JD_num_words,Resume_num_words,word_common,word_total,word_share,Similarity Score
0,"<div id=""jobDescriptionText"" class=""jobsearch-...",ount execut job number 50955730 descript sprea...,justin carvil data scientist profession experi...,458,235,19.0,446.0,0.042601,0.931139
1,"<div id=""jobDescriptionText"" class=""jobsearch-...",primari purpos work part product manag team en...,justin carvil data scientist profession experi...,590,235,25.0,537.0,0.046555,0.929776
2,"<div id=""jobDescriptionText"" class=""jobsearch-...",descript hire enterpris sale develop repres re...,justin carvil data scientist profession experi...,498,235,33.0,469.0,0.070362,0.950274
3,"<div id=""jobDescriptionText"" class=""jobsearch-...",vice presid ad sale market nbc olymp respons w...,justin carvil data scientist profession experi...,430,235,14.0,437.0,0.032037,0.931017
4,"<div id=""jobDescriptionText"" class=""jobsearch-...",gener summari senior compens analyst key partn...,justin carvil data scientist profession experi...,351,235,19.0,388.0,0.048969,0.924335


# Universal Sentence Encoder - Word Embeddings

## Prepare subset of data

In [30]:
small_df = df.sample(20, ignore_index=True, random_state=22)[['Clean Job Description', 'Resume']]
small_df.head(3)

Unnamed: 0,Clean Job Description,Resume
0,gener sale produc sale gain provid custom serv...,justin carvil data scientist profession experi...
1,design review mechan system includ motion devi...,justin carvil data scientist profession experi...
2,gener sale produc sale gain provid custom serv...,justin carvil data scientist profession experi...


## Import Dependencies / Load USE Model

In [38]:
from absl import logging

import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)

module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [39]:
def embed(input):
  return model(input)

## Make Embeddings

In [64]:
small_df['Job Embedding'] = None
for i, row in enumerate(small_df['Clean Job Description']):
    small_df['Job Embedding'].iloc[i] = embed([row]).numpy()[0]

In [65]:
small_df['Job Embedding']

0     [-0.06052517, -0.063466616, 0.04240349, 0.0628...
1     [-0.053626165, -0.055753887, 0.05094826, 0.055...
2     [-0.06052517, -0.063466616, 0.04240349, 0.0628...
3     [0.045459274, -0.045461148, -0.011110358, -0.0...
4     [-0.040605675, -0.051936015, -0.022556204, 0.0...
5     [-0.0052114776, -0.054965567, -0.015351024, 0....
6     [-0.026673786, -0.052869562, -0.05167172, 0.05...
7     [-0.06401517, -0.06491341, -0.0037403295, 0.05...
8     [-0.04891549, -0.049729, -0.026901186, 0.04840...
9     [-0.052986484, -0.05735398, -0.038213767, 0.05...
10    [0.04387696, -0.045553118, -0.039802235, 0.017...
11    [-0.04618722, -0.047041614, -0.046949744, 0.04...
12    [-0.048059396, -0.053109597, -0.052007586, 0.0...
13    [0.020050647, -0.04629718, 0.013052736, 0.0010...
14    [-0.046116848, -0.04678524, 0.029229747, 0.030...
15    [-0.056009214, -0.058064878, -0.043980166, 0.0...
16    [-0.030130783, -0.05805259, -0.024351109, 0.05...
17    [-0.065015, -0.06602504, -0.023950208, 0.0

In [68]:
for i, row in enumerate(small_df['Job Embedding']):
    small_df['Job Embed Size'].iloc[i] = len(row)

In [69]:
small_df.head()

Unnamed: 0,Clean Job Description,Resume,Job Embedding,Job Embed Size
0,gener sale produc sale gain provid custom serv...,justin carvil data scientist profession experi...,"[-0.06052517, -0.063466616, 0.04240349, 0.0628...",512
1,design review mechan system includ motion devi...,justin carvil data scientist profession experi...,"[-0.053626165, -0.055753887, 0.05094826, 0.055...",512
2,gener sale produc sale gain provid custom serv...,justin carvil data scientist profession experi...,"[-0.06052517, -0.063466616, 0.04240349, 0.0628...",512
3,erp master data manag specialist consult readi...,justin carvil data scientist profession experi...,"[0.045459274, -0.045461148, -0.011110358, -0.0...",512
4,descript posit summari servic administr provid...,justin carvil data scientist profession experi...,"[-0.040605675, -0.051936015, -0.022556204, 0.0...",512


In [75]:
small_df['Resume Embedding'] = None
for i, row in enumerate(small_df['Resume']):
    small_df['Resume Embedding'].iloc[i] = embed([row]).numpy()[0]

small_df['Resume Embed Size'] = None
for i, row in enumerate(small_df['Resume Embedding']):
    small_df['Resume Embed Size'].iloc[i] = len(row)
small_df.head()

Unnamed: 0,Clean Job Description,Resume,Job Embedding,Job Embed Size,Resume Embedding,Resume Embed Size
0,gener sale produc sale gain provid custom serv...,justin carvil data scientist profession experi...,"[-0.06052517, -0.063466616, 0.04240349, 0.0628...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512
1,design review mechan system includ motion devi...,justin carvil data scientist profession experi...,"[-0.053626165, -0.055753887, 0.05094826, 0.055...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512
2,gener sale produc sale gain provid custom serv...,justin carvil data scientist profession experi...,"[-0.06052517, -0.063466616, 0.04240349, 0.0628...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512
3,erp master data manag specialist consult readi...,justin carvil data scientist profession experi...,"[0.045459274, -0.045461148, -0.011110358, -0.0...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512
4,descript posit summari servic administr provid...,justin carvil data scientist profession experi...,"[-0.040605675, -0.051936015, -0.022556204, 0.0...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512


## Get Similarity Score

In [82]:
from scipy import spatial
cos_sim = 1 - spatial.distance.cosine(small_df['Job Embedding'][0], small_df['Resume Embedding'][0])
cos_sim

0.4067569077014923

In [84]:
# Create an empty list to store similarity scores
similarity_scores = []

for index, row in small_df.iterrows():

  cos_sim = 1 - spatial.distance.cosine(small_df['Job Embedding'][index], small_df['Resume Embedding'][index])

  # Append the similarity score to the list
  similarity_scores.append(cos_sim)

# Add the similarity scores as a new column in the dataframe
small_df['Similarity Score'] = similarity_scores

In [86]:
small_df.sort_values(by='Similarity Score', ascending=False)

Unnamed: 0,Clean Job Description,Resume,Job Embedding,Job Embed Size,Resume Embedding,Resume Embed Size,similarity_score,Similarity Score
5,look dynam creativ senior technic instructor j...,justin carvil data scientist profession experi...,"[-0.0052114776, -0.054965567, -0.015351024, 0....",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.692963
4,descript posit summari servic administr provid...,justin carvil data scientist profession experi...,"[-0.040605675, -0.051936015, -0.022556204, 0.0...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.561638
6,entri level softwar engin python awsjob summar...,justin carvil data scientist profession experi...,"[-0.026673786, -0.052869562, -0.05167172, 0.05...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.532946
9,overview join sale team tpx role sale team ope...,justin carvil data scientist profession experi...,"[-0.052986484, -0.05735398, -0.038213767, 0.05...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.512376
8,posit summari candid requir oper field servic ...,justin carvil data scientist profession experi...,"[-0.04891549, -0.049729, -0.026901186, 0.04840...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.496537
16,erp applic consult aktion associ look skill er...,justin carvil data scientist profession experi...,"[-0.030130783, -0.05805259, -0.024351109, 0.05...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.442134
12,wayfair profession help hundr thousand busi ma...,justin carvil data scientist profession experi...,"[-0.048059396, -0.053109597, -0.052007586, 0.0...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.440933
19,deloitt servic lp includ intern support area m...,justin carvil data scientist profession experi...,"[0.011296743, -0.058612075, -0.009373135, 0.05...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.440038
11,posit entri level posit j1 student welcom appl...,justin carvil data scientist profession experi...,"[-0.04618722, -0.047041614, -0.046949744, 0.04...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.425539
15,1200 support technician posit snapshot midway ...,justin carvil data scientist profession experi...,"[-0.056009214, -0.058064878, -0.043980166, 0.0...",512,"[-0.049488615, -0.052642465, 0.008287607, 0.03...",512,,0.408954


In [87]:
small_df['Clean Job Description'][5]

'look dynam creativ senior technic instructor join team help us deliv world class train custom partner nation worldwid role perform instruct deliveri approv cours materi teach multipl cours work within team also abl work independ also abl teach class virtual well splunk partner custom contract facil addit demonstr abil lead major department project effect work cross function member depart view leader mentor team member deliv multipl technic advanc cours respons teach advanc technic class regularli particip cross function project lead major project within depart handl multipl project task minim supervis consist meet exce goal project achiev commun effect verbal written manag team member depart anticip resolv common problem balanc long short term goal priorit activ activ particip curriculum plan session new product train review train curriculum provid feedback instruct design adapt well chang job project requir recommend implement solut project issu aris requir extens experi softwar tech