<h1>Part 2 : Pre-Processing of Job Description Text</h1>

<h3> Import Packages</h3>

In [1]:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import regex as re

import os
import warnings
warnings.filterwarnings('ignore')

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# If you haven't already done so, execute:
#import nltk
#nltk.download('punkt')

<h2>1. Load data from sqlite database</h2>

In [2]:
from sqlalchemy import create_engine, MetaData, Table

# Query SQL db to get analyst job descriptions first

# Create connection to db
engine = create_engine("sqlite:///joblist.sqlite")
print(engine.table_names())

# Load in data table
metadata = MetaData()
data = Table('data', metadata, autoload=True, autoload_with=engine)

print(data.columns.keys())
print(repr(metadata.tables['data']))

['data']
['jobtitle', 'company', 'location', 'salary', 'jobdescription', 'label']
Table('data', MetaData(bind=None), Column('jobtitle', VARCHAR(length=100), table=<data>), Column('company', VARCHAR(length=100), table=<data>), Column('location', VARCHAR(length=25), table=<data>), Column('salary', INTEGER(), table=<data>), Column('jobdescription', TEXT(), table=<data>), Column('label', INTEGER(), table=<data>), schema=None)


In [3]:
from sqlalchemy import select

# Build query
stmt = select([data.columns.jobdescription, data.columns.label])
stmt = stmt.where(data.columns.label == '0') # 0 = analysts

# Create connection to engine
connection = engine.connect()

# Execute query
results = connection.execute(stmt).fetchall()
print(results[0].keys())

['jobdescription', 'label']


In [4]:
# Count the number of rows in db

from sqlalchemy import func

stmt_count = select([func.count(data.columns.jobdescription)])
results_count = connection.execute(stmt_count).scalar()
print(results_count)

625


In [5]:
# Create dataframe from SQLAlchemy ResultSet
df_data = pd.DataFrame(results)

# Give columns proper heading
df_data.columns = results[0].keys()
print(df_data.head())

                                      jobdescription  label
0  Position Title:Pricing Analyst Position Type: ...      0
1  Title: Senior Data Analyst - Telephony Manager...      0
2  We are looking for a talented Fuel Cell Data E...      0
3  CAREER OPPORTUNITY SENIOR METER DATA ANALYST L...      0
4  The Data Engineer reports directly to the Dire...      0


In [6]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   jobdescription  450 non-null    object
 1   label           450 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 7.2+ KB


In [7]:
df_data['jobdescription'] = df_data['jobdescription'].astype('string')
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   jobdescription  450 non-null    string
 1   label           450 non-null    int64 
dtypes: int64(1), string(1)
memory usage: 7.2 KB


<h2>Text Analysis of Job Description</h2>

<h3>Pre-processing Pipeline</h3>

For each job description, we will tokenize, stem, remove stop words, and lowercase each word.

The following is an example using 1 job description.

In [8]:
# Select an example job description from df_data

text = df_data['jobdescription'][7]
#text

In [9]:
# Tokenize example text
# Maybe add len(token) to dataframe and plot

text_token = word_tokenize(str(text))
print(len(text_token))

471


In [10]:
print(text_token[-40:])

['(', 'Preferred', ')', 'Work', 'remotely', ':', 'Temporarily', 'due', 'to', 'COVID-19COVID-19', 'precaution', '(', 's', ')', ':', 'Remote', 'interview', 'processPersonal', 'protective', 'equipment', 'provided', 'or', 'requiredPlastic', 'shield', 'at', 'work', 'stationsSocial', 'distancing', 'guidelines', 'in', 'placeVirtual', 'meetingsSanitizing', ',', 'disinfecting', ',', 'or', 'cleaning', 'procedures', 'in', 'place']


In [11]:
# stem tokens
# stemmer takes a list an input

stemmer = PorterStemmer()
text_stemmed = [stemmer.stem(w) for w in text_token]
print(text_stemmed[-40:])

['(', 'prefer', ')', 'work', 'remot', ':', 'temporarili', 'due', 'to', 'covid-19covid-19', 'precaut', '(', 's', ')', ':', 'remot', 'interview', 'processperson', 'protect', 'equip', 'provid', 'or', 'requiredplast', 'shield', 'at', 'work', 'stationssoci', 'distanc', 'guidelin', 'in', 'placevirtu', 'meetingssanit', ',', 'disinfect', ',', 'or', 'clean', 'procedur', 'in', 'place']


<h2>Text Pre-processing Pipeline</h2>

All the job postings currently available will be considered the "test" set and will not be split. Any predictions will be made on new postings scraped at a later date.

Here, we will try 2 different vectorizers: CountVectorizer and TF-IDF.

The pre-processing pipeline will include the following steps:

- Tokenization
- Stopword removal
- Lower casing
- Stemming
- Apply transformation

<h2>Count Vectorizer</h2>

The CountVectorizer function returns an encoded vector with an integer count for each word.

We will use CountVectorizer to identify skills that are commonly found in job descriptions. These skills would therefore have a higher count across the corpus compared to other (i) lesser used or company-specific skills and (ii) non-important words.

Here, we will tokenize only words, use the built-in stop words list, and keep only tokens that appear in at least 2 dfs (document frequencies).

In [21]:
cv = CountVectorizer(analyzer='word', token_pattern = r"[a-zA-Z]+", stop_words = 'english', lowercase = True, min_df=4, ngram_range=(1,2))
count_vector = cv.fit_transform(df_data['jobdescription'])

print(cv.get_feature_names()[:10])

['ab', 'abilities', 'abilities competencies', 'abilities excellent', 'abilities provide', 'ability', 'ability adapt', 'ability analyze', 'ability apply', 'ability build']


In [22]:
# Create dataframe out of CV sparse array
counts = pd.DataFrame(count_vector.toarray(), columns = cv.get_feature_names())
print(counts.head())

   ab  abilities  abilities competencies  abilities excellent  \
0   0          0                       0                    0   
1   0          1                       0                    0   
2   0          0                       0                    0   
3   0          0                       0                    0   
4   0          0                       0                    0   

   abilities provide  ability  ability adapt  ability analyze  ability apply  \
0                  0        8              0                0              2   
1                  0        2              0                0              0   
2                  0        0              0                0              0   
3                  0        1              0                0              0   
4                  0        2              0                0              0   

   ability build  ...  yoga  yonge  yonge street  york  young  young people  \
0              0  ...     0      0             0 

In [23]:
# calculate sparsity
sparsity = 1.0 - np.count_nonzero(counts) / counts.size
print(sparsity)

0.967948920699841


In [18]:
#print(cv.stop_words_)

{'bellcanada com', 've steadily', 'evaluation management', 'cooperative environment', 'embracing new', 'application m', 'love ecommerce', 'previously undetermined', 'cross location', 'ambiguity bring', 'certification assetperson', 'consumer promotions', 'innovation participate', 'approvals uat', 'operating unit', 'inventory received', 'server access', 'reporting category', 'consumable formats', 'maps analyzing', 'phase lead', 'systems plan', 'effectively collaborating', 'recommendations assists', 'technologiesexperience detailed', 'aid development', 'amazing north', 'situations proficient', 'designing trade', 'delivering exceptional', 'resonates owners', 'calculations dynamic', 'capital secure', 'strive positively', 'explanation interpretation', 'fabricators', 'world tour', 'integrated environment', 'role profile', 'office g', 'undergrad work', 'discretion appropriate', 'triage support', 'documents crd', 'o lead', 'monthly rate', 'role evolve', 'includes media', 'independently larger',

<h2>TFIDF Vectorizer</h2>

TFIDF converts the job description text into a matrix of TF-IDF features. The TF-IDF Vectorizer function from the sklearn.feature_extraction module performs multiple steps including tokenization, stopword removal, and lower casing.

<h4>1. Text Normalization: Stemming Words</h4>

In [35]:
porter = PorterStemmer()
df_data['jobdescription'][7] = df_data['jobdescription'][7].apply(porter.stem())

AttributeError: 'str' object has no attribute 'apply'

In [18]:
df_data['jobdescription'][7]

'canadian orthodontic partners has an exciting new opportunity, looking for a talented new analyst to join our standards team! reporting to the national manager of cop standards, the business analyst will gather and interpret data to develop actionable steps to improve processes, optimizing operational performance and acquisition due diligence and transitions.what we offer: competitive annual salary + quarterly bonuscomprehensive benefits package including : medical, dental, vision, and orthodontic coverage for employees and their families.paid vacation time.educational reimbursement program.real career growth opportunities.key responsibilities: lead the collection of acquisition due diligence data.collect all required initial dd information from selling doctor based on standard question listensure information is complete and prepared for decisions making by the team, highlighting risks or issues to expedite the decision making.complete a first pass of the dd analysis.as the project co

<h4>2. Initialize Stopword Removal & Lowercasing</h4>

In [12]:
# Initialize TFIDF Vectorizer
tvec = TfidfVectorizer(analyzer = 'word', token_pattern = r"[a-zA-Z]+", stop_words = 'english', lowercase= True)

<h4>3. Apply TF-IDF</h4>

In [13]:
# returns matrix of tf-idf features
tvec_token = tvec.fit_transform(df_data['jobdescription'])

In [16]:
tvec.get_feature_names()

['00',
 '000',
 '000026rp',
 '000ft',
 '000m3',
 '0090',
 '00am',
 '00pm',
 '01',
 '012',
 '013job',
 '0159',
 '01expected',
 '01job',
 '02',
 '0272',
 '02job',
 '03',
 '03job',
 '04',
 '04job',
 '05',
 '055',
 '0553',
 '05job',
 '06',
 '0661',
 '078',
 '08',
 '0815',
 '082',
 '083',
 '08expected',
 '09',
 '0g7',
 '10',
 '100',
 '1000',
 '100ft',
 '100mm',
 '100s',
 '10112323contract',
 '104',
 '1045',
 '1075',
 '1090',
 '10b',
 '10mb',
 '11',
 '110',
 '112',
 '113886',
 '114',
 '1145',
 '115',
 '116',
 '11685job',
 '119',
 '12',
 '120',
 '1200',
 '1236',
 '126',
 '128',
 '12th',
 '13',
 '130',
 '1300',
 '13021',
 '137',
 '13th',
 '14',
 '14001',
 '144',
 '15',
 '150',
 '1500',
 '153',
 '154',
 '155',
 '15611job',
 '1563',
 '15724',
 '15725contract',
 '15775contract',
 '15job',
 '16',
 '160367',
 '161182',
 '161615',
 '1654',
 '16607',
 '16764',
 '16job',
 '17',
 '170',
 '172',
 '173',
 '176',
 '178',
 '18',
 '180',
 '1830s',
 '1846',
 '1871',
 '1881',
 '189',
 '19',
 '1914',
 '1918',


In [None]:
tvec_token.shape

In [23]:
#tvec_token.shape without token_pattern

(450, 10256)

In [None]:
tvec.idf_

In [34]:
# Observe TFIDF Weights

tfidf = dict(zip(tvec.get_feature_names(), tvec.idf_))
tfidf = pd.DataFrame.from_dict(weights, orient='index')
tfidf.columns = ['tfidf']

# Lowest TFIDF Scores
low_tfidf = tfidf.sort_values(by=['tfidf'], ascending=True).head(10)
print(low_tfidf)

               tfidf
data        1.008909
experience  1.064095
work        1.097752
business    1.105114
skills      1.115015
team        1.137658
management  1.242170
analysis    1.259265
years       1.259265
analyst     1.327642


In [35]:
# Highest TFIDF Scores
high_tfidf = tfidf.sort_values(by=['tfidf'], ascending=False).head(10)
print(high_tfidf)

                       tfidf
managementassist     6.41832
dell                 6.41832
deletion             6.41832
impeccably           6.41832
delicious            6.41832
softwarework         6.41832
softwarestrong       6.41832
deliverablescreate   6.41832
deliverablespresent  6.41832
softwares            6.41832


Based on these results, using TF-IDF to look for important features relevent to data analyst jobs is not very effective for unigrams.

There are also features that do not have a proper space between two words.

<h3>Text Pre-processing Pipeline Function</h3>

See DSDJ Feature Engineering pt 2 for a train & test version of the Tfidf vectorizer function.

In [None]:
# Define a function that cleans and performs a TFIDF transformation to text data
tfidf = TfidfVectorizer(stop_words='english', lowercase=True)
porter = PorterStemmer()

def tfidf_pipeline(txt):
    txt = txt.apply(porter.stem) # Apply Stemming
    x = tfidf.fit_transform(txt) # Apply Vectorizer, Stopword Removal, & Lowercasing
    return x 

In [None]:
jobtext_TFIDF = tfidf_pipeline(jobText['Job Description'])

In [None]:
original = jobText.shape
preprocessed = jobtext_TFIDF.shape

print("Original raw data df shape: " + str(original))
print("Preprocessed data shape: " + str(preprocessed))

In [None]:
jobText_TFIDF.head()

In [None]:
# word frequency : how many words per post.
# Text cleaning : lower casing, extra white-space removal, lemmatization

# Determine most common words that occur in the job descriptions. 
# Predetermine a list of expected lookup terms for dictionary of skills

# BOW - Create a list of dictionaries containing word counts for each job posting

# Table with skill, count, percentage

# Wordcloud

In [None]:
#Words2Vec - similar words are closer together in a sentence

# Topic modelling - where skills is considered a topic

# NER with BERT