<h1>Part 2 : Pre-Processing of Job Description Text</h1>

<h3> Import Packages</h3>

In [1]:
#import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import regex as re

import os
import warnings
warnings.filterwarnings('ignore')

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS

# If you haven't already done so, execute:
#import nltk
#nltk.download('punkt')

<h2>1. Load data from sqlite database</h2>

In [2]:
from sqlalchemy import create_engine, MetaData, Table

# Query SQL db to get analyst job descriptions first

# Create connection to db
engine = create_engine("sqlite:///joblist.sqlite")
print(engine.table_names())

# Load in data table
metadata = MetaData()
data = Table('data', metadata, autoload=True, autoload_with=engine)

print(data.columns.keys())
print(repr(metadata.tables['data']))

['data']
['jobtitle', 'company', 'location', 'salary', 'jobdescription', 'label']
Table('data', MetaData(bind=None), Column('jobtitle', VARCHAR(length=100), table=<data>), Column('company', VARCHAR(length=100), table=<data>), Column('location', VARCHAR(length=25), table=<data>), Column('salary', INTEGER(), table=<data>), Column('jobdescription', TEXT(), table=<data>), Column('label', INTEGER(), table=<data>), schema=None)


In [3]:
from sqlalchemy import select

# Build query
stmt = select([data.columns.jobdescription, data.columns.label])
stmt = stmt.where(data.columns.label == '0') # 0 = analysts

# Create connection to engine
connection = engine.connect()

# Execute query
results = connection.execute(stmt).fetchall()
print(results[0].keys())

['jobdescription', 'label']


In [4]:
# Count the number of rows in db

from sqlalchemy import func

stmt_count = select([func.count(data.columns.jobdescription)])
results_count = connection.execute(stmt_count).scalar()
print(results_count)

700


In [5]:
# Create dataframe from SQLAlchemy ResultSet
df_data = pd.DataFrame(results)

# Give columns proper heading
df_data.columns = results[0].keys()
print(df_data.head())

                                      jobdescription  label
0  Position Title:Pricing Analyst Position Type: ...      0
1  Title: Senior Data Analyst - Telephony Manager...      0
2  We are looking for a talented Fuel Cell Data E...      0
3  CAREER OPPORTUNITY SENIOR METER DATA ANALYST L...      0
4  The Data Engineer reports directly to the Dire...      0


In [6]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   jobdescription  450 non-null    object
 1   label           450 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 7.2+ KB


In [7]:
df_data['jobdescription'] = df_data['jobdescription'].astype('string')
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   jobdescription  450 non-null    string
 1   label           450 non-null    int64 
dtypes: int64(1), string(1)
memory usage: 7.2 KB


<h2>Text Analysis of Job Description</h2>

<h3>Pre-processing Pipeline</h3>

For each job description, we will tokenize, stem, remove stop words, and lowercase each word.

The following is an example using 1 job description.

In [8]:
# Select an example job description from df_data

text = df_data['jobdescription'][7]

In [9]:
# Tokenize example text
# Maybe add len(token) to dataframe and plot

text_token = word_tokenize(str(text))
print(len(text_token))

471


In [10]:
print(text_token[-40:])

['(', 'Preferred', ')', 'Work', 'remotely', ':', 'Temporarily', 'due', 'to', 'COVID-19COVID-19', 'precaution', '(', 's', ')', ':', 'Remote', 'interview', 'processPersonal', 'protective', 'equipment', 'provided', 'or', 'requiredPlastic', 'shield', 'at', 'work', 'stationsSocial', 'distancing', 'guidelines', 'in', 'placeVirtual', 'meetingsSanitizing', ',', 'disinfecting', ',', 'or', 'cleaning', 'procedures', 'in', 'place']


In [11]:
# stem tokens
# stemmer takes a list an input

stemmer = PorterStemmer()
text_stemmed = [stemmer.stem(w) for w in text_token]
print(text_stemmed[-40:])

['(', 'prefer', ')', 'work', 'remot', ':', 'temporarili', 'due', 'to', 'covid-19covid-19', 'precaut', '(', 's', ')', ':', 'remot', 'interview', 'processperson', 'protect', 'equip', 'provid', 'or', 'requiredplast', 'shield', 'at', 'work', 'stationssoci', 'distanc', 'guidelin', 'in', 'placevirtu', 'meetingssanit', ',', 'disinfect', ',', 'or', 'clean', 'procedur', 'in', 'place']


<h2>Text Pre-processing Pipeline</h2>

All the job postings currently available will be considered the "test" set and will not be split. Any predictions will be made on new postings scraped at a later date.

CountVectorizer and TF-IDF are two methods that can be used to vectorize text.

The CountVectorizer function returns an encoded vector with an integer count for each word.

We will use CountVectorizer to identify skills that are commonly found in job descriptions. These skills would therefore have a higher count across the corpus compared to other (i) lesser used or company-specific skills and (ii) non-important words.


The pre-processing pipeline will include the following steps:

- Tokenization
- Stopword removal
- Lower casing
- Apply transformation

<h2>TFIDF Vectorizer</h2>

TFIDF converts the job description text into a matrix of TF-IDF features. The TF-IDF Vectorizer function from the sklearn.feature_extraction module performs multiple steps including tokenization, stopword removal, and lower casing.

<h4>1. Initialize Stopword Removal & Lowercasing</h4>

In [12]:
# Add words to stopwords list
# https://stackoverflow.com/questions/24386489/adding-words-to-scikit-learns-countvectorizers-stop-list
custom_stopwords = ['bachelor', 'degree', 'work', 'equal', 'opportunity', 'employer', 'objectives', 'ontario', 'canada', 'disability', 'strong', 'including', 'ensure', 'understanding', 'related']

# Initialize TFIDF Vectorizer
tvec = TfidfVectorizer(analyzer = 'word',
                       stop_words = ENGLISH_STOP_WORDS.union(custom_stopwords), 
                       lowercase= True, 
                       min_df=4)

<h4>2. Apply TF-IDF</h4>

In [13]:
# returns matrix of tf-idf features
tvec_token = tvec.fit_transform(df_data['jobdescription'])

In [14]:
tvec_token.shape

(450, 3908)

In [16]:
# Observe TFIDF Weights

weights = dict(zip(tvec.get_feature_names(), tvec.idf_))
tfidf = pd.DataFrame.from_dict(weights, orient='index')
tfidf.columns = ['tfidf']

# Lowest TFIDF Scores
low_tfidf = tfidf.sort_values(by=['tfidf'], ascending=True).head(25)
print(low_tfidf)

                  tfidf
data           1.008909
experience     1.061734
business       1.105114
skills         1.115015
team           1.137658
management     1.242170
analysis       1.256395
years          1.259265
analyst        1.327642
support        1.330724
process        1.333815
knowledge      1.355725
ability        1.374895
working        1.391156
requirements   1.431295
communication  1.434714
tools          1.458978
development    1.469560
solutions      1.491066
job            1.509349
role           1.516756
analytical     1.516756
provide        1.524219
environment    1.524219
information    1.539313


In [17]:
# Highest TFIDF Scores
high_tfidf = tfidf.sort_values(by=['tfidf'], ascending=False).head(10)
print(high_tfidf)

               tfidf
œuvre       5.502029
colleague   5.502029
golden      5.502029
sites       5.502029
catalog     5.502029
glassdoor   5.502029
gl          5.502029
unwavering  5.502029
cats        5.502029
cc          5.502029


<h3>Bi-grams and Tri-grams</h3>

The TF-IDF vectorizer has a parameter which allows for the extraction of words that occur frequently together, such as "machine learning" or "data science". These are referred to as bigrams and can also be useful in identifying key concepts.

In [18]:
# Add words to stopwords list
custom_stopwords = ['bachelor', 'degree', 'ability', 'work', 'equal', 'opportunity', 'employer', 'objectives', 'ontario', 'canada', 'disability']

# Initialize TFIDF Vectorizer
tvec2 = TfidfVectorizer(analyzer = 'word', 
                       stop_words = ENGLISH_STOP_WORDS.union(custom_stopwords), 
                       lowercase= True, 
                       ngram_range=(2,3), 
                       min_df=4)

# returns matrix of tf-idf features
tvec2_token = tvec2.fit_transform(df_data['jobdescription'])

In [19]:
weights2 = dict(zip(tvec2.get_feature_names(), tvec2.idf_))
tfidf2 = pd.DataFrame.from_dict(weights2, orient='index')
tfidf2.columns = ['tfidf']

# Lowest TFIDF Scores
low_tfidf2 = tfidf2.sort_values(by=['tfidf'], ascending=True).head(25)
print(low_tfidf2)

                           tfidf
communication skills    1.891112
data analysis           2.011601
years experience        2.081029
problem solving         2.100832
business intelligence   2.366535
computer science        2.429336
internal external       2.438639
team members            2.457507
data analytics          2.486495
experience working      2.536756
project management      2.547119
experience data         2.589679
ad hoc                  2.589679
decision making         2.600608
business requirements   2.600608
related field           2.611658
minimum years           2.634131
problem solving skills  2.668816
analytical skills       2.668816
solving skills          2.668816
written communication   2.668816
business analyst        2.680651
job description         2.680651
selection process       2.704748
data management         2.729441


In [20]:
# Highest TFIDF Scores
high_tfidf2 = tfidf2.sort_values(by=['tfidf'], ascending=False).head(10)
print(high_tfidf2)

                             tfidf
00 hourexperience         5.502029
growing business          5.502029
gym membership family     5.502029
studios wattpad           5.502029
studios wattpad books     5.502029
guidance support          5.502029
submitted updated         5.502029
growth initiatives        5.502029
growth employee wellness  5.502029
growth employee           5.502029


<h3>Text Pre-processing Pipeline Function</h3>

Put it together!

In [21]:
custom_stopwords = ['bachelor', 'degree', 'work', 'equal', 'opportunity', 'employer', 'objectives', 'ontario', 'canada', 'disability', 'strong', 'including', 'ensure', 'understanding', 'related']

# Initialize TFIDF Vectorizer
tvec3 = TfidfVectorizer(analyzer = 'word',  
                       stop_words = ENGLISH_STOP_WORDS.union(custom_stopwords), 
                       lowercase= True, 
                       ngram_range=(1,3),
                       min_df=4)

def tfidf3_pipeline(txt):
    x = tvec3.fit_transform(txt) # Apply Vectorizer, Stopword Removal, & Lowercasing
    return x 

In [22]:
jobtext_TFIDF3 = tfidf3_pipeline(df_data['jobdescription'])

In [23]:
original = df_data.shape
preprocessed = jobtext_TFIDF3.shape

print("Original raw data df shape: " + str(original))
print("Preprocessed data shape: " + str(preprocessed))

Original raw data df shape: (450, 2)
Preprocessed data shape: (450, 17190)


In [25]:
weights4 = dict(zip(tvec3.get_feature_names(), tvec3.idf_))
tfidf3 = pd.DataFrame.from_dict(weights4, orient='index')
tfidf3.columns = ['tfidf']

# Lowest TFIDF Scores
low_tfidf3 = tfidf3.sort_values(by=['tfidf'], ascending=True).head(25)
print(low_tfidf3)

                  tfidf
data           1.008909
experience     1.061734
business       1.105114
skills         1.115015
team           1.137658
management     1.242170
analysis       1.256395
years          1.259265
analyst        1.327642
support        1.330724
process        1.333815
knowledge      1.355725
ability        1.374895
working        1.391156
requirements   1.431295
communication  1.434714
tools          1.458978
development    1.469560
solutions      1.491066
job            1.509349
analytical     1.516756
role           1.516756
environment    1.524219
provide        1.524219
information    1.539313
