<h1>Part 2 : Pre-Processing of Job Description Text</h1>

In [1]:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import regex as re

import os
import warnings
warnings.filterwarnings('ignore')

from wordcloud import WordCloud, STOPWORDS

In [2]:
# Load csv
with open('results.csv') as f:
    csv_data = pd.read_csv(f, encoding='latin-1')
    raw_data = pd.DataFrame(csv_data)

raw_data.head()

Unnamed: 0,Job Title,Company,Location,Salary,Job Description
0,Pricing Analyst,Day & Ross Inc.,"Mississauga, ON",,Position Title:Pricing Analyst\nPosition Type:...
1,Senior Data Analyst- Telephony Manager,NCRi Inc.,"Mississauga, ON",,Title: Senior Data Analyst - Telephony Manager...
2,Fuel Cell Data Engineer / Analyst,Cummins Inc.,"Mississauga, ON",,We are looking for a talented Fuel Cell Data E...
3,Senior Meter Data Analyst,Rodan Energy Solutions Inc.,"Mississauga, ON",,CAREER OPPORTUNITY\nSENIOR METER DATA ANALYST\...
4,"Data Engineer, Business Intelligence & Analytics",Herjavec Group,"Toronto, ON",,The Data Engineer reports directly to the Dire...


In [3]:
raw_data.describe()

Unnamed: 0,Job Title,Company,Location,Salary,Job Description
count,450,450,450,62,450
unique,314,267,16,47,384
top,Data Analyst,BMO Financial Group,"Toronto, ON","$60,000 - $70,000 a year",Our student and new graduate programs offer a ...
freq,19,23,333,4,5


In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Job Title        450 non-null    object
 1   Company          450 non-null    object
 2   Location         450 non-null    object
 3   Salary           62 non-null     object
 4   Job Description  450 non-null    object
dtypes: object(5)
memory usage: 17.7+ KB


<h2>Text Analysis of Job Description</h2>

<h3>Pre-processing</h3>

All the job postings currently available will be considered the "test" set and will not be split. Any predictions will be made on new postings scraped at a later date.

The pre-processing pipeline will include the following steps:
- Word count
- Tokenization
- Lower casing
- Stopword removal
- Stemming
- transformation using TF-IDF

In [5]:
# The text we will be analysing is the job description
# Load libraries

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

textdf = raw_data[['Job Description']]
textdf.head()

Unnamed: 0,Job Description
0,Position Title:Pricing Analyst\nPosition Type:...
1,Title: Senior Data Analyst - Telephony Manager...
2,We are looking for a talented Fuel Cell Data E...
3,CAREER OPPORTUNITY\nSENIOR METER DATA ANALYST\...
4,The Data Engineer reports directly to the Dire...


In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/jennifer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
textdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 1 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Job Description  450 non-null    object
dtypes: object(1)
memory usage: 3.6+ KB


In [40]:
text1 = textdf['Job Description'][1]
text1

'0      position title:pricing analyst\\nposition type:...\n1      title: senior data analyst - telephony manager...\n2      we are looking for a talented fuel cell data e...\n3      career opportunity\\nsenior meter data analyst\\...\n4      the data engineer reports directly to the dire...\n                             ...                        \n445    job description:\\n\\nanalyst health data public...\n446    company profile\\nhome capital group inc., toge...\n447    on behalf of our pharmaceutical manufacturing ...\n448    address:\\n60 yonge street\\njob family group:\\n...\n449    business analyst – risk & treasury\\ntoronto – ...\nname: job description, length: 450, dtype: object'

In [None]:
textdf['Job Description'] = textdf['Job Description'].astype('str')

In [33]:
text1_token = word_tokenize(str(text1))
print(text1_token)

['0', 'position', 'title', ':', 'pricing', 'analyst\\nposition', 'type', ':', '...', '1', 'title', ':', 'senior', 'data', 'analyst', '-', 'telephony', 'manager', '...', '2', 'we', 'are', 'looking', 'for', 'a', 'talented', 'fuel', 'cell', 'data', 'e', '...', '3', 'career', 'opportunity\\nsenior', 'meter', 'data', 'analyst\\', '...', '4', 'the', 'data', 'engineer', 'reports', 'directly', 'to', 'the', 'dire', '...', '...', '445', 'job', 'description', ':', '\\n\\nanalyst', 'health', 'data', 'public', '...', '446', 'company', 'profile\\nhome', 'capital', 'group', 'inc.', ',', 'toge', '...', '447', 'on', 'behalf', 'of', 'our', 'pharmaceutical', 'manufacturing', '...', '448', 'address', ':', '\\n60', 'yonge', 'street\\njob', 'family', 'group', ':', '\\n', '...', '449', 'business', 'analyst', '–', 'risk', '&', 'treasury\\ntoronto', '–', '...', 'name', ':', 'job', 'description', ',', 'length', ':', '450', ',', 'dtype', ':', 'object']


Verify that there is text in each job description by counting and graphing the number of words

In [34]:
print(len(text1_token))

107


Making a smaller df with just the job description gave inaccurate results (text is cut off). Use directly from raw data df.

In [41]:
text2 = raw_data['Job Description'][1]
text2

'Title: Senior Data Analyst - Telephony Manager\nReporting Manager: VP of Technology\n\nLocation: Mississauga\nJob Type: Full time, Permanent\n\n Job Summary\n\n\nThe primary responsibility of the Telephony Manager (Senior Data Analyst) is to maintain and operate and to operate a high-volume dialer system and its subsystems in a contact center environment effectively and efficiently. Experience in managing Telephony Applications (IVR, Quality Management, Workforce Management, Reporting). You will also collaborate with business management to determine overall strategy.\n\n Responsibilities:\n\nResearch, plan, install, configure, troubleshoot, maintain, and upgrade telephony systems.\nAnalyze and evaluate present or proposed telephony procedures to drive contact center best practices and continually improve performance of the systems and dialer strategies.\nTroubleshoot and resolve hardware, software, and connectivity problems, including user access and component configuration\nRecord an

In [42]:
text2_token = word_tokenize(str(text2))
print(len(text2_token))

378


<h4>1. Tokenize Words</h4>

In [43]:
tokens = word_tokenize(str(raw_data['Job Description']))

<h4>2. Stemming Words</h4>

In [44]:
porter = PorterStemmer()
textdf['Job Description'] = raw_data['Job Description'].apply(porter.stem)

<h4>3. Stopword Removal & Lowercasing</h4>

In [45]:
# Initialize TFIDF Vectorizer
cv = TfidfVectorizer(stop_words = 'english', lowercase= True)

<h4>4. Apply TF-IDF</h4>

In [46]:
textdf_CV = cv.fit_transform(textdf['Job Description'])

<h3>Text Pre-processing Pipeline Function</h3>

See DSDJ Feature Engineering pt 2 for a train & test version of the Tfidf vectorizer function.

In [47]:
# A function that cleans and performs a TFIDF transformation to text data
tfidf = TfidfVectorizer(stop_words='english', lowercase=True)

def tfidf_pipeline(txt):
    txt = txt.apply(porter.stem) # Apply Stemming
    x = tfidf.fit_transform(txt) # Apply Vectorizer, Stopword Removal, & Lowercasing
    return x 

In [48]:
text_TFIDF = tfidf_pipeline(raw_data['Job Description'])

In [49]:
original = raw_data.shape
preprocessed = text_TFIDF.shape

print("Original raw data df shape: " + str(original))
print("Preprocessed data shape: " + str(preprocessed))

Original raw data df shape: (450, 5)
Preprocessed data shape: (450, 10256)


In [None]:
# word frequency : how many words per post.
# Text cleaning : lower casing, extra white-space removal, lemmatization

# Determine most common words that occur in the job descriptions. 

# Predetermine a list of expected lookup terms for dictionary of skills
# BOW - Create a list of dictionaries containing word counts for each job posting
# Table with skill, count, percentage
# Wordcloud

In [None]:
#Words2Vec - similar words are closer together in a sentence

# Topic modelling - where skills is considered a topic

# NER with BERT