<h1>Part 2 : Pre-Processing of Job Description Text</h1>

In [1]:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import regex as re

import os
import warnings
warnings.filterwarnings('ignore')

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# If you haven't already done so, run:
#import nltk
#nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/jennifer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# Load csv
with open('results.csv') as f:
    jobText = pd.read_csv(f, 
                          header = 0, 
                          usecols=["Job Title", "Job Description"], 
                          encoding='latin-1')

jobText.head()

Unnamed: 0,Job Title,Job Description
0,Pricing Analyst,Position Title:Pricing Analyst\nPosition Type:...
1,Senior Data Analyst- Telephony Manager,Title: Senior Data Analyst - Telephony Manager...
2,Fuel Cell Data Engineer / Analyst,We are looking for a talented Fuel Cell Data E...
3,Senior Meter Data Analyst,CAREER OPPORTUNITY\nSENIOR METER DATA ANALYST\...
4,"Data Engineer, Business Intelligence & Analytics",The Data Engineer reports directly to the Dire...


In [3]:
jobText.describe()

Unnamed: 0,Job Title,Job Description
count,450,450
unique,314,384
top,Data Analyst,At the Janssen Pharmaceutical Companies of Joh...
freq,19,5


In [4]:
jobText.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Job Title        450 non-null    object
 1   Job Description  450 non-null    object
dtypes: object(2)
memory usage: 7.2+ KB


In [5]:
jobText['Job Description'] = jobText['Job Description'].astype('str')

In [7]:
jobText.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Job Title        450 non-null    object
 1   Job Description  450 non-null    object
dtypes: object(2)
memory usage: 7.2+ KB


Add a column for the length of each description

In [8]:
jobText['length'] = jobText['Job Description'].str.len()
jobText.head()

Unnamed: 0,Job Title,Job Description,length
0,Pricing Analyst,Position Title:Pricing Analyst\nPosition Type:...,3404
1,Senior Data Analyst- Telephony Manager,Title: Senior Data Analyst - Telephony Manager...,2471
2,Fuel Cell Data Engineer / Analyst,We are looking for a talented Fuel Cell Data E...,2152
3,Senior Meter Data Analyst,CAREER OPPORTUNITY\nSENIOR METER DATA ANALYST\...,3825
4,"Data Engineer, Business Intelligence & Analytics",The Data Engineer reports directly to the Dire...,3819


<h2>Text Analysis of Job Description</h2>

<h3>Pre-processing</h3>

All the job postings currently available will be considered the "test" set and will not be split. Any predictions will be made on new postings scraped at a later date.

The pre-processing pipeline will include the following steps:
- Word count
- Tokenization
- Lower casing
- Stopword removal
- Stemming
- transformation using TF-IDF

In [9]:
text2 = jobText['Job Description'][1]
text2

'Title: Senior Data Analyst - Telephony Manager\nReporting Manager: VP of Technology\n\nLocation: Mississauga\nJob Type: Full time, Permanent\n\n Job Summary\n\n\nThe primary responsibility of the Telephony Manager (Senior Data Analyst) is to maintain and operate and to operate a high-volume dialer system and its subsystems in a contact center environment effectively and efficiently. Experience in managing Telephony Applications (IVR, Quality Management, Workforce Management, Reporting). You will also collaborate with business management to determine overall strategy.\n\n Responsibilities:\n\nResearch, plan, install, configure, troubleshoot, maintain, and upgrade telephony systems.\nAnalyze and evaluate present or proposed telephony procedures to drive contact center best practices and continually improve performance of the systems and dialer strategies.\nTroubleshoot and resolve hardware, software, and connectivity problems, including user access and component configuration\nRecord an

In [10]:
text2_token = word_tokenize(str(text2))
print(len(text2_token))

378


<h4>1. Tokenize Words</h4>

In [12]:
tokens = word_tokenize(str(jobText['Job Description']))

<h4>2. Stemming Words</h4>

In [13]:
porter = PorterStemmer()
jobText['Job Description'] = jobText['Job Description'].apply(porter.stem)

<h4>3. Stopword Removal & Lowercasing</h4>

In [14]:
# Initialize TFIDF Vectorizer
cv = TfidfVectorizer(stop_words = 'english', lowercase= True)

<h4>4. Apply TF-IDF</h4>

In [15]:
jobText_CV = cv.fit_transform(jobText['Job Description'])

<h3>Text Pre-processing Pipeline Function</h3>

See DSDJ Feature Engineering pt 2 for a train & test version of the Tfidf vectorizer function.

In [16]:
# A function that cleans and performs a TFIDF transformation to text data
tfidf = TfidfVectorizer(stop_words='english', lowercase=True)

def tfidf_pipeline(txt):
    txt = txt.apply(porter.stem) # Apply Stemming
    x = tfidf.fit_transform(txt) # Apply Vectorizer, Stopword Removal, & Lowercasing
    return x 

In [17]:
jobtext_TFIDF = tfidf_pipeline(jobText['Job Description'])

In [20]:
original = jobText.shape
preprocessed = jobtext_TFIDF.shape

print("Original raw data df shape: " + str(original))
print("Preprocessed data shape: " + str(preprocessed))

Original raw data df shape: (450, 3)
Preprocessed data shape: (450, 10255)


In [None]:
# word frequency : how many words per post.
# Text cleaning : lower casing, extra white-space removal, lemmatization

# Determine most common words that occur in the job descriptions. 

# Predetermine a list of expected lookup terms for dictionary of skills
# BOW - Create a list of dictionaries containing word counts for each job posting
# Table with skill, count, percentage
# Wordcloud

In [None]:
#Words2Vec - similar words are closer together in a sentence

# Topic modelling - where skills is considered a topic

# NER with BERT