<h1>Part 2 : Pre-Processing of Job Description Text</h1>

<h3> Import Packages</h3>

In [1]:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import regex as re

import os
import warnings
warnings.filterwarnings('ignore')

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# If you haven't already done so, execute:
#import nltk
#nltk.download('punkt')

In [2]:
from sqlalchemy import create_engine, MetaData, Table

# Query SQL db to get analyst job descriptions first

# Create connection to db
engine = create_engine("sqlite:///joblist.sqlite")
print(engine.table_names())

# Load in data table
metadata = MetaData()

data = Table('data', metadata, autoload=True, autoload_with=engine)

print(data.columns.keys())
print(repr(metadata.tables['data']))

['data']
['jobtitle', 'company', 'location', 'salary', 'jobdescription', 'label']
Table('data', MetaData(bind=None), Column('jobtitle', VARCHAR(length=100), table=<data>), Column('company', VARCHAR(length=100), table=<data>), Column('location', VARCHAR(length=25), table=<data>), Column('salary', INTEGER(), table=<data>), Column('jobdescription', TEXT(), table=<data>), Column('label', INTEGER(), table=<data>), schema=None)


In [3]:
from sqlalchemy import select

# Build query
stmt = select([data.columns.jobdescription, data.columns.label])
stmt = stmt.where(data.columns.label == '0') # 0 = analysts

# Create connection to engine
connection = engine.connect()

# Execute query
results = connection.execute(stmt).fetchall()
print(results[0].keys())

['jobdescription', 'label']


In [4]:
# Create dataframe from SQLAlchemy ResultSet
df_data = pd.DataFrame(results)

# Give columns proper heading
df_data.columns = results[0].keys()
print(df_data.head())

                                      jobdescription  label
0  Position Title:Pricing Analyst\nPosition Type:...      0
1  Title: Senior Data Analyst - Telephony Manager...      0
2  We are looking for a talented Fuel Cell Data E...      0
3  CAREER OPPORTUNITY\nSENIOR METER DATA ANALYST\...      0
4  The Data Engineer reports directly to the Dire...      0


In [5]:
df_data.describe()

Unnamed: 0,label
count,900.0
mean,0.0
std,0.0
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,0.0


In [6]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   jobdescription  900 non-null    object
 1   label           900 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 14.2+ KB


In [18]:
df_data['jobdescription'] = df_data['jobdescription'].astype('string')

In [19]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   jobdescription  450 non-null    string
 1   label           450 non-null    int64 
 2   length          450 non-null    int64 
dtypes: int64(2), string(1)
memory usage: 10.7 KB


Add a column for the length of each description

In [13]:
#df_data['length'] = df_data['jobdescription'].str.len()
#df_data.head()

Unnamed: 0,jobdescription,label,length
0,Position Title:Pricing Analyst\nPosition Type:...,0,3404
1,Title: Senior Data Analyst - Telephony Manager...,0,2471
2,We are looking for a talented Fuel Cell Data E...,0,2152
3,CAREER OPPORTUNITY\nSENIOR METER DATA ANALYST\...,0,3825
4,The Data Engineer reports directly to the Dire...,0,3819


<h2>Text Analysis of Job Description</h2>

<h3>Pre-processing</h3>

All the job postings currently available will be considered the "test" set and will not be split. Any predictions will be made on new postings scraped at a later date.

The pre-processing pipeline will include the following steps:
- Word count
- Tokenization
- Lower casing
- Stopword removal
- Stemming
- transformation using TF-IDF

In [20]:
# Select job description from df_data

text = df_data['jobdescription'][3]
text

'CAREER OPPORTUNITY\nSENIOR METER DATA ANALYST\nLocation: Mississauga, ON\nIf you are an ambitious, curious, hard-working, and seasoned energy professional, this is an exciting new opportunity for you to join the Rodan Energy team – an organization that has been recognized as one of Canada’s Top Small & Medium Employers for three years in a row.\nRodan Energy is a top North American energy services company, quickly growing to serve our clients across North America. We are a leader in distributed energy resource asset optimization, demand response, energy management and intelligence systems, and power systems engineering and metering services. Our innovative solutions are designed for large commercial and industrial energy users as well as power distribution and generation companies. We focus on minimizing energy costs while maximizing efficiencies and corporate sustainability for our clients. Our goal is a sustainable energy future.\nWe have an opportunity for a Senior Meter Data Analy

Note that here in the jobdescription, some of the new-line (\n) 

In [17]:
# Tokenize text
# Maybe add len(token) to dataframe and plot

text_token = word_tokenize(str(text))
print(len(text_token))

609


<h4>1. Tokenize Words</h4>

In [12]:
tokens = word_tokenize(str(df_data['Job Description']))

<h4>2. Stemming Words</h4>

In [13]:
porter = PorterStemmer()
df_data['Job Description'] = df_data['Job Description'].apply(porter.stem)

<h4>3. Stopword Removal & Lowercasing</h4>

In [14]:
# Initialize TFIDF Vectorizer
cv = TfidfVectorizer(stop_words = 'english', lowercase= True)

<h4>4. Apply TF-IDF</h4>

In [15]:
jobText_CV = cv.fit_transform(jobText['Job Description'])

In [None]:
# Observe TFIDF Weights


<h3>Text Pre-processing Pipeline Function</h3>

See DSDJ Feature Engineering pt 2 for a train & test version of the Tfidf vectorizer function.

In [16]:
# Define a function that cleans and performs a TFIDF transformation to text data
tfidf = TfidfVectorizer(stop_words='english', lowercase=True)

def tfidf_pipeline(txt):
    txt = txt.apply(porter.stem) # Apply Stemming
    x = tfidf.fit_transform(txt) # Apply Vectorizer, Stopword Removal, & Lowercasing
    return x 

In [17]:
jobtext_TFIDF = tfidf_pipeline(jobText['Job Description'])

In [20]:
original = jobText.shape
preprocessed = jobtext_TFIDF.shape

print("Original raw data df shape: " + str(original))
print("Preprocessed data shape: " + str(preprocessed))

Original raw data df shape: (450, 3)
Preprocessed data shape: (450, 10255)


In [None]:
jobText_TFIDF.head()

In [None]:
# word frequency : how many words per post.
# Text cleaning : lower casing, extra white-space removal, lemmatization

# Determine most common words that occur in the job descriptions. 
# Predetermine a list of expected lookup terms for dictionary of skills

# BOW - Create a list of dictionaries containing word counts for each job posting

# Table with skill, count, percentage

# Wordcloud

In [None]:
#Words2Vec - similar words are closer together in a sentence

# Topic modelling - where skills is considered a topic

# NER with BERT