# Train Trigram Phrase Model
*David Norrish, October 2019*

We will use Gensim to detect common bigrams and the trigrams from the corpus of job descriptions. The trigram model will be used in subsequent text processing steps so that concepts like "statistical modelling" can be reliably separated from "data modelling".

In [1]:
from pathlib import Path

import pandas as pd
from pandas import DataFrame

import spacy

In [2]:
DATA_PATH = Path("../data")
RAW_PATH = DATA_PATH / "raw"
# Discovery all CSVs with job descriptions
JOB_PATHS = list(RAW_PATH.glob("*jobs.csv"))

In [3]:
JOB_PATHS

[PosixPath('../data/raw/data_engineer_jobs.csv'),
 PosixPath('../data/raw/control_jobs.csv'),
 PosixPath('../data/raw/data_scientist_jobs.csv')]

Read in an example jobs CSV to see the structure

In [4]:
print(f"Read in {JOB_PATHS[0]}:")
jobs_df = pd.read_csv(JOB_PATHS[0])
jobs_df.head()

Read in ../data/raw/data_engineer_jobs.csv:


Unnamed: 0,title,institution,date,text,required skills,applicants skills
0,Data Pipeline Engineer,Data Processors,30/09/2019,Data Processors is a data centric research and technology services company. We offer a unique do...,,
1,Data Engineer,Aurecon,02/10/2019,"At Aurecon we see the future through a very different lens. Do you?\n\nInnovation, eminence and ...",,
2,"Senior Big Data Engineer SQL,Python experience, exp in AWS or Azure a must!",Counterpoint Group,02/10/2019,"Our Client seeks people who are truly passionate about development within the Big Data space, wh...",,
3,"Senior Data Engineer, Melbourne",EY,23/09/2019,Variety of work and career development opportunities\n Choose a career that connects you ...,,
4,Data Analyst or ETL Developer,Private Advertiser,03/10/2019,"We are looking for a Data Analyst or Senior Data Analyst, or ETL Developer, Senior ETL, ETL Tech...",,


Load in the spaCy model that will be used to parse job descriptions.

In [5]:
%%time
nlp = spacy.load("en_core_web_md")

CPU times: user 19.6 s, sys: 496 ms, total: 20.1 s
Wall time: 20.1 s


## 1. Prepare text for phrase modelling
Phrase modelling only needs to consider adjacent words within any given sentence, not documents as a whole or conssecutive sentences.

As such, for all position descriptions, normalise the text (lemmatise, lower-case & drop punctuation), and save to a single file, with each sentence or bullet point on a separate line.

In [6]:
# Define a couple of generator functions to handle the cleaning
def line_generator(df: DataFrame, col: str):
    """Generator for lines of text from a DataFrame column"""
    for text in df[col]:
        try:
            text.split()
            for line in text.split('\n'):
                yield line.strip()
        except Exception:
            print(type(text))
            print(text)
            breakpoint()                

def parse_lines(df: DataFrame, col: str):
    """
    spaCy-parse text in a DataFrame column line by line
    
    spaCy won't detect newlines as sentence boundaries, so must do this explicitly first
    """
    for parsed_text in nlp.pipe(line_generator(df, col)):
        for sent in parsed_text.sents:
            yield ' '.join([token.lemma_.lower() for token in sent if not (token.is_punct or token.is_space)])

In [7]:
CLEANED_PATH = DATA_PATH / "cleaned"
LINES_PATH = CLEANED_PATH / "normalised_jobs.txt"

In [8]:
%%time
if not LINES_PATH.exists():
    with open(str(LINES_PATH), 'w') as fhand:
        for path in JOB_PATHS:
            df = pd.read_csv(path)
            for line in parse_lines(df, "text"):
                if not (line == "" or line.isspace()):
                    fhand.write(line + '\n')

CPU times: user 12.8 s, sys: 328 ms, total: 13.1 s
Wall time: 13.1 s


## 2. Train Bigram Model
To train a trigram model using Gensim, we must first train a bigram model, "bigramize" the text (by joining bigram phrases with "\_)", then repeat the process.

In [9]:
# The Phrases class trains a phrase model The Phraser class is a wrapper that
# cuts memory consumption of a phrase model by discarding state not needed for bigram detection
from gensim.models.phrases import Phrases, Phraser
# Load in an iterator for lines of text saved to a file
from gensim.models.word2vec import LineSentence

In [10]:
MODELS_PATH = DATA_PATH / "models"
BIGRAM_MODEL_PATH = MODELS_PATH / "bigram_model.model"

In [11]:
# Create an iterator of the normalized texts
unigram_sentences = LineSentence(str(LINES_PATH))

In [12]:
def train_phrase_model(text_iterator, output_path: Path):
    """
    Train and save a phrase model. `text_iterator` should yield lists of tokens,
    e.g. Gensim's LineSentence class

    If a model has already been trained, load from disk.
    """
    if not output_path.exists():
        phrase_model = Phrases(text_iterator)
        phrase_model.save(str(output_path))    
        print("Phrase model saved to", output_path)
    else:
        phrase_model = Phrases.load(str(output_path))
        print("Loaded pre-trained phrase model from", output_path)
    return Phraser(phrase_model)

In [13]:
bigram_model = train_phrase_model(unigram_sentences, BIGRAM_MODEL_PATH)

Loaded pre-trained phrase model from ../data/models/bigram_model.model


Bigram-ize the unigram texts and save to file.

In [14]:
BIGRAM_LINES_PATH = CLEANED_PATH / 'bigram_lines.txt'

In [15]:
%%time
def apply_phrase_model(line_iterator, phrase_model, output_path: Path):
    """Apply a phrase model to text and save to file"""    
    if not output_path.exists():
        with open(str(output_path), 'w') as fhand:
            for sent in line_iterator:
                phrase_line = ' '.join(phrase_model[sent])
                fhand.write(phrase_line + '\n')
        print("Saved phrased lines to", output_path)
    else:
        print(output_path, "already exists")

apply_phrase_model(unigram_sentences, bigram_model, BIGRAM_LINES_PATH)

Saved phrased lines to ../data/cleaned/bigram_lines.txt
CPU times: user 425 ms, sys: 7.9 ms, total: 433 ms
Wall time: 438 ms


Inspect the resulting bigram-ed text.

In [16]:
bigram_sentences = LineSentence(str(BIGRAM_LINES_PATH))

In [17]:
from itertools import islice
import re

# Look for some examples of bigrams
def get_example_phrases(sentence_iterator, n_grams=2, num_examples=5):
    """
    Takes an iterator of sentences as list of tokens and
    prints some example that include phrases
    """
    regex_pattern = "_"
    for i in range(n_grams - 2):
        regex_pattern += "[a-z]*_"

    examples_found = 0
    for tok_list in sentence_iterator:
        sent = ' '.join(tok_list)
        if re.search(regex_pattern, sent):
            examples_found += 1
            print(f"{examples_found}. {sent}\n")
        if examples_found >= num_examples:
            break

get_example_phrases(bigram_sentences, n_grams=2)

1. -pron- offer a unique domain to work within -pron- be a world_class provider of financial statistical_modelling couple with unbeatable employee condition and benefit

2. -pron- be seek direct i.e. not via a recruiter application from talented developer who possess the desire and willingness to solve_problem of importance to business_outcome work well within -pron- team and have a dedicated and professional outlook on software_development process

3. in_addition to be an excellent software developer -pron- will_also specialise have a desire to learn the follow area

4. data_pipeline development

5. a bachelor 's degree_in software_engineering computer_science or information system



## 3. Train the Trigram Model

In [18]:
TRIGRAM_MODEL_PATH = MODELS_PATH/ 'trigram_model.model'

In [19]:
trigram_model = train_phrase_model(bigram_sentences, TRIGRAM_MODEL_PATH)

Loaded pre-trained phrase model from ../data/models/trigram_model.model


In [20]:
TRIGRAM_LINES_PATH = CLEANED_PATH / "trigram_lines.txt"

apply_phrase_model(bigram_sentences, trigram_model, TRIGRAM_LINES_PATH)

Saved phrased lines to ../data/cleaned/trigram_lines.txt


Check some examples

In [21]:
trigram_sentences = LineSentence(str(TRIGRAM_LINES_PATH))

In [22]:
get_example_phrases(trigram_sentences, n_grams=3)

1. demonstrate outstanding verbal and write_communication_skill include the ability_to communicate effectively with technical and non_technical colleague

2. high tertiary_qualification e.g. masters_or_phd

3. -pron- will not consider visa sponsorship for_this_position and -pron- will not consider candidate who be locate overseas unless -pron- be return to australia within the next calendar month

4. the application_form will include_these_question

5. which of the follow_statement_best_describe -pron- right to work in_australia

