In [157]:
import pickle
import json
from collections import Counter
from itertools import groupby
from itertools import chain

import spacy
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.corpora import Dictionary

# A: Loading the JSON Data

In [2]:
data = []
with open('../data/corpus.txt') as file:
    for line in file:
        data.append(json.loads(line))

In [3]:
print('number of documents in the JSON file corpus.txt: ', len(data))

number of documents in the JSON file corpus.txt:  10000


## A.1: Extracting Relevant Text

Let's look at a single document from the JSON file to understand its structure:

In [4]:
(data[2])

{'author': {'string': 'Chicago Tribune'},
 'crawlName': {'string': 'chicago_tribue_business'},
 'date': 1507161600000,
 'html': '',
 'humanLanguage': {'string': 'en'},
 'pageUrl': 'http://www.chicagotribune.com/news/local/breaking/ct-gunman-reserved-two-rooms-at-blackstone-photos-20171005-photogallery.html',
 'siteName': {'string': 'chicagotribune.com'},
 'tags': [{'count': 1,
   'label': 'The Blackstone Hotel',
   'rdfTypes': [],
   'score': 0.38,
   'uri': 'http://dbpedia.org/page/The_Blackstone_Hotel'},
  {'count': 1,
   'label': 'Chicago Police Department',
   'rdfTypes': [],
   'score': 0.37,
   'uri': 'http://dbpedia.org/page/Chicago_Police_Department'},
  {'count': 1,
   'label': 'music festival',
   'rdfTypes': [],
   'score': 0.24,
   'uri': 'http://dbpedia.org/page/Music_festival'},
  {'count': 1,
   'label': 'mass shooting',
   'rdfTypes': [],
   'score': 0.19,
   'uri': 'http://dbpedia.org/page/Mass_shooting'},
  {'count': 1,
   'label': 'Lollapalooza',
   'rdfTypes': [],
 

It appears from the sample document above that each article's title and main body are stored in the document's `title` and `text` attributes as strings respectively.

**Sanity checks to confirm whether the JSON object `data` maintains the structure seen above:**

In [5]:
article_lengths = [len(doc['text']) for doc in data]   # list of the lengths of each article
article_lengths_count = Counter(article_lengths)       # dict mapping article length to frequency of occurrence
print('5 most frequent counts of article lengths (no. of characters in the article):')
print(article_lengths_count.most_common(5))

5 most frequent counts of article lengths (no. of characters in the article):
[(1, 7793), (0, 43), (1068, 4), (2714, 4), (482, 4)]


There are `7793` documents with article length `1` and `43` with length `0`.  This seems unusual and must be investigated:

In [6]:
print('A document having length of text == 0: format -> (title, text, URL)')
(next(iter(((doc['title'] , doc['text'], doc['pageUrl']) for doc in data if len(doc['text']) == 0))))

A document having length of text == 0: format -> (title, text, URL)


("Craig Robinson and Adam Scott buddy up in Fox's supernatural comedy 'Ghosted'",
 '',
 'http://www.chicagotribune.com/entertainment/tv/la-et-st-ghosted-review-20170930-story.html')

The sample document above has its article text missing.  The URL associated with the article shows that the full article text is available to read.  The same observation applies to a few other similar documents with zero article lengths that were explored.  It might be possible to acquire the full article text for such articles later.  Since article title constitutes a useful signal, these articles will be retained for topic modeling.  

Let's check if any documents have missing article text as well as title; if any are found, they must be deleted:

In [7]:
print('No. of documents missing both title as well as text: ', end='')
print(len([doc for doc in data if len(doc['text']) == 0 and len(doc['title']) == 0]))

No. of documents missing both title as well as text: 0


Next, let's look at a sample document with article length `1`:

In [8]:
print('A document having length of text == 1: format -> (title, text, URL)')
next(iter(((doc['title'] , doc['text'], doc['pageUrl']) for doc in data if len(doc['text']) == 1)))

A document having length of text == 1: format -> (title, text, URL)


({'string': 'Losses for banks and smaller companies take US stocks lower | The Sacramento Bee'},
 {'string': "U.S. stock indexes are slipping back from record highs Tuesday as banks and small companies fall. Travel booking sites Priceline and TripAdvisor are taking steep losses following their third-quarter reports and retailers are falling too. Companies that pay big dividends, including utilities, are making gains. Oil prices are down slightly after they jumped to two-year highs a day ago.\nKEEPING SCORE: The Standard & Poor's 500 index lost 4 points, or 0.2 percent, to 2,586 as of 3 p.m. Eastern time. The Dow Jones industrial average slipped 26 points, or 0.1 percent, to 23,521. The Nasdaq composite fell 26 points, or 0.4 percent, to 6,759. Smaller companies were on track for their worst loss since early August. The Russell 2000 index tumbled 18 points, or 1.2 percent, to 1,479 as Wall Street continued to watch for signs of progress by House Republicans on their proposed tax cuts. I

The structure of the JSON document above is different from the first sample document.  This must be taken into account while extracting article titles and text from the JSON object `data`.

**The fields of interest in the JSON documents are: `text` and `title`.**  

Note:  
The `tags` field is an array of entities extracted from based on text analysis by Diffbot (reference: https://www.diffbot.com/dev/docs/article/).  Each entity has a label (its name) and a relevance score.  However, a cursory exploration of these tags for a few documents revealed that some of the entities identified by Diffbot are not relevant to the article.  Hence, the `tags` field was not used in this work.

In [9]:
def get_title_and_text(doc):
    """Returns a tuple of strings representing the article's title and text from the JSON document doc."""
    title = doc['title']
    text = doc['text']
    if type(text) is dict:
        title = title['string']
        text = text['string']
    return title, text

In [10]:
articles = []

# initialize tracking of document index in the JSON array.
# document index can be used to map any article to all 
# attributes available in the original JSON object 'data'.
ind = 0
for doc in data:
    title, text = get_title_and_text(doc)
    articles.append([ind, title, text])
    ind += 1

**Sanity check:  **

In [11]:
print('No. of documents with article length == 0 after text extraction: ', end='')
print(len([art for art in articles if len(art[2]) == 1]))

No. of documents with article length == 0 after text extraction: 0


In [12]:
# Save the list 'articles' containing relevant information in the format:
# [[document index in the original JSON array, article title, article text]]
with open('../data/articles.pkl', 'wb') as file:
    pickle.dump(articles, file)

In [12]:
with open('../data/articles.pkl', 'rb') as file:
    articles = pickle.load(file)

# B: Tokenization

In the next section, phrase detection is performed using the `Gensim` library.  To work with `Gensim`, each article's text must be transformed into a list of sentences, with each sentence being a list of word tokens.  This process is performed using the `SpaCy` NLP library in this section.  

**Storing parts-of-speech:**  
The final part of the phrase detection process in the next section involves filtering the identified bigrams and trigrams based on whether or not they match specific part-of-speech templates/patterns.  For this purpose, along with storing word tokens, their correspoding parts-of-speech are also stored in this section.

**Lemmatization:**  
Lemmatization converts a word to its base form, using knowledge of the part of speech of that word.  This helps in normalizing the text, and hence the lemmatized forms of the word tokens have been stored.

In [10]:
nlp = spacy.load('en')   # load SpaCy's default NLP model for English

## B.1: Tokenization of the Articles' Text

In [14]:
# text_index_all yields (text, index) tuples as required by .pipe() later:
# the article text yielded has the title joined to it
# index is the document index in the original JSON array
text_index_all = (('. '.join([title, text]), index) for index, title, text in articles)

# generator yielding parsed SpaCy Doc objects, which are sequences of tokens:
docs = nlp.pipe(text_index_all, as_tuples=True, batch_size=1000)

# unigram_sents_pos will store lists of lemmatized tokens and their parts-of-speech (pos) for each sentence
# format: [[document index, lemmatized tokens list, tokens' pos tag list], ...]
unigram_sents_pos = []

for parsed_text, index in docs:
    for sent in parsed_text.sents:
        # lemmatize tokens & save corresponding pos tags after filtering whitespace and punctuations
        tokenized_sent = [(token.lemma_, token.pos_) for token in sent if not (token.is_space or token.is_punct)]
        if len(tokenized_sent) != 0:
            # separate out lemmatized tokens and pos tags
            tokens, pos = list(zip(*tokenized_sent))
            unigram_sents_pos.append([index, tokens, pos])

In [18]:
with open('../data/unigram_sents_pos.pkl', 'wb') as file:
    pickle.dump(unigram_sents_pos, file)

In [2]:
with open('../data/unigram_sents_pos.pkl', 'rb') as file:
    unigram_sents_pos = pickle.load(file)

`unigram_sents_pos` is a list of lists, with each constituent list representing a **sentence** in an article.

It might be helpful to see the structure of `unigram_sents_pos` to understand the rest of the code.  Let's look at the representation of the article with index # `2` after tokenization:

In [6]:
[art for art in articles if art[0] == 2]   # 3rd article in the JSON array (at index 2)

[[2,
  'Gunman reserved two rooms at Blackstone',
  "Chicago police are investigating whether Stephen Paddock, the gunman in the Las Vegas mass shooting, booked rooms at the Blackstone Hotel during this summer's Lollapalooza music festival, held across the street in Grant Park."]]

In [7]:
# the 3rd article's representation after tokenization 
[sent for sent in unigram_sents_pos if sent[0] == 2]

[[2,
  ('gunman', 'reserve', 'two', 'room', 'at', 'blackstone'),
  ('PROPN', 'VERB', 'NUM', 'NOUN', 'ADP', 'PROPN')],
 [2,
  ('chicago',
   'police',
   'be',
   'investigate',
   'whether',
   'stephen',
   'paddock',
   'the',
   'gunman',
   'in',
   'the',
   'las',
   'vegas',
   'mass',
   'shooting',
   'book',
   'room',
   'at',
   'the',
   'blackstone',
   'hotel',
   'during',
   'this',
   'summer',
   "'s",
   'lollapalooza',
   'music',
   'festival',
   'hold',
   'across',
   'the',
   'street',
   'in',
   'grant',
   'park'),
  ('PROPN',
   'NOUN',
   'VERB',
   'VERB',
   'ADP',
   'PROPN',
   'PROPN',
   'DET',
   'NOUN',
   'ADP',
   'DET',
   'PROPN',
   'PROPN',
   'NOUN',
   'NOUN',
   'VERB',
   'NOUN',
   'ADP',
   'DET',
   'PROPN',
   'PROPN',
   'ADP',
   'DET',
   'NOUN',
   'PART',
   'PROPN',
   'NOUN',
   'NOUN',
   'VERB',
   'ADP',
   'DET',
   'NOUN',
   'ADP',
   'PROPN',
   'PROPN')]]

## B.2: Phrase Detection

Naturally occurring bigram and trigram phrases (e.g. 'Bank of America') are identified in this section using the `Gensim` module.  The method for phrase detection used in this section relies on the calculation of the Normalized Pointwise Mutual Information (NPMI) score [ref: G. Bouma, “Normalized (pointwise) mutual information in collocation extraction,” Proceedings of GSCL, pp. 31–40, 2009.].  The higher the NPMI score for a set of two word tokens, the greater the likelihood of those words being part of a phrase.  All pairs of words with NPMI greater than a specified threshold are treated as phrases.

**Tokens in phrases are joined using `__`**:  
Double underscores are used in the code below (via the `delimiter` argument supplied to `Phrases`) to join tokens that are parts of phrases.  The more commonly seen glue character `_` (single underscore) is not used here because the text contains several instances of `_`, e.g. as a part of twitter handles, hashtags, etc.  Using a double underscore - which does not appear anywhere in the tokens from the original text - avoids a mixup between paired words and unpaired words that contain an underscore. 

In [4]:
# gensim's phrase detector expects an iterable of sentences, 
# with each sentence being a list of word tokens
unigram_sentences = [tokens for index, tokens, pos in unigram_sents_pos]

# common_terms (stop words) passed to gensim's phrase detector are 
# ignored when determining the frequency count based NPMI score of the 
# phrases that they are a part of. Thus, their presence between two 
# words won’t hinder detection of phrases like “Bank of America”.
common_terms = ['the', 'and', 'or', 'of', 'in', 'at', 'on']

# Train a first-order phrase detector
bigram_model = Phrases(unigram_sentences, threshold=0.6, scoring='npmi', common_terms=common_terms, delimiter=b'__')
bigram_phraser = Phraser(bigram_model)   # object to apply bigram model to tokens
bigram_sentences = bigram_phraser[unigram_sentences]

# Train a second-order phrase detector
trigram_model = Phrases(bigram_sentences, threshold=0.5, scoring='npmi', common_terms=common_terms, delimiter=b'__')
trigram_phraser = Phraser(trigram_model)  # object to apply trigram model to tokens
trigram_sentences = trigram_phraser[bigram_sentences]
# convert to list (from gensim's TransformedCorpus type) for processing downstream
trigram_sentences = list(trigram_sentences)

In [5]:
paired_words = (word for sent in trigram_sentences for word in sent if '__' in word)
paired_words_frq = Counter(paired_words)
paired_words_frq.most_common(20)

[('do__not', 9379),
 ('more__than', 3601),
 ('such__as', 2070),
 ('as__well', 2020),
 ('last__year', 1830),
 ('per__cent', 1355),
 ('new__york', 1244),
 ('united__states', 1156),
 ('year__ago', 888),
 ('last__week', 857),
 ('sign__up', 776),
 ('talk__about', 764),
 ('white__house', 717),
 ('more__information', 683),
 ('high__school', 655),
 ('social__medium', 603),
 ('long__term', 590),
 ('earlier__this', 561),
 ('north__korea', 554),
 ('less__than', 550)]

## B.3: Filter Phrases Based on Part-of-Speech Templates

In [6]:
def filter_pairs(tokens_paired, tokens_original, pos_original):
    """modify (in-place) tokens_paired
    """
    skip = 0        # to help track current word index while filtering
    to_remove = []  # indices of word tokens to be remvoed
    
    for i in range(len(tokens_paired)):
        word = tokens_paired[i]
        if '__' in word:   # indicates phrase, i.e. paired words
            num_paired = word.count('__') + 1   # number of words in phrase
            if num_paired > 3:     # Case 1: > 3 words paired -> ignore pairing
                skip = handle_failed_pairing(i, skip, num_paired, tokens_original, tokens_paired, to_remove)
                continue
            elif num_paired == 2:  # Case 2: bigrams: noun/adj, noun
                # part-of-speech tags of 1st and 2nd words
                pos_1, pos_2 = pos_original[i + skip: i + skip + 2]
                if pos_1 not in ('NOUN', 'PROPN', 'ADJ') or pos_2 not in ('NOUN', 'PROPN'):
                    skip = handle_failed_pairing(i, skip, num_paired, tokens_original, tokens_paired, to_remove)
                    continue
            elif num_paired == 3:  # Case 3: trigrams: noun/adj, all types, noun/adj
                pos_1, pos_2, pos_3 = pos_original[i + skip: i + skip + 3]
                if not all(pos in ['NOUN', 'PROPN', 'ADJ'] for pos in [pos_1, pos_3]):
                    skip = handle_failed_pairing(i, skip, num_paired, tokens_original, tokens_paired, to_remove)
                    continue
            skip += num_paired - 1
            
    # remove rejected pairs of words that have been split and added back individually
    if len(to_remove) > 0:
        for j in sorted(to_remove, reverse=True):
            del tokens_paired[j]


def handle_failed_pairing(i, skip, num_paired, tokens_original, tokens_paired, to_remove):
    # split up paired words failing our format requirements and update skip
    to_remove.extend([i])
    tokens_paired.extend(tokens_original[i + skip: i + skip + num_paired])
    skip += num_paired - 1
    return skip

In [7]:
unigram_pos = [pos for index, tokens, pos in unigram_sents_pos]
# perform in-place filtering of phrases in trigram_sentences
# get rid of paired words from the corpus which
# (1) have more than 3 words joined
# (2) bigrams not in the format: noun/adj, noun
# (3) trigrams not in the format: noun/adj, all types, noun/adj
for i in range(len(trigram_sentences)):
    filter_pairs(tokens_paired=trigram_sentences[i], 
                 tokens_original=unigram_sentences[i], 
                 pos_original=unigram_pos[i])

Let's look at the updated phrases:

In [8]:
paired_words = (word for sent in trigram_sentences for word in sent if '__' in word)
paired_words_frq = Counter(paired_words)
paired_words_frq.most_common(20)

[('last__year', 1830),
 ('new__york', 1244),
 ('united__states', 1156),
 ('last__week', 856),
 ('per__cent', 802),
 ('white__house', 717),
 ('more__information', 683),
 ('high__school', 655),
 ('social__medium', 603),
 ('long__term', 590),
 ('north__korea', 554),
 ('last__month', 535),
 ('president__donald__trump', 525),
 ('los__angeles', 482),
 ('local__story', 475),
 ('fourth__quarter', 450),
 ('free__30__day', 425),
 ('real__estate', 416),
 ('third__quarter', 415),
 ('south__korea', 374)]

## B.4: Final Clean-up: Stop Word Removal

In [11]:
def is_clean(word):
    """Returns a Boolean indicator of whether the input word can be considered to be 'clean'.
    To be considered clean, a word must:
    1. Not be '-PRON-', which is an artificial lemma representing pronouns, and
    2. Not be an English stop word.
    """
    if word == '-PRON-':
        status = False
    # check if word is in SpaCy's default English stop words
    elif word in nlp.Defaults.stop_words:
        status = False
    else:
        status = True
    return status

trigram_sentences = [[word for word in sent if is_clean(word)] for sent in trigram_sentences]

In [12]:
with open('../data/trigram_sentences.pkl', 'wb') as file:
    pickle.dump(trigram_sentences, file)

In [140]:
# group the list of sentences in trigram_sentences by article index: 
# tokenized_articles format: 
# [[list of sentences (tokens) in article 1], ..., [list of sentences (tokens) in article 10000]]
article_indices = (index for index, tokens, pos in unigram_sents_pos)
article_ind_sent = zip(article_indices, trigram_sentences)

tokenized_articles = []
for index, group in groupby(article_ind_sent, key=lambda x: x[0]):
    sentence_in_group = (sent for index, sent in group)
    tokenized_articles.append(list(sentence_in_group))

## B.5: Boosting the Articles' Titles

The titles of news articles generally capture the crux of their content.  Hence, it is reasonable to boost the importance of the word tokens in the articles' titles to help tease out the topic discussed in the articles' main text.

In this section, the word tokens in each article's title have been 'boosted' by repeating them several times.  The number of times the title tokens are repeated is parameterized by the number of tokens in the article's main text.

The first sentence of each article in `trigram_sentences` in its title - because in the first step in the tokenization section (`B.1`), each article's title was joined (prefixed) to its text.  Note that a very small number of articles have multiple sentences in their title.  Since this number is very small (< 5%), these articles have not been addressed separately - basically, only the tokens in the first sentecne of their title have been boosted.

In [141]:
def get_title_multiplier(article):
    """Return the number of times the input article's title 
    tokens must be repeated such that the title becomes about 
    half as long as the article's text.
    """
    title = article[0]
    # flatten list of sentence tokens in text:
    text = list(chain.from_iterable(article[1:]))
    if len(title) != 0:
        # if the text has zero tokens, num_repeat = 1
        num_repeat = max(len(text) // len(title) // 2, 1)
    else:
        # the title may have zero tokens if they got filtered 
        # out in the pre-processing steps upto this point
        num_repeat = 1
    return num_repeat

In [142]:
# modify tokenized_articles by boosting the number of times title tokens appear
for i in range(len(tokenized_articles)):
    # no. of times the article tokens must be repeated:
    num_repeat = get_title_multiplier(tokenized_articles[i])
    tokenized_articles[i][0] *= num_repeat

In [160]:
# finally, flatten the sub-lists within tokenized_articles
# format after flattening: [[all tokens in article 1], ..., [all tokens in article 10000]]
tokenized_articles = [list(chain.from_iterable(article_sents)) for article_sents in tokenized_articles]

In [165]:
with open('../data/tokenized_articles.pkl', 'wb') as file:
    pickle.dump(tokenized_articles, file)

# C: Construct a Vocabulary

In [173]:
# learn the vocabulary of the corpus by iterating over all articles
vocab_dictionary = Dictionary(tokenized_articles)
print(vocab_dictionary)

Dictionary(184929 unique tokens: ['1,000', '10', '1950', '2005', '2016']...)


The vocabulary contains a large number of tokens.  However, tokens that are very frequent or very rare are not useful and can be removed.  In the code below, tokens that appear in less than `20` articles are removed, along with tokens that appear in more than `50%` of the articles.

In [174]:
vocab_dictionary.filter_extremes(no_below=20, no_above=0.4)
# remove gaps in word id sequence caused by token removal
vocab_dictionary.compactify()
print(vocab_dictionary)

Dictionary(8639 unique tokens: ['1,000', '10', '1950', '2005', '2016']...)


The vocabulary size has now dropped down to `8639` unique tokens.

In [176]:
vocab_dictionary.save('../data/vocab_dictionary.pkl')