In [124]:
import pickle
import json
from collections import Counter

import pandas as pd

import contractions

# A) Loading the JSON Data

In [2]:
data = []
with open('../data/corpus.txt') as file:
    for line in file:
        data.append(json.loads(line))

In [3]:
print('number of documents in the JSON file corpus.txt: ', len(data))

number of documents in the JSON file corpus.txt:  10000


## A.1) Extracting Relevant Text

Let's look at a single document from the JSON file to understand its structure:

In [6]:
(data[2])

{'author': {'string': 'Chicago Tribune'},
 'crawlName': {'string': 'chicago_tribue_business'},
 'date': 1507161600000,
 'html': '',
 'humanLanguage': {'string': 'en'},
 'pageUrl': 'http://www.chicagotribune.com/news/local/breaking/ct-gunman-reserved-two-rooms-at-blackstone-photos-20171005-photogallery.html',
 'siteName': {'string': 'chicagotribune.com'},
 'tags': [{'count': 1,
   'label': 'The Blackstone Hotel',
   'rdfTypes': [],
   'score': 0.38,
   'uri': 'http://dbpedia.org/page/The_Blackstone_Hotel'},
  {'count': 1,
   'label': 'Chicago Police Department',
   'rdfTypes': [],
   'score': 0.37,
   'uri': 'http://dbpedia.org/page/Chicago_Police_Department'},
  {'count': 1,
   'label': 'music festival',
   'rdfTypes': [],
   'score': 0.24,
   'uri': 'http://dbpedia.org/page/Music_festival'},
  {'count': 1,
   'label': 'mass shooting',
   'rdfTypes': [],
   'score': 0.19,
   'uri': 'http://dbpedia.org/page/Mass_shooting'},
  {'count': 1,
   'label': 'Lollapalooza',
   'rdfTypes': [],
 

It appears from the sample document above that each article's title and main body are stored in the document's `title` and `text` attributes as strings respectively.

**Sanity checks to confirm whether the JSON object `data` maintains the structure seen above:**

In [76]:
article_lengths = [len(doc['text']) for doc in data]   # list of the lengths of each article
article_lengths_count = Counter(article_lengths)       # dict mapping article length to frequency of occurrence
print('5 most frequent counts of article lengths (no. of characters in the article):')
print(article_lengths_count.most_common(5))

5 most frequent counts of article lengths (no. of characters in the article):
[(1, 7793), (0, 43), (1068, 4), (2714, 4), (482, 4)]


There are `7793` documents with article length `1` and `43` with length `0`.  This seems unusual and must be investigated:

In [100]:
print('A document having length of text == 0: format -> (title, text, URL)')
(next(iter(((doc['title'] , doc['text'], doc['pageUrl']) for doc in data if len(doc['text']) == 0))))

A document having length of text == 0: format -> (title, text, URL)


("Craig Robinson and Adam Scott buddy up in Fox's supernatural comedy 'Ghosted'",
 '',
 'http://www.chicagotribune.com/entertainment/tv/la-et-st-ghosted-review-20170930-story.html')

The sample document above has its article text missing.  The URL associated with the article shows that the full article text is available to read.  The same observation applies to a few other similar documents with zero article lengths that were explored.  It might be possible to acquire the full article text for such articles later.  Since article title constitutes a useful signal, these articles will be retained for topic modeling.  

Let's check if any documents have missing article text as well as title; if any are found, they must be deleted:

In [98]:
print('No. of documents missing both title as well as text: ', end='')
print(len([doc for doc in data if len(doc['text']) == 0 and len(doc['title']) == 0]))

No. of documents missing both title as well as text: 0


Next, let's look at a sample document with article length `1`:

In [80]:
print('A document having length of text == 1: format -> (title, text, URL)')
next(iter(((doc['title'] , doc['text'], doc['pageUrl']) for doc in data if len(doc['text']) == 1)))

A document having length of text == 1: format -> (title, text, URL)


({'string': 'Losses for banks and smaller companies take US stocks lower | The Sacramento Bee'},
 {'string': "U.S. stock indexes are slipping back from record highs Tuesday as banks and small companies fall. Travel booking sites Priceline and TripAdvisor are taking steep losses following their third-quarter reports and retailers are falling too. Companies that pay big dividends, including utilities, are making gains. Oil prices are down slightly after they jumped to two-year highs a day ago.\nKEEPING SCORE: The Standard & Poor's 500 index lost 4 points, or 0.2 percent, to 2,586 as of 3 p.m. Eastern time. The Dow Jones industrial average slipped 26 points, or 0.1 percent, to 23,521. The Nasdaq composite fell 26 points, or 0.4 percent, to 6,759. Smaller companies were on track for their worst loss since early August. The Russell 2000 index tumbled 18 points, or 1.2 percent, to 1,479 as Wall Street continued to watch for signs of progress by House Republicans on their proposed tax cuts. I

The structure of the JSON document above is different from the first sample document.  This must be taken into account while extracting article titles and text from the JSON object `data`.

**The fields of interest in the JSON documents are: `text` and `title`.**  

Note:  
The `tags` field is an array of entities extracted from based on text analysis by Diffbot (reference: https://www.diffbot.com/dev/docs/article/).  Each entity has a label (its name) and a relevance score.  However, a cursory exploration of these tags for a few documents revealed that some of the entities identified by Diffbot are not relevant to the article.  Hence, the `tags` field was not used in this work.

In [113]:
def get_title_and_text(doc):
    """Returns a tuple of strings representing the article's title and text from the JSON document doc."""
    title = doc['title']
    text = doc['text']
    if type(text) is dict:
        title = title['string']
        text = text['string']
    return title, text

In [114]:
articles = []

# initialize tracking of document index in the JSON array.
# document index can be used to map any article to all 
# attributes available in the original JSON object 'data'.
ind = 0
for doc in data:
    title, text = get_title_and_text(doc)
    articles.append([ind, title, text])
    ind += 1

**Sanity check:  **

In [123]:
print('No. of documents with article length == 0 after text extraction: ', end='')
print(len([doc for doc in articles if len(doc[2]) == 1]))

No. of documents with article length == 0 after text extraction: 0


In [126]:
# Save the list 'articles' containing relevant information in the format:
# [[document index in the original JSON array, article title, article text]]
with open('../data/articles.pkl', 'wb') as file:
    pickle.dump(articles, file)

# Tokenization

**Article Title**  
The article ---------

