# Topic Modeling of American News Articles Using LDA

Created by Patrick Steeves for Independent Study with Professor Kanungo <br>
George Washington University, 12/23/2017

### Setting up

Import notebook Module.ipynb, which contains functions to import and clean the data, add features, and more

In [7]:
%run Module.ipynb

In [9]:
%matplotlib inline

### <br> <br> Data import and cleaning

The data we have is collected from Kaggle at https://www.kaggle.com/snapcrack/all-the-news <br>
It contains 143,000 articles and is available for download as 3 files: articles1.csv, articles2.csv, articles3.csv <br><br> The data was split into smaller subsets and is downloaded here from GitHub

Remove all non alphabetical characters, tokenize words, lemmatize words, and filter out stop words

Import news articles from CSV

In [8]:
data = importData()

Importing and unzipping file 0...
Importing and unzipping file 1...
Importing and unzipping file 2...
Importing and unzipping file 3...
Importing and unzipping file 4...
Importing and unzipping file 5...
Importing and unzipping file 6...
Importing and unzipping file 7...
Computing word counts


(array([ 14605.,  29648.,  27947.,  23224.,  16881.,  10264.,   6836.,
          4108.,   2560.,   1608.]),
 array([   0,  200,  400,  600,  800, 1000, 1200, 1400, 1600, 1800, 2000]),
 <a list of 10 Patch objects>)

This project is only concerned with recommending articles of similar and average length. It does not wish to recommend articles of less than 200 words, or over 700 words

In [None]:
plt.hist(data.word_count, bins=[b for b in range(0,2001, 200)])

In [8]:
data = data.loc[data['word_count'] > 199,:]
data = data.loc[data['word_count'] < 701,:].reset_index(drop=True)

Starting cleaning...
Starting tokenizing...
Starting lemmatizing and filtering...
Took 11619 seconds to clean texts


In [None]:
clean_articles = cleanData(data,'content')
del(data)

Let's take a look at our data

In [9]:
clean_articles.iloc[:3,:]

Unnamed: 0,title,publication,content,word_count,cleaned_content,tokens
0,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,"SEOUL, South Korea — North Korea’s leader, ...",694,seoul south korea north korea s leader kim sai...,"[seoul, south, korea, north, korea, leader, ki..."
1,Taiwan’s President Accuses China of Renewed In...,New York Times,BEIJING — President Tsai of Taiwan sharpl...,571,beijing president tsai of taiwan sharply criti...,"[beijing, president, tsai, taiwan, sharply, cr..."
2,How to form healthy habits in your 20s - The N...,New York Times,This article is part of a series aimed at help...,665,this article is part of a series aimed at help...,"[article, part, series, aimed, helping, naviga..."


Add bigrams to articles

In [11]:
bigrams, complete_data = addBigrams(clean_articles, 'tokens')
del(clean_articles)

The above code recognizes sets of consecutive words that are in our dataset at least 250 articles. These couples are returned as one word separated by a hyphen, as illustrated below

In [44]:
print(bigrams['new','york','north','korea', 'not','a','bigram'])

['new_york', 'north_korea', 'new', 'haven']


Create bag of words representation for articles

In [14]:
training_data, dictionary = createBOW(complete_data, 'tokens')

Our data is now ready to train the LDA model

In [15]:
training_data.iloc[10:13,:]

Unnamed: 0,title,publication,content,word_count,cleaned_content,tokens,bow
10,Chinese City Official Shoots 2 Others and Kill...,New York Times,BEIJING — A city official in southwest Chin...,358,beijing a city official in southwest china unl...,"[beijing, city, official, southwest, china, un...","[(11, 1), (16, 1), (54, 2), (73, 1), (79, 4), ..."
11,Ivanka Trump’s New Washington Home Once Belong...,New York Times,"WASHINGTON — Ivanka Trump, who is weighing ...",683,washington ivanka trump who is weighing a prom...,"[washington, ivanka, trump, weighing, prominen...","[(51, 1), (55, 3), (60, 1), (69, 1), (71, 1), ..."
12,How We Put Together Our 52 Places to Go List -...,New York Times,"For the 12th straight year, the Travel section...",567,for the th straight year the travel section pr...,"[th, straight, year, travel, section, present,...","[(38, 2), (60, 1), (84, 1), (108, 1), (161, 1)..."


### <br><br> LDA Model

Check that our dictionary is filled

In [45]:
dictionary[0]

'associate'

In [21]:
print("Our training data has {} distinct words".format(len(dictionary)))
print("Our corpus has {} documents".format(len(training_data)))

Our training data has 8096 distinct words
Our corpus has 69934 documents


Set parameters for LDA model training

In [None]:
id2word = dictionary.id2token   # Dictionary with BOW token definitions

In [22]:
num_topics = 20    # This number of topics led to the most interpretable topics
chunksize = 9000
passes = 15
iterations = 400

Time to train model!

In [23]:
from gensim.models import LdaModel

start = time.time()
model = LdaModel(corpus = training_data['bow'], num_topics = num_topics, id2word = id2word, chunksize = chunksize, iterations = iterations, passes = passes)
print("Took {} seconds to train model".format(time.time()-start))

Took 5416.5155465602875 seconds to train model


Print resulting topics. As assumed in our model, our articles can be composed of 20 different topics. These topics are generated by the following word distributions, which are easily interpretable. The first topic, for example, is the judiciary. The second one is foreign conflicts, and so on

In [24]:
for i in range(num_topics):
    print(model.print_topic(i,topn=30)+'\n')

0.022*"state" + 0.020*"court" + 0.018*"law" + 0.013*"federal" + 0.009*"new" + 0.009*"case" + 0.008*"judge" + 0.008*"right" + 0.008*"government" + 0.008*"order" + 0.007*"justice" + 0.007*"would" + 0.007*"rule" + 0.007*"department" + 0.006*"decision" + 0.006*"legal" + 0.005*"public" + 0.005*"ban" + 0.005*"lawsuit" + 0.005*"executive" + 0.004*"attorney" + 0.004*"supreme" + 0.004*"statement" + 0.004*"supreme_court" + 0.004*"district" + 0.004*"appeal" + 0.004*"use" + 0.004*"governor" + 0.004*"administration" + 0.004*"ruling"

0.014*"state" + 0.010*"military" + 0.009*"attack" + 0.008*"force" + 0.008*"syria" + 0.007*"islamic" + 0.007*"north" + 0.007*"war" + 0.007*"korea" + 0.006*"united" + 0.006*"government" + 0.006*"group" + 0.006*"iran" + 0.006*"country" + 0.006*"syrian" + 0.005*"islamic_state" + 0.005*"official" + 0.005*"nuclear" + 0.005*"security" + 0.005*"isi" + 0.005*"u" + 0.005*"russia" + 0.005*"north_korea" + 0.005*"united_state" + 0.004*"iraq" + 0.004*"missile" + 0.004*"south" + 0.00

For each of our articles, compute its PDF over topics

In [25]:
training_data['topics'] = training_data.bow.apply(lambda x: model.get_document_topics(x, minimum_probability = 1e-8))
# get_document_topics returns tuples but we only want to keep the probs
training_data['topics'] = [np.array([prob[1] for prob in row]) for row in training_data.topics]

In [27]:
training_data.iloc[18:20,:]

Unnamed: 0,title,publication,content,word_count,cleaned_content,tokens,bow,topics
18,2 Credit-Reporting Agencies Must Pay $23 Milli...,New York Times,The nation’s consumer watchdog agency on Tuesd...,340,the nation s consumer watchdog agency on tuesd...,"[nation, consumer, watchdog, agency, tuesday, ...","[(47, 1), (324, 1), (490, 1), (502, 1), (585, ...","[0.269334122173, 0.000340136059492, 0.06643992..."
19,Chase Sapphire Reserve Card’s Huge Bonus Will ...,New York Times,When a Wall Street banking institution starts ...,365,when a wall street banking institution starts ...,"[wall, street, banking, institution, start, th...","[(51, 2), (108, 1), (361, 1), (433, 1), (565, ...","[0.000270270275914, 0.000270270272897, 0.00027..."


### <br><br>Recommending further articles to readers

The code below returns the three most similar articles as assessed by our model. It uses functionality defined in the module notebook

Let's try an example. Pick a random text, say the 1112th one in our corpus

In [42]:
article1 = round(np.random.randn()*len(training_data)
training_data.iloc[article1,2]

'The veteran television personality Jane Pauley will replace Charles Osgood as the anchor of the highly rated CBS show “Sunday Morning. ” Mr. Osgood, who is retiring, announced the news on his last show on Sunday. Ms. Pauley’s first day in the role will be Oct. 9, and she will become only the third anchor of the show, which started in 1979. For Ms. Pauley, 65, a return to the anchor role for a morning television show represents an unexpected   comeback. And by selecting her instead of a younger    CBS is clearly trying to ease the transition from Mr. Osgood, 83, whose folksy delivery has been a mainstay on the show for more than two decades. In a statement, the president of CBS News, David Rhodes, said, “Charles Osgood is a television news legend  —   and so is Jane Pauley. ” Ms. Pauley first catapulted to fame at age 25 when she replaced Barbara Walters as an anchor of the “Today” show 40 years ago. She remained with “Today” through the late 1980s until the notoriously messy handoff i

In [43]:
similar1, similar2, similar3 = similarArticle(article1)
print(training_data.iloc[similar1,2][:2000])
print('\n')
print(training_data.iloc[similar2,2][:2000])
print('\n')
print(training_data.iloc[similar3,2][:2000])

Arnold Schwarzenegger’s debut as the new boss of NBC’s New Celebrity Apprentice drew soft ratings Monday night without its longtime host and current   Donald Trump. [The season premiere of Celebrity Apprentice drew an average of 4. 9 million viewers and a 1. 3 rating in the     demo, according to the Hollywood Reporter. The outlet notes those numbers are down a significant 35 percent from last year’s Donald   premiere of the show.  NBC did heavy promotion for the new Apprentice, with TV spots highlighting a     Schwarzenegger poised to take over the boss position played by Trump for 14 seasons. The network also teased Schwarzenegger’s replacement line for Trump’s “You’re fired” Schwarzenegger’s updated catchphrase, revealed on Monday night’s episode, is “You’re terminated. ” The show also drew some buzz when it was reported that Trump would retain his executive producer credit on the show (though he said he would not have any involvement in it whatsoever). Neither factor appeared to he

<br>Nice! The algorithm returns articles about pop culture that reference Trump. Let's try another one..

In [34]:
article2 = round(np.random.randn()*len(training_data)
training_data.iloc[article2,2]

'The Wall Street Journal’s Potomac Watch columnist Kimberley Strassel details the “devastating case against a Clinton presidency” that can be made by reviewing the WikiLeaks documents combined with what is already known in the public record. Strassel notes that although “the nation now has proof of pretty much everything [Hillary Clinton] has been accused of,” the media “has almost uniformly ignored the flurry of bombshells, preferring to devote its front pages” to the story that Donald Trump “made lewd remarks a decade ago and now stands accused of groping women. ”[ From Strassel in the Wall Street Journal:  The Obama administration —  the federal government, supported by tax dollars —  was working as an extension of the Clinton campaign. The State Department coordinated with her staff in responding to the email scandal, and the Justice Department kept her team informed about developments in the court case. Worse, Mrs. Clinton’s State Department, as documents obtained under the Freedo

In [39]:
similar1, similar2, similar3 = similarArticle(article2)
print(training_data.iloc[similar1,2][:2000])
print('\n')
print(training_data.iloc[similar2,2][:2000])
print('\n')
print(training_data.iloc[similar3,2][:2000])

Democratic presidential nominee Hillary Clinton told Fox News’ Chris Wallace that she is “really proud” of her family’s charity the Clinton Foundation  —   after spending the past week hiding its existence from viewers of the Democratic National Convention. [The glaring contradiction prompted a video comparison, which you can watch above.  Wallace, in an exclusive   interview, asked Clinton about allegations first made in Clinton Cash, the book and now   novel, that Clinton used the    charity for international “  ” deals while she served as Secretary of State. Author and Breitbart News Senior    Peter Schweizer says that Clinton Foundation donors were on the receiving end of corrupt deals approved by Hillary’s State Department  —   including the sale of U. S. uranium to Russia and a rare, lucrative mining permit in Haiti. Clinton retorted that she is “really proud of the Clinton Foundation,” yet not a single speaker at the DNC last week  —   not even Bill or Chelsea Clinton  —   menti

After training the LDA model, the recommendation process is very quick and doesn't take more than a few seconds to run