# News headline topic analysis with LDA
The purpose of this analysis is to find dominant topics across news headlines (25 per day, over 1 year). These topics will later be correlated with daily stock market loss/gain to understand if certain topics influence the stock market.

This work borrows heavily from Susan Li's ["Topic Modeling and LDA in Python"](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) article.

In [1]:
# import dependencies
import pandas as pd

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import numpy as np
np.random.seed(22)

[nltk_data] Downloading package wordnet to /Users/stacy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Import the data
data = pd.read_csv("../Data/RedditNews.csv")

In [3]:
data.head()

# 73,608 records from 2016-2018

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host
2,2016-07-01,"The president of France says if Brexit won, so..."
3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...
4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...


In [4]:
# Getting just the headlines for our corpus
headlines = data[['News']]
headlines.head()

Unnamed: 0,News
0,A 117-year-old woman in Mexico City finally re...
1,IMF chief backs Athens as permanent Olympic host
2,"The president of France says if Brexit won, so..."
3,British Man Who Must Give Police 24 Hours' Not...
4,100+ Nobel laureates urge Greenpeace to stop o...


## Data pre-processing
### Lemmitize

In [5]:
# Lemmatize the words keeping the context (stemming is "dumb" so we won't)
# However if we have a much larger corpus, we might consider stemming (as it is faster)
def lemmatize(text):
    return WordNetLemmatizer().lemmatize(text, pos='v') # pos='v' means it peforms stemming with context

### Remove stopwords and words shorter than 3 chars

In [6]:
# Remove stopwords and words shorter than 3 characters, then lemmatize
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize(token))
    return result

### Check outputs

In [7]:
sample = headlines['News'][2]

print('original document: ')
words = []
for word in sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(sample))

original document: 
['The', 'president', 'of', 'France', 'says', 'if', 'Brexit', 'won,', 'so', 'can', 'Donald', 'Trump']


 tokenized and lemmatized document: 
['president', 'france', 'say', 'brexit', 'donald', 'trump']


### Preprocess the headlines and save the results

In [8]:
cleaned_headlines = headlines['News'].map(preprocess)
cleaned_headlines[:5] # Check the results

0    [year, woman, mexico, city, finally, receive, ...
1      [chief, back, athens, permanent, olympic, host]
2      [president, france, say, brexit, donald, trump]
3    [british, police, hours, notice, threaten, hun...
4    [nobel, laureates, urge, greenpeace, stop, opp...
Name: News, dtype: object

## Count the word occurences using Bag of Words

In [9]:
# corpora.Dictionary implements the concept of a Dictionary – a mapping between words and their integer ids.
# https://radimrehurek.com/gensim/corpora/dictionary.html
dictionary = gensim.corpora.Dictionary(cleaned_headlines)

count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 alvarez
1 bear
2 birth
3 certificate
4 city
5 die
6 finally
7 hours
8 later
9 lira
10 mexico


In [10]:
# Filter out irrelevant words
'''
    less than 15 documents (absolute number) or
    more than 0.5 documents (fraction of total corpus size, not absolute number).
    after the above two steps, keep only the first 100000 most frequent tokens.
'''

dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [11]:
'''
For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.
'''

bow_corpus = [dictionary.doc2bow(doc) for doc in cleaned_headlines]
bow_corpus[2]

[(21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1)]

In [12]:
bow_doc_2 = bow_corpus[2]

for i in range(len(bow_doc_2)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_2[i][0], 
                                            dictionary[bow_doc_2[i][0]], 
                                            bow_doc_2[i][1]))

Word 21 ("brexit") appears 1 time.
Word 22 ("donald") appears 1 time.
Word 23 ("france") appears 1 time.
Word 24 ("president") appears 1 time.
Word 25 ("say") appears 1 time.
Word 26 ("trump") appears 1 time.


## TD-IDF
Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document.

In [13]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.2600564898594514),
 (1, 0.28994526204184407),
 (2, 0.3744826130954538),
 (3, 0.2051174112907887),
 (4, 0.21074667376840217),
 (5, 0.2818126498910203),
 (6, 0.24752380792284462),
 (7, 0.2842753904922502),
 (8, 0.2101120460560261),
 (9, 0.32010680937189695),
 (10, 0.253807102500421),
 (11, 0.3026294517497535),
 (12, 0.20803861949214233),
 (13, 0.16333461952636777),
 (14, 0.16583903979070258)]


## Running LDA using Bag of Words

Train our lda model using gensim.models.LdaMulticore and save it to ‘lda_model’.

In [34]:
lda_model_bow = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [35]:
# For each topic, we will explore the words occuring in that topic and its relative weight.

for idx, topic in lda_model_bow.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.019*"court" + 0.012*"rule" + 0.011*"government" + 0.007*"earthquake" + 0.006*"judge" + 0.006*"case" + 0.006*"crimes" + 0.005*"company" + 0.005*"british" + 0.005*"canadian"
Topic: 1 
Words: 0.011*"world" + 0.011*"saudi" + 0.011*"israeli" + 0.009*"bank" + 0.009*"russian" + 0.007*"crash" + 0.006*"women" + 0.006*"plane" + 0.006*"arabia" + 0.006*"land"
Topic: 2 
Words: 0.029*"right" + 0.029*"gaza" + 0.023*"israel" + 0.016*"human" + 0.010*"internet" + 0.009*"palestinian" + 0.008*"say" + 0.007*"palestinians" + 0.006*"hamas" + 0.006*"minister"
Topic: 3 
Words: 0.029*"kill" + 0.021*"bomb" + 0.012*"water" + 0.012*"people" + 0.012*"japan" + 0.008*"civilians" + 0.008*"china" + 0.007*"dead" + 0.006*"world" + 0.006*"japanese"
Topic: 4 
Words: 0.026*"china" + 0.020*"north" + 0.019*"korea" + 0.017*"nuclear" + 0.012*"south" + 0.011*"world" + 0.009*"power" + 0.008*"bank" + 0.008*"billion" + 0.008*"russia"
Topic: 5 
Words: 0.026*"kill" + 0.020*"pakistan" + 0.019*"attack" + 0.017*"afgha

## Running LDA using TF-IDF

In [41]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.008*"kill" + 0.005*"nuclear" + 0.005*"iran" + 0.005*"iraq" + 0.005*"bomb" + 0.004*"attack" + 0.004*"gaddafi" + 0.004*"say" + 0.003*"syrian" + 0.003*"israel"
Topic: 1 Word: 0.015*"korea" + 0.013*"north" + 0.007*"south" + 0.006*"nuclear" + 0.006*"iran" + 0.005*"missile" + 0.005*"china" + 0.005*"russia" + 0.004*"korean" + 0.004*"israel"
Topic: 2 Word: 0.005*"china" + 0.005*"japan" + 0.004*"internet" + 0.004*"world" + 0.004*"pope" + 0.004*"water" + 0.004*"mumbai" + 0.003*"people" + 0.003*"tsunami" + 0.003*"church"
Topic: 3 Word: 0.008*"drug" + 0.006*"georgia" + 0.005*"russia" + 0.005*"world" + 0.004*"bank" + 0.004*"say" + 0.004*"mexico" + 0.004*"israel" + 0.003*"billion" + 0.003*"global"
Topic: 4 Word: 0.006*"assange" + 0.005*"wikileaks" + 0.004*"world" + 0.004*"julian" + 0.004*"earthquake" + 0.003*"boycott" + 0.003*"government" + 0.003*"china" + 0.003*"police" + 0.003*"olympics"
Topic: 5 Word: 0.005*"bush" + 0.005*"say" + 0.005*"right" + 0.004*"ahmadinejad" + 0.004*"chave

In [42]:
# Performance evaluation by classifying sample document using LDA Bag of Words model

# We will check where our test document would be classified.

cleaned_headlines[2]

['president', 'france', 'say', 'brexit', 'donald', 'trump']

In [43]:
for index, score in sorted(lda_model_bow[bow_corpus[2]], key=lambda tup: -1*tup[1]):
    topic = index
    score = score
    print(index)
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))
    print(topic, score)

0

Score: 0.4510440230369568	 
Topic: 0.045*"israel" + 0.020*"israeli" + 0.018*"china" + 0.018*"state" + 0.015*"right" + 0.014*"say" + 0.014*"human" + 0.011*"palestinian" + 0.010*"unite" + 0.010*"iran"
0 0.45104402
8

Score: 0.4335630238056183	 
Topic: 0.009*"internet" + 0.008*"government" + 0.007*"right" + 0.006*"say" + 0.006*"house" + 0.006*"muslim" + 0.006*"people" + 0.006*"israelis" + 0.006*"demand" + 0.005*"church"
8 0.43356302
1

Score: 0.014427188783884048	 
Topic: 0.018*"bank" + 0.013*"government" + 0.010*"protest" + 0.009*"court" + 0.008*"chinese" + 0.008*"china" + 0.008*"million" + 0.008*"police" + 0.006*"people" + 0.006*"billion"
1 0.014427189
4

Score: 0.01442678365856409	 
Topic: 0.036*"gaza" + 0.020*"israel" + 0.013*"ship" + 0.013*"hamas" + 0.009*"israeli" + 0.007*"children" + 0.006*"say" + 0.006*"georgia" + 0.006*"sell" + 0.006*"report"
4 0.014426784
2

Score: 0.014426068402826786	 
Topic: 0.034*"kill" + 0.021*"attack" + 0.016*"bomb" + 0.014*"force" + 0.010*"army" + 0.01

In [44]:
# Performance evaluation by classifying sample document using LDA TF-IDF model.

for index, score in sorted(lda_model_tfidf[bow_corpus[2]], key=lambda tup: -1*tup[1]):
#     print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index)))


Score: 0.8713290691375732	 
Topic: 0.007*"israel" + 0.007*"police" + 0.006*"israeli" + 0.006*"gaza" + 0.005*"kill" + 0.005*"palestinian" + 0.005*"protest" + 0.004*"protesters" + 0.004*"palestinians" + 0.004*"iran"

Score: 0.014297603629529476	 
Topic: 0.015*"korea" + 0.013*"north" + 0.007*"south" + 0.006*"nuclear" + 0.006*"iran" + 0.005*"missile" + 0.005*"china" + 0.005*"russia" + 0.004*"korean" + 0.004*"israel"

Score: 0.014297451823949814	 
Topic: 0.011*"gaza" + 0.009*"israel" + 0.008*"kill" + 0.006*"israeli" + 0.005*"attack" + 0.005*"troop" + 0.004*"strike" + 0.004*"say" + 0.004*"pakistan" + 0.004*"hamas"

Score: 0.01429729163646698	 
Topic: 0.008*"kill" + 0.005*"nuclear" + 0.005*"iran" + 0.005*"iraq" + 0.005*"bomb" + 0.004*"attack" + 0.004*"gaddafi" + 0.004*"say" + 0.003*"syrian" + 0.003*"israel"

Score: 0.014296866953372955	 
Topic: 0.005*"bush" + 0.005*"say" + 0.005*"right" + 0.004*"ahmadinejad" + 0.004*"chavez" + 0.004*"human" + 0.004*"china" + 0.003*"president" + 0.003*"death"

In [45]:
# Testing model on unseen document

unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

# lda_model[bow_vector] # Gets topic probabilities for unseen document 
# # (7, 0.66596633) is the highest score

# for topic in lda_model[bow_vector]:
#     print(topic)

# lda_model.print_topic(index, 7)

for index, score in sorted(lda_model_bow[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))
    
for index, score in sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))
    


Score: 0.8499320149421692	 Topic: 0.009*"internet" + 0.008*"government" + 0.007*"right" + 0.006*"say" + 0.006*"house"
Score: 0.016680601984262466	 Topic: 0.045*"israel" + 0.020*"israeli" + 0.018*"china" + 0.018*"state" + 0.015*"right"
Score: 0.016679517924785614	 Topic: 0.036*"gaza" + 0.020*"israel" + 0.013*"ship" + 0.013*"hamas" + 0.009*"israeli"
Score: 0.016676833853125572	 Topic: 0.037*"iran" + 0.018*"nuclear" + 0.018*"russia" + 0.014*"iraq" + 0.014*"world"
Score: 0.016672994941473007	 Topic: 0.026*"police" + 0.013*"president" + 0.013*"russian" + 0.009*"russia" + 0.009*"protest"
Score: 0.016672952100634575	 Topic: 0.034*"kill" + 0.021*"attack" + 0.016*"bomb" + 0.014*"force" + 0.010*"army"
Score: 0.016671812161803246	 Topic: 0.023*"north" + 0.022*"korea" + 0.021*"kill" + 0.019*"south" + 0.015*"pakistan"
Score: 0.016671642661094666	 Topic: 0.018*"bank" + 0.013*"government" + 0.010*"protest" + 0.009*"court" + 0.008*"chinese"
Score: 0.01667155511677265	 Topic: 0.017*"world" + 0.010*"ind

## Export the model
We'll use the LDA-Bag of words model, which appears to be more accurate (based on our test documents).

In [46]:
import joblib
joblib.dump(lda_model_bow, "lda_bow.gz")

['lda_bow.gz']

## Append topics and scores to cleaned dataset

In [91]:
# This function returns the best-scoring topic and score assocated with a headline
def topic_getter(headline):
    if headline == np.nan:
        return [np.nan,np.nan]
    else:
        processed_headline = dictionary.doc2bow(preprocess(headline))
        topic_scores = sorted(lda_model_bow[processed_headline], key=lambda tup: -1*tup[1])
        topic = topic_scores[0][0]
        score = topic_scores[0][1]
        return [topic, score]

topic_getter(unseen_document)
# (8, 0.8499303)

[8, 0.849932]

In [97]:
# Load dataset with combined data
all_data = pd.read_csv('../Data/Combined_News_DJIA.csv')
all_data.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


In [98]:
all_data.dropna(how='any', inplace=True)

In [99]:
for x in range(1,26): # for each column
    col = "Top" + str(x)
    new_col = col + "_topic_score"
    
    col_data = []
    
    # for each row in column
    # get the headline, feed it into the lda_model_bow to get the topic and score
    for index, row in all_data.iterrows():
        headline_topic = row[col]
        try: 
            col_data.append(topic_getter(headline_topic))
        except:
            print(headline_topic)
            continue
    # pass the topic and score into a new column in df
    all_data[new_col] = col_data

In [106]:
all_data.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16_topic_score,Top17_topic_score,Top18_topic_score,Top19_topic_score,Top20_topic_score,Top21_topic_score,Top22_topic_score,Top23_topic_score,Top24_topic_score,Top25_topic_score
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,"[7, 0.54839784]","[5, 0.49373335]","[8, 0.7479067]","[4, 0.5546607]","[5, 0.6731857]","[8, 0.8999648]","[4, 0.42172644]","[3, 0.84993804]","[2, 0.51924974]","[6, 0.52812487]"
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,"[8, 0.5204197]","[1, 0.81994945]","[9, 0.6280556]","[1, 0.34079245]","[5, 0.33000782]","[8, 0.84992933]","[8, 0.3987812]","[7, 0.8199619]","[7, 0.45951775]","[4, 0.7749679]"
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,"[7, 0.6842205]","[7, 0.49105403]","[7, 0.3169475]","[7, 0.81998307]","[4, 0.6680585]","[7, 0.29200497]","[8, 0.3772373]","[1, 0.33793223]","[6, 0.37801707]","[3, 0.70385313]"
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,"[3, 0.36663938]","[7, 0.60835826]","[0, 0.5265882]","[7, 0.36622435]","[8, 0.8499715]","[7, 0.3666452]","[5, 0.6786247]","[7, 0.8497919]","[8, 0.3896271]","[7, 0.40924603]"
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,"[6, 0.50975776]","[7, 0.4164804]","[3, 0.60738546]","[7, 0.5090185]","[1, 0.34263086]","[9, 0.45443296]","[5, 0.8199477]","[7, 0.48615962]","[5, 0.47404864]","[2, 0.53765625]"


In [108]:
# Save updated data
all_data.to_csv("../Data/Combined_News_DJIA_topics.csv")