# News headline topic analysis with Non-negative Matrix Factorization (NMF) 

The purpose of this analysis is to find dominant topics across news headlines (25 per day, over 1 year). These topics will later be correlated with daily stock market loss/gain information to understand how certain topics may influence the stock market.

## What is NMF?

Non-negative Matrix Factorization is a mathematical technique that when applied to documents can take data with many features (e.g. 1000s of topics) and convert them into a smaller set of topics. It is similar to LDA in that it is a way to discover higher-level topics out of individual words present in any set of document (in our case, news headlines). You can use NMF to get a sense of the overall themes in a set of documents.

NMF is an unsupervised machine learning model that works by taking your matrix _A_ of documents x words (e.g. 50 documents and 5000 words) and returning topics (_W_) and weights/coefficients (_H_) for the topics.

As Rob Salgado explains in his [excellent article on NMF](https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45), "NMF will modify the initial values of W and H so that the product approaches A until either the approximation error converges or the max iterations are reached." 

Like LDA, the "topics" that NMF finds aren't specific words (e.g., "This headline is about 'war'") but instead conceptually similar groups of words that together make up a theme (e.g., "This headline is similar to the words 'war', 'crisis', 'iran'...").

Once you've created your NMF model, you can feed in a document and the model will score the overall relevancy of your document against the main _x_ topics found in your overall corpus. In other words, it will tell you which of the main topics found in the overall corpus are also found in your document, and to what extent.

## Why NMF?

NMF is regarded to be superior to LDA in terms of efficiency and accuracy, though it does not appear to be as popular. 

## How does NMF perform compared to LDA?

More here later.

## Credit

Parts of this work borrow from Rob Salgado's [excellent NMF tutorial](https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45), as well as Ravish Chawla's [NMF tutorial](https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df).

In [98]:
import pandas as pd
import numpy as np
import scipy as sp

import nltk 
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split

np.random.seed(22)

[nltk_data] Downloading package wordnet to /Users/stacy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [99]:
# Import the data
data = pd.read_csv("../Data/RedditNews.csv", error_bad_lines=False)

In [100]:
# Getting just the headlines for our corpus
headlines = data[['News']]
headlines.head()

Unnamed: 0,News
0,A 117-year-old woman in Mexico City finally re...
1,IMF chief backs Athens as permanent Olympic host
2,"The president of France says if Brexit won, so..."
3,British Man Who Must Give Police 24 Hours' Not...
4,100+ Nobel laureates urge Greenpeace to stop o...


## Data preprocessing
### Lemmitize

In [101]:
def lemmatize(text):
    return WordNetLemmatizer().lemmatize(text, pos='v') # pos='v' means it peforms stemming with context

In [102]:
# Remove stopwords and words shorter than 3 characters, then lemmatize
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize(token))
    return result

In [103]:
sample = headlines['News'][2]

print('original document: ')
words = []
for word in sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(sample))

original document: 
['The', 'president', 'of', 'France', 'says', 'if', 'Brexit', 'won,', 'so', 'can', 'Donald', 'Trump']


 tokenized and lemmatized document: 
['president', 'france', 'say', 'brexit', 'donald', 'trump']


In [104]:
headlines['cleaned_headlines']  = headlines['News'].map(preprocess)
headlines['cleaned_headlines'][:5] # Check the results

train, test = train_test_split(headlines['cleaned_headlines'], test_size=0.1)

In [105]:
test

41624    [portugal, raid, pension, fund, meet, deficit,...
5620     [fearless, father, throw, suicide, bomber, sav...
65161    [reddit, spend, morning, write, brief, history...
72850              [vote, force, isps, disconnect, pirate]
59075    [chvez, order, jet, intercept, military, plane...
                               ...                        
68615                                [england, households]
70154              [iran, hold, american, student, prison]
55963    [turkey, position, missiles, repulse, israeli,...
23059    [china, moon, rover, activate, science, tool, ...
64850         [spanish, intelligence, agents, expel, cuba]
Name: cleaned_headlines, Length: 7361, dtype: object

## Vectorize and transform the text using TF-IDF
Sentence here about why we use TF-IDF in this case and not BoW.

In [106]:
# Vectorize the headlines using TF-IDF to create features to train our model on
texts = train
dictionary = gensim.corpora.Dictionary(texts)

'''
Filter out irrelevant words:
Keep tokens that appear in at least 15 documents
Keep only the 100,000 most frequent tokens
'''
dictionary.filter_extremes(no_below=15, keep_n=100000)

tfidf_vectorizer = TfidfVectorizer(
    min_df=3, # ignore words that appear in less than 3 articles
    max_df=0.85, # ignore words that appear in more than 85% of articles
    max_features=5000, # limit the number of important words to 5,000
    ngram_range=(1, 2), # look for both words and two-word phrases
    preprocessor=' '.join # join the tokenized words instead of creating a list
)

tfidf = tfidf_vectorizer.fit_transform(texts)

## Create and fit the NMF model on our headlines

In [138]:
# Create NMF model and fit it
# We are using ten topic groupings here so it is directly comparable to the LDA model's performance

'''
"Nonnegative Double Singular Value Decomposition (NNDSVD) is a new method designed to enhance the 
initialization stage of the nonnegative matrix factorization.

"NNDSVD is well suited to initialize NMF algorithms with sparse factors."" - http://nimfa.biolab.si/nimfa.methods.seeding.nndsvd.html
'''
model = NMF(n_components=10, init='nndsvd')
model.fit(tfidf)

NMF(alpha=0.0, beta_loss='frobenius', init='nndsvd', l1_ratio=0.0, max_iter=200,
    n_components=10, random_state=None, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

## Check our NMF topics

In [139]:
# Print out the topics to visually inspect them

def get_nmf_topics(model, n_top_words):
    
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = tfidf_vectorizer.get_feature_names()
    
    word_dict = {};
    for i in range(10):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:10 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic #' + '{:2d}'.format(i+1)] = words
    
    return pd.DataFrame(word_dict)

get_nmf_topics(model, 10)

Unnamed: 0,Topic # 1,Topic # 2,Topic # 3,Topic # 4,Topic # 5,Topic # 6,Topic # 7,Topic # 8,Topic # 9,Topic #10
0,police,gaza,isis,korea,ukraine,wikileaks,say,snowden,libya,georgia
1,iran,israel,islamic state,north,russia,assange,china,edward,egypt,russia
2,kill,israeli,islamic,north korea,russian,julian,world,edward snowden,protest,ossetia
3,people,hamas,ebola,south,crimea,julian assange,climate,spy,protesters,georgian
4,afghanistan,palestinians,state,korean,ukrainian,cable,change,surveillance,egyptian,south ossetia
...,...,...,...,...,...,...,...,...,...,...
4985,gadhafi,factor,libyans,jail years,latest,mark,onion,methane,humans,sample
4986,gaddafi,shouldn,libyan rebel,receive,later,marine,depth,mexican police,surface,samsung
4987,fukushima plant,factory,libyan,elites,lash,marijuana,omar,research,human shield,historic
4988,fukushima nuclear,shots,libya,record number,larger,marathon,describe,resistant,surveillance,hire


So far, these topics seem to be a lot more coherent than what the LDA model produced. There's not only better in-topic coherence (i.e. the words relate to each other well) but also distinctions between topics (i.e. there's less keyword overlap from topic to topic).

## Testing the model on new data

In [115]:
# https://stackabuse.com/python-for-nlp-topic-modeling/

df = pd.DataFrame()

# See how well our model performed by using the test data
tfidf_test = tfidf_vectorizer.transform(test)
X_test = model.transform(tfidf_test)

# Get the top predicted topic
predicted_topics = [np.argsort(each)[::-1][0] + 1 for each in X_test]    

# Add to the df
df['test'] = test
df.reset_index(drop=True, inplace=True)
df['pred_topic_num'] = predicted_topics

df.head()

Unnamed: 0,test,pred_topic_num
0,"[portugal, raid, pension, fund, meet, deficit,...",1
1,"[fearless, father, throw, suicide, bomber, sav...",4
2,"[reddit, spend, morning, write, brief, history...",1
3,"[vote, force, isps, disconnect, pirate]",1
4,"[chvez, order, jet, intercept, military, plane...",5


## Find a single topic per day

Sorting the headlines into topics isn't working so well against the test data. Maybe more topics are better? This is where a **coherence score** would come in: it is a score that tells you how "coherent" (closely related) the words within a topic are, and you can use it to [automatically select the best number of topics](https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45) to train your model on. That is currently beyond the scope of this project, but will be the next area for exploration.

Since we know that we want to find one topic per day to feed into other models, we'll now take our trained model and set it loose on the combined headlines from each day.

In [151]:
single_topic = pd.read_csv("../Data/Combined_News_DJIA_single_topic.csv")
single_topic.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top19,Top20,Top21,Top22,Top23,Top24,Top25,combined_headlines,daily_topic,daily_words
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge""",Georgia 'downs two Russian warplanes' as coun...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo...",Why wont America and Nato help us? If they wo...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,"b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man...",Remember that adorable 9-year-old who sang at...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,"b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...,U.S. refuses Israel weapons to attack Iran: ...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...,All the experts admit that we should legalise...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."


In [152]:
# Clean and tokenize headlines 
single_topic['tokenized'] = single_topic['combined_headlines'].map(preprocess)

In [159]:
# Vectorize the headlines using TF-IDF to create features to train our model on
dictionary = gensim.corpora.Dictionary(single_topic['tokenized'])

'''
Filter out irrelevant words:
Keep tokens that appear in at least 15 documents
Keep only the 100,000 most frequent tokens
'''
dictionary.filter_extremes(no_below=15, keep_n=100000)

tfidf_vectorizer = TfidfVectorizer(
    min_df=3, # ignore words that appear in less than 3 articles
    max_df=0.85, # ignore words that appear in more than 85% of articles
    max_features=5000, # limit the number of important words to 5,000
    ngram_range=(1, 2), # look for both words and two-word phrases
    preprocessor=' '.join # join the tokenized words instead of creating a list
)

# Helped by https://stackabuse.com/python-for-nlp-topic-modeling/
doctermmatrix = tfidf_vectorizer.fit_transform(single_topic['tokenized'])

topic_values = model.transform(doctermmatrix)
single_topic['nmf_topic'] = topic_values.argmax(axis=1)
single_topic.tail()

In [173]:
# Create human readable topic word lists to append to df
nmf_topics = []

for i,topic in enumerate(model.components_):
    nmf_topics.append([tfidf_vectorizer.get_feature_names()[i] for i in topic.argsort()[-10:]])
    
def get_topic_list(topic):
    return nmf_topics[topic]
    
single_topic['nmf_topic_readable'] = single_topic['nmf_topic'].apply(get_topic_list)

single_topic.rename(columns={
    'daily_topic': 'lda_topic',
    'daily_words': 'lda_topic_readable'
}, inplace=True)

single_topic.drop(labels=['tokenized'], axis=1, inplace=True)

In [175]:
# Save updated data
single_topic.to_csv("../Data/Combined_News_DJIA_single_topic.csv", index=False)