# News headline topic analysis with Non-negative Matrix Factorization (NMF) 

The purpose of this analysis is to find dominant topics across news headlines (25 per day, over 1 year). These topics will later be correlated with daily stock market loss/gain information to understand how certain topics may influence the stock market.

## What is NMF?

Non-negative Matrix Factorization is a mathematical technique that when applied to documents can take data with many features (e.g. 1000s of topics) and convert them into a smaller set of topics. It is similar to LDA in that it is a way to discover higher-level topics out of individual words present in any set of document (in our case, news headlines). You can use NMF to get a sense of the overall themes in a set of documents.

NMF is an unsupervised machine learning model that works by taking your matrix _A_ of documents x words (e.g. 50 documents and 5000 words) and returning topics (_W_) and weights/coefficients (_H_) for the topics.

As Rob Salgado explains in his [excellent article on NMF](https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45), "NMF will modify the initial values of W and H so that the product approaches A until either the approximation error converges or the max iterations are reached." 

Like LDA, the "topics" that NMF finds aren't specific words (e.g., "This headline is about 'war'") but instead conceptually similar groups of words that together make up a theme (e.g., "This headline is similar to the words 'war', 'crisis', 'iran'...").

Once you've created your NMF model, you can feed in a document and the model will score the overall relevancy of your document against the main _x_ topics found in your overall corpus. In other words, it will tell you which of the main topics found in the overall corpus are also found in your document, and to what extent.

## Why NMF?

NMF is regarded to be superior to LDA in terms of efficiency and accuracy, though it does not appear to be as popular. 

## How does NMF perform compared to LDA?

More here later.

## Credit

Parts of this work borrow from Rob Salgado's [excellent NMF tutorial](https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45), as well as Ravish Chawla's [NMF tutorial](https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df).

In [13]:
import pandas as pd
import numpy as np
import scipy as sp

import nltk 
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize

np.random.seed(22)

[nltk_data] Downloading package wordnet to /Users/stacy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Import the data
data = pd.read_csv("../Data/RedditNews.csv", error_bad_lines=False)

In [3]:
# Getting just the headlines for our corpus
headlines = data[['News']]
del data
headlines.head()

Unnamed: 0,News
0,A 117-year-old woman in Mexico City finally re...
1,IMF chief backs Athens as permanent Olympic host
2,"The president of France says if Brexit won, so..."
3,British Man Who Must Give Police 24 Hours' Not...
4,100+ Nobel laureates urge Greenpeace to stop o...


## Data preprocessing
### Lemmitize

In [4]:
def lemmatize(text):
    return WordNetLemmatizer().lemmatize(text, pos='v') # pos='v' means it peforms stemming with context

In [5]:
# Remove stopwords and words shorter than 3 characters, then lemmatize
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize(token))
    return result

In [6]:
sample = headlines['News'][2]

print('original document: ')
words = []
for word in sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(sample))

original document: 
['The', 'president', 'of', 'France', 'says', 'if', 'Brexit', 'won,', 'so', 'can', 'Donald', 'Trump']


 tokenized and lemmatized document: 
['president', 'france', 'say', 'brexit', 'donald', 'trump']


In [19]:
headlines['cleaned_headlines']  = headlines['News'].map(preprocess)
headlines['cleaned_headlines'][:5] # Check the results

0    [year, woman, mexico, city, finally, receive, ...
1      [chief, back, athens, permanent, olympic, host]
2      [president, france, say, brexit, donald, trump]
3    [british, police, hours, notice, threaten, hun...
4    [nobel, laureates, urge, greenpeace, stop, opp...
Name: cleaned_headlines, dtype: object

## Vectorize and transform the text using TF-IDF
Sentence here about why we use TF-IDF in this case and not BoW.

In [20]:
# Vectorize the headlines using TF-IDF to create features to train our model on
texts = headlines['cleaned_headlines']
dictionary = gensim.corpora.Dictionary(texts)

'''
Filter out irrelevant words:
Keep tokens that appear in at least 15 documents
Keep only the 100,000 most frequent tokens
'''
dictionary.filter_extremes(no_below=15, keep_n=100000)

tfidf_vectorizer = TfidfVectorizer(
    min_df=3, # ignore words that appear in less than 3 articles
    max_df=0.85, # ignore words that appear in more than 85% of articles
    max_features=5000, # limit the number of important words to 5,000
    ngram_range=(1, 2), # look for both words and two-word phrases
    preprocessor=' '.join # join the tokenized words instead of creating a list
)

tfidf = tfidf_vectorizer.fit_transform(texts)

In [26]:
# Add a step here creating train-test splits (or should it come before we do TF-IDF?)

## Create and fit the NMF model on our headlines

In [21]:
# Create NMF model and fit it
# We are using ten topic groupings here so it is directly comparable to the LDA model's performance

'''
Nonnegative Double Singular Value Decomposition (NNDSVD) [Boutsidis2007] 
is a new method designed to enhance the initialization stage of the nonnegative matrix factorization.

NNDSVD is well suited to initialize NMF algorithms with sparse factors. - http://nimfa.biolab.si/nimfa.methods.seeding.nndsvd.html
'''
model = NMF(n_components=10, init='nndsvd')
model.fit(tfidf)

NMF(alpha=0.0, beta_loss='frobenius', init='nndsvd', l1_ratio=0.0, max_iter=200,
    n_components=10, random_state=None, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

## Check our NMF topics

In [23]:
# Print out the topics to visually inspect them

def get_nmf_topics(model, n_top_words):
    
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = tfidf_vectorizer.get_feature_names()
    
    word_dict = {};
    for i in range(10):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:-10 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic #' + '{:2d}'.format(i+1)] = words
    
    return pd.DataFrame(word_dict)

get_nmf_topics(model, 10)

Unnamed: 0,Topic # 1,Topic # 2,Topic # 3,Topic # 4,Topic # 5,Topic # 6,Topic # 7,Topic # 8,Topic # 9,Topic #10
0,police,korea,israel,kill,russia,china,iran,world,say,saudi
1,protest,north,gaza,attack,ukraine,chinese,nuclear,bank,minister,arabia
2,government,north korea,israeli,pakistan,russian,japan,attack,news,right,saudi arabia
3,year,south,palestinian,bomb,putin,india,iranian,largest,prime,women
4,arrest,south korea,hamas,people,syria,south china,iran nuclear,world bank,prime minister,yemen
5,years,korean,palestinians,strike,military,build,sanction,global,human,right
6,people,north korean,bank,soldier,nato,beijing,weapons,countries,human right,behead
7,state,nuclear,west bank,isis,georgia,south,plant,world news,president,king
8,drug,jong,west,syria,warn,china say,power,change,state,human
9,force,launch,rocket,civilians,troop,billion,deal,biggest,official,human right


So far, these topics seem to be a lot more coherent than what the LDA model produced. There's not only better in-topic coherence (i.e. the words relate to each other well) but also distinctions between topics (i.e. there's less keyword overlap from topic to topic).

In [None]:
# Transform the new data with the fitted models
# tfidf_new = tfidf_vectorizer.transform(new_texts)
# X_new = nmf.transform(tfidf_new)

# # Get the top predicted topic
# predicted_topics = [np.argsort(each)[::-1][0] for each in X_new]

# # Add to the df
# df_new['pred_topic_num'] = predicted_topics

# df_new