# News headline topic analysis with Non-negative Matrix Factorization (NMF) 

The purpose of this analysis is to find dominant topics across news headlines (25 per day, over 1 year). These topics will later be correlated with daily stock market loss/gain information to understand how certain topics may influence the stock market.

## What is NMF?

Latent Dirichlet allocation is a way to discover higher-level topics out of individual words present in any set of document (in our case, news headlines). You can use LDA to get a sense of the overall themes in a set of documents.

LDA is an unsupervised machine learning model that works by analyzing two things: a distribution of topics in a document, and a distribution of words in a topic. The "topics" it finds aren't specific words (e.g., "This headline is about 'war'") but instead conceptually similar groups of words that together make up a theme (e.g., "This headline is similar to the words 'war', 'crisis', 'iran'...").

Once you've created your LDA model, you can feed in a document and the model will score the overall relevancy of your document against the main x topics found in your overall corpus. In other words, it will tell you which of the main topics found in the overall corpus are also found in your document, and to what extent.

## Why NMF?

LDA is a fairly popular topic modelling choice among NLP professionals, and relatively straightforward to implement. Using LDA, we were able to find 10 dominant themes in eight years' worth of news headlines within about an hour—a task that would take a human days to analyze.

## How does NMF perform compared to LDA?

More here later.

## Credit

Parts of this work borrow from Ravish Chawla's [NMF tutorial](https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df).

In [6]:
import pandas as pd
import numpy as np
import scipy as sp
# import sklearn
# import sys

from nltk.corpus import stopwords
import nltk 
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

from gensim.models import ldamodel
import gensim.corpora

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize
# import pickle

np.random.seed(22)

[nltk_data] Downloading package wordnet to /Users/stacy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Import the data
data = pd.read_csv("../Data/RedditNews.csv", error_bad_lines=False)

In [4]:
# Getting just the headlines for our corpus
headlines = data[['News']]
del data
headlines.head()

Unnamed: 0,News
0,A 117-year-old woman in Mexico City finally re...
1,IMF chief backs Athens as permanent Olympic host
2,"The president of France says if Brexit won, so..."
3,British Man Who Must Give Police 24 Hours' Not...
4,100+ Nobel laureates urge Greenpeace to stop o...


## Data preprocessing
### Lemmitize

In [9]:
def lemmatize(text):
    return WordNetLemmatizer().lemmatize(text, pos='v') # pos='v' means it peforms stemming with context

In [10]:
# Remove stopwords and words shorter than 3 characters, then lemmatize
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize(token))
    return result

In [11]:
sample = headlines['News'][2]

print('original document: ')
words = []
for word in sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(sample))

original document: 
['The', 'president', 'of', 'France', 'says', 'if', 'Brexit', 'won,', 'so', 'can', 'Donald', 'Trump']


 tokenized and lemmatized document: 
['president', 'france', 'say', 'brexit', 'donald', 'trump']


In [12]:
cleaned_headlines = headlines['News'].map(preprocess)
cleaned_headlines[:5] # Check the results

0    [year, woman, mexico, city, finally, receive, ...
1      [chief, back, athens, permanent, olympic, host]
2      [president, france, say, brexit, donald, trump]
3    [british, police, hours, notice, threaten, hun...
4    [nobel, laureates, urge, greenpeace, stop, opp...
Name: News, dtype: object