# Testing recent news headline topic analysis with an existing Non-negative Matrix Factorization (NMF) model

Here, we test our existing NMF topic model against a more recent batch of news headlines pulled from CNN, from April 20 - May 15, 2020.

Our assumption is that the world has changed so drastically since 2016 that our existing topic model will perform poorly and a new topic model will generate entirely different categories.

## Credit

Parts of this work borrow from Rob Salgado's [excellent NMF tutorial](https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45), as well as Ravish Chawla's [NMF tutorial](https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df).

In [34]:
import pandas as pd
import numpy as np
import scipy as sp

import nltk 
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split

np.random.seed(22)

# Load the pre-trained model
import joblib
nmf = joblib.load('nmf_improved.gz')

[nltk_data] Downloading package wordnet to /Users/stacy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [35]:
# Import the data
data = pd.read_csv("../Data/Combined_News_NASDAQ.csv")
data.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top21,Top22,Top23,Top24,Top25,combined_headlines,daily_topic_existing_lda,daily_words_existing_lda,daily_topic_new_lda,daily_words_new_lda
0,4/20/20,0,"The US and China Want a Divorce, but Neither C...",The OnePlus 8 series is much cheaper in India ...,Facebook launches COVID-19 data maps for the U...,A guide to the NFL Draft: How to watch and wha...,"How to Do Less, With Journalist Celeste Headlee","What It Will Take to Reopen Everything, Accord...",Coronavirus: Reporter challenges Trump over Cu...,Deepwater Horizon: a decade of disaster,...,Severe weather hits Southern US,Jane Goodall says COVID-19 arose from our disr...,"Zoom, Skype, Microsoft Teams: Why you should c...",Facebook's interactive COVID-19 map displays s...,"All the things COVID-19 will change forever, a...",The US and China Want a Divorce but Neither C...,1,"0.006*""russian"" + 0.005*""russia"" + 0.004*""mill...",8,"0.015*""launch"" + 0.012*""want"" + 0.012*""change""..."
1,4/21/20,0,Viruses Make Us Question What It Means to Be A...,Can You Be Happy in Quarantine?,Dr. Gupta takes virus test,How to End a Video Call Politely,The US Cities With the Worst Air Pollution All...,Marketing data platform Adverity raises $30M S...,Trump claims he will temporarily suspend immig...,What a century of fighting the flu tells us ab...,...,Telecom's Latest Dumb Claim: The Internet Only...,US hydroxychloroquine study shows no benefit t...,Razer's updated Blade Stealth gets a faster di...,The bootstrapper creates value,The bootstrapper creates value,Viruses Make Us Question What It Means to Be A...,1,"0.006*""russian"" + 0.005*""russia"" + 0.004*""mill...",0,"0.011*""world"" + 0.011*""right"" + 0.011*""time"" +..."
2,4/22/20,1,Herd Immunity Won't Save Us,Sometimes the Best Cocktail Is a Glass of Wine,Chinese EV startup Byton furloughs hundreds in...,George Steinmetz's Bird's-Eye View of the Earth,The Deepwater Horizon Disaster Fueled a Gulf S...,Viral texts about coronavirus lockdown were am...,Facebook and Instagram test location labels to...,Is your family's chewing and slurping driving ...,...,Netflix subscribers soar like crazy as coronav...,The Latest: WHO chief hopes US will reconsider...,US failed to block UN virus vaccine resolution,TSA employees have contracted COVID-19 at 58 a...,Will There Be A Meat Shortage Because Of The C...,Herd Immunity Won t Save Us Sometimes the Best...,6,"0.005*""israeli"" + 0.005*""iran"" + 0.004*""gaza"" ...",6,"0.019*""chinese"" + 0.016*""best"" + 0.014*""report..."
3,4/23/20,0,‘Pokémon Journeys’ will be a Netflix exclusive...,How Argentina’s Strict Covid-19 Lockdown Saved...,Gogoro’s first e-bike is coming to the US next...,US announces millions in aid for resource-rich...,Throw us your best 60-second pitch on May 13 a...,See Fauci's testing assessment that Trump disa...,A captivating 'Mandalorian' docuseries trailer...,Don't have a desk chair? All you need is a sea...,...,These 14 charts from Goldman Sachs show how mu...,Ventilator companies are opening up critical r...,Original Content podcast: ‘Too Hot to Handle’ ...,'We can’t afford to wait': coronavirus could s...,Here’s how you can ‘reset’ your sleep cycle du...,Pokémon Journeys’ will be a Netflix exclusive...,1,"0.006*""russian"" + 0.005*""russia"" + 0.004*""mill...",1,"0.018*""bank"" + 0.018*""digital"" + 0.017*""come"" ..."
4,4/24/20,1,Lockdown Has Taken Us From Internet Time to Gr...,Refresh Your Book Stash at a Little Free Library,Stop Travel Memories From Appearing in Your So...,"If we let the US Postal Service die, we’ll be ...",30 years of Hubble telescope images,Where to Travel in an RV Right Now,Don't Inject Bleach (Sigh),First ever 'road map' of the moon released,...,Podcast: What the heck is a 'Planetary Compute...,Dyson won't build ventilators for the UK after...,US durable goods orders slump the most since 2...,COVID-19 forced Airbnb to rethink its product ...,Coronavirus: Trump suggests injecting disinfec...,Lockdown Has Taken Us From Internet Time to Gr...,3,"0.006*""russia"" + 0.004*""iran"" + 0.004*""militar...",0,"0.011*""world"" + 0.011*""right"" + 0.011*""time"" +..."


## Data preprocessing
### Lemmitize

In [36]:
def lemmatize(text):
    return WordNetLemmatizer().lemmatize(text, pos='v') # pos='v' means it peforms stemming with context

In [37]:
# Remove stopwords and words shorter than 3 characters, then lemmatize
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize(token))
    return result

In [38]:
cleaned_all_headlines = data['combined_headlines'].map(preprocess)

In [39]:
# Vectorize the headlines using TF-IDF to create features to train our model on
texts = cleaned_all_headlines
dictionary = gensim.corpora.Dictionary(texts)

'''
Filter out irrelevant words:
Keep tokens that appear in at least 15 documents
Keep only the 100,000 most frequent tokens
'''

dictionary.filter_extremes(no_below=2, keep_n=1000)

tfidf_vectorizer = TfidfVectorizer(
    min_df=3, # ignore words that appear in less than 3 articles
    max_df=0.85, # ignore words that appear in more than 85% of articles
    max_features=5000, # limit the number of important words to 5,000
    ngram_range=(1, 2), # look for both words and two-word phrases
    preprocessor=' '.join # join the tokenized words instead of creating a list
)

tfidf = tfidf_vectorizer.fit_transform(texts)

In [40]:
topic_values = nmf.transform(tfidf)
data['daily_topic_existing_nmf'] = topic_values.argmax(axis=1)
data.tail()

ValueError: Array with wrong shape passed to NMF (input H). Expected (10, 229), but got (10, 5000) 

In [105]:
test

41624    [portugal, raid, pension, fund, meet, deficit,...
5620     [fearless, father, throw, suicide, bomber, sav...
65161    [reddit, spend, morning, write, brief, history...
72850              [vote, force, isps, disconnect, pirate]
59075    [chvez, order, jet, intercept, military, plane...
                               ...                        
68615                                [england, households]
70154              [iran, hold, american, student, prison]
55963    [turkey, position, missiles, repulse, israeli,...
23059    [china, moon, rover, activate, science, tool, ...
64850         [spanish, intelligence, agents, expel, cuba]
Name: cleaned_headlines, Length: 7361, dtype: object

## Vectorize and transform the text using TF-IDF
Sentence here about why we use TF-IDF in this case and not BoW.

In [106]:
# Vectorize the headlines using TF-IDF to create features to train our model on
texts = train
dictionary = gensim.corpora.Dictionary(texts)

'''
Filter out irrelevant words:
Keep tokens that appear in at least 15 documents
Keep only the 100,000 most frequent tokens
'''
dictionary.filter_extremes(no_below=15, keep_n=100000)

tfidf_vectorizer = TfidfVectorizer(
    min_df=3, # ignore words that appear in less than 3 articles
    max_df=0.85, # ignore words that appear in more than 85% of articles
    max_features=5000, # limit the number of important words to 5,000
    ngram_range=(1, 2), # look for both words and two-word phrases
    preprocessor=' '.join # join the tokenized words instead of creating a list
)

tfidf = tfidf_vectorizer.fit_transform(texts)

## Create and fit the NMF model on our headlines

In [138]:
# Create NMF model and fit it
# We are using ten topic groupings here so it is directly comparable to the LDA model's performance

'''
"Nonnegative Double Singular Value Decomposition (NNDSVD) is a new method designed to enhance the 
initialization stage of the nonnegative matrix factorization.

"NNDSVD is well suited to initialize NMF algorithms with sparse factors."" - http://nimfa.biolab.si/nimfa.methods.seeding.nndsvd.html
'''
model = NMF(n_components=10, init='nndsvd')
model.fit(tfidf)

NMF(alpha=0.0, beta_loss='frobenius', init='nndsvd', l1_ratio=0.0, max_iter=200,
    n_components=10, random_state=None, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

## Check our NMF topics

In [139]:
# Print out the topics to visually inspect them

def get_nmf_topics(model, n_top_words):
    
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = tfidf_vectorizer.get_feature_names()
    
    word_dict = {};
    for i in range(10):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:10 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic #' + '{:2d}'.format(i+1)] = words
    
    return pd.DataFrame(word_dict)

get_nmf_topics(model, 10)

Unnamed: 0,Topic # 1,Topic # 2,Topic # 3,Topic # 4,Topic # 5,Topic # 6,Topic # 7,Topic # 8,Topic # 9,Topic #10
0,police,gaza,isis,korea,ukraine,wikileaks,say,snowden,libya,georgia
1,iran,israel,islamic state,north,russia,assange,china,edward,egypt,russia
2,kill,israeli,islamic,north korea,russian,julian,world,edward snowden,protest,ossetia
3,people,hamas,ebola,south,crimea,julian assange,climate,spy,protesters,georgian
4,afghanistan,palestinians,state,korean,ukrainian,cable,change,surveillance,egyptian,south ossetia
...,...,...,...,...,...,...,...,...,...,...
4985,gadhafi,factor,libyans,jail years,latest,mark,onion,methane,humans,sample
4986,gaddafi,shouldn,libyan rebel,receive,later,marine,depth,mexican police,surface,samsung
4987,fukushima plant,factory,libyan,elites,lash,marijuana,omar,research,human shield,historic
4988,fukushima nuclear,shots,libya,record number,larger,marathon,describe,resistant,surveillance,hire


So far, these topics seem to be a lot more coherent than what the LDA model produced. There's not only better in-topic coherence (i.e. the words relate to each other well) but also distinctions between topics (i.e. there's less keyword overlap from topic to topic).

## Testing the model on new data

In [115]:
# https://stackabuse.com/python-for-nlp-topic-modeling/

df = pd.DataFrame()

# See how well our model performed by using the test data
tfidf_test = tfidf_vectorizer.transform(test)
X_test = model.transform(tfidf_test)

# Get the top predicted topic
predicted_topics = [np.argsort(each)[::-1][0] + 1 for each in X_test]    

# Add to the df
df['test'] = test
df.reset_index(drop=True, inplace=True)
df['pred_topic_num'] = predicted_topics

df.head()

Unnamed: 0,test,pred_topic_num
0,"[portugal, raid, pension, fund, meet, deficit,...",1
1,"[fearless, father, throw, suicide, bomber, sav...",4
2,"[reddit, spend, morning, write, brief, history...",1
3,"[vote, force, isps, disconnect, pirate]",1
4,"[chvez, order, jet, intercept, military, plane...",5


## Find a single topic per day

Sorting the headlines into topics isn't working so well against the test data. Maybe more topics are better? This is where a **coherence score** would come in: it is a score that tells you how "coherent" (closely related) the words within a topic are, and you can use it to [automatically select the best number of topics](https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45) to train your model on. That is currently beyond the scope of this project, but will be the next area for exploration.

Since we know that we want to find one topic per day to feed into other models, we'll now take our trained model and set it loose on the combined headlines from each day.

In [151]:
single_topic = pd.read_csv("../Data/Combined_News_DJIA_single_topic.csv")
single_topic.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top19,Top20,Top21,Top22,Top23,Top24,Top25,combined_headlines,daily_topic,daily_words
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge""",Georgia 'downs two Russian warplanes' as coun...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo...",Why wont America and Nato help us? If they wo...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,"b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man...",Remember that adorable 9-year-old who sang at...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,"b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...,U.S. refuses Israel weapons to attack Iran: ...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...,All the experts admit that we should legalise...,1,"0.006*""russian"" + 0.006*""russia"" + 0.004*""forc..."


In [152]:
# Clean and tokenize headlines 
single_topic['tokenized'] = single_topic['combined_headlines'].map(preprocess)

In [159]:
# Vectorize the headlines using TF-IDF to create features to train our model on
dictionary = gensim.corpora.Dictionary(single_topic['tokenized'])

'''
Filter out irrelevant words:
Keep tokens that appear in at least 15 documents
Keep only the 100,000 most frequent tokens
'''
dictionary.filter_extremes(no_below=15, keep_n=100000)

tfidf_vectorizer = TfidfVectorizer(
    min_df=3, # ignore words that appear in less than 3 articles
    max_df=0.85, # ignore words that appear in more than 85% of articles
    max_features=5000, # limit the number of important words to 5,000
    ngram_range=(1, 2), # look for both words and two-word phrases
    preprocessor=' '.join # join the tokenized words instead of creating a list
)

# Helped by https://stackabuse.com/python-for-nlp-topic-modeling/
doctermmatrix = tfidf_vectorizer.fit_transform(single_topic['tokenized'])

topic_values = model.transform(doctermmatrix)
single_topic['nmf_topic'] = topic_values.argmax(axis=1)
single_topic.tail()

In [173]:
# Create human readable topic word lists to append to df
nmf_topics = []

for i,topic in enumerate(model.components_):
    nmf_topics.append([tfidf_vectorizer.get_feature_names()[i] for i in topic.argsort()[-10:]])
    
def get_topic_list(topic):
    return nmf_topics[topic]
    
single_topic['nmf_topic_readable'] = single_topic['nmf_topic'].apply(get_topic_list)

single_topic.rename(columns={
    'daily_topic': 'lda_topic',
    'daily_words': 'lda_topic_readable'
}, inplace=True)

single_topic.drop(labels=['tokenized'], axis=1, inplace=True)

In [175]:
# Save updated data
single_topic.to_csv("../Data/Combined_News_DJIA_single_topic.csv", index=False)