# Non-Negative Matrix Factorization (NMF) for Topic Modeling

Non-Negative Matrix Factorization (NMF) is a method used to extract topics from a collection of documents. 
It works by factorizing the term-document matrix (where rows represent documents and columns represent terms) into two smaller matrices.
The main steps are:
1. Construct a vector space for the the document resylting in a term document matrix A.
2. Apply TF-IDF term weight normilization to matric A.
3. Normalize TF-IDF vectors to unit length
4. Initialise factors using NNDSVD on A.
5. Apply Project Gradient NMF to A.

##### Import Libraries

In [1]:
import numpy as np
import pandas as pd

##### Read dataset

In [4]:
npr = pd.read_csv(r"C:\Users\Owner\Desktop\UPDATED_NLP_COURSE\UPDATED_NLP_COURSE\05-Topic-Modeling\npr.csv")
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [6]:
npr.describe().transpose()

Unnamed: 0,count,unique,top,freq
Article,11992,11991,"Washington state has released an estimated 3, ...",2


##### Preprocessing our dataset 

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words = 'english')

In [14]:
dtm = tfidf.fit_transform(npr['Article'])
dtm 

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [16]:
from sklearn.decomposition import NMF
nmf_model = NMF(n_components= 7, random_state=42)
nmf_model.fit(dtm)

Grab top 15 words of each topic

In [23]:
for index, topic in enumerate(nmf_model.components_):
    print(f"The top 15 wordsfor a top # {index}")
    print([tfidf.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print('\n')

The top 15 wordsfor a top # 0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


The top 15 wordsfor a top # 1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


The top 15 wordsfor a top # 2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


The top 15 wordsfor a top # 3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


The top 15 wordsfor a top # 4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


The top 15 wordsfor a top # 5
['love', 've', 'don', 'al

In [25]:
topic_results = nmf_model.transform(dtm)
topic_results.argmax(axis=1)
npr['Topic'] = topic_results.argmax(axis=1)
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6


In [27]:
mytopic_dict = {0:'health', 1:'elections', 2:'legislation',3:'politics', 4:'elections', 5:'music', 6:'education' }
npr['Topic Label'] = npr['Topic'].map(mytopic_dict)
npr.head()

Unnamed: 0,Article,Topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,elections
1,Donald Trump has used Twitter — his prefe...,1,elections
2,Donald Trump is unabashedly praising Russian...,1,elections
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,politics
4,"From photography, illustration and video, to d...",6,education
