Non-Negative Matrix factorization is an unsupervised algorithm that simultaneously performs dimensionality reduction and clustering. We can use it in conjuction with TF-IDF to model topic across documents.

We are given a non-negative matrix A, we want to find k-dimension approximation in terms of non-negative factors W and H. Each basis vector can be interpreted as a cluster. The memberships of objects in these clusters encoded by H.

Input: Non-negative data matrix(A), number of basis vectors or number of topics(k), initial values for factors W and H which start off as random matrices.

Steps to be followed: 
1. First we construct vector space model for documents(such as stopword filtering), resulting in document term matrix A.
2. Then we apply TF-IDF term weight normalization to A.
3. After this, we normalize TF-IDF vectors to unit length.
4. Then we initialize the factors using NNDSVD(Non-negative double single singular value decomposition) on A matrix.
5. We apply projected gradient Non-negative matrix factorization to A.

Basis Vectors: The topics(clusters) in the data.

Coefficient Matrix: The membership weights for documents relative to each topic(cluster).

Just like LDA, we need to select the number of expected topics beforehand(the value of k). Also just like LDA, we will have to interpret the topics based off the coefficient values of the words per topic. Matrix coefficient values are not probabilities that can be interpreted like we did with LDA.

![NNMF](https://i.imgur.com/nDA7RPS.jpg)

![Objective Function](https://i.imgur.com/KsCiOaY.jpg)

![Expectation- Maximization Optimization](https://i.imgur.com/lPa5mZu.jpg)

In [1]:
import pandas as pd
npr = pd.read_csv('npr.csv')

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
tfidf = TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [4]:
dtm = tfidf.fit_transform(npr['Article'])

In [6]:
dtm 

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [7]:
from sklearn.decomposition import NMF

In [8]:
nmf_model = NMF(n_components=7,random_state=42)

In [9]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [12]:
tfidf.get_feature_names()[2300]

'albala'

In [14]:
for index,topic in enumerate(nmf_model.components_):
    print(f"THe top 15 words for TOPIC #{index}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])

THe top 15 words for TOPIC #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']
THe top 15 words for TOPIC #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']
THe top 15 words for TOPIC #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']
THe top 15 words for TOPIC #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']
THe top 15 words for TOPIC #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']
THe top 15 words for TOPIC #5
['love', 've', 'don', 'album', 'way

In [15]:
topic_results = nmf_model.transform(dtm)

In [16]:
topic_results[0]

array([0.        , 0.12075603, 0.00140297, 0.05919954, 0.01518909,
       0.        , 0.        ])

In [19]:
topic_results.argmax()

22021

In [20]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 0, 4, 3], dtype=int64)

In [21]:
npr['Topic']=topic_results.argmax(axis=1)

In [25]:
mytopic_dict = {0:'health',1:'election',2:'legis',3:'politics',4:'election',5:'music',6:'education'}
npr['Topic Label'] = npr['Topic'].map(mytopic_dict)
npr.head()

Unnamed: 0,Article,Topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,election
1,Donald Trump has used Twitter — his prefe...,1,election
2,Donald Trump is unabashedly praising Russian...,1,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,politics
4,"From photography, illustration and video, to d...",6,education
