<a href="https://colab.research.google.com/github/plaban1981/NLP-with-Python/blob/master/Non_negative_matrix_factorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Non Negaive Matrix Factorization

* It is an unsupervised algorithm that simultaneously performs dimensionality reduction and clustering.

* It can be used in conjuction with TF-IDF to model topics across documents.

## Steps :-

* 1. Construct Vector Space Model for documents after filtering stopword

* 2. Apply TF-IDF term weight normalization to A

* 3. Normalize TF-IDF vectors to unit-length

* 4. Initialize factors using NNDSVD on A (Non Negative Double Singular Value Decomposition)

* 5. Apply Projected Gradient NMF to A

Basic Vectors :- The topics (clusters) in the data.

Coefficient Matrix :- The membership weights for the documents relative to each cluster or topic 

In [0]:
import pandas as pd
import numpy as np


In [5]:
data = pd.read_csv('/content/npr.csv')
data.shape

(11992, 1)

In [6]:
data.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


##Preprocessing

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [0]:
article_vector = tfidf.fit_transform(data['Article'])

In [10]:
article_vector

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## Negative Matrix Factorization

In [11]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=7,random_state=7)
nmf.fit(article_vector)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=7, random_state=7, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

##Displaying Topics

In [15]:
len(nmf.components_)

7

In [12]:
len(tfidf.get_feature_names())

54777

In [14]:
import random
for i in range(10):
  rand_word_id = random.randint(0,len(tfidf.get_feature_names()))
  print(tfidf.get_feature_names()[rand_word_id])



disproven
timberlake
exome
bounding
bow
loosens
omara
diocese
dermatology
disagreeing


In [0]:
single_topic = nmf.components_[0]

In [17]:
single_topic

array([0.00000000e+00, 2.49961336e-01, 0.00000000e+00, ...,
       1.70320960e-03, 2.37554282e-04, 0.00000000e+00])

In [18]:
single_topic.argsort()[-15:]

array([33390, 41067, 28659, 35959, 22673, 14441, 36310, 53989, 52615,
       47218, 53152, 19307, 36283, 54692, 42993])

In [19]:
top_15_word_indices = single_topic.argsort()[-15:]
for index in top_15_word_indices:
  print(tfidf.get_feature_names()[index])

new
research
like
patients
health
disease
percent
women
virus
study
water
food
people
zika
says


In [21]:
for index,topic in enumerate(nmf.components_):
  print(f"Top 15 words for Topic #{index}")
  print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])

Top 15 words for Topic #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']
Top 15 words for Topic #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']
Top 15 words for Topic #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']
Top 15 words for Topic #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']
Top 15 words for Topic #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']
Top 15 words for Topic #5
['love', 've', 'don', 'album', 'way', 'time', 'song', 'life

##Create a dictionary for Topic Mapping

In [0]:
topic_label = {0:'Health',1:'Politics',2:'Health Insurance',3:'Security',4:'Elections',5:'Music',6:'Education'}

##Attaching Discovered Topic Labels to Original Articles

In [0]:
topic_results = nmf.transform(article_vector)

In [24]:
topic_results.shape

(11992, 7)

In [25]:
topic_results[0].round(2)

array([0.  , 0.12, 0.  , 0.06, 0.02, 0.  , 0.  ])

In [26]:
topic_results[0].argmax()

1

In [0]:
data['Topic'] = topic_results.argmax(axis=1)

In [28]:
data.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6


In [33]:
data['Topic_Label'] = data['Topic'].map(topic_label)
data.head()

Unnamed: 0,Article,Topic,Topic_Label
0,"In the Washington of 2016, even when the polic...",1,Politics
1,Donald Trump has used Twitter — his prefe...,1,Politics
2,Donald Trump is unabashedly praising Russian...,1,Politics
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,Security
4,"From photography, illustration and video, to d...",6,Education


In [34]:
data.tail()

Unnamed: 0,Article,Topic,Topic_Label
11987,The number of law enforcement officers shot an...,3,Security
11988,"Trump is busy these days with victory tours,...",1,Politics
11989,It’s always interesting for the Goats and Soda...,0,Health
11990,The election of Donald Trump was a surprise to...,4,Elections
11991,Voters in the English city of Sunderland did s...,3,Security


##Topic Modeling Project Assesment

In [35]:
quora = pd.read_csv('/content/quora_questions.csv')
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


##Preprocessing

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [0]:
tfidf_vector = tfidf.fit_transform(quora['Question'])

In [40]:
tfidf_vector

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

##Non Negative Matrix Factorization

In [0]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=20,random_state=42)

In [42]:
nmf.fit(tfidf_vector)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [43]:
for index,topic in enumerate(nmf.components_):
  print(f"Top 15 words assocaited with topic #{index}")
  print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])

Top 15 words assocaited with topic #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']
Top 15 words assocaited with topic #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']
Top 15 words assocaited with topic #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']
Top 15 words assocaited with topic #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']
Top 15 words assocaited with topic #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']
Top 15 words assocaited with topic #5
['reservation', 'engineering', 'minister', 'president', 'comp

##Attaching Discovered Topic Labels to Original Articles

In [0]:
topic_results = nmf.transform(tfidf_vector)

In [0]:
quora['Topic'] = topic_results.argmax(axis=1)

In [47]:
quora.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14


In [48]:
quora.tail()

Unnamed: 0,Question,Topic
404284,How many keywords are there in the Racket prog...,6
404285,Do you believe there is life after death?,4
404286,What is one coin?,11
404287,What is the approx annual cost of living while...,11
404288,What is like to have sex with cousin?,9
