## NMF - Non-negative Matrix Factorization

- Dimention reduction technique
- NMF model are interpretable (unlike PCA)
- All sample features must be non-negative (>=0)
- NMF expresses documents as combinations of topics (or 'themes') (NMF components are topics)
- NMF expresses images as combinations of patterns (NMF components are parts of images)
- NMF samples reconstruction = NMF feature values * NMF components (matrix dot product, Matrix Factorization)

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize

## Wikipedia articles Dataset

In [4]:
articles = pd.read_json('../datasets/wikipedia/News-article-wikipedia.json', lines=True)
articles.head()

Unnamed: 0,_unit_id,article,newdescp
0,691201838,Gaza aid ship to dock in Egypt after Israel pr...,A ship with supplies for Gaza will dock at el...
1,691201839,Mel Gibson,Often acts and directs stories involving an i...
2,691201840,Talent Agency WME drops Mel Gibson,Cast member Mel Gibson (R) and Oksana Grigori...
3,691201841,Suicide bomber killed in Tehran-Fars,"(Adds details) TEHRAN, June 20 (Reuters) - A..."
4,691201842,Iran's 10% ballot boxes to be recounted,Tehran - Iran's Guardian Council is ready to ...


In [5]:
titles = articles['article'].values
documents = articles['newdescp'].values

In [6]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_features = tfidf.fit_transform(documents)
words = tfidf.get_feature_names()
tfidf_features.shape

(3000, 31430)

## NMF applied to Wikipedia articles

In [7]:
# Create an NMF instance
model = NMF(n_components=6)

In [8]:
# Fit the model to articles
model.fit(tfidf_features)

NMF(alpha=0.0, beta=1, eta=0.1, init=None, l1_ratio=0.0, max_iter=200,
  n_components=6, nls_max_iter=2000, random_state=None, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)

In [9]:
# Transform the articles
nmf_features = model.transform(tfidf_features)
nmf_features.shape

(3000, 6)

In [10]:
# Create a pandas DataFrame
df = pd.DataFrame(nmf_features, index=titles)
df.head()

Unnamed: 0,0,1,2,3,4,5
Gaza aid ship to dock in Egypt after Israel pressure,0.0,0.189971,0.0,0.004573,0.017002,0.0
Mel Gibson,0.015155,0.0,0.003397,0.0,0.037371,0.0
Talent Agency WME drops Mel Gibson,0.016871,0.002543,0.005586,0.001217,0.018434,0.004754
Suicide bomber killed in Tehran-Fars,0.014304,0.0,0.054671,0.0,0.011165,0.028697
Iran's 10% ballot boxes to be recounted,0.060345,0.001549,0.0,0.0,0.0,0.045902


In [11]:
# Print the row for 'At least 20 dead in Syria clashes'
df.loc['At least 20 dead in Syria clashes',:]

0    0.003685
1    0.000000
2    0.000000
3    0.273719
4    0.000000
5    0.000000
Name: At least 20 dead in Syria clashes, dtype: float64

In [12]:
# Print the row for '17 reported dead in Syria clashes'
df.loc['17 reported dead in Syria clashes',:]

0    0.016144
1    0.000000
2    0.000000
3    0.247033
4    0.000000
5    0.000000
Name: 17 reported dead in Syria clashes, dtype: float64

## NMF learns topics of documents

In [13]:
# Create a DataFrame
components_df = pd.DataFrame(model.components_, columns=words)
components_df.shape

(6, 31430)

In [14]:
# Select row 3
component = components_df.iloc[3,:]

In [15]:
# Print result of nlargest
component.nlargest()

syrian      0.796416
syria       0.724202
al          0.496426
assad       0.479043
damascus    0.353419
Name: 3, dtype: float64

## Which articles are similar to 'Mel Gibson'?

In [20]:
# Normalize the NMF features
norm_features = normalize(nmf_features)

In [21]:
# Create a DataFrame
df = pd.DataFrame(norm_features, index=titles)

In [22]:
# Select the row corresponding to 'Mel Gibson'
article = df.loc['Mel Gibson']

In [23]:
# Compute the dot products
similarities = df.dot(article)

In [24]:
# Display those with the largest cosine similarity
print(similarities.nlargest())

Mel Gibson                                                                             1.000000
Super Bowl lights out: Power outage halts San Francisco 49ers-Baltimore Ravens game    0.999955
Super Bowl updates: Packers hang on for 31-25 win                                      0.999792
Nick Clegg visits Pakistan flood relief camp in Sukkur                                 0.998188
Search of collapsed Bangladesh building ends with 1,127 found dead                     0.998114
dtype: float64
