# Topics extraction with Non-Negative Matrix Factorization

This is a proof of concept application of Non Negative Matrix
Factorization of the term frequency matrix of a corpus of documents so
as to extract an additive model of the topic structure of the corpus.

## Load the 20 newsgroups dataset

In [14]:
from sklearn import datasets
dataset = datasets.fetch_20newsgroups(shuffle=True, random_state=1)
print(dataset.target_names[dataset.target[0]])

In [15]:
print(dataset.data[0])

## Restrict the dimensions of the problem

For shorter computation times.

In [16]:
n_samples = 1000
n_features = 1000

## Vectorize to compute word frequencies for each document

Restrict to the most common word frequency and use TF-IDF weighting (without top 5% stop words)

In [17]:
from sklearn.feature_extraction import text

vectorizer = text.CountVectorizer(max_df=0.95, max_features=n_features)
counts = vectorizer.fit_transform(dataset.data[:n_samples])

tfidf = text.TfidfTransformer().fit_transform(counts)
tfidf

Convert from a `scipy.sparse.csr_matrix` representation to a dense `numpy` array and remove negative values.

In [18]:
tfidf.toarray()

## Extract some topics with Non-negative Matrix Factorization

In [19]:
from sklearn import decomposition
n_topics = 5

nmf = decomposition.NMF(n_components=n_topics).fit(tfidf)

In [20]:
print(nmf)

In [21]:
print(nmf.components_)

## Display the most important words for each extracted topic

Reuse the vocabulary of the vectorizer to find the words names from the matrix positions.

In [22]:
n_top_words = 12
inverse_vocabulary = dict((v, k) for k, v in vectorizer.vocabulary.iteritems())

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d: " % topic_idx, " ".join([inverse_vocabulary[i] for i in topic.argsort()[:-(n_top_words + 1):-1]]))