<font color="#483D8B">
<h1  align="left">     4710 Lab 6</h1>
<div align="left">
<font size=3><b>
<br>Topic Extraction
<br>Kevin Yang
<br>March 7, 2019
<br></font></b></div>


---------------

## Overview

This lab will focus on topic extraction and topic modeling with non-negative matrix factorization(NMF) and latent dirichlet allocation (LDA). It provides a practice for extracting topics from a corpus by utilizing NMF and LDA models. Non-negative matrix factorization is applied with the Forbenius norm and Kullback-Leibler divergence.  

This is an example of applying :class:`sklearn.decomposition.NMF` and
:class:`sklearn.decomposition.LatentDirichletAllocation` on a corpus
of documents and extract additive models of the topic structure of the
corpus. The output is a list of topics, each represented as a list of
terms (weights are not shown).

Non-negative Matrix Factorization is applied with two different objective
functions: the Frobenius norm, and the generalized Kullback-Leibler divergence.
The latter is equivalent to Probabilistic Latent Semantic Indexing.

The default parameters (n_samples / n_features / n_components) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).


Reference:
1. https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
2. https://www.quora.com/What-is-the-difference-between-TfidfVectorizer-and-CountVectorizer-1
3. https://towardsdatascience.com/lda-topic-modeling-an-explanation-e184c90aadcd
4. https://www.tdktech.com/tech-talks/topic-modeling-explained-lda-to-bayesian-inference

In [1]:
%matplotlib inline

In [2]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from __future__ import print_function
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

## Data

In [3]:
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))

Loading dataset...
done in 1.676s.


The program above load the 20 newsgroups dataset and vectorize it. It filters out unneccessary terms. 'print_top_words' is a function that prints the word with the highest value relating to the Topic. Headers are removed as well as footers and quoted replies. Common english words in one document, or at least 95% of the documents are also stripped. The number of samples, features, components, top words are given their size. The program will download the 20news dataset and output the total time taken to download/fetchthe dataset. 

## Exploratory Data Analysis

In [4]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

Extracting tf-idf features for NMF...
done in 0.511s.


'TfidfVectorizer' converts a collection of document to a matrix of TF-IDF features. It transforms the text to feature vectors
that can be used as input while printing the time for completion.'stop_words' removes most commonly utilized english words with little meaning. If emphasizes on the frequency of a term and the importance of the term in each document/dataset.

In [5]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

Extracting tf features for LDA...
done in 0.404s.



Extracts term frequencies for the Latent Dirichlet Allocation (LDA). 'CountVectorizer' implements a token and converts terms to word count vectors. 'stop_words' are not included with the tokens. The time for fitting the data_samples into the Count vectorizer is outputted.

## Model

In [6]:
# Fit the NMF model
# 1 Default NMF model (Frobenius norm)

print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_features=1000...
done in 0.444s.

Topics in NMF model (Frobenius norm):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6: file problem files format win sound ftp pub read save sit

The NMF model or the Frobenius norm is fitted with term frequencies and inverse document frequency (tf-idf). It applies nonnegative matrix factorization based on the tf-idf features created above earlier. Feature names derived from the tf-idf are stored in 'tfidf_feature_names'. We can see that stop words (low value filler words) have been removed from each topic, through the NMF model. Through the NMF model, each topic consist of terms that are related to the original topic. Topic 7 for example, consists of terms relating to sports such as "game","team","win","play","goal","division",etc and Topic 2 involve the religion. The NMF model produce components of a topic from terms with high value/weight in the corpus. The time to run is outputted. 

In [7]:
# Fit the NMF model
# 2

print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))
print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=2000 and n_features=1000...
done in 1.635s.

Topics in NMF model (generalized Kullback-Leibler divergence):
Topic #0: people just like time don say really know way things make think right said did want ve probably work years
Topic #1: windows thanks using help need hi work know use looking mail software does used pc video available running info advance
Topic #2: god does true read know say believe subject says religion mean question point jesus people book christian mind understand matter
Topic #3: thanks know like interested mail just want new send edu list does bike thing email reply post wondering hear heard
Topic #4: time new 10 year sale old offer 20 16 15 great 30 weeks good test model condition 11 14 power
Topic #5: use number com government new university data states information talk phone right including security provide control following long used research
Topic #6: edu try file so

Another NMF model is fitted above with new components implemented: 'beta_loss='kullback-leibler'', 'solver='mu'', 'max_iter=1000'. The model is iterated at a max of 1000 before timing out. The kullback-leibler describes the divergence between probability distributions. We get more specific terms with kullback-leibler divergence relating to the topic. 
'mu' is the Multiplicative update. The time for completion is longer compared to the previous run of the Frebius Norm. Outputs for each topic number are different with more added texts. The Topics are not as easy to dissect compared to the Frebius Norm. 

In [8]:
# 3

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 3.252s.

Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil i

The program above fits the number of samples and number of features into the LDA model (Latent Dirichlet Allocation). LDA is a generative probabilistic model. In the LDA function, the learning method 'online' utilizes online variational Bayes method and components are updated incrementally. From the LDA model, we can interpret the topic for each topic number, such as Topic 4 being related to health care research/study, Topic 6 being related to scores/stats of a sports game, etc.  The time for completion of the LDA model is much greater than the previous models.

## Conclusion

The NMF and LDA models allows for topic extraction/modeling from a corpus. I've learned that the Topic extraction involves many parts (number of samples, features, components, etc.) and requires a strong understanding of the Term Frequency-Inverse Document Frequency (TF-IDF) and CountVectorizer. I've learned that tf-idf operates based on a value/weight of the terms in a corpus that increases proportionally to the count, but it is offset by the frequency of the term. The Count Vectorizer emphasize on counting the frequencies of a term. Both NMF models utilize stop words to remove low value terms that are not specifically related to the topic, or are "filler words" such as "like", "the", "a", "to", etc. The Frobenius Norm model output high value terms that are popular within the topic, for example, a topic relating to Sports will output sport-related terms such as "game","team","win","play","season",etc., but doesn't necessarily give us the specific topic at hand (What is this sports topic pertaining to? What is the story?)   

The Kullback-Leibler divergence NMF model gives us more specifity regarding the topic compared to the Frobenius Norm NMF model. Unlike the Frobenius Norm, that outputs popular terms of the topic, the kullback-leibler provides a more accurate read of each topic that captures the main idea of the text. Topic 7, for example, the Frobenius Norm output terms that are popular in hockey ('goal','nhl','flyers'). Kullback-leibler provides a stronger explanation for topic 7 in the corpus pertaining to the overall specific "headline" of topic 7. LDA is a topic extraction model that clusters various terms together within a topic. Topics are composed of a "mixture of topics" and doesn't necessarily give us the main topic.

By comparison, the LDA model is the slowest in regards to operation and outputting data for topics. The fastest topic extraction model is the Frobenius Norm .