# Whirl-wind guide to Topic Modeling

##### Resources referenced for further study.
> http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb
> http://nicschrading.com/project/Intro-to-NLP-with-spaCy/
> http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/
> http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/
> http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py
> http://www.socher.org/index.php/Main/ImprovingWordRepresentationsViaGlobalContextAndMultipleWordPrototypes

# LDA

In [1]:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

In [2]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [3]:
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

In [4]:
print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))

Loading dataset...
done in 2.282s.


In [5]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

Extracting tf features for LDA...
done in 0.487s.



In [6]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 4.378s.


In [7]:
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said

# LSA/LSI

The most straight forward method. SVD is used to reduce the dimension of a TFIDF matrix.

In [8]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

In [9]:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000,
                             min_df=2, stop_words='english',
                             use_idf=True)

In [10]:
X_train_tfidf = vectorizer.fit_transform(data_samples)
X_train_tfidf.shape

(2000, 10000)

In [11]:
# make a pipeline that uses the SVD and Normalizer from sklearn.
svd = TruncatedSVD(100)
lsa = make_pipeline(svd, Normalizer(copy=False))

In [12]:
%%time
# Train the data on the normalizer.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

CPU times: user 771 ms, sys: 112 ms, total: 883 ms
Wall time: 1 s


In [13]:
X_train_lsa.shape

(2000, 100)

##### Inspecting what the SVD learned.

In [14]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import string
from collections import Counter
import itertools
punct = string.punctuation+' '

In [15]:
def remove_punct(word):
    return ''.join([i for i in word if i not in punct])

In [16]:
# Insepcting the SVD components.
top_docs = X_train_lsa[:, 2].argsort()[0:15]
check_hold = list(itertools.chain(*[data_samples[i].lower().split() for i in top_docs]))
check_hold = [remove_punct(i) for i in check_hold]
check_hold = [i for i in check_hold if i not in ENGLISH_STOP_WORDS]
term_count = dict(Counter(check_hold))

In [17]:
sorted(list(zip(term_count.values(), term_count.keys())))[::-1]

[(83, 'god'),
 (57, ''),
 (27, 'does'),
 (19, 'just'),
 (19, 'faith'),
 (17, 'people'),
 (16, 'good'),
 (16, 'gods'),
 (15, 'believe'),
 (14, 'windows'),
 (13, 'say'),
 (12, 'exist'),
 (11, 'time'),
 (11, 'dont'),
 (11, 'did'),
 (10, 'read'),
 (10, 'makes'),
 (10, 'like'),
 (9, 'know'),
 (9, 'bible'),
 (8, 'way'),
 (8, 'suppose'),
 (8, 'human'),
 (8, 'dos'),
 (8, 'doesnt'),
 (8, 'brothers'),
 (7, 'think'),
 (7, 'sin'),
 (7, 'said'),
 (7, 'man'),
 (7, 'life'),
 (7, 'judge'),
 (7, 'im'),
 (7, 'existence'),
 (7, 'drive'),
 (7, '2'),
 (6, 'worship'),
 (6, 'work'),
 (6, 'use'),
 (6, 'nature'),
 (6, 'make'),
 (6, 'let'),
 (6, 'judaism'),
 (6, 'jews'),
 (6, 'jesus'),
 (6, 'heaven'),
 (6, 'fisher'),
 (6, 'different'),
 (6, 'came'),
 (6, 'actions'),
 (6, '1'),
 (5, 'works'),
 (5, 'wood'),
 (5, 'using'),
 (5, 'thought'),
 (5, 'thing'),
 (5, 'statement'),
 (5, 'says'),
 (5, 'right'),
 (5, 'reading'),
 (5, 'purposes'),
 (5, 'problems'),
 (5, 'point'),
 (5, 'mean'),
 (5, 'lord'),
 (5, 'looking'),
 

# Non-Negative Matrix Factorization

In [18]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

Extracting tf-idf features for NMF...
done in 0.535s.


In [19]:
# Fit the NMF model
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=2000 and n_features=1000...
done in 2.102s.


In [20]:
print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)


Topics in NMF model (generalized Kullback-Leibler divergence):
Topic #0: people just like time don say really know way things make think right said did want ve probably work years
Topic #1: windows thanks using help need hi work know use looking mail software does used pc video available running info advance
Topic #2: god does true read know say believe subject says religion mean question point jesus people book christian mind understand matter
Topic #3: thanks know like interested mail just want new send edu list does bike thing email reply post wondering hear heard
Topic #4: time new 10 year sale old offer 20 16 15 great 30 weeks good test model condition 11 14 power
Topic #5: use number com government new university data states information talk phone right including security provide control following long used research
Topic #6: edu try file soon remember problem com program hope mike space article wrong library short include win little couldn sun
Topic #7: year world team game pla

# Chinese Restaurant Process

In [21]:
import random

In [22]:
tables

NameError: name 'tables' is not defined

In [None]:
# Play with different concentrations
for concentration in [0.0, 0.5, 1.0]:

    # First customer always sits at the first table
    # To do otherwise would be insanity
    tables = [1]

    # n=1 is the first customer 
    for n in range(2,1000):

        # Gen random number 0~1
        rand = random.random()

        p_total = 0
        existing_table = False

        for index, count in enumerate(tables):

            prob = count / (n + concentration)

            p_total += prob
            if rand < p_total:
                tables[index] += 1
                existing_table = True
                break

        # New table!!
        if not existing_table:
             tables.append(1)

    for index, count in enumerate(tables):
        print(index, "X", (count/100), count)
    print("----")