## Practical Assignment on Topic Mining

#### Míriam Sánchez Alcón

#### The Brown corpus is easily available in the NLTK toolkit. The texts are divided in the following categories.

• ’adventure’,
• ’belles_lettres’, • ’editorial’,
• ’fiction’,
• ’government’,
• ’hobbies’,
• ’humor’,
• ’learned’,
• ’lore’,
• ’mystery’,
• ’news’,
• ’religion’,
• ’reviews’,
• ’romance’,
• ’science_fiction’

#### Run different topic extraction methods (ICA, pLSA, NMF). All these methods are available in open-source Python libraries: some may take the documents as strings, while others may require you to build the term-frequency matrix. Tweak the pre-processing and hyperparameters to improve the results. Compare the results from different methods, both quantitatively and qualitatively. Compare also with the categories above. Do you find some correlation? I.e. are some topics more predominant among some categories?

For this assignment we will use two different libraries: Gensim and Scikit-Learn.

## Preparing the dataset

In [54]:
from nltk.corpus import brown

data = []
 
for fileid in brown.fileids():
    document = ' '.join(brown.words(fileid))
    data.append(document)
 
NO_DOCUMENTS = len(data)
print(NO_DOCUMENTS)
print(data[0])
print("=" * 20)
print(data[1])

500
Austin , Texas -- Committee approval of Gov. Price Daniel's `` abandoned property '' act seemed certain Thursday despite the adamant protests of Texas bankers . Daniel personally led the fight for the measure , which he had watered down considerably since its rejection by two previous Legislatures , in a public hearing before the House Committee on Revenue and Taxation . Under committee rules , it went automatically to a subcommittee for one week . But questions with which committee members taunted bankers appearing as witnesses left little doubt that they will recommend passage of it . Daniel termed `` extremely conservative '' his estimate that it would produce 17 million dollars to help erase an anticipated deficit of 63 million dollars at the end of the current fiscal year next Aug. 31 . He told the committee the measure would merely provide means of enforcing the escheat law which has been on the books `` since Texas was a republic '' . It permits the state to take over bank a

## 1. Gensim implementation

Since Gensim doesn’t have an implementation for NMF we’re only going to implement LDA and LSI (Latent Semantic Indexing AKA Latent Semantic Analysis) models.

In [41]:
import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords
 
NUM_TOPICS = 10
STOPWORDS = stopwords.words('english')
 
def clean_text(text):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    return cleaned_text
 
# For gensim we need to tokenize the data and filter out stopwords
tokenized_data = []
for text in data:
    tokenized_data.append(clean_text(text))
 
 
# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)
 
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
 
# Have a look at how the 5th document looks like: [(word_id, count), ...]
print(corpus[5])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...
 
# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)
 
# Build the LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

  cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]


[(5, 1), (6, 3), (10, 5), (16, 4), (25, 2), (28, 1), (31, 1), (34, 1), (35, 1), (39, 1), (42, 1), (43, 2), (52, 2), (55, 1), (57, 1), (59, 1), (65, 2), (70, 1), (74, 1), (76, 1), (86, 1), (88, 10), (89, 8), (90, 4), (93, 1), (95, 2), (97, 2), (109, 3), (110, 7), (131, 2), (132, 9), (135, 1), (148, 6), (149, 3), (171, 1), (172, 1), (178, 2), (179, 5), (184, 1), (186, 1), (196, 1), (201, 1), (219, 1), (220, 7), (221, 2), (227, 3), (235, 1), (236, 1), (246, 1), (248, 1), (249, 3), (251, 1), (253, 2), (257, 1), (260, 5), (275, 1), (288, 1), (292, 2), (301, 5), (308, 2), (311, 1), (317, 1), (318, 6), (319, 2), (324, 2), (330, 1), (335, 1), (338, 5), (339, 1), (340, 2), (341, 1), (342, 8), (351, 2), (352, 1), (356, 2), (357, 3), (359, 2), (364, 3), (366, 3), (368, 3), (370, 3), (374, 11), (377, 3), (380, 3), (381, 1), (384, 1), (386, 1), (391, 1), (392, 4), (404, 1), (409, 7), (412, 1), (427, 4), (429, 1), (430, 1), (434, 1), (449, 7), (457, 3), (459, 2), (462, 1), (465, 1), (467, 3), (480, 

Now we will display the topics the two models have inferred:

In [42]:
print("LDA Model:")
print("-" * 10)
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
#print("=" * 20)
print('\n')

 
print("LSI Model:")
print("-" * 10)
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10))
 
print("=" * 20)

LDA Model:
----------
Topic #0: 0.005*"one" + 0.005*"would" + 0.003*"said" + 0.003*"first" + 0.003*"time" + 0.002*"two" + 0.002*"new" + 0.002*"could" + 0.002*"made" + 0.002*"like"
Topic #1: 0.007*"one" + 0.006*"would" + 0.005*"said" + 0.004*"time" + 0.003*"could" + 0.003*"new" + 0.003*"like" + 0.003*"two" + 0.003*"man" + 0.002*"even"
Topic #2: 0.004*"one" + 0.004*"would" + 0.003*"may" + 0.003*"two" + 0.003*"man" + 0.002*"first" + 0.002*"time" + 0.002*"also" + 0.002*"many" + 0.002*"said"
Topic #3: 0.006*"one" + 0.005*"said" + 0.005*"would" + 0.004*"could" + 0.003*"time" + 0.002*"new" + 0.002*"little" + 0.002*"even" + 0.002*"state" + 0.002*"made"
Topic #4: 0.005*"one" + 0.004*"would" + 0.004*"could" + 0.003*"new" + 0.003*"said" + 0.003*"man" + 0.002*"may" + 0.002*"first" + 0.002*"time" + 0.002*"many"
Topic #5: 0.007*"one" + 0.005*"would" + 0.004*"new" + 0.004*"said" + 0.003*"two" + 0.003*"could" + 0.003*"time" + 0.003*"man" + 0.002*"even" + 0.002*"first"
Topic #6: 0.005*"one" + 0.004*"wo

As we can observe above, we will have to pay attention to the choice of tokens for each inferred topic and "guess" the topic name (sports, politics, etc.). Some of them are more clear than others, since there are some key tokens (government, church, school) and not so many generic ones (shall, said). It's easier to infer from sustantives.

Now we will use the models we just created transform unseen documents to their topic distribution:

In [43]:
text = "I don't believe in God"
bow = dictionary.doc2bow(clean_text(text))

# The result is a list of tuples (topic, distribution) <<<<<<<<<<<<<
print("LSI model:")
print("----------")
print(lsi_model[bow])
# [(0, 0.091615426138426506), (1, -0.0085557463300508351), (2, 0.016744863677828108), (3, 0.040508186718598529), (4, 0.014201267714185898), (5, -0.012208538275305329), (6, 0.031254053085582149), (7, 0.017529584659403553), (8, 0.056957633371540077), (9, 0.025989149894888153)]
print('\n')
print("LDA model:")
print("----------")
print(lda_model[bow])
# [(0, 0.020005183), (1, 0.020005869), (2, 0.02000626), (3, 0.020005472), (4, 0.020009108), (5, 0.020005926), (6, 0.81994385), (7, 0.020006068), (8, 0.020006327), (9, 0.020005994)]
 

LSI model:
----------
[(0, 0.05138134885986674), (1, -0.010440317487344815), (2, 0.043701345773114034), (3, -0.13726770114628398), (4, -0.07847092271711568), (5, 0.12684235291421667), (6, -0.012135562784484483), (7, 0.0043321729036308616), (8, -0.028899572987512532), (9, 0.08303993859850312)]


LDA model:
----------
[(0, 0.033351634), (1, 0.033350583), (2, 0.03335156), (3, 0.699829), (4, 0.03335213), (5, 0.03335164), (6, 0.03335309), (7, 0.03335314), (8, 0.03335307), (9, 0.033354126)]


Gensim also offers a simple way of performing similarity queries using topic models:

In [44]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
 
# Let's perform some queries
similarities = lda_index[lda_model[bow]]
# Sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
 
# Top most similar documents:
print("Top most similar documents:")
print("---------------------------")
print(similarities[:10])
#print("=" * 117)
print('\n')

# [(104, 0.87591344), (178, 0.86124849), (31, 0.8604598), (77, 0.84932965), (85, 0.84843522), (135, 0.84421808), (215, 0.84184396), (353, 0.84038532), (254, 0.83498049), (13, 0.82832891)]
 
# Let's see what's the most similar document
document_id, similarity = similarities[0]
print("Most similar document:")
print("---------------------------")
print(data[document_id][:2000])

Top most similar documents:
---------------------------
[(323, 0.99320024), (265, 0.9926302), (93, 0.9920774), (3, 0.9920453), (165, 0.9920445), (120, 0.9919147), (158, 0.99155647), (211, 0.99153775), (65, 0.99152607), (64, 0.9912466)]


Most similar document:
---------------------------
Analysis Analysis means the evaluation of subparts , the comparative ratings of parts , the comprehension of the meaning of isolated elements . Analysis in roleplaying is usually done for the purpose of understanding strong and weak points of an individual or as a process to eliminate weak parts and strengthen good parts . Impersonal purposes Up to this point stress has been placed on roleplaying in terms of individuals . Roleplaying can be done for quite a different purpose : to evaluate procedures , regardless of individuals . For example : a sales presentation can be analyzed and evaluated through roleplaying . Examples Let us now put some flesh on the theoretical bones we have assembled by giving i

The most similar document is of course a religious one as well, since the word 'God' was present in the one we used to compare.

## 2. Scikit-Learn implementation

Now we will go through the same process with sklearn. This librabry offers a NMF implementation as well. The algorithms are more bare-bones than what we’ve seen with gensim but on the plus side, they implement the fit/transform interface we’re used with:

In [45]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print("LDA:NO_DOCUMENTS, NO_TOPICS")
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
print('\n')
 
# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=NUM_TOPICS)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print("NMF:NO_DOCUMENTS, NO_TOPICS")
print(nmf_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
print('\n')

 
# Build a Latent Semantic Indexing Model
lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
lsi_Z = lsi_model.fit_transform(data_vectorized)
print("LSI:NO_DOCUMENTS, NO_TOPICS")
print(lsi_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
print('\n')
 
 
# Let's see how the first document in the corpus looks like in different topic spaces
print("LDA:")
print(lda_Z[0])
print("NMF:")
print(nmf_Z[0])
print("LSI:")
print(lsi_Z[0])

  token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')


LDA:NO_DOCUMENTS, NO_TOPICS
(500, 10)


NMF:NO_DOCUMENTS, NO_TOPICS
(500, 10)


LSI:NO_DOCUMENTS, NO_TOPICS
(500, 10)


LDA:
[8.52279351e-01 1.16540390e-01 3.04409978e-02 1.05613037e-04
 1.05607941e-04 1.05605999e-04 1.05623267e-04 1.05597359e-04
 1.05596677e-04 1.05616759e-04]
NMF:
[0.         0.         2.11170691 0.07693485 0.         0.542902
 1.06897384 0.         0.         0.24524583]
LSI:
[ 23.30684227   1.59535259  21.81220042  -0.04981546   0.81265353
  11.36123712   4.7554325   -1.23417897   0.60253665 -14.05571044]


In order to inspect the inferred topics we need to implement a print function ourselves:

In [46]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])
 
print("LDA Model:")
print('\n')

print_topics(lda_model, vectorizer)
print('\n')
 
print("NMF Model:")
print('\n')

print_topics(nmf_model, vectorizer)
print('\n')
 
print("LSI Model:")
print('\n')

print_topics(lsi_model, vectorizer)
print('\n')


LDA Model:


Topic 0:
[('state', 681.6683187197638), ('new', 678.2979404004526), ('states', 495.18118782997135), ('united', 383.6802647204728), ('government', 372.399487103471), ('years', 349.28442176751366), ('general', 346.01971328966005), ('year', 332.5298246593678), ('world', 332.32380998611956), ('public', 316.49184640641704)]
Topic 1:
[('said', 1703.383497780041), ('like', 1245.6028602829658), ('time', 1197.2708424868795), ('man', 1111.690873138322), ('did', 903.7456567041945), ('just', 795.617409324864), ('little', 743.6144892844557), ('way', 734.5544300057948), ('new', 679.3087292463248), ('know', 649.1202850905528)]
Topic 2:
[('god', 180.14418271938146), ('said', 143.29296870406253), ('john', 120.2759985680541), ('man', 96.89008321226092), ('mike', 92.9255198210274), ('new', 83.8210560113306), ('ball', 76.14530331297229), ('christ', 72.42839964551344), ('phil', 72.40540333170506), ('game', 70.27843289337984)]
Topic 3:
[('music', 88.01046813402108), ('feed', 85.4561377465113), 

[('form', 0.29488347989611224), ('dictionary', 0.27138603393610317), ('information', 0.2657526282064664), ('year', 0.20824648877086638), ('text', 0.20537331487760774), ('cell', 0.1714310637468196), ('forms', 0.16948747114876506), ('tax', 0.15835854193182533), ('new', 0.153866240454304), ('fiscal', 0.14762695602832895)]
Topic 9:
[('fiscal', 0.26909706261480437), ('year', 0.24906536245617833), ('tax', 0.1919994914567153), ('school', 0.18149031757594886), ('states', 0.12814731161986873), ('time', 0.12255129451655453), ('like', 0.11831102158793812), ('years', 0.0896163342981135), ('children', 0.08708290558021348), ('child', 0.08267413762856106)]




We can observe the token distribution for each inferred topic and create associations, as in the Gensim implementation from before. Generally, it seems easier to map the presented tokens to a general topic, since there are many sustantives that are domain specific (house, home, government, court) and therefore very straightforward to classify. From these observations it seems that the topics government, news and religion are more predominant, or just easier to spot than the others.

Transforming an unseen document goes like this:

In [47]:
text = "The economy is working better than ever"
x = nmf_model.transform(vectorizer.transform([text]))[0]
print(x)

[0.00290024 0.         0.         0.         0.         0.00438791
 0.         0.         0.         0.00464881]


Here’s how to implement the similarity functionality we’ve seen in the gensim section:

In [48]:
from sklearn.metrics.pairwise import euclidean_distances
 
def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
    return most_similar
 
similarities = most_similar(x, nmf_Z)
document_id, similarity = similarities[0]
print("Our sample sentence:")
print("--------------------")

print(text)
print('\n')

print('Most similar text:')
print("--------------------")

print(data[document_id][:1000])

Our sample sentence:
--------------------
The economy is working better than ever


Most similar text:
--------------------
Livery stable -- J. Vernon , prop. '' . Coaching had declined considerably by 1905 , but the sign was still there , near the old Wells Fargo building in San Francisco , creaking in the fog as it had for thirty years . John Vernon had had all the patronage he cared for -- he had prospered , but he could not retire from horsedom . Coaching was in his blood . He had two interests in life : the pleasures of the table and driving . Twice a week he drove his tallyho over the Santa Cruz road , upland and through the redwood forest , with orchards below him at one hand , and glimpses of the Pacific at the other . The journey back he made along the coast road , traveling hell-for-leather , every lantern of the tallyho ablaze . The southward route was the classic run in California , and the most fashionable . His patronage on this stretch was made up largely of San Francisc

We see that one of the key words present in the text is 'Wells Fargo', a well known American bank (therefore related to the token 'economy').

Now a cool tool. We can use SVD with 2 components (topics) to display words and documents in 2D. We will display documents since it’s a bit more straightforward. We run the following lines to init bokeh:

In [49]:
import pandas as pd
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()

Plotting documents in 2D:

In [50]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

We can see if indeed closer documents on the plot are more similar. To display words in 2D we just need to transpose the vectorized data: words_2d = svd.fit_transform(data_vectorized.T).

In [51]:
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

In order to get a really good word representation we need a significantly larger corpus so we can find some meaningful representations.

LDA is the most popular method for doing topic modeling in real-world applications. That is because it provides accurate results, can be trained online (do not retrain every time we get new data) and can be run on multiple cores. Let’s repeat the process we did in the previous sections with sklearn and LatentDirichletAllocation:

In [52]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
 
text = "The economy is working better than ever"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x, x.sum())

  token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')


[0.02500003 0.02500009 0.02501613 0.7749461  0.02500526 0.02500002
 0.02500018 0.02501239 0.02501077 0.02500903] 1.0


The factors corresponding to each component (topic) add up to 1. That’s not a coincidence. Indeed, LDA considers documents as being generated by a mixture of the topics. The purpose of LDA is to compute how much of the document was generated by which topic. In this example, more than half of the document has been generated by the second topic:

LDA is an iterative algorithm and has two main steps:

In the initialization stage, each word is assigned to a random topic.
Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration:
What’s the probability of the word belonging to a topic
What’s the probability of the document to be generated by a topic
Due to these important qualities, we can visualize LDA results easily. We’re going to use a specialized tool called PyLDAVis:

In [53]:
import pyLDAvis.sklearn
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


As we can see, the topics are shown on the left while words are on the right. By clicking on a topic bubble on the left side, we can see the words associated to that topic on the right side.

The main things we should consider are:

1. Larger topics are more frequent in the corpus.
2. Topics closer together are more similar, topics further apart are less similar.
3. When we select a topic, we can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.
4. Hovering over a word will adjust the topic sizes according to how representative the word is for the topic.
5. As we mentioned before, LDA can be used for automatic tagging. We can go over each topic (pyLDAVis helps a lot) and attach a label to it. In the example above we can see that the topic 2 is mainly about politics and government, while the topic 4 is about sports. Unfortunately, not all topics are so clearly defined as the ones we looked at. Results can be improved by experimenting with different num_topics values. In this case, our corpus is not really that large, it only has 500 instances. A larger corpus will induce more clearly defined topics.