# Topic Modeling and Applications in Text Classification and Text Clustering

Copyright by Pham Quang Nhat Minh (FPT Technology Research Institute (FTRI) - FPT University)

## Introduction

In this tutorial, we introduce how to use LDA libraries to estimate topic models on data and how to use topic distributions as representations of documents in text classification and text clustering.

## Topic Modeling Overview

**Acknowledgement**: Content of this section is credited by [Dr. Le Hong Phuong](http://mim.hus.vnu.edu.vn/phuonglh/).

A topic model is a type of statistical model for discovering the abstract topics that occur in a collection of documents. Each topic can be viewed as a distribution of words (a unigram language model). Intuitively, given that a
document is about a particular topic, one would expect particular words to appear in the document more or less frequently: ```dog``` and ```bone``` will appear more often in documents about ```dogs```, ```cat``` and ```meow``` will appear in documents about cats, and the and is will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.

Topic modelling is widely used in text mining and has many applications. In the lecture, you have learned some topic models, including latent semantic indexing (LSI), probabilistic latent semantic analysis (PLSA) and latent
Dirichlet allocation (LDA).

## Topic Modeling Toolkits

There are many implementations of LDA for topic modeling. In the tutorial, we try ```lda-c``` of David Blei and ```gensim``` library.

For a longer list of LDA implementation see the article on Wikipedia: [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation).

## Get Your Hands Dirty

### Dataset

In this section, we used the sample data that contains 2246 documents from the Associated Press on the homepage of Proff David Blei. Download the data from [http://www.cs.princeton.edu/~blei/lda-c/ap.tgz](http://www.cs.princeton.edu/~blei/lda-c/ap.tgz). On \*nix environment, you may use the tool wget. After downloading, you need to uncompress the data into the current directory.

```
wget http://www.cs.princeton.edu/~blei/lda-c/ap.tgz
tar xvfz ap.tgz
```

The data file ```ap/ap.dat``` was converted into LDA format from original documents. The data format is as follows.

```
[M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]
```

where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string.

For more information, you may want to read the instruction in [```readme.txt```](http://www.cs.princeton.edu/~blei/lda-c/readme.txt) within the lda-c-dist directory.

### LDA-C

lda-c is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA is fully described in [Blei et al. (2003)](http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).

You need to download the source code, uncompress the tgz file, and compile the tool before using it.

```
wget http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz
tar xvfz lda-c-dist.tgz
cd lda-c-dist
make
```

### Topic Estimation

The syntax for the estimating topics as follows.

```
lda est [alpha] [k] [settings] [data] [random/seeded/*] [directory]
```

For the data in the question, we can perform as follows.

```
./lda-c-dist/lda est 0.001 100 lda-c-dist/settings.txt ap/ap.dat random models/model-001
```

LDA model will be saved in the directory ```models/model-001```.

Will can show a top 20 words for each topic by using the script ```topics.py```.

```
python2.7 lda-c-dist/topics.py models/model-001/final.beta ap/vocab.txt 20
```

### Gensim

We can use Gensim - a Python library for Topic Modeling. You can install Gensim by using conda or pip. See [https://radimrehurek.com/gensim/install.html](https://radimrehurek.com/gensim/install.html)

```
conda install gensim
```

After install gensim, we can start using the library for topic modeling.

#### Reading data in LDA-C format

We use class BleiCorpus in the module ```gensim.corpora.bleicorpus``` to load the data in the file ```ap.dat``` and vocabulary file ```vocab.txt```.

In [1]:
import gensim

corpus = gensim.corpora.BleiCorpus('./ap/ap.dat', './ap/vocab.txt')

Now we can use the corpus in a loop like so:

```
for document in corpus:
    # Some process on the document
```

If we just want to look at the content of the first document, we can do as follows.

In [2]:
print( next(iter(corpus)) )

[(0, 1.0), (6144, 1.0), (3586, 2.0), (3, 1.0), (4, 1.0), (1541, 1.0), (8, 1.0), (10, 1.0), (3927, 1.0), (12, 7.0), (4621, 1.0), (527, 1.0), (9232, 1.0), (1112, 2.0), (20, 1.0), (2587, 1.0), (6172, 1.0), (10269, 2.0), (37, 1.0), (42, 1.0), (3117, 1.0), (1582, 1.0), (1585, 3.0), (435, 1.0), (9268, 3.0), (571, 2.0), (60, 1.0), (61, 1.0), (63, 2.0), (64, 2.0), (5185, 1.0), (11, 1.0), (4683, 1.0), (590, 2.0), (1103, 2.0), (592, 1.0), (5718, 1.0), (1623, 2.0), (1624, 4.0), (89, 2.0), (6234, 1.0), (8802, 1.0), (1638, 1.0), (103, 1.0), (600, 1.0), (9404, 1.0), (106, 1.0), (3691, 1.0), (720, 1.0), (2672, 1.0), (113, 1.0), (2165, 1.0), (5751, 1.0), (123, 3.0), (1148, 1.0), (128, 2.0), (1670, 2.0), (4231, 1.0), (1167, 1.0), (144, 1.0), (147, 1.0), (149, 7.0), (3735, 2.0), (5272, 2.0), (1732, 1.0), (673, 2.0), (5282, 1.0), (27, 1.0), (1700, 1.0), (9893, 2.0), (166, 1.0), (167, 1.0), (173, 1.0), (174, 1.0), (2224, 1.0), (2248, 1.0), (372, 2.0), (186, 1.0), (4284, 3.0), (3450, 2.0), (117, 2.0), (203

Each pair printed in the above output is the pair of termID:termCount. ```termID``` is the word index in the vocabulary and ```termCount``` is the number of times term ```termID``` occurs in the document.

We can identify words that correspond to term indexes. Here we just print 10 words in the document.

In [3]:
doc = next(iter(corpus))
for word_id, freq in doc[0:10]:
    print(corpus.id2word[word_id], freq)

i 1.0
maurice 1.0
adult 2.0
people 1.0
year 1.0
h 1.0
last 1.0
years 1.0
resolved 1.0
police 7.0


#### Estimating topic models from the data

Now we will train a LDA model on the data we have and save the model to the disk.

In [4]:
model = gensim.models.LdaModel(corpus, id2word=corpus.id2word, 
                               alpha='auto', num_topics=100)
model.save('ap.lda')

According to the tutorial [Using Gensim for LDA](http://christop.club/2014/05/06/using-gensim-for-lda/), the meaning of parameters we used in the above function is as follows. (We just copy & paste here)

1. id2word: Although you can build a model from just a corpus, I’ve gone ahead and let the LdaModel know about the corpus.id2word. It just makes some of the things I’ll show you next nicer.
2. alpha: This particular LDA implementation uses something that can automatically update the alpha value for us. This determines how ‘smooth’ the model is, which makes no damned sense if you aren’t working in the area (it doesn’t make much sense to me). Here’s what alpha does: as it gets smaller, each document is going to be more specific, i.e., likely to only made up of a few topics. As it gets bigger, a document can begin to appear in multiple topics, which is what we want. It’s not good to have a large alpha either, because then all our topics will start intermingling and making out and that’s gross.
3. num_topics: The num_topics parameter just determines how many topics we want the model to give us. I’ve used 100 here since we are only looking at a corpus of titles.

Now let's look at some random topics.

In [6]:
model.show_topics(num_topics=5, num_words=10, log=True)

[(79,
  '0.087*cdy + 0.078*clr + 0.027*rn + 0.012*eduardo + 0.011*uprisings + 0.010*havana + 0.008*menem + 0.005*new + 0.005*two + 0.004*ali'),
 (87,
  '0.013*yeutter + 0.010*fresh + 0.008*outlets + 0.008*deteriorating + 0.008*forestry + 0.008*iran + 0.006*states + 0.006*perez + 0.005*trade + 0.004*united'),
 (32,
  '0.008*year + 0.006*i + 0.005*million + 0.005*two + 0.005*last + 0.005*percent + 0.004*first + 0.004*new + 0.004*people + 0.004*years'),
 (34,
  '0.012*i + 0.012*united + 0.010*states + 0.008*year + 0.004*people + 0.004*going + 0.004*years + 0.004*first + 0.004*congress + 0.004*new'),
 (82,
  '0.009*billion + 0.008*trade + 0.007*united + 0.007*states + 0.005*last + 0.005*canadian + 0.005*percent + 0.005*imports + 0.004*japan + 0.004*world')]

We get top 5 topics and words with probabilities that words belong the topic.

Now let's do querying the trained topic model with a *fake* query.

In [7]:
query = "police gun boy minister government"
query = query.split()
query

['police', 'gun', 'boy', 'minister', 'government']

We generate a dictionary that is a map from word id to word.

In [8]:
id2word = gensim.corpora.Dictionary()
id2word.merge_with(corpus.id2word)

<gensim.models.VocabTransform at 0x10ef2fcf8>

And then we converty the query to the list of tuples (word, frequency).

In [9]:
query = id2word.doc2bow(query)
query

[(9, 1), (12, 1), (122, 1), (1585, 1), (1624, 1)]

Now we infer the distribution over topics for the document.

In [10]:
import random
random.seed(1000)
a = list(sorted(model[query], key=lambda x: x[1]))
print(a)

[(42, 0.40112170293508975), (75, 0.42665021287633781)]


Let’s check out what words are most associated with those some topics.

In [11]:
model.print_topic(a[0][0])

'0.014*government + 0.013*party + 0.009*police + 0.007*mexican + 0.007*allied + 0.006*threats + 0.005*mexico + 0.005*national + 0.005*two + 0.005*minister'

In [12]:
model.print_topic(a[0][0])

'0.014*government + 0.013*party + 0.009*police + 0.007*mexican + 0.007*allied + 0.006*threats + 0.005*mexico + 0.005*national + 0.005*two + 0.005*minister'

### R Implementation of LDA

There is also an implementation of LDA for R language. Please see the package implemented by Jonathan Chang on [lda: Collapsed Gibbs Sampling Methods for Topic Models](https://cran.r-project.org/web/packages/lda/index.html).

## Applications of Topic Modeling in Text Classification and Text Clustering

In this section, we will learn how we can use topic distributions of documents as new representations for the text classification task and compare with the **BOW** representation.

We will use [the 20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/) for this tutorial. The data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

### Basic Idea

The basic idea is to train a topic model on the training portion of the data. We will get the distribution over words for documents in the training data. We use the new representations to train a classification model on the training data. In testing, we used the generated topic model to infer topic distributions, and used the trained classifier to predict the labels for documents in test data.

### Data Set

We will use the python library [scikit-learn](http://scikit-learn.org/stable/about.html) to fetch 20newsgroups data. Since it takes time to train LDA model from all documents in the data, we just use a subset of categories.

In [13]:
from sklearn.datasets import fetch_20newsgroups

categories = [
        'alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space',
    ]

data_train = fetch_20newsgroups(subset='train', data_home = './',
                                categories = categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', data_home = './',
                               categories = categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

We now list all newsgroups in the data.

In [14]:
# order of labels in `target_names` can be different from `categories`
target_names = data_train.target_names
print(target_names)

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


And you may to know some basic statistics about data, such as the number of examples.

In [15]:
print('Number of training examples: %d' % len(data_train.data))
print('Number of test examples: %d' % len(data_test.data))

Number of training examples: 2034
Number of test examples: 1353


### Text Classification with BOW features

In this section, we will perform text classification on 20newsgroups data by using Bag-of-word (BOW) features. The section is based on [the tutorial](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html) on scikit-learn.

We use TF-IDF weighting scheme to represent documents in the data set. We use [Support Vector Machine (SVM)](https://en.wikipedia.org/wiki/Support_vector_machine) as the classification algorithm.

In [16]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

data_train = fetch_20newsgroups(subset='train', data_home = './',
                                categories = categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', data_home = './',
                               categories = categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
print("Extracting features from the training data using a sparse vectorizer")

# sublinear_tf : boolean, default=False
# Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
print("n_samples: %d, n_features: %d" % X_train.shape)
print()

print("Extracting features from the test data using the same vectorizer")
X_test = vectorizer.transform(data_test.data)
print("n_samples: %d, n_features: %d" % X_test.shape)
print()

Extracting features from the training data using a sparse vectorizer
n_samples: 2034, n_features: 26576

Extracting features from the test data using the same vectorizer
n_samples: 1353, n_features: 26576



Now, we will train a classification model using SVM with linear kernel.

In [17]:
from sklearn.svm import LinearSVC
from sklearn import metrics

print(target_names)
clf = LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)

clf.fit(X_train, y_train)
pred = clf.predict(X_test)

score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)

print("classification report:")
print(metrics.classification_report(y_test, pred,
                                    target_names=target_names))

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
accuracy:   0.780
classification report:
                    precision    recall  f1-score   support

       alt.atheism       0.69      0.62      0.66       319
     comp.graphics       0.89      0.91      0.90       389
         sci.space       0.78      0.90      0.84       394
talk.religion.misc       0.68      0.60      0.64       251

       avg / total       0.77      0.78      0.78      1353



### Preprocessing

We now estimate topic models from the training portion of the data set. We will ```gensim``` to do the task.

We follow the tutorial [Latent Dirichlet Allocation (LDA) with Python](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html), and go through step by step to create a LDA model from the training data.

We will do following steps in preprocessing:

- Tokenization
- Removing stop words
- Stemming

#### Tokenization

In [18]:
from nltk.tokenize import word_tokenize
docs_train = [word_tokenize(doc) for doc in data_train.data]

#### Removing stop words

Now, we remove stop words in the training documents.

In [19]:
from nltk.corpus import stopwords
import re
import string
english_stops = set(stopwords.words('english'))
punc_set = set( [ c for c in string.punctuation ] )

def stopwords_and_punc_filter(doc):
    return [word.lower() for word in doc 
            if (word.lower() not in english_stops) and
            (word not in punc_set) and 
            (re.search(r'^[0-9a-zA-Z_]+$', word))]

We use the function ```stopwords_and_punc_filter``` to remove stop words in documents.

In [20]:
docs_train_no_stopwords = [ stopwords_and_punc_filter(doc) 
                            for doc in docs_train ]
print(docs_train_no_stopwords[0])

['hi', 'noticed', 'save', 'model', 'mapping', 'planes', 'positioned', 'carefully', 'file', 'reload', 'restarting', '3ds', 'given', 'default', 'position', 'orientation', 'save', 'file', 'preserved', 'anyone', 'know', 'information', 'stored', 'file', 'nothing', 'explicitly', 'said', 'manual', 'saving', 'texture', 'rules', 'file', 'like', 'able', 'read', 'texture', 'rule', 'information', 'anyone', 'format', 'file', 'file', 'format', 'available', 'somewhere', 'rych']


#### Stemming

We define a function for stemming a document.

In [21]:
import nltk
from nltk.stem.porter import PorterStemmer
wnl = nltk.WordNetLemmatizer()
porter = nltk.PorterStemmer()
def stemming_doc(doc):
    return [porter.stem(t) for t in doc]

def lemmatization(doc):
    return [wnl.lemmatize(t) for t in doc]

We use the above function for stemming step.

In [22]:
# train_texts = [stemming_doc(doc) for doc in docs_train_no_stopwords]
# train_texts = docs_train_no_stopwords
train_texts = [lemmatization(doc) for doc in docs_train_no_stopwords]
print(train_texts[0])

['hi', 'noticed', 'save', 'model', 'mapping', 'plane', 'positioned', 'carefully', 'file', 'reload', 'restarting', '3d', 'given', 'default', 'position', 'orientation', 'save', 'file', 'preserved', 'anyone', 'know', 'information', 'stored', 'file', 'nothing', 'explicitly', 'said', 'manual', 'saving', 'texture', 'rule', 'file', 'like', 'able', 'read', 'texture', 'rule', 'information', 'anyone', 'format', 'file', 'file', 'format', 'available', 'somewhere', 'rych']


###  Estimating Topic Models

We will create a dictionary from the collection of training documents.

The ```Dictionary()``` function traverses texts, assigning a unique integer id to each unique token while also collecting word counts and relevant statistics. To see each token’s unique integer id, try ```print(dictionary.token2id)```.

In [23]:
from gensim import corpora, models
dictionary = corpora.Dictionary(train_texts)
# print(dictionary.token2id)

Next, our dictionary must be converted into a ```bag-of-words```:

In [24]:
corpus = [dictionary.doc2bow(text) for text in train_texts]
print(corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 2), (17, 2), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 6), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 2), (33, 1), (34, 1)]


After having term-document matrix, we are now ready to build LDA model. In order to speed up estimating LDA models, we use the module ```LdaMulticore```.

In [25]:
model = gensim.models.ldamulticore.LdaMulticore(corpus, 
                                                id2word=dictionary, 
                                                num_topics=500)

#### Examining the results

Our LDA model is now stored as ldamodel. We can review our topics with the ```print_topic``` and ```print_topics``` methods.

In [26]:
print(model.print_topics(num_topics=2, num_words=10))

[(498, '0.006*new + 0.006*washington + 0.006*street + 0.005*space + 0.005*time + 0.005*york + 0.005*source + 0.004*also + 0.004*dc + 0.004*ny'), (132, '0.015*order + 0.009*reuss + 0.007*image + 0.007*u + 0.006*texas + 0.006*post + 0.005*member + 0.005*omniscient + 0.005*use + 0.005*whole')]


### Using Topic Distributions of Documents for Classification

Now we will use distribution over topics for the classification task. We need to use LDA model to infer topic distributions of documents in training and test data.

Let's infer topic distributions for training documents. For instance, we get the topic distribution for document 0 in the corpus by using following function.

In [27]:
lda_doc = model.__getitem__(corpus[0], 0.00001)
# Print some topics with probabilities
lda_doc[0:10]

[(0, 4.2553191489361501e-05),
 (1, 4.2553191489361501e-05),
 (2, 4.2553191489361501e-05),
 (3, 4.2553191489361501e-05),
 (4, 4.2553191489361501e-05),
 (5, 4.2553191489361501e-05),
 (6, 4.2553191489361501e-05),
 (7, 4.2553191489361501e-05),
 (8, 4.2553191489361501e-05),
 (9, 4.2553191489361501e-05)]

We now get topic distributions for all training documents, and convert them to feature vectors. These feature vectors will be used as input of Machine Learning algorithms.

In [28]:
from sklearn.feature_extraction import DictVectorizer

lda_docs_train = [model.__getitem__(doc, 0.00001) for doc in corpus]
vectorizer = DictVectorizer(sparse=False)
X_lda_train = vectorizer.fit_transform([dict(doc) for doc in lda_docs_train])

We now need to infer topic distributions for all documents in test data. We walk through similar steps as for training documents.

In [29]:
from nltk.tokenize import word_tokenize
docs_test = [word_tokenize(doc) for doc in data_test.data]
docs_test_no_stopwords = [ stopwords_and_punc_filter(doc) 
                            for doc in docs_test ]
test_texts  = [lemmatization(doc) for doc in docs_test_no_stopwords]
test_corpus = [dictionary.doc2bow(text) for text in test_texts]

We infer topics distributions for test documents.

In [30]:
lda_docs_test = [model.__getitem__(doc, 0.00001) for doc in test_corpus]
lda_docs_test[0][0:10]
X_lda_test = vectorizer.fit_transform([dict(doc) for doc in lda_docs_test])

We fit the data with new representations of documents.

In [31]:
clf = LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)
clf.fit(X_lda_train, y_train)

pred = clf.predict(X_lda_test)

score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)

print("classification report:")
print(metrics.classification_report(y_test, pred,
                                    target_names=target_names))

accuracy:   0.607
classification report:
                    precision    recall  f1-score   support

       alt.atheism       0.58      0.45      0.51       319
     comp.graphics       0.65      0.77      0.70       389
         sci.space       0.60      0.75      0.66       394
talk.religion.misc       0.56      0.32      0.41       251

       avg / total       0.60      0.61      0.59      1353



We see that the accuracy is much lower than the using BOW representations. In order to improve the result, you may want to try:

- Tunning the parameters for Topic Models (i.e., the number of topics, alpha, etc)
- Applying feature selection
- Combine BOW representations with topical distribution-based features

**Three possible above methods are left as practice exercises for you.**

### Text Clustering with BOW Representations

We know apply k-means algorithm for text clustering on the 20newsgroups data set. In this section, we use BOW representations of documents for text clustering.

In [32]:
import numpy as np
from time import time
from sklearn.cluster import KMeans, MiniBatchKMeans

labels = data_train.target
true_k = np.unique(labels).shape[0]
km = KMeans(n_clusters=true_k, init='k-means++', random_state=42, verbose=False)
t0 = time()
km.fit(X_train)
print("done in %0.3fs" % (time() - t0))
print()

done in 10.239s



We now evaluate the clustering result.

In [33]:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X_train, km.labels_, sample_size=1000))

print()

Homogeneity: 0.278
Completeness: 0.365
V-measure: 0.316
Adjusted Rand-Index: 0.234
Silhouette Coefficient: 0.007



### Text Clustering with Topic Distributions

We use topic distributions of documents as new representations of documents.

In [34]:
km_lda = KMeans(n_clusters=true_k, init='k-means++', random_state=42, verbose=False)
t0 = time()
km.fit(X_lda_train)
print("done in %0.3fs" % (time() - t0))
print()

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X_train, km.labels_, sample_size=1000))

print()

done in 0.212s

Homogeneity: 0.003
Completeness: 0.063
V-measure: 0.006
Adjusted Rand-Index: 0.001
Silhouette Coefficient: 0.006



We can see that comparing with BOW representations, results of using topic distribution are much worse. **Please think how to improve these results**.

One possible improvement is to normalizing data before clustering. **We leave this as an exercise for readers**.



## References

- [Using Gensim for LDA](http://christop.club/2014/05/06/using-gensim-for-lda/)
- [Latent Dirichlet Allocation (LDA) with Python](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html)
- [models.ldamodel – Latent Dirichlet Allocation](https://radimrehurek.com/gensim/models/ldamodel.html)
- [Classification of text documents using sparse features](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html)
- [The 20 newsgroups text datas](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)
- [Clustering on scikit-learn](http://scikit-learn.org/stable/modules/clustering.html)
- [Clustering text documents using k-means](http://scikit-learn.org/stable/auto_examples/text/document_clustering.html)