In [1]:
import logging
from gensim.models import EnsembleLda, LdaMulticore
from gensim.corpora import OpinosisCorpus
import os

enable the ensemble logger to show what it is doing currently

In [2]:
elda_logger = logging.getLogger(EnsembleLda.__module__)
elda_logger.setLevel(logging.INFO)
elda_logger.addHandler(logging.StreamHandler())

# Experiments on the Opinosis Dataset

Opinosis is an extremely small corpus that contains 289 product reviews for 51 products, which is why it is hard to extract topics from it. https://github.com/kavgan/opinosis

## Preparing the corpus

First, download the opinosis dataset. On linux it can be done like this for example:

In [None]:
!mkdir ~/opinosis
!wget -P ~/opinosis https://github.com/kavgan/opinosis/raw/master/OpinosisDataset1.0_0.zip
!unzip ~/opinosis/OpinosisDataset1.0_0.zip -d ~/opinosis

In [3]:
path = os.path.expanduser('~/opinosis/')

Corpus and id2word mapping can be created using the load_opinosis_data function provided in the package.
It preprocesses the data using the PorterStemmer and stopwords from the nltk package.

The parameter of the function is the relative path to the folder, into which the zip file was extracted before. That folder contains a 'summaries-gold' subfolder.

In [4]:
opinosis = OpinosisCorpus(path)

data source:
title:		Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions
authors:	Ganesan, Kavita and Zhai, ChengXiang and Han, Jiawei
booktitle:	Proceedings of the 23rd International Conference on Computational Linguistics
pages:		340-348
year:		2010
organization:	Association for Computational Linguistics


## Training

**parameters**

**topic_model_kind** ldamulticore is highly recommended for EnsembleLda. ensemble_workers and **distance_workers** are used to improve the time needed to train the models, as well as the **masking_method** 'rank'. ldamulticore is not able to fully utilize all cores on this small corpus, so **ensemble_workers** can be set to 3 to get 95 - 100% cpu usage on my i5 3470.

Since the corpus is so small, a high number of **num_models** is needed to extract stable topics. The Opinosis corpus contains 51 categories, however, some of them are quite similar. For example there are 3 categories about the batteries of portable products. There are also multiple categories about cars. So I chose 20 for num_topics, which is smaller than the number of categories.

The default for **min_samples** would be 64, half of the number of models. But since this does not return any topics, or at most 2, I set this to 32.

In [5]:
elda = EnsembleLda(corpus=opinosis.corpus, id2word=opinosis.id2word, num_models=128, num_topics=20,
                   passes=20, iterations=100, ensemble_workers=3, distance_workers=4,
                   topic_model_kind='ldamulticore', masking_method='rank', min_samples=32)

Spawned worker to generate 42 topic models...
Spawned worker to generate 43 topic models...
Generating 42 topic models...
Generating 43 topic models...
Spawned worker to generate 43 topic models...
Generating 43 topic models...
Spawned worker to generate 640 rows of the asymmetric similarity matrix...
Spawned worker to generate 640 rows of the asymmetric similarity matrix...
Spawned worker to generate 640 rows of the asymmetric similarity matrix...
Spawned worker to generate 640 rows of the asymmetric similarity matrix...
Fitting the clustering model
Generating stable topics
Generating classic gensim model representation based on results from the ensemble


In [6]:
# pretty print, note that the words are stemmed so they appear chopped off
for t in elda.print_topics(num_words=7):
    print('-', t[1].replace('*',' ').replace('"','').replace(' +',','), '\n')

- 0.145 free, 0.043 park, 0.033 coffe, 0.030 wine, 0.027 even, 0.025 morn, 0.022 internet 

- 0.161 screen, 0.062 bright, 0.043 clear, 0.027 easi, 0.021 read, 0.019 touch, 0.011 size 

- 0.127 staff, 0.075 friendli, 0.074 help, 0.070 servic, 0.016 quick, 0.012 profession, 0.012 good 

- 0.123 seat, 0.067 comfort, 0.050 uncomfort, 0.039 front, 0.037 back, 0.036 firm, 0.032 drive 

- 0.113 room, 0.104 clean, 0.041 small, 0.037 bathroom, 0.036 comfort, 0.023 size, 0.016 well 

