## Topics and Transformations

In [1]:
from pprint import pprint
from collections import defaultdict
from gensim import corpora, models

Aim is to transform documents from one vector representation into another. This process serves two goals:<br>

1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.<br>

2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

### Creating the Corpus

In [2]:
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

In [4]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
pprint(corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


### Creating a transformation<br>
The transformations are standard Python objects, typically initialized by means of a training corpus:

In [6]:
tfidf = models.TfidfModel(corpus)

# Create a simple BoW corpus and transform
doc_bow = [(0,1), (1,1)]
print(tfidf[doc_bow])

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


Transformations always convert between two specific vector spaces. The same vector space (= the same set of feature ids) must be used for training as well as for subsequent vector transformations. Failure to use the same input feature space, such as applying a different string preprocessing, using different feature ids, or using bag-of-words input vectors where TfIdf vectors are expected, will result in feature mismatch during transformation calls and consequently in either garbage output and/or runtime exceptions.

Applying a transformation to a whole corpus:

In [9]:
corpus_tfidf = tfidf[corpus]

for doc in corpus_tfidf:
    print(doc)

<gensim.interfaces.TransformedCorpus object at 0x0000024889D8C848>
[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


Calling model[corpus] only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration. We cannot convert the entire corpus at the time of calling corpus_transformed = model[corpus], because that would mean storing the result in main memory, and that contradicts *gensim’s objective of memory-indepedence*. If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

Transformations can also be serialized, one on top of another, in a sort of chain:

In [10]:
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformation
corpus_lsi = lsi_model[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
lsi_model.print_topics(2)

[(0,
  '-0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"time" + -0.060*"response" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"'),
 (1,
  '0.460*"system" + 0.373*"user" + 0.332*"eps" + 0.328*"interface" + 0.320*"response" + 0.320*"time" + 0.293*"computer" + 0.280*"human" + 0.171*"survey" + -0.161*"trees"')]

Transformed our Tf-Idf corpus via Latent Semantic Indexing into a latent 2-D space (2-D because we set num_topics=2). It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words.

In [11]:
# Both bow->tf-idf and tf-idf->lsi transformations are actually executed on the fly
for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)

[(0, -0.06600783396090143), (1, 0.5200703306361851)] Human machine interface for lab abc computer applications
[(0, -0.19667592859142266), (1, 0.760956316770006)] A survey of user opinion of computer system response time
[(0, -0.08992639972446158), (1, 0.7241860626752513)] The EPS user interface management system
[(0, -0.07585847652177913), (1, 0.6320551586003431)] System and human system engineering testing of EPS
[(0, -0.10150299184979941), (1, 0.5737308483002965)] Relation of user perceived response time to error measurement
[(0, -0.7032108939378314), (1, -0.1611518021402553)] The generation of random binary unordered trees
[(0, -0.8774787673119836), (1, -0.16758906864659084)] The intersection graph of paths in trees
[(0, -0.9098624686818582), (1, -0.1408655362871868)] Graph minors IV Widths of trees and well quasi ordering
[(0, -0.6165825350569283), (1, 0.05392907566389603)] Graph minors A survey


As expected, the first five documents are more strongly related to the second topic while the remaining four documents

### Available transformations

Gensim implements several popular Vector Space Model algorithms:

(Term Frequency * Inverse Document Frequency)<br> Tf-Idf expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality, except that features which were rare in the training corpus will have their value increased. It therefore converts integer-valued vectors into real-valued ones, while leaving the number of dimensions intact. It can also optionally normalize the resulting vectors to (Euclidean) unit length.

In [12]:
model = models.TfidfModel(corpus, normalize=True)

Latent Semantic Indexing<br>LSI (or sometimes LSA) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality. For the toy corpus above we used only 2 latent dimensions, but on real corpora, target dimensionality of 200–500 is recommended as a “golden standard”.

In [14]:
model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)

Random Projections<br>RP aim to reduce vector space dimensionality. This is a very efficient (both memory- and CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness. Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

In [15]:
model = models.RpModel(corpus_tfidf, num_topics=500)

Latent Dirichlet Allocation<br>LDA is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).

In [16]:
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

Hierarchical Dirichlet Process<br>HDP is a non-parametric bayesian method (note the missing number of requested topics):

In [17]:
model = models.HdpModel(corpus, id2word=dictionary)