# Jonathan Halverson
# Tuesday, February 7, 2017
# Topic modeling with Gensim

Gensim is a free Python library designed to automatically extract semantic topics from documents, as efficiently and painlessly as possible.

Gensim is designed to process raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation and Random Projections discover semantic structure of documents by examining statistical co-occurrence patterns of the words within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic 
representation and queried for topical similarity against other documents.

The basic idea is to come up with a long list of questions that have numerical answers. Each document in the corpus is then represented by a vector of the question number and the answer with zero answers being left out giving a sparse vector. The vectors are then transformed according to some model giving a new set of vectors. One example is TF-IDF which results in a vector of weights.

In [1]:
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
>>> from gensim import corpora, models, similarities
>>>
>>> corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
>>>           [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
>>>           [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
>>>           [(0, 1.0), (4, 2.0), (7, 1.0)],
>>>           [(3, 1.0), (5, 1.0), (6, 1.0)],
>>>           [(9, 1.0)],
>>>           [(9, 1.0), (10, 1.0)],
>>>           [(9, 1.0), (10, 1.0), (11, 1.0)],
>>>           [(8, 1.0), (10, 1.0), (11, 1.0)]]

2017-02-07 16:17:26,818 : INFO : 'pattern' package not found; tag filters are not available for English


Next, let’s initialize a transformation:

In [3]:
>>> tfidf = models.TfidfModel(corpus)

2017-02-07 16:17:26,830 : INFO : collecting document frequencies
2017-02-07 16:17:26,831 : INFO : PROGRESS: processing document #0
2017-02-07 16:17:26,832 : INFO : calculating IDF weights for 9 documents and 11 features (28 matrix non-zeros)


A transformation is used to convert documents from one vector representation into another:

In [4]:
>>> vec = [(0, 1), (4, 1)]
>>> print(tfidf[vec])

[(0, 0.8075244024440723), (4, 0.5898341626740045)]


To transform the whole corpus via TfIdf and index it, in preparation for similarity queries:

In [5]:
>>> index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)

2017-02-07 16:17:26,845 : INFO : creating sparse index
2017-02-07 16:17:26,846 : INFO : creating sparse matrix from corpus
2017-02-07 16:17:26,847 : INFO : PROGRESS: at document #0
2017-02-07 16:17:26,849 : INFO : created <9x12 sparse matrix of type '<type 'numpy.float32'>'
	with 28 stored elements in Compressed Sparse Row format>


In [6]:
>>> sims = index[tfidf[vec]]
>>> print(list(enumerate(sims)))

[(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


How to read this output? Document number zero (the first document) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1% etc.

Thus, according to TfIdf document representation and cosine similarity measure, the most similar to our query document vec is document no. 3, with a similarity score of 82.1%. Note that in the TfIdf representation, any documents which do not share any common features with vec at all (documents no. 4–8) get a similarity score of 0.0.