# Tutorial 1: Corpora and Vector Spaces
See this *gensim* tutorial on the web [here](https://radimrehurek.com/gensim/tut1.html).

Don’t forget to set:

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

if you want to see logging events.

## From Strings to Vectors

This time, let’s start from documents represented as strings:

In [2]:
from gensim import corpora

2016-11-14 12:11:34,855 : INFO : 'pattern' package found; tag filters are available for English


In [3]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

This is a tiny corpus of nine documents, each consisting of only a single sentence.

First, let’s tokenize the documents, remove common words (using a toy stoplist) as well as words that only appear once in the corpus:

In [4]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint  # pretty-printer
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


Your way of processing the documents will likely vary; here, I only split on whitespace to tokenize, followed by lowercasing each word. In fact, I use this particular (simplistic and inefficient) setup to mimic the experiment done in [Deerwester et al.’s original LSA article](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) (Table 2).

The ways to process documents are so varied and application- and language-dependent that I decided to not constrain them by any interface. Instead, a document is represented by the features extracted from it, not by its “surface” string form: how you get to the features is up to you. Below I describe one common, general-purpose approach (called bag-of-words), but keep in mind that different application domains call for different features, and, as always, it’s [garbage in, garbage out](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out)...

To convert documents to vectors, we’ll use a document representation called [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model). In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

"How many times does the word *system* appear in the document? Once"

It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary:

In [5]:
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference
print(dictionary)

2016-11-14 12:11:35,120 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2016-11-14 12:11:35,122 : INFO : built Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...) from 9 documents (total 29 corpus positions)
2016-11-14 12:11:35,123 : INFO : saving Dictionary object under /tmp/deerwester.dict, separately None


Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)


Here we assigned a unique integer id to all words appearing in the corpus with the [gensim.corpora.dictionary.Dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary) class. This sweeps across the texts, collecting word counts and relevant statistics. In the end, we see there are twelve distinct words in the processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector). To see the mapping between words and their ids:

In [6]:
print(dictionary.token2id)

{u'minors': 11, u'graph': 10, u'system': 6, u'trees': 9, u'eps': 8, u'computer': 1, u'survey': 5, u'user': 7, u'human': 2, u'time': 4, u'interface': 0, u'response': 3}


To actually convert tokenized documents to vectors:

In [7]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

[(1, 1), (2, 1)]


The function `doc2bow()` simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. The sparse vector `[(0, 1), (1, 1)]` therefore reads: in the document *“Human computer interaction”*, the words computer (id 0) and human (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.

In [8]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use
for c in corpus:
    print(c)

2016-11-14 12:11:35,492 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
2016-11-14 12:11:35,496 : INFO : saving sparse matrix to /tmp/deerwester.mm
2016-11-14 12:11:35,497 : INFO : PROGRESS: saving document #0
2016-11-14 12:11:35,501 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2016-11-14 12:11:35,505 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index


[(0, 1), (1, 1), (2, 1)]
[(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(0, 1), (6, 1), (7, 1), (8, 1)]
[(2, 1), (6, 2), (8, 1)]
[(3, 1), (4, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(5, 1), (10, 1), (11, 1)]


By now it should be clear that the vector feature with `id=10 stands` for the question “How many times does the word graph appear in the document?” and that the answer is “zero” for the first six documents and “one” for the remaining three. As a matter of fact, we have arrived at exactly the same corpus of vectors as in the [Quick Example](https://radimrehurek.com/gensim/tutorial.html#first-example).

## Corpus Streaming – One Document at a Time

Note that *corpus* above resides fully in memory, as a plain Python list. In this simple example, it doesn’t matter much, but just to make things clear, let’s assume there are millions of documents in the corpus. Storing all of them in RAM won’t do. Instead, let’s assume the documents are stored in a file on disk, one document per line. Gensim only requires that a corpus must be able to return one document vector at a time:

In [9]:
class MyCorpus(object):
    def __iter__(self):
        for line in open('datasets/mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

The assumption that each document occupies one line in a single file is not important; you can mold the `__iter__` function to fit your input format, whatever it is. Walking directories, parsing XML, accessing network... Just parse your input to retrieve a clean list of tokens in each document, then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.

In [10]:
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x7f0b90e16a50>


Corpus is now an object. We didn’t define any way to print it, so `print` just outputs address of the object in memory. Not very useful. To see the constituent vectors, let’s iterate over the corpus and print each document vector (one at a time):

In [11]:
for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[(2, 1)]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]


Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.

Similarly, to construct the dictionary without loading all texts into memory:

In [12]:
from six import iteritems

# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('datasets/mycorpus.txt'))

# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist 
            if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]

# remove stop words and words that appear only once
dictionary.filter_tokens(stop_ids + once_ids)

# remove gaps in id sequence after words that were removed
dictionary.compactify()
print(dictionary)

2016-11-14 12:11:35,980 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2016-11-14 12:11:35,987 : INFO : built Dictionary(258 unique tokens: [u'gt', u'<style', u'name="robots"', u'cf-cookie-error"', u'<body>']...) from 106 documents (total 405 corpus positions)


Dictionary(52 unique tokens: [u'<script', u'-->', u'class="no-js', u'scan', u'style="width:']...)


And that is all there is to it! At least as far as bag-of-words representation is concerned. Of course, what we do with such corpus is another question; it is not at all clear how counting the frequency of distinct words could be useful. As it turns out, it isn’t, and we will need to apply a transformation on this simple representation first, before we can use it to compute any meaningful document vs. document similarities. Transformations are covered in the [next tutorial](https://radimrehurek.com/gensim/tut2.html), but before that, let’s briefly turn our attention to *corpus persistency*.

## Corpus Formats

There exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk. *Gensim* implements them via the *streaming corpus interface* mentioned earlier: documents are read from (resp. stored to) disk in a lazy fashion, one document at a time, without the whole corpus being read into main memory at once.

One of the more notable file formats is the [Matrix Market format](http://math.nist.gov/MatrixMarket/formats.html). To save a corpus in the Matrix Market format:

In [13]:
# create a toy corpus of 2 documents, as a plain Python list
corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it

corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)

2016-11-14 12:11:36,093 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm
2016-11-14 12:11:36,097 : INFO : saving sparse matrix to /tmp/corpus.mm
2016-11-14 12:11:36,098 : INFO : PROGRESS: saving document #0
2016-11-14 12:11:36,099 : INFO : saved 2x2 matrix, density=25.000% (1/4)
2016-11-14 12:11:36,100 : INFO : saving MmCorpus index to /tmp/corpus.mm.index


Other formats include [Joachim’s SVMlight format](http://svmlight.joachims.org/), [Blei’s LDA-C format](http://www.cs.princeton.edu/~blei/lda-c/) and [GibbsLDA++ format](http://gibbslda.sourceforge.net/).

In [14]:
corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

2016-11-14 12:11:36,239 : INFO : converting corpus to SVMlight format: /tmp/corpus.svmlight
2016-11-14 12:11:36,241 : INFO : saving SvmLightCorpus index to /tmp/corpus.svmlight.index
2016-11-14 12:11:36,244 : INFO : no word id mapping provided; initializing from corpus
2016-11-14 12:11:36,245 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
2016-11-14 12:11:36,247 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
2016-11-14 12:11:36,248 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index
2016-11-14 12:11:36,251 : INFO : no word id mapping provided; initializing from corpus
2016-11-14 12:11:36,253 : INFO : storing corpus in List-Of-Words format into /tmp/corpus.low
2016-11-14 12:11:36,256 : INFO : saving LowCorpus index to /tmp/corpus.low.index


Conversely, to load a corpus iterator from a Matrix Market file:

In [15]:
corpus = corpora.MmCorpus('/tmp/corpus.mm')

2016-11-14 12:11:36,382 : INFO : loaded corpus index from /tmp/corpus.mm.index
2016-11-14 12:11:36,384 : INFO : initializing corpus reader from /tmp/corpus.mm
2016-11-14 12:11:36,386 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries


Corpus objects are streams, so typically you won’t be able to print them directly:

In [16]:
print(corpus)

MmCorpus(2 documents, 2 features, 1 non-zero entries)


Instead, to view the contents of a corpus:

In [17]:
# one way of printing a corpus: load it entirely into memory
print(list(corpus))  # calling list() will convert any sequence to a plain Python list

[[(1, 0.5)], []]


or

In [18]:
# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
    print(doc)

[(1, 0.5)]
[]


The second way is obviously more memory-friendly, but for testing and development purposes, nothing beats the simplicity of calling `list(corpus)`.

To save the same Matrix Market document stream in Blei’s LDA-C format,

In [19]:
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

2016-11-14 12:11:36,801 : INFO : no word id mapping provided; initializing from corpus
2016-11-14 12:11:36,804 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
2016-11-14 12:11:36,807 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
2016-11-14 12:11:36,809 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index


In this way, *gensim* can also be used as a memory-efficient **I/O format conversion tool**: just load a document stream using one format and immediately save it in another format. Adding new formats is dead easy, check out the [code for the SVMlight corpus](https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py) for an example.

## Compatibility with NumPy and SciPy

Gensim also contains [efficient utility functions](http://radimrehurek.com/gensim/matutils.html) to help converting from/to `numpy` matrices:

In [20]:
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5,2])
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
numpy_matrix_dense = gensim.matutils.corpus2dense(corpus, num_terms=10)

and from/to `scipy.sparse` matrices:

In [21]:
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5,2)
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

For a complete reference (Want to prune the dictionary to a smaller size? Optimize converting between corpora and NumPy/SciPy arrays?), see the [API documentation](https://radimrehurek.com/gensim/apiref.html). Or continue to the next tutorial on Topics and Transformations ([notebook](https://github.com/piskvorky/gensim/tree/develop/docs/notebooks/Topics_and_Transformations.ipynb) 
or [website](https://radimrehurek.com/gensim/tut2.html)).

# Topics and Transformation

Don't forget to set

In [22]:
import logging
import os.path

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

if you want to see logging events.

## Transformation interface

In the previous tutorial on [Corpora and Vector Spaces](https://radimrehurek.com/gensim/tut1.html), we created a corpus of documents represented as a stream of vectors. To continue, let’s fire up gensim and use that corpus:

In [23]:
from gensim import corpora, models, similarities
try:
    dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
    corpus = corpora.MmCorpus('/tmp/deerwester.mm')
    print("Used files generated from first tutorial")
except:
    raise ValueError("SKIP: Run cells from the strings to vectors tutorial")

2016-11-14 12:11:37,290 : INFO : loading Dictionary object from /tmp/deerwester.dict
2016-11-14 12:11:37,292 : INFO : loaded corpus index from /tmp/deerwester.mm.index
2016-11-14 12:11:37,293 : INFO : initializing corpus reader from /tmp/deerwester.mm
2016-11-14 12:11:37,294 : INFO : accepted corpus with 9 documents, 12 features, 28 non-zero entries


Used files generated from first tutorial


In [24]:
print (dictionary[0])
print (dictionary[1])
print (dictionary[2])

interface
computer
human


In this tutorial, I will show how to transform documents from one vector representation into another. This process serves two goals:

1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
1. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

### Creating a transformation

The transformations are standard Python objects, typically initialized by means of a training corpus:

In [25]:
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

2016-11-14 12:11:37,535 : INFO : collecting document frequencies
2016-11-14 12:11:37,539 : INFO : PROGRESS: processing document #0
2016-11-14 12:11:37,543 : INFO : calculating IDF weights for 9 documents and 11 features (28 matrix non-zeros)


We used our old corpus from tutorial 1 to initialize (train) the transformation model. Different transformations may require different initialization parameters; in case of TfIdf, the “training” consists simply of going through the supplied corpus once and computing document frequencies of all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet Allocation, is much more involved and, consequently, takes much more time.

> <B>Note</B>:
> Transformations always convert between two specific vector spaces. The same vector space (= the same set of feature ids) must be used for training as well as for subsequent vector transformations. Failure to use the same input feature space, such as applying a different string preprocessing, using different feature ids, or using bag-of-words input vectors where TfIdf vectors are expected, will result in feature mismatch during transformation calls and consequently in either garbage output and/or runtime exceptions.

In [26]:
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


Or to apply a transformation to a whole corpus:

In [27]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(1, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.44424552527467476), (6, 0.3244870206138555), (7, 0.3244870206138555)]
[(0, 0.5710059809418182), (6, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(2, 0.49182558987264147), (6, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (4, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(5, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


In this particular case, we are transforming the same corpus that we used for training, but this is only incidental. Once the transformation model has been initialized, it can be used on any vectors (provided they come from the same vector space, of course), even if they were not used in the training corpus at all. This is achieved by a process called folding-in for LSA, by topic inference for LDA etc.

> <b>Note:</b> 
> Calling model[corpus] only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration. We cannot convert the entire corpus at the time of calling corpus_transformed = model[corpus], because that would mean storing the result in main memory, and that contradicts gensim’s objective of memory-indepedence. If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

Transformations can also be serialized, one on top of another, in a sort of chain:

In [28]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

2016-11-14 12:11:37,824 : INFO : using serial LSI version on this node
2016-11-14 12:11:37,828 : INFO : updating model with new documents
2016-11-14 12:11:37,835 : INFO : preparing a new chunk of documents
2016-11-14 12:11:37,837 : INFO : using 100 extra samples and 2 power iterations
2016-11-14 12:11:37,839 : INFO : 1st phase: constructing (12, 102) action matrix
2016-11-14 12:11:37,841 : INFO : orthonormalizing (12, 102) action matrix
2016-11-14 12:11:37,846 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2016-11-14 12:11:37,848 : INFO : computing the final decomposition
2016-11-14 12:11:37,850 : INFO : keeping 2 factors (discarding 47.565% of energy spectrum)
2016-11-14 12:11:37,851 : INFO : processed documents up to #9
2016-11-14 12:11:37,853 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2016-11-14 12:11:37,855 : INFO : topic #

Here we transformed our Tf-Idf corpus via [Latent Semantic Indexing](http://en.wikipedia.org/wiki/Latent_semantic_indexing) into a latent 2-D space (2-D because we set num_topics=2). Now you’re probably wondering: what do these two latent dimensions stand for? Let’s inspect with models.LsiModel.print_topics():

In [29]:
lsi.print_topics(2)

2016-11-14 12:11:37,947 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2016-11-14 12:11:37,948 : INFO : topic #1(1.476): 0.460*"system" + 0.373*"user" + 0.332*"eps" + 0.328*"interface" + 0.320*"response" + 0.320*"time" + 0.293*"computer" + 0.280*"human" + 0.171*"survey" + -0.161*"trees"


[(0,
  u'0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  u'0.460*"system" + 0.373*"user" + 0.332*"eps" + 0.328*"interface" + 0.320*"response" + 0.320*"time" + 0.293*"computer" + 0.280*"human" + 0.171*"survey" + -0.161*"trees"')]

(the topics are printed to log – see the note at the top of this page about activating logging)

It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic:

In [30]:
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

[(0, 0.066007833960905815), (1, 0.52007033063618513)]
[(0, 0.19667592859142663), (1, 0.76095631677000419)]
[(0, 0.089926399724466255), (1, 0.72418606267525099)]
[(0, 0.075858476521783291), (1, 0.63205515860034278)]
[(0, 0.10150299184980208), (1, 0.57373084830029497)]
[(0, 0.70321089393783121), (1, -0.16115180214025937)]
[(0, 0.87747876731198315), (1, -0.16758906864659578)]
[(0, 0.90986246868185761), (1, -0.14086553628719187)]
[(0, 0.61658253505692806), (1, 0.053929075663892642)]


In [31]:
lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
lsi = models.LsiModel.load('/tmp/model.lsi')

2016-11-14 12:11:38,134 : INFO : saving Projection object under /tmp/model.lsi.projection, separately None
2016-11-14 12:11:38,136 : INFO : saving LsiModel object under /tmp/model.lsi, separately None
2016-11-14 12:11:38,137 : INFO : not storing attribute projection
2016-11-14 12:11:38,138 : INFO : not storing attribute dispatcher
2016-11-14 12:11:38,139 : INFO : loading LsiModel object from /tmp/model.lsi
2016-11-14 12:11:38,140 : INFO : loading id2word recursively from /tmp/model.lsi.id2word.* with mmap=None
2016-11-14 12:11:38,141 : INFO : setting ignored attribute projection to None
2016-11-14 12:11:38,143 : INFO : setting ignored attribute dispatcher to None
2016-11-14 12:11:38,144 : INFO : loading LsiModel object from /tmp/model.lsi.projection


The next question might be: just how exactly similar are those documents to each other? Is there a way to formalize the similarity, so that for a given input document, we can order some other set of documents according to their similarity? Similarity queries are covered in the [next tutorial](https://radimrehurek.com/gensim/tut3.html).

## Available transformations

Gensim implements several popular Vector Space Model algorithms:

* [Term Frequency * Inverse Document Frequency](http://en.wikipedia.org/wiki/Tf–idf), Tf-Idf expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality, except that features which were rare in the training corpus will have their value increased. It therefore converts integer-valued vectors into real-valued ones, while leaving the number of dimensions intact. It can also optionally normalize the resulting vectors to (Euclidean) unit length.

In [32]:
model = models.TfidfModel(corpus, normalize=True)

2016-11-14 12:11:38,290 : INFO : collecting document frequencies
2016-11-14 12:11:38,293 : INFO : PROGRESS: processing document #0
2016-11-14 12:11:38,296 : INFO : calculating IDF weights for 9 documents and 11 features (28 matrix non-zeros)


* [Latent Semantic Indexing, LSI (or sometimes LSA)](http://en.wikipedia.org/wiki/Latent_semantic_indexing) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality. For the toy corpus above we used only 2 latent dimensions, but on real corpora, target dimensionality of 200–500 is recommended as a “golden standard” [1].

In [33]:
model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)

2016-11-14 12:11:38,394 : INFO : using serial LSI version on this node
2016-11-14 12:11:38,396 : INFO : updating model with new documents
2016-11-14 12:11:38,398 : INFO : preparing a new chunk of documents
2016-11-14 12:11:38,400 : INFO : using 100 extra samples and 2 power iterations
2016-11-14 12:11:38,402 : INFO : 1st phase: constructing (12, 400) action matrix
2016-11-14 12:11:38,404 : INFO : orthonormalizing (12, 400) action matrix
2016-11-14 12:11:38,410 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2016-11-14 12:11:38,412 : INFO : computing the final decomposition
2016-11-14 12:11:38,413 : INFO : keeping 9 factors (discarding 0.000% of energy spectrum)
2016-11-14 12:11:38,415 : INFO : processed documents up to #9
2016-11-14 12:11:38,416 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2016-11-14 12:11:38,417 : INFO : topic #1

    LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!

> <b>Example</b> 
> 
> model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
> lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model

> model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
> lsi_vec = model[tfidf_vec]


    See the [gensim.models.lsimodel](https://radimrehurek.com/gensim/models/lsimodel.html#module-gensim.models.lsimodel) documentation for details on how to make LSI gradually “forget” old observations in infinite streams. If you want to get dirty, there are also parameters you can tweak that affect speed vs. memory footprint vs. numerical precision of the LSI algorithm.

    gensim uses a novel online incremental streamed distributed training algorithm (quite a mouthful!), which I published in [5]. gensim also executes a stochastic multi-pass algorithm from Halko et al. [4] internally, to accelerate in-core part of the computations. See also 
    [Experiments on the English Wikipedia](https://radimrehurek.com/gensim/wiki.html) for further speed-ups by distributing the computation across a cluster of computers.

* [Random Projections](http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf), RP aim to reduce vector space dimensionality. This is a very efficient (both memory- and CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness. Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

In [34]:
model = models.RpModel(corpus_tfidf, num_topics=500)

2016-11-14 12:11:38,479 : INFO : no word id mapping provided; initializing from corpus, assuming identity
2016-11-14 12:11:38,481 : INFO : constructing (500, 12) random matrix


* [Latent Dirichlet Allocation, LDA](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).

In [35]:
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

2016-11-14 12:11:38,568 : INFO : using symmetric alpha at 0.01
2016-11-14 12:11:38,569 : INFO : using symmetric eta at 0.01
2016-11-14 12:11:38,570 : INFO : using serial LDA version on this node
2016-11-14 12:11:38,595 : INFO : running online LDA training, 100 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000
2016-11-14 12:11:38,654 : INFO : -116.013 per-word bound, 83803297152943784775780404037681152.0 perplexity estimate based on a held-out corpus of 9 documents with 29 words
2016-11-14 12:11:38,655 : INFO : PROGRESS: pass 0, at document #9/9
2016-11-14 12:11:38,682 : INFO : topic #45 (0.010): 0.083*user + 0.083*survey + 0.083*graph + 0.083*trees + 0.083*eps + 0.083*interface + 0.083*system + 0.083*human + 0.083*response + 0.083*computer
2016-11-14 12:11:38,683 : INFO : topic #40 (0.010): 0.083*user + 0.083*survey + 0.083*graph + 0.083*trees +

    gensim uses a fast implementation of online LDA parameter estimation based on [2], modified to run in distributed mode on a cluster of computers.

* [Hierarchical Dirichlet Process, HDP](http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf) is a non-parametric bayesian method (note the missing number of requested topics):

In [36]:
model = models.HdpModel(corpus, id2word=dictionary)

2016-11-14 12:11:38,833 : INFO : topic 0: 0.181*user + 0.144*minors + 0.127*graph + 0.120*trees + 0.108*survey + 0.097*human + 0.085*computer + 0.069*response + 0.060*time + 0.004*system + 0.003*eps + 0.002*interface
2016-11-14 12:11:38,834 : INFO : topic 1: 0.237*time + 0.136*human + 0.135*minors + 0.130*eps + 0.105*response + 0.102*computer + 0.053*trees + 0.036*survey + 0.030*graph + 0.026*system + 0.007*user + 0.003*interface
2016-11-14 12:11:38,835 : INFO : topic 2: 0.249*response + 0.150*eps + 0.120*system + 0.100*graph + 0.075*time + 0.063*survey + 0.060*computer + 0.059*minors + 0.038*user + 0.031*interface + 0.029*trees + 0.024*human
2016-11-14 12:11:38,835 : INFO : topic 3: 0.187*human + 0.176*graph + 0.170*time + 0.144*user + 0.107*response + 0.089*system + 0.048*minors + 0.026*eps + 0.019*computer + 0.018*interface + 0.012*trees + 0.003*survey
2016-11-14 12:11:38,836 : INFO : topic 4: 0.160*trees + 0.159*graph + 0.138*computer + 0.132*survey + 0.121*human + 0.088*system + 0

    gensim uses a fast, online implementation based on [3]. The HDP model is a new addition to gensim, and still rough around its academic edges – use with care.

Adding new VSM transformations (such as different weighting schemes) is rather trivial; see the API reference or directly the Python code for more info and examples.

It is worth repeating that these are all unique, incremental implementations, which do not require the whole training corpus to be present in main memory all at once. With memory taken care of, I am now improving Distributed Computing, to improve CPU efficiency, too. If you feel you could contribute (by testing, providing use-cases or code), please let me know.

Continue on to the next tutorial on Similarity Queries.


# Similarity Queries

Don't forget to set:

In [37]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Similarity Interface

In the previous tutorials on [Corpora and Vector Space](https://radimrehurek.com/gensim/tut1.html) and [Topics and Transformations](https://radimrehurek.com/gensim/tut2.html), we covered what it means to create a corpus in the Vector Space Model and how to transform it between different vector spaces. A common reason for such a charade is that we want to determine **similarity between pairs of documents**, or the **similarity between a specific document** and a set of other documents (such as a user query vs. indexed documents).

To show how this can be done in gensim, let us consider the same corpus as in the previous examples (which really originally comes from Deerwester et al.’s [“Indexing by Latent Semantic Analysis”](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) seminal 1990 article):

In [38]:
from gensim import corpora, models, similarities

try:
    dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
    corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors"
except:
    raise ValueError("SKIP: Run cells from the strings to vectors tutorial")
print(corpus)

2016-11-14 12:11:38,991 : INFO : loading Dictionary object from /tmp/deerwester.dict
2016-11-14 12:11:38,993 : INFO : loaded corpus index from /tmp/deerwester.mm.index
2016-11-14 12:11:38,995 : INFO : initializing corpus reader from /tmp/deerwester.mm
2016-11-14 12:11:38,996 : INFO : accepted corpus with 9 documents, 12 features, 28 non-zero entries


MmCorpus(9 documents, 12 features, 28 non-zero entries)


To follow Deerwester’s example, we first use this tiny corpus to define a 2-dimensional LSI space:

In [39]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

2016-11-14 12:11:39,132 : INFO : using serial LSI version on this node
2016-11-14 12:11:39,134 : INFO : updating model with new documents
2016-11-14 12:11:39,135 : INFO : preparing a new chunk of documents
2016-11-14 12:11:39,137 : INFO : using 100 extra samples and 2 power iterations
2016-11-14 12:11:39,139 : INFO : 1st phase: constructing (12, 102) action matrix
2016-11-14 12:11:39,143 : INFO : orthonormalizing (12, 102) action matrix
2016-11-14 12:11:39,146 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2016-11-14 12:11:39,148 : INFO : computing the final decomposition
2016-11-14 12:11:39,149 : INFO : keeping 2 factors (discarding 43.156% of energy spectrum)
2016-11-14 12:11:39,152 : INFO : processed documents up to #9
2016-11-14 12:11:39,155 : INFO : topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"response" + 0.265*"time" + 0.240*"computer" + 0.221*"human" + 0.206*"survey" + 0.198*"interface" + 0.036*"graph"
2016-11-14 12:11:39,157 : INFO : topic #1(2

Now suppose a user typed in the query *“Human computer interaction”*. We would like to sort our nine corpus documents in decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of possible similarities—on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [40]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)

[(0, 0.46182100453271557), (1, 0.070027665279000353)]


In addition, we will be considering [cosine](http://en.wikipedia.org/wiki/Cosine_similarity) similarity to determine the similarity of two vectors. Cosine similarity is a standard measure in Vector Space Modeling, but wherever the vectors represent probability distributions, [different similarity measures](http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Symmetrised_divergence) may be more appropriate.

### Initializing query structures

To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, we might also be indexing a different corpus altogether.

In [41]:
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

2016-11-14 12:11:39,346 : INFO : creating matrix with 9 documents and 2 features


> <B>Warning</B>:
> The class `similarities.MatrixSimilarity` is only appropriate when the whole set of vectors fits into memory. For example, a corpus of one million documents would require 2GB of RAM in a 256-dimensional LSI space, when used with this class.
> Without 2GB of free RAM, you would need to use the `similarities.Similarity` class. This class operates in fixed memory, by splitting the index across multiple files on disk, called shards. It uses `similarities.MatrixSimilarity` and `similarities.SparseMatrixSimilarity` internally, so it is still fast, although slightly more complex.

Index persistency is handled via the standard save() and load() functions:

In [42]:
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

2016-11-14 12:11:39,434 : INFO : saving MatrixSimilarity object under /tmp/deerwester.index, separately None
2016-11-14 12:11:39,436 : INFO : loading MatrixSimilarity object from /tmp/deerwester.index


This is true for all similarity indexing classes (`similarities.Similarity`, `similarities.MatrixSimilarity` and `similarities.SparseMatrixSimilarity`). Also in the following, index can be an object of any of these. When in doubt, use `similarities.Similarity`, as it is the most scalable version, and it also supports adding more documents to the index later.

### Performing queries

To obtain similarities of our query document against the nine indexed documents:

In [43]:
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.098794639), (8, 0.050041765)]


Cosine measure returns similarities in the range *<-1, 1>* (the greater, the more similar), so that the first document has a score of 0.99809301 etc.

With some standard Python magic we sort these similarities into descending order, and obtain the final answer to the query *“Human computer interaction”*:

```
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) # print sorted (document number, similarity score) 2-tuples

[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees
```

(I added the original documents in their “string form” to the output comments, to improve clarity.)

The thing to note here is that documents no. 2 ("`The EPS user interface management system`") and 4 ("`Relation of user perceived response time to error measurement`") would never be returned by a standard boolean fulltext search, because they do not share any common words with "`Human computer interaction`". However, after applying LSI, we can observe that both of them received quite high similarity scores (no. 2 is actually the most similar!), which corresponds better to our intuition of them sharing a “computer-human” related topic with the query. In fact, this semantic generalization is the reason why we apply transformations and do topic modelling in the first place.

## Where next?

Congratulations, you have finished the tutorials – now you know how gensim works :-) To delve into more details, you can browse through the [API documentation](https://radimrehurek.com/gensim/apiref.html), see the [Wikipedia experiments](https://radimrehurek.com/gensim/wiki.html) or perhaps check out [distributed computing](https://radimrehurek.com/gensim/distributed.html) in gensim.

Gensim is a fairly mature package that has been used successfully by many individuals and companies, both for rapid prototyping and in production. That doesn’t mean it’s perfect though:

* there are parts that could be implemented more efficiently (in C, for example), or make better use of parallelism (multiple machines cores)
* new algorithms are published all the time; help gensim keep up by [discussing them](http://groups.google.com/group/gensim) and [contributing code](https://github.com/piskvorky/gensim/wiki/Developer-page)
* your **feedback is most welcome** and appreciated (and it’s not just the code!): [idea contributions](https://github.com/piskvorky/gensim/wiki/Ideas-&-Features-proposals), [bug reports](https://github.com/piskvorky/gensim/issues) or just consider contributing [user stories and general questions](http://groups.google.com/group/gensim/topics).
Gensim has no ambition to become an all-encompassing framework, across all NLP (or even Machine Learning) subfields. Its mission is to help NLP practicioners try out popular topic modelling algorithms on large datasets easily, and to facilitate prototyping of new algorithms for researchers.