In [1]:
import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO)

### From Strings to Vectors
Start from documents represented as strings:

In [2]:
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

This is a tiny corpus of nine documents, each consisting of a single sentence.

First, tokenize the documents, remove common words as well as words that only appear once in corpus.

In [3]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
#remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1] for text in texts]
from pprint import pprint # pretty-printer
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


Use bag-of-words to represent the documents.

In [4]:
dictionary = corpora.Dictionary(texts)
import os
if (not os.path.isdir('tmp')):
    os.mkdir('tmp')
os.chdir("tmp")
dictionary.save('deerwester.dict') # store the dictionary for future reference
print(dictionary)

Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)


Check the word vector (12-D vector {word: id})

In [5]:
print(dictionary.token2id)

{u'minors': 11, u'graph': 10, u'system': 6, u'trees': 9, u'eps': 8, u'computer': 1, u'survey': 5, u'user': 7, u'human': 2, u'time': 4, u'interface': 0, u'response': 3}


To actually convert tokenized documents to vectors:

In [6]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec) # the word "interaction" does not appear in the dictionary and is ignored

[(1, 1), (2, 1)]


The function _doc2bow()_ simply counts the number of occurences of each distinct word, converts the word to its integer word id and returs the result as a sparse vector. The sparse vector [(1,1), (2,1)] therefore reads: in the document "Human computer interaction", the words compputer(id 1) and human(id 2) appear once; the other ten dictionary words appear zero times.

In [7]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('deerwester.mm', corpus) # store to disk for later use
pprint(corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(0, 1), (6, 1), (7, 1), (8, 1)],
 [(2, 1), (6, 2), (8, 1)],
 [(3, 1), (4, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(5, 1), (10, 1), (11, 1)]]


### Corpus Streaming - One Document at a Time
Assume the documents are stored in a file on disk, one document per line. Gensim only requires that a corpus must be able to return one document vector at a time:

In [8]:
os.chdir("../")
class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split()) # dictionary still is previously defined by corpora

In [9]:
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x10400add0>


Corpus is now an object. Print just outputs the address of the object in memory. To see the constituent vector, iterate over the corpus:

In [10]:
for vector in corpus_memory_friendly: # load one vector into memory at a time
    print(vector)

IOError: [Errno 2] No such file or directory: 'mycorpus.txt'

Similarly, to construct the dictionary without loading all texts into memory:

In [None]:
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist]
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once
dictionary.compactify() # remove gaps in id sequence after words that were removed
print(dictionary)

### Corpus Formats
Serializeing a Vector Space corpus to disk.

To save a corpus in the Matrix Market format:

In [None]:
from gensim import corpora
# create a toy corpus of 2 documents, as a plain Python list
corpus = [[(1,0.5)], []] # make one document empty, for the heck of it
os.chdir("tmp")
corpora.MmCorpus.serialize('corpus.mm', corpus)

Other formats include Joachim's SVMight format, Blei's LDA-C format and GibbsLDA++ format

In [None]:
corpora.SvmLightCorpus.serialize('corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('corpus.lda-c', corpus)
corpora.LowCorpus.serialize('corpus.low', corpus)

Conversely, to load a corpus iterator:

In [None]:
corpus = corpora.MmCorpus('corpus.mm')

Corpus objects are streams, so typically not able to print directly:

In [None]:
print(corpus)

Instead, to view the contents of corpus:

In [None]:
# one way of printing a corpus: load it entirely into memory
print(list(corpus)) # calling list() will convert any sequence to a plain Python list

In [None]:
# another way is to print one document at a time, making use of the streaming interface
for doc in corpus: 
    print(doc)

### Compatibility with NumPy and Scipy

Converting to/from numpy matrices:

In [None]:
import gensim
numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=12) # num_terms=number_of_corpus_features
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
print(numpy_matrix)

Converting to/from scipy.sparse matrices:

In [None]:
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
corpus = gensim.matutils.Sparse2Corpus(scipy_csc_matrix)
print(scipy_csc_matrix)
os.chdir("../") # return to the root level folder