In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora

import csv
from six import iteritems

2016-11-11 19:29:06,352 : INFO : 'pattern' package found; tag filters are available for English


Set up the dictionary by reading from the corpus in a memory-efficient way

In [2]:
# choose some simple stop word list
stoplist = set('for a of the and to in'.split())

# make the dictionary, a collection of statistics about all tokens in the corpus
dictionary = corpora.Dictionary(line.lower().split() for line in open('./datasets/corpus.csv'))

2016-11-11 19:29:06,458 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2016-11-11 19:29:07,560 : INFO : built Dictionary(38074 unique tokens: [u'considered,', u'considered.', u'homomorphism', u'(mcda),', u'(qoe)']...) from 2851 documents (total 449011 corpus positions)


The [tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb) includes this step. I assume it makes the bag-of-words perform better.

In [3]:
# find stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist 
            if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]

# remove stop words and words that appear only once
dictionary.filter_tokens(stop_ids + once_ids)

# remove gaps in id sequence after words that were removed
dictionary.compactify()

Define a class that efficiently represents the bag-of-words

In [8]:
# memory-friendly bag-of-words
class BOW(object):
    def __iter__(self):
        for line in open('./datasets/corpus.csv'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

Now we can make a bag of words and do something with it by iterating over it

In [11]:
arxiv_bow = BOW() # doesn't load the corpus into memory!

In [None]:
for vector in arxiv_bow:  # load one vector into memory at a time
    print(vector)

Represent an unseen document as a bag-of-words using this dictionary to define the vector space

In [12]:
#Create a token to feature ID map. Given a token, gives the feature ID of that token.
token2id_map = dictionary.token2id

The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. The sparse vector [(0, 1), (1, 1)] therefore reads: in the document “all partial results illustrated entropy”, the words all (id=31) and partial (id=82) appear once; words that don't appear in the corpus are ignored

In [13]:
new_doc = "all partial results results illustrated entropy"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(2560, 1), (2564, 2), (2566, 1), (3446, 1), (12277, 1)]
