# Getting Started with gensim

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/gensim%20Quick%20Start.ipynb

In [1]:
raw_corpus = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

In [8]:
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
stoplist

{'a', 'and', 'for', 'in', 'of', 'the', 'to'}

In [9]:
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in raw_corpus]
texts

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

In [15]:
# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
frequency

defaultdict(int, {})

In [11]:
for text in texts:
    for token in text:
        frequency[token] += 1
        print(frequency[token])

1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
2
2
1
2
3
2
4
1
1
2
1
3
1
2
2
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
3
1
1
1
3
2
2


In [6]:
# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
processed_corpus

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [17]:
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)



Dictionary(12 unique tokens: ['human', 'minors', 'graph', 'response', 'interface']...)


In [18]:
print(dictionary.token2id)

{'human': 0, 'minors': 11, 'graph': 10, 'response': 4, 'interface': 1, 'trees': 9, 'computer': 2, 'system': 3, 'survey': 5, 'time': 7, 'user': 6, 'eps': 8}


In [20]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
new_vec

[(0, 1), (2, 1)]

Note that "interaction" did not occur in the original corpus and so it was not included in the vectorization. Also note that this vector only contains entries for words that actually appeared in the document. Because any given document will only contain a few words out of the many words in the dictionary, words that do not appear in the vectorization are represented as implicitly zero as a space saving measure.

In [22]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
bow_corpus

[[(0, 1), (1, 1), (2, 1)],
 [(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(1, 1), (3, 1), (6, 1), (8, 1)],
 [(0, 1), (3, 2), (8, 1)],
 [(4, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(5, 1), (10, 1), (11, 1)]]

Now that we have vectorized our corpus we can begin to transform it using models. We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The details of this transformation are learned from the training corpus.
One simple example of a model is tf-idf. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

In [25]:
from gensim import models
# train the model
tfidf = models.TfidfModel(bow_corpus)
# transform the "system minors" sting
tfidf[dictionary.doc2bow("system minors".lower().split())]

[(3, 0.5898341626740045), (11, 0.8075244024440723)]

# Word2vec Tutorial

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

In [38]:
# import modules & set up logging
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)

2017-01-05 21:15:45,981 : INFO : collecting all words and their counts
2017-01-05 21:15:45,981 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-05 21:15:45,981 : INFO : collected 3 word types from a corpus of 4 raw words and 2 sentences
2017-01-05 21:15:45,981 : INFO : Loading a fresh vocabulary
2017-01-05 21:15:45,981 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-01-05 21:15:45,997 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-01-05 21:15:45,998 : INFO : deleting the raw counts dictionary of 3 items
2017-01-05 21:15:45,998 : INFO : sample=0.001 downsamples 3 most-common words
2017-01-05 21:15:45,999 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-01-05 21:15:46,000 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-01-05 21:15:46,000 : INFO : resetting layer weights
2017-01-05 21:15:46,001 : INFO : training model with 3 workers o

In [40]:
# create some toy data to use with the following example
import smart_open, os

if not os.path.exists('./data/'):
    os.makedirs('./data/')

filenames = ['./data/f1.txt', './data/f2.txt']

for i,fname in enumerate(filenames):
    with smart_open.smart_open(fname, 'w') as fout:
        for line in sentences[i]:
            fout.write(line + '\n')

In [41]:
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

In [42]:
sentences = MySentences('./data/') # a memory-friendly iterator
print(list(sentences))

[['first'], ['sentence'], ['second'], ['sentence']]


In [43]:
# generate the Word2Vec model
model = gensim.models.Word2Vec(sentences, min_count=1)

2017-01-05 21:16:41,367 : INFO : collecting all words and their counts
2017-01-05 21:16:41,369 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-05 21:16:41,372 : INFO : collected 3 word types from a corpus of 4 raw words and 4 sentences
2017-01-05 21:16:41,374 : INFO : Loading a fresh vocabulary
2017-01-05 21:16:41,375 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-01-05 21:16:41,375 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-01-05 21:16:41,376 : INFO : deleting the raw counts dictionary of 3 items
2017-01-05 21:16:41,377 : INFO : sample=0.001 downsamples 3 most-common words
2017-01-05 21:16:41,378 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-01-05 21:16:41,379 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-01-05 21:16:41,380 : INFO : resetting layer weights
2017-01-05 21:16:41,381 : INFO : training model with 3 workers o

In [44]:
print(model)
print(model.wv.vocab)

Word2Vec(vocab=3, size=100, alpha=0.025)
{'first': <gensim.models.word2vec.Vocab object at 0x0000000007EB9438>, 'second': <gensim.models.word2vec.Vocab object at 0x0000000007ECDEB8>, 'sentence': <gensim.models.word2vec.Vocab object at 0x0000000007ECDC18>}


In [45]:
new_model = gensim.models.Word2Vec(min_count=1)  # an empty model, no training
new_model.build_vocab(sentences)                 # can be a non-repeatable, 1-pass generator     
new_model.train(sentences)                       # can be a non-repeatable, 1-pass generator

2017-01-05 21:21:09,828 : INFO : collecting all words and their counts
2017-01-05 21:21:09,828 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-05 21:21:09,828 : INFO : collected 3 word types from a corpus of 4 raw words and 4 sentences
2017-01-05 21:21:09,828 : INFO : Loading a fresh vocabulary
2017-01-05 21:21:09,828 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-01-05 21:21:09,845 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-01-05 21:21:09,845 : INFO : deleting the raw counts dictionary of 3 items
2017-01-05 21:21:09,846 : INFO : sample=0.001 downsamples 3 most-common words
2017-01-05 21:21:09,847 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-01-05 21:21:09,847 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-01-05 21:21:09,848 : INFO : resetting layer weights
2017-01-05 21:21:09,849 : INFO : training model with 3 workers o

0

In [52]:
print(new_model)
print(model.wv.vocab)

Word2Vec(vocab=3, size=100, alpha=0.025)


In [53]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = test_data_dir + 'lee_background.cor'

In [54]:
class MyText(object):
    def __iter__(self):
        for line in open(lee_train_file):
            # assume there's one document per line, tokens separated by whitespace
            yield line.lower().split()

sentences = MyText()

print(sentences)

<__main__.MyText object at 0x0000000008150B38>


In [55]:
# default value of min_count=5
model = gensim.models.Word2Vec(sentences, min_count=10)

2017-01-05 21:31:24,952 : INFO : collecting all words and their counts
2017-01-05 21:31:24,952 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-05 21:31:24,968 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2017-01-05 21:31:24,968 : INFO : Loading a fresh vocabulary
2017-01-05 21:31:24,968 : INFO : min_count=10 retains 806 unique words (7% of original 10186, drops 9380)
2017-01-05 21:31:24,968 : INFO : min_count=10 leaves 40964 word corpus (68% of original 59890, drops 18926)
2017-01-05 21:31:24,968 : INFO : deleting the raw counts dictionary of 10186 items
2017-01-05 21:31:24,968 : INFO : sample=0.001 downsamples 54 most-common words
2017-01-05 21:31:24,983 : INFO : downsampling leaves estimated 26224 word corpus (64.0% of prior 40964)
2017-01-05 21:31:24,983 : INFO : estimated required memory for 806 words and 100 dimensions: 1047800 bytes
2017-01-05 21:31:24,985 : INFO : resetting layer weights
2017-01-05 21:3

In [56]:
# default value of size=100
model = gensim.models.Word2Vec(sentences, size=200)

2017-01-05 21:31:27,708 : INFO : collecting all words and their counts
2017-01-05 21:31:27,710 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-05 21:31:27,742 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2017-01-05 21:31:27,742 : INFO : Loading a fresh vocabulary
2017-01-05 21:31:27,753 : INFO : min_count=5 retains 1723 unique words (16% of original 10186, drops 8463)
2017-01-05 21:31:27,754 : INFO : min_count=5 leaves 46858 word corpus (78% of original 59890, drops 13032)
2017-01-05 21:31:27,760 : INFO : deleting the raw counts dictionary of 10186 items
2017-01-05 21:31:27,761 : INFO : sample=0.001 downsamples 49 most-common words
2017-01-05 21:31:27,762 : INFO : downsampling leaves estimated 32849 word corpus (70.1% of prior 46858)
2017-01-05 21:31:27,763 : INFO : estimated required memory for 1723 words and 200 dimensions: 3618300 bytes
2017-01-05 21:31:27,769 : INFO : resetting layer weights
2017-01-05 21:

In [57]:
model.accuracy('./data/questions-words.txt')

ValueError: missing section header before line #0 in ./data/questions-words.txt

In [58]:
model.evaluate_word_pairs(test_data_dir +'wordsim353.tsv')

2017-01-05 21:35:56,062 : INFO : Pearson correlation coefficient against C:\Users\Student\Anaconda3\lib\site-packages\gensim\test\test_data\wordsim353.tsv: 0.0978
2017-01-05 21:35:56,063 : INFO : Spearman rank-order correlation coefficient against C:\Users\Student\Anaconda3\lib\site-packages\gensim\test\test_data\wordsim353.tsv: 0.0968
2017-01-05 21:35:56,064 : INFO : Pairs with unknown words ratio: 85.6%


((0.097812589405989414, 0.49470504695220818),
 SpearmanrResult(correlation=0.09684353455958257, pvalue=0.49901062434954946),
 85.55240793201133)

In [59]:
from tempfile import mkstemp

fs, temp_path = mkstemp("gensim_temp")  # creates a temp file

model.save(temp_path)  # save the model

2017-01-05 21:36:30,322 : INFO : saving Word2Vec object under C:\Users\Student\AppData\Local\Temp\tmpx2xtxf9hgensim_temp, separately None
2017-01-05 21:36:30,322 : INFO : not storing attribute cum_table
2017-01-05 21:36:30,322 : INFO : not storing attribute syn0norm
2017-01-05 21:36:30,374 : INFO : saved C:\Users\Student\AppData\Local\Temp\tmpx2xtxf9hgensim_temp


In [60]:
new_model = gensim.models.Word2Vec.load(temp_path) 

2017-01-05 21:36:52,266 : INFO : loading Word2Vec object from C:\Users\Student\AppData\Local\Temp\tmpx2xtxf9hgensim_temp
2017-01-05 21:36:52,320 : INFO : loading wv recursively from C:\Users\Student\AppData\Local\Temp\tmpx2xtxf9hgensim_temp.wv.* with mmap=None
2017-01-05 21:36:52,321 : INFO : setting ignored attribute cum_table to None
2017-01-05 21:36:52,322 : INFO : setting ignored attribute syn0norm to None
2017-01-05 21:36:52,322 : INFO : loaded C:\Users\Student\AppData\Local\Temp\tmpx2xtxf9hgensim_temp


In [61]:
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 
                  'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, )

# cleaning up temp
os.close(fs)
os.remove(temp_path)

2017-01-05 21:37:12,312 : INFO : loading Word2Vec object from C:\Users\Student\AppData\Local\Temp\tmpx2xtxf9hgensim_temp
2017-01-05 21:37:12,362 : INFO : loading wv recursively from C:\Users\Student\AppData\Local\Temp\tmpx2xtxf9hgensim_temp.wv.* with mmap=None
2017-01-05 21:37:12,364 : INFO : setting ignored attribute cum_table to None
2017-01-05 21:37:12,364 : INFO : setting ignored attribute syn0norm to None
2017-01-05 21:37:12,365 : INFO : loaded C:\Users\Student\AppData\Local\Temp\tmpx2xtxf9hgensim_temp
2017-01-05 21:37:12,369 : INFO : collecting all words and their counts
2017-01-05 21:37:12,370 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-05 21:37:12,371 : INFO : collected 13 word types from a corpus of 13 raw words and 1 sentences
2017-01-05 21:37:12,372 : INFO : Updating model with new vocabulary
2017-01-05 21:37:12,372 : INFO : New added 0 unique words (0% of original 13)
                        and increased the count of 0 pre-existing wo

In [62]:
model.most_similar(positive=['human', 'crime'], negative=['party'], topn=1)

2017-01-05 21:37:27,678 : INFO : precomputing L2-norms of word weight vectors


[('australians', 0.9973864555358887)]

In [63]:
model.doesnt_match("input is lunch he sentence cat".split())

'sentence'

In [64]:
print(model.similarity('human', 'party'))
print(model.similarity('tree', 'murder'))

0.999586090025
0.997775995999


In [65]:
model['tree']  # raw NumPy vector of a word

array([-0.01127922, -0.03585431, -0.0125974 , -0.04147318,  0.00118686,
       -0.01606489, -0.00240359, -0.01758233, -0.03929804, -0.00437588,
       -0.02706588, -0.04878673,  0.00651546, -0.00814761, -0.01902641,
       -0.00802842, -0.02677576, -0.00783781, -0.00248508, -0.03722271,
       -0.0401651 ,  0.02337405,  0.026868  , -0.00215164, -0.02530951,
       -0.04220852, -0.00314844, -0.02795812,  0.04118952,  0.02010571,
       -0.06484023, -0.03345084, -0.06004576,  0.02802508,  0.03755744,
       -0.00150028,  0.02468759,  0.05103324,  0.00318254, -0.01412363,
       -0.02588135, -0.04607787, -0.00969513, -0.01094116, -0.02884848,
        0.0235644 , -0.01119853, -0.02832495,  0.01450322,  0.05574993,
        0.02936582, -0.00019662, -0.00673731, -0.01590701, -0.01878649,
       -0.02997692, -0.02445791, -0.04906164, -0.00614071, -0.0465661 ,
       -0.009026  , -0.01483737, -0.00262535, -0.02153099, -0.04723297,
        0.00348318, -0.00915821,  0.0257897 ,  0.05443203, -0.02

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Word2Vec_FastText_Comparison.ipynb