# Getting Started with gensim

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/gensim%20Quick%20Start.ipynb

In [1]:
raw_corpus = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

In [2]:
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
stoplist

{'a', 'and', 'for', 'in', 'of', 'the', 'to'}

In [3]:
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in raw_corpus]
texts

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

In [4]:
# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
frequency

defaultdict(int, {})

In [5]:
for text in texts:
    for token in text:
        frequency[token] += 1
        print(frequency[token])

1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
2
2
1
2
3
2
4
1
1
2
1
3
1
2
2
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
3
1
1
1
3
2
2


In [6]:
# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
processed_corpus

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [7]:
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)



Dictionary(12 unique tokens: ['time', 'survey', 'graph', 'system', 'human']...)


In [8]:
print(dictionary.token2id)

{'time': 4, 'survey': 6, 'graph': 10, 'system': 7, 'human': 1, 'minors': 11, 'trees': 9, 'interface': 0, 'response': 5, 'eps': 8, 'user': 3, 'computer': 2}


In [9]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
new_vec

[(1, 1), (2, 1)]

Note that "interaction" did not occur in the original corpus and so it was not included in the vectorization. Also note that this vector only contains entries for words that actually appeared in the document. Because any given document will only contain a few words out of the many words in the dictionary, words that do not appear in the vectorization are represented as implicitly zero as a space saving measure.

In [10]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
bow_corpus

[[(0, 1), (1, 1), (2, 1)],
 [(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(0, 1), (3, 1), (7, 1), (8, 1)],
 [(1, 1), (7, 2), (8, 1)],
 [(3, 1), (4, 1), (5, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(6, 1), (10, 1), (11, 1)]]

Now that we have vectorized our corpus we can begin to transform it using models. We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The details of this transformation are learned from the training corpus.
One simple example of a model is tf-idf. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

In [11]:
from gensim import models
# train the model
tfidf = models.TfidfModel(bow_corpus)
# transform the "system minors" sting
tfidf[dictionary.doc2bow("system minors".lower().split())]

[(7, 0.5898341626740045), (11, 0.8075244024440723)]

---

# Word2vec Tutorial

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

## Preparing the Input

In [1]:
# import modules & set up logging
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)

2017-01-07 10:54:41,235 : INFO : collecting all words and their counts
2017-01-07 10:54:41,235 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-07 10:54:41,236 : INFO : collected 3 word types from a corpus of 4 raw words and 2 sentences
2017-01-07 10:54:41,236 : INFO : Loading a fresh vocabulary
2017-01-07 10:54:41,237 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-01-07 10:54:41,238 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-01-07 10:54:41,239 : INFO : deleting the raw counts dictionary of 3 items
2017-01-07 10:54:41,239 : INFO : sample=0.001 downsamples 3 most-common words
2017-01-07 10:54:41,240 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-01-07 10:54:41,240 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-01-07 10:54:41,242 : INFO : resetting layer weights
2017-01-07 10:54:41,242 : INFO : training model with 3 workers o

In [2]:
# create some toy data to use with the following example
import smart_open, os

if not os.path.exists('./data/'):
    os.makedirs('./data/')

filenames = ['./data/f1.txt', './data/f2.txt']

for i,fname in enumerate(filenames):
    with smart_open.smart_open(fname, 'w') as fout:
        for line in sentences[i]:
            fout.write(line + '\n')

Python流式高效访问(超)大文件的库(支持云端/本地的压缩/未压缩文件：S3, HDFS, gzip, bz2...)

- enumerate(iterable, start=0)¶：Return an enumerate object. iterable must be a sequence, an iterator, or some other object which supports iteration. The __next__() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over iterable.

seasons = ['Spring', 'Summer', 'Fall', 'Winter']

list(enumerate(seasons))

[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]

list(enumerate(seasons, start=1))

[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]


- with as 的說明：
http://blog.kissdata.com/2014/05/23/python-with.html
取代掉try catch最後finally要close的部分

In [3]:
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()
    

例如下面的程式，在 Windows 上會產生 a\b\c，在 Linux 上會產生 a/b/c：

import os
print os.path.join("a", "b", "c") 

yield說明：http://www.ibm.com/developerworks/cn/opensource/os-cn-python-yield/

In [4]:
sentences = MySentences('./data/') # a memory-friendly iterator
print(sentences)
print(list(sentences))

<__main__.MySentences object at 0x000002648F5C97F0>
[['first'], ['sentence'], ['second'], ['sentence']]


In [5]:
# generate the Word2Vec model
model = gensim.models.Word2Vec(sentences, min_count=1)

2017-01-07 10:56:10,956 : INFO : collecting all words and their counts
2017-01-07 10:56:10,957 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-07 10:56:10,958 : INFO : collected 3 word types from a corpus of 4 raw words and 4 sentences
2017-01-07 10:56:10,959 : INFO : Loading a fresh vocabulary
2017-01-07 10:56:10,959 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-01-07 10:56:10,960 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-01-07 10:56:10,960 : INFO : deleting the raw counts dictionary of 3 items
2017-01-07 10:56:10,961 : INFO : sample=0.001 downsamples 3 most-common words
2017-01-07 10:56:10,962 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-01-07 10:56:10,962 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-01-07 10:56:10,963 : INFO : resetting layer weights
2017-01-07 10:56:10,964 : INFO : training model with 3 workers o

In [6]:
print(model)
print(model.wv.vocab)

Word2Vec(vocab=3, size=100, alpha=0.025)
{'first': <gensim.models.word2vec.Vocab object at 0x000002648F5C97B8>, 'second': <gensim.models.word2vec.Vocab object at 0x000002648F5C9898>, 'sentence': <gensim.models.word2vec.Vocab object at 0x000002648F5C96A0>}


class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000)

sg defines the training algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.

size is the dimensionality of the feature vectors.

window is the maximum distance between the current and predicted word within a sentence.

alpha is the initial learning rate (will linearly drop to min_alpha as training progresses).

seed = for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.)

min_count = ignore all words with total frequency lower than this.

max_vocab_size = limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).

sample = threshold for configuring which higher-frequency words are randomly downsampled;
default is 1e-3, useful range is (0, 1e-5).
workers = use this many worker threads to train the model (=faster training with multicore machines).

hs = if 1, hierarchical softmax will be used for model training. If set to 0 (default), and negative is non-zero, negative sampling will be used.

negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). Default is 5. If set to 0, no negative samping is used.

cbow_mean = if 0, use the sum of the context word vectors. If 1 (default), use the mean. Only applies when cbow is used.

hashfxn = hash function to use to randomly initialize weights, for increased training reproducibility. Default is Python’s rudimentary built in hash function.

iter = number of iterations (epochs) over the corpus. Default is 5.

trim_rule = vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and returns either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. Note: The rule, if given, is only used prune vocabulary during build_vocab() and is not stored as part of the model.

sorted_vocab = if 1 (default), sort the vocabulary by descending frequency before assigning word indexes.

batch_words = target size (in words) for batches of examples passed to worker threads (and thus cython routines). Default is 10000. (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

In [7]:
new_model = gensim.models.Word2Vec(min_count=1)  # an empty model, no training
new_model.build_vocab(sentences)                 # can be a non-repeatable, 1-pass generator     
new_model.train(sentences)                       # can be a non-repeatable, 1-pass generator

2017-01-07 10:56:27,908 : INFO : collecting all words and their counts
2017-01-07 10:56:27,909 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-07 10:56:27,910 : INFO : collected 3 word types from a corpus of 4 raw words and 4 sentences
2017-01-07 10:56:27,911 : INFO : Loading a fresh vocabulary
2017-01-07 10:56:27,911 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-01-07 10:56:27,912 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-01-07 10:56:27,912 : INFO : deleting the raw counts dictionary of 3 items
2017-01-07 10:56:27,913 : INFO : sample=0.001 downsamples 3 most-common words
2017-01-07 10:56:27,914 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-01-07 10:56:27,914 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-01-07 10:56:27,915 : INFO : resetting layer weights
2017-01-07 10:56:27,916 : INFO : training model with 3 workers o

0

build_vocab(sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False)¶

Build vocabulary from a sequence of sentences (can be a once-only generator stream). 

Each sentence must be a list of unicode strings.



In [8]:
print(new_model)
print(model.wv.vocab)

Word2Vec(vocab=3, size=100, alpha=0.025)
{'first': <gensim.models.word2vec.Vocab object at 0x000002648F5C97B8>, 'second': <gensim.models.word2vec.Vocab object at 0x000002648F5C9898>, 'sentence': <gensim.models.word2vec.Vocab object at 0x000002648F5C96A0>}


---

## More data would be nice

In [11]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
print(os.sep)
print(gensim.__path__[0])

lee_train_file = test_data_dir + 'lee_background.cor'

\
C:\Users\lisrba\Anaconda3\lib\site-packages\gensim


1. os.sep is the (or a most common) pathname separator ('/' or ':' or '\') 

2. import os 
print os.sep 
Your output will be '\'or '/' depending on the operating system you're using.

3. makes your code more portable 

In [12]:
class MyText(object):
    def __iter__(self):
        for line in open(lee_train_file):
            # assume there's one document per line, tokens separated by whitespace
            yield line.lower().split()

sentences = MyText()

print(sentences)

<__main__.MyText object at 0x000002648F5C9630>


## TRAINING

In [13]:
# default value of min_count=5
model = gensim.models.Word2Vec(sentences, min_count=10)

2017-01-07 10:59:57,388 : INFO : collecting all words and their counts
2017-01-07 10:59:57,390 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-07 10:59:57,414 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2017-01-07 10:59:57,415 : INFO : Loading a fresh vocabulary
2017-01-07 10:59:57,422 : INFO : min_count=10 retains 806 unique words (7% of original 10186, drops 9380)
2017-01-07 10:59:57,427 : INFO : min_count=10 leaves 40964 word corpus (68% of original 59890, drops 18926)
2017-01-07 10:59:57,434 : INFO : deleting the raw counts dictionary of 10186 items
2017-01-07 10:59:57,435 : INFO : sample=0.001 downsamples 54 most-common words
2017-01-07 10:59:57,436 : INFO : downsampling leaves estimated 26224 word corpus (64.0% of prior 40964)
2017-01-07 10:59:57,436 : INFO : estimated required memory for 806 words and 100 dimensions: 1047800 bytes
2017-01-07 10:59:57,439 : INFO : resetting layer weights
2017-01-07 10:5

In [14]:
# default value of size=100
model = gensim.models.Word2Vec(sentences, size=200)

2017-01-07 10:59:59,686 : INFO : collecting all words and their counts
2017-01-07 10:59:59,687 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-07 10:59:59,709 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2017-01-07 10:59:59,710 : INFO : Loading a fresh vocabulary
2017-01-07 10:59:59,719 : INFO : min_count=5 retains 1723 unique words (16% of original 10186, drops 8463)
2017-01-07 10:59:59,721 : INFO : min_count=5 leaves 46858 word corpus (78% of original 59890, drops 13032)
2017-01-07 10:59:59,729 : INFO : deleting the raw counts dictionary of 10186 items
2017-01-07 10:59:59,730 : INFO : sample=0.001 downsamples 49 most-common words
2017-01-07 10:59:59,731 : INFO : downsampling leaves estimated 32849 word corpus (70.1% of prior 46858)
2017-01-07 10:59:59,732 : INFO : estimated required memory for 1723 words and 200 dimensions: 3618300 bytes
2017-01-07 10:59:59,739 : INFO : resetting layer weights
2017-01-07 10:

## EVALUATING

In [56]:
model.accuracy(test_data_dir +'questions-words.txt')

2017-01-06 20:42:49,043 : INFO : family: 0.0% (0/2)
2017-01-06 20:42:49,064 : INFO : gram3-comparative: 0.0% (0/12)
2017-01-06 20:42:49,086 : INFO : gram4-superlative: 0.0% (0/12)
2017-01-06 20:42:49,103 : INFO : gram5-present-participle: 0.0% (0/20)
2017-01-06 20:42:49,114 : INFO : gram6-nationality-adjective: 0.0% (0/20)
2017-01-06 20:42:49,126 : INFO : gram7-past-tense: 0.0% (0/20)
2017-01-06 20:42:49,143 : INFO : gram8-plural: 0.0% (0/12)
2017-01-06 20:42:49,151 : INFO : total: 0.0% (0/98)


[{'correct': [], 'incorrect': [], 'section': 'capital-common-countries'},
 {'correct': [], 'incorrect': [], 'section': 'capital-world'},
 {'correct': [], 'incorrect': [], 'section': 'currency'},
 {'correct': [], 'incorrect': [], 'section': 'city-in-state'},
 {'correct': [],
  'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HIS', 'HER', 'HE', 'SHE')],
  'section': 'family'},
 {'correct': [], 'incorrect': [], 'section': 'gram1-adjective-to-adverb'},
 {'correct': [], 'incorrect': [], 'section': 'gram2-opposite'},
 {'correct': [],
  'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
   ('GOOD', 'BETTER', 'LONG', 'LONGER'),
   ('GOOD', 'BETTER', 'LOW', 'LOWER'),
   ('GREAT', 'GREATER', 'LONG', 'LONGER'),
   ('GREAT', 'GREATER', 'LOW', 'LOWER'),
   ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'LOW', 'LOWER'),
   ('LONG', 'LONGER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'GOOD', 'BETTER'),
   ('LOW', 'LOWER', 'GREAT', 'GREATER'),
   (

In [41]:
model.evaluate_word_pairs(test_data_dir +'wordsim353.tsv')

2017-01-06 19:43:59,264 : INFO : Pearson correlation coefficient against C:\Users\Student\Anaconda3\lib\site-packages\gensim\test\test_data\wordsim353.tsv: 0.1265
2017-01-06 19:43:59,265 : INFO : Spearman rank-order correlation coefficient against C:\Users\Student\Anaconda3\lib\site-packages\gensim\test\test_data\wordsim353.tsv: 0.1470
2017-01-06 19:43:59,266 : INFO : Pairs with unknown words ratio: 85.6%


((0.12648249887163787, 0.376463016725192),
 SpearmanrResult(correlation=0.14698495338762813, pvalue=0.3033560185328007),
 85.55240793201133)

## Storing and loading models

In [43]:
from tempfile import mkstemp
fs, temp_path = mkstemp("gensim_temp")  # creates a temp file
model.save(temp_path)  # save the model

2017-01-06 20:08:56,721 : INFO : saving Word2Vec object under C:\Users\Student\AppData\Local\Temp\tmpe5ivks3bgensim_temp, separately None
2017-01-06 20:08:56,721 : INFO : not storing attribute cum_table
2017-01-06 20:08:56,721 : INFO : not storing attribute syn0norm
2017-01-06 20:08:56,775 : INFO : saved C:\Users\Student\AppData\Local\Temp\tmpe5ivks3bgensim_temp


tempfile.mkstemp([suffix=''[, prefix='tmp'[, dir=None[, text=False]]]])

mkstemp方法用于创建一个临时文件。该方法仅仅用于创建临时文件，调用tempfile.mkstemp函数后，返回包含两个元素的元组，第一个元素指示操作该临时文件的安全级别，第二个元素指示该临时文件的路径。参数suffix和prefix分别表示临时文件名称的后缀和前缀；dir指定了临时文件所在的目录，如果没有指定目录，将根据系统环境变量TMPDIR, TEMP或者TMP的设置来保存临时文件；参数text指定了是否以文本的形式来操作文件，默认为False，表示以二进制的形式来操作文件。
    
http://www.cnblogs.com/captain_jack/archive/2011/01/19/1939555.html

In [44]:
new_model = gensim.models.Word2Vec.load(temp_path) 

2017-01-06 20:09:08,989 : INFO : loading Word2Vec object from C:\Users\Student\AppData\Local\Temp\tmpe5ivks3bgensim_temp
2017-01-06 20:09:09,038 : INFO : loading wv recursively from C:\Users\Student\AppData\Local\Temp\tmpe5ivks3bgensim_temp.wv.* with mmap=None
2017-01-06 20:09:09,038 : INFO : setting ignored attribute cum_table to None
2017-01-06 20:09:09,039 : INFO : setting ignored attribute syn0norm to None
2017-01-06 20:09:09,040 : INFO : loaded C:\Users\Student\AppData\Local\Temp\tmpe5ivks3bgensim_temp


## Online training / Resuming training

In [45]:
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 
                  'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, )

# cleaning up temp
os.close(fs)
os.remove(temp_path)

2017-01-06 20:15:22,227 : INFO : loading Word2Vec object from C:\Users\Student\AppData\Local\Temp\tmpe5ivks3bgensim_temp
2017-01-06 20:15:22,273 : INFO : loading wv recursively from C:\Users\Student\AppData\Local\Temp\tmpe5ivks3bgensim_temp.wv.* with mmap=None
2017-01-06 20:15:22,274 : INFO : setting ignored attribute cum_table to None
2017-01-06 20:15:22,274 : INFO : setting ignored attribute syn0norm to None
2017-01-06 20:15:22,274 : INFO : loaded C:\Users\Student\AppData\Local\Temp\tmpe5ivks3bgensim_temp
2017-01-06 20:15:22,278 : INFO : collecting all words and their counts
2017-01-06 20:15:22,279 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-01-06 20:15:22,280 : INFO : collected 13 word types from a corpus of 13 raw words and 1 sentences
2017-01-06 20:15:22,281 : INFO : Updating model with new vocabulary
2017-01-06 20:15:22,282 : INFO : New added 0 unique words (0% of original 13)
                        and increased the count of 0 pre-existing wo

## Using the model

In [52]:
model.most_similar(positive=['human', 'crime'], negative=['party'], topn=10)

[('area', 0.9950318932533264),
 ('bombings', 0.9949900507926941),
 ("that's", 0.994957447052002),
 ('pacific', 0.9949444532394409),
 ('boxing', 0.9949276447296143),
 ('12', 0.9949259757995605),
 ('bush', 0.9949249625205994),
 ('me', 0.9949047565460205),
 ('action', 0.9949044585227966),
 ('militant', 0.9949022531509399)]

most_similar(positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

If topn is False, most_similar returns the vector of similarity scores.

restrict_vocab is an optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

In [54]:
model.doesnt_match("input is lunch he sentence cat".split())

model.doesnt_match("cut appear have tennis".split())

'tennis'

doesnt_match(words)
Which word from the given list doesn’t go with the others?

Example:
trained_model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'

In [51]:
print(model.similarity('human', 'party'))
print(model.similarity('tree', 'murder'))
print(model.similarity('royal', 'commission'))

0.999567357806
0.997924129753
0.999888551396


The results here don't look good because the training corpus is very small. To get meaningful results one needs to train on 500k+ words.

If you need the raw output vectors in your application, you can access these either on a word-by-word basis:

In [65]:
model['tree']  # raw NumPy vector of a word

array([-0.01127922, -0.03585431, -0.0125974 , -0.04147318,  0.00118686,
       -0.01606489, -0.00240359, -0.01758233, -0.03929804, -0.00437588,
       -0.02706588, -0.04878673,  0.00651546, -0.00814761, -0.01902641,
       -0.00802842, -0.02677576, -0.00783781, -0.00248508, -0.03722271,
       -0.0401651 ,  0.02337405,  0.026868  , -0.00215164, -0.02530951,
       -0.04220852, -0.00314844, -0.02795812,  0.04118952,  0.02010571,
       -0.06484023, -0.03345084, -0.06004576,  0.02802508,  0.03755744,
       -0.00150028,  0.02468759,  0.05103324,  0.00318254, -0.01412363,
       -0.02588135, -0.04607787, -0.00969513, -0.01094116, -0.02884848,
        0.0235644 , -0.01119853, -0.02832495,  0.01450322,  0.05574993,
        0.02936582, -0.00019662, -0.00673731, -0.01590701, -0.01878649,
       -0.02997692, -0.02445791, -0.04906164, -0.00614071, -0.0465661 ,
       -0.009026  , -0.01483737, -0.00262535, -0.02153099, -0.04723297,
        0.00348318, -0.00915821,  0.0257897 ,  0.05443203, -0.02

In [55]:
len(model['tree'])

200

…or en-masse as a 2D NumPy matrix from model.syn0.

commit and push:
https://github.com/isetbio/jupyter-hub-oauth-isetbio/wiki/Pushing-and-Pulling-Notebooks