# gensim – Topic Modelling in Python

https://radimrehurek.com/gensim/tutorial.html

https://github.com/RaRe-Technologies/gensim

In [1]:
# !pip install -U gensim
!pip freeze | grep gensim

gensim==3.7.2


# 训练Word2Vec向量
https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

如果您错过了嗡嗡声，word2vec是作为基于神经网络的机器学习算法“新浪潮”（通常被称为“深度学习”）的一员而广为人知的功能（尽管word2vec本身还很浅）。 使用大量未注释的纯文本，word2vec会自动学习单词之间的关系。 输出是向量，每个单词一个向量，具有显着的线性关系，使我们可以执行以下操作：

    vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)

    vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) =~ vec(“Toronto Maple Leafs”).

Word2vec在自动文本标记，推荐系统和机器翻译中非常有用。

本教程：

     引入Word2Vec作为对传统词袋的改进

     使用预先训练的模型展示Word2Vec的演示

     演示根据您自己的数据训练新模型

     演示加载和保存模型

     介绍几个训练参数并演示其效果

     讨论内存需求

     通过应用降维来可视化Word2Vec嵌入

## Introducing: the Word2Vec Model
Word2Vec是一个较新的模型，它使用浅神经网络将单词嵌入到低维向量空间中。结果得到一组词向量，其中在向量空间中靠得很近的向量根据上下文有相似的含义，而相距较远的词向量有不同的含义。例如，strong和powerful会靠得很近，strong和Paris会比较远。

该模型有两个版本，Word2Vec类同时实现了这两个版本：


    Skip-grams (SG)

    Continuous-bag-of-words (CBOW)

## Word2Vec Demo

In [3]:
# import gensim.downloader as api
# wv = api.load('word2vec-google-news-300')

## Training Your Own Model
首先，您需要一些数据来训练模型。对于下面的示例，我们将使用Lee Corpus(如果您已经安装了gensim，那么您已经拥有了它)。这个语料库足够小，可以完全放入内存，但是我们将实现一个内存友好的迭代器，逐行读取它，以演示如何处理更大的语料库。

In [24]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        count = 0
        for line in open(corpus_path):
            yield utils.simple_preprocess(line)

In [13]:
utils.simple_preprocess('Hundreds of people have been forced to vacate their homes')

['hundreds',
 'of',
 'people',
 'have',
 'been',
 'forced',
 'to',
 'vacate',
 'their',
 'homes']

如果我们想进行任何自定义的预处理，例如 解码非标准编码，小写字母，删除数字，提取命名实体……所有这些都可以在MyCorpus迭代器中完成，而word2vec不需要知道。 所需要的就是输入产生一个句子（另一个utf8单词列表）。

让我们继续，在我们的语料库上训练模型。 暂时不用担心培训参数，我们稍后会重新讨论。

In [5]:
import gensim.models

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

In [34]:
# RuntimeError: you must first build vocabulary before training the model
# 原因:训练文本太小
sentences = [
    ["cat", "say", "meow"], 
    ["cat", "say", "meow"], 
    ["cat", "say", "meow"], 
    ["cat", "say", "meow"], 
    ["cat", "say", "meow"], 
    ["cat", "say", "meow"], 
    ["dog", "say", "woof"]]
model = gensim.models.Word2Vec()

一旦有了模型，就可以使用与上面演示相同的方式。

该模型的主要部分是model.wv，其中“ wv”代表“单词向量”。

In [8]:
# 一种常见的操作是检索模型的词汇表
for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

hundreds
of
people
have
been
forced
to
their
homes
in


In [4]:
from gensim.test.utils import datapath
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

In [9]:
# We can easily obtain vectors for terms the model is familiar with:
vec_king = model.wv['king']
vec_king

array([ 0.04187758, -0.06847143,  0.02055344,  0.00092365,  0.01805769,
        0.00374368, -0.01640475,  0.00353284, -0.06052319, -0.04967679,
        0.05961818, -0.05473563, -0.0438375 , -0.00711683,  0.02270704,
       -0.02680787,  0.00437799,  0.05661393,  0.01918527,  0.00388391,
       -0.0399648 ,  0.09748787,  0.03946414,  0.02308097, -0.00141993,
        0.00331337,  0.02956577,  0.0243604 , -0.01386652,  0.00333372,
        0.00197411,  0.04006763,  0.02115147, -0.03394582,  0.04414264,
       -0.009897  ,  0.03377845, -0.01650199, -0.0059514 ,  0.03305526,
       -0.00229126,  0.04339327, -0.01369251,  0.00771568,  0.03685286,
        0.06171643,  0.01159823,  0.02902963, -0.07301527,  0.00844696,
       -0.03733594, -0.04698925, -0.01567307,  0.03573236,  0.02323769,
        0.00749972,  0.0036171 ,  0.00112709,  0.01664387,  0.03213069,
        0.01336515, -0.04343333, -0.03845324, -0.0252916 , -0.06374004,
       -0.00650541,  0.05516641, -0.04275392,  0.03187048, -0.04

In [11]:
try:
    vec_cameroon = model.wv['cameroon']
except KeyError:
    print("The word 'cameroon' does not appear in this model")

The word 'cameroon' does not appear in this model


In [20]:
pairs = [
#     ('car', 'minivan'),   # a minivan is a kind of car
#     ('car', 'bicycle'),   # still a wheeled vehicle
#     ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('king', 'man'),    # ... and so on
    ('king', 'woman'),
    ('king', 'car'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, model.wv.similarity(w1, w2)))

'king'	'man'	1.00
'king'	'woman'	0.99
'king'	'car'	1.00


In [22]:
# Print the 5 most similar words to “car” or “minivan”
print(model.wv.most_similar(positive=['car', 'man'], topn=5))

[('an', 0.9997594952583313), ('from', 0.9997572898864746), ('those', 0.9997518658638), ('people', 0.9997501373291016), ('six', 0.9997469186782837)]


In [23]:
print(model.wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))


land


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


## Storing and loading models

In [None]:
model.save(temporary_filepath)

new_model = gensim.models.Word2Vec.load(temporary_filepath)

## Training Parameters

### min_count
min_count用于修剪内部字典。 在十亿个单词的语料库中仅出现一次或两次的单词可能是无趣的错别字和垃圾。 此外，没有足够的数据来对这些词进行任何有意义的训练，因此最好忽略它们：

min_count = 5的默认值
### size
size是gensim Word2Vec将单词映射到的N维空间的维数（N）。

较大的值需要更多的训练数据，但可以产生更好（更准确）的模型。 合理的值在数十到数百之间。

default value of size=100
### worker
worker，最后一个主要参数（此处为完整列表）用于训练并行化，以加快训练速度：

default value of workers=3 

In [7]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [5]:
from gensim.models import Word2Vec, FastText

In [None]:
class GensimWrapper:
    """

    """
    @classmethod
    def trainWord2vec(cls, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
                 max_vocab_size=None, sample=1e-3, seed=1, workers=3):
        """
        训练word2vec
        :param sentences:
            The sentences iterable can be simply a list of lists of tokens, but for larger corpora,
            consider an iterable that streams the sentences directly from disk/network.
        :param size: Dimensionality of the word vectors.
        :param alpha:
        :param window: 窗口的长度
            Maximum distance between the current and predicted word within a sentence.
        :param min_count: Ignores all words with total frequency lower than this.
        :param max_vocab_size:
        :param sample:
        :param seed:
        :param workers:
            Use these many worker threads to train the model (=faster training with multicore machines)
        :return:
        """
        return Word2Vec(sentences, size, alpha, window, min_count, max_vocab_size, sample, seed, workers)
    @classmethod
    def loadWord2vecModel(cls, path):
        """

        :param path:
        :return:
        """
        return Word2Vec.load(path)

In [4]:
sentences = [["你", "是", "谁"], ["我", "是", "中国人"]]
model = Word2Vec(sentences)

RuntimeError: you must first build vocabulary before training the model

In [6]:
sentences = [["你", "是", "谁"], ["我", "是", "中国人"]]
model = FastText(sentences)

RuntimeError: you must first build vocabulary before training the model

In [None]:
model['你']

# 使用Gensim进行TfIdf
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/gensim Quick Start.ipynb

原始语料经过分词，去除停用词，字典编号，
这样原始语料就可以转化为vector
使用作为model的输入

In [3]:
from gensim.models import TfidfModel

In [None]:
"""

"""
raw_corpus = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in raw_corpus]
# Count word frequencies
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
print(processed_corpus)
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)  # Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
print(dictionary.token2id)
"""
{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}
"""
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)#[(0, 1), (1, 1)]
"""
The first entry in each tuple corresponds to the ID of the token in the dictionary, the second corresponds to the count of this token.
Note that "interaction" did not occur in the original corpus and so it was not included in the vectorization. 
"""
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
print(bow_corpus)
# train the model
tfidf = models.TfidfModel(bow_corpus)
# transform the "system minors" string
print(tfidf[dictionary.doc2bow("system minors".lower().split())])#[(5, 0.5898341626740045), (11, 0.8075244024440723)]


# similarities.docsim – Document similarity queries
https://radimrehurek.com/gensim/similarities/docsim.html