In [3]:
# default_exp engineering.nbdev

%reload_ext autoreload
%autoreload 2

# glove
GloVe: Global Vectors for Word Representation

GloVe是一种用于获取单词向量表示的无监督学习算法。 对来自语料库的汇总全局单词-单词共现统计信息进行训练，并且所得表示形式展示了单词向量空间的有趣线性子结构。


GloVe的训练目标是学习单词向量，使其点积等于单词共现概率的对数。 由于比率的对数等于对数之差(log(a/b)=log(a)-log(b)，因此该目标将共现概率的比率（对数）与词向量空间中的向量差相关联。 因为这些比率可以编码某种形式的含义，所以此信息也被编码为矢量差。 因此，生成的词向量在词类比任务（例如在word2vec程序包中检查的类词）上的性能非常好。

官网https://nlp.stanford.edu/projects/glove/

https://github.com/stanfordnlp/GloVe

## 原理
https://blog.csdn.net/u014665013/article/details/79642083

https://www.cnblogs.com/jfdwd/p/11086914.html

* 首先基于语料库构建词的共现矩阵，
* 然后基于共现矩阵和GloVe模型学习词向量。

** 开始 -> 统计共现矩阵 -> 训练词向量 -> 结束**
### 统计共现矩阵
所谓的共现，共同出现，其实就是看一个词有没有在另一个词的附近出现，所谓的附近，其实就是一个移动窗口的概念，定义窗口的半径（从中心词到边缘的距离）后，看看方圆多少范围内出现词的个数，就是共现，现在看看例子。

假设语料库就只有下面一行：

    i love you but you love him i am sad

设半径为2，于是移动窗口的滑动就有下面的形式：

以窗口5为例，此处就可以认为，love分别和but, you, him, i共同出现了一次，通过这种方式去计数，就能知道任意两个词之间的共现关系（一般是可逆的），构成共现矩阵X，一般地，X是一个对称矩阵。

### 使用GloVe模型训练词向量

首先，模型的损失函数长这样的：
$$J=\sum_{i,j}^{N}f(X_{ij})(v_i^Tv_j+b_j+b_i-log(X_{ij}))^2$$

vi和vj是词汇i和j的词向量，bi和bj是常数项，f是特定的权重函数，N是词汇表大小。

X_ij的意义为：在整个语料库中，单词i和单词j共同出现在一个窗口中的次数。

## 官网glove的使用

The demo.sh script downloads a small corpus, consisting of the first 100M characters of Wikipedia. It collects unigram counts, constructs and shuffles cooccurrence data, and trains a simple version of the GloVe model. 

It also runs a word analogy evaluation script in python to verify word vector quality. More details about training on your own corpus can be found by reading demo.sh or the src/README.md

初始化参数

    CORPUS=text8
    VOCAB_FILE=vocab.txt
    COOCCURRENCE_FILE=cooccurrence.bin
    COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
    BUILDDIR=build
    SAVE_FILE=vectors
    VERBOSE=2
    MEMORY=4.0
    VOCAB_MIN_COUNT=5
    VECTOR_SIZE=50
    MAX_ITER=15
    WINDOW_SIZE=15
    
运行会新生成五个文件

    cooccurrence.bin 
    cooccurrence.shuf.bin
    vectors.bin
    vectors.txt  # 词向量
    vocab.txt  # 词汇表


In [10]:
!ls /Users/luoyonggui/Documents/temp/GloVe/ 

LICENSE               cooccurrence.shuf.bin [31mtext8[m[m
Makefile              [31mdemo.sh[m[m               text8.zip
README.md             [1m[36meval[m[m                  vectors.bin
[1m[36mbuild[m[m                 [31mrandomization.test.sh[m[m vectors.txt
cooccurrence.bin      [1m[36msrc[m[m                   vocab.txt


In [1]:
!cd /Users/luoyonggui/Documents/temp/GloVe/ && make

mkdir -p build
gcc -c src/vocab_count.c -o build/vocab_count.o -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
gcc -c src/cooccur.c -o build/cooccur.o -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
gcc -c src/shuffle.c -o build/shuffle.o -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
gcc -c src/glove.c -o build/glove.o -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
gcc -c src/common.c -o build/common.o -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
gcc build/vocab_count.o build/common.o -o build/vocab_count -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
gcc build/cooccur.o build/common.o -o build/cooccur -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
gcc build/shuffle.o build/common.o -o build/shuffle -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
gcc build/glove.o build/common.o -o build/glove -lm

In [6]:
!cd /Users/luoyonggui/Documents/temp/GloVe/ && ./demo.sh

mkdir -p build

$ build/vocab_count -min-count 5 -verbose 2 < text8 > vocab.txt
BUILDING VOCABULARY
Processed 0 tokens.[11G100000 tokens.[11G200000 tokens.[11G300000 tokens.[11G400000 tokens.[11G500000 tokens.[11G600000 tokens.[11G700000 tokens.[11G800000 tokens.[11G900000 tokens.[11G1000000 tokens.[11G1100000 tokens.[11G1200000 tokens.[11G1300000 tokens.[11G1400000 tokens.[11G1500000 tokens.[11G1600000 tokens.[11G1700000 tokens.[11G1800000 tokens.[11G1900000 tokens.[11G2000000 tokens.[11G2100000 tokens.[11G2200000 tokens.[11G2300000 tokens.[11G2400000 tokens.[11G2500000 tokens.[11G2600000 tokens.[11G2700000 tokens.[11G2800000 tokens.[11G2900000 tokens.[11G3000000 tokens.[11G3100000 tokens.[11G3200000 tokens.[11G3300000 tokens.[11G3400000 tokens.[11G3500000 tokens.[11G3600000 tokens.[11G3700000 tokens.[11G3800000 tokens.[11G3900000 tokens.[11G4000000 tokens.[11G4100000 tokens.[11G4200000 tokens.[11G4300000 tokens.[11G4400000 tokens.[11G45000

## GloVe的Python实现

在pypi里面看到了很多GloVe的包，但是很多都有坑，我直接说一个我自己已经走通的包mittens。

下载方式还是比较简单的， pip install mittens基本没什么问题，想要去看看源码的话，在这里：

一般而言GloVe按照计算共现矩阵和GloVe训练两大模块，

In [22]:
!conda install glove_python

Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - glove_python

Current channels:

  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




In [20]:
!pip install glove_python -U

Collecting glove_python
  Using cached https://files.pythonhosted.org/packages/3e/79/7e7e548dd9dcb741935d031117f4bed133276c2a047aadad42f1552d1771/glove_python-0.1.0.tar.gz
Building wheels for collected packages: glove-python
  Building wheel for glove-python (setup.py) ... [?25lerror
[31m  ERROR: Complete output from command /Users/luoyonggui/anaconda3/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/private/var/folders/7j/kgtjln3x2dj2g2v57d5vncyw0000gp/T/pip-install-krt5v7vc/glove-python/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/7j/kgtjln3x2dj2g2v57d5vncyw0000gp/T/pip-wheel-7a5f_xz3 --python-tag cp37:[0m
[31m  ERROR: running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.9-x86_64-3.7
  creating build/lib.macosx-10.9-x86_64-3.7/glove
  copying glove/__init_

## gensim加载glove训练的词向量
https://radimrehurek.com/gensim/scripts/glove2word2vec.html

glove词向量的格式如下：

    word1 0.123 0.134 0.532 0.152
    word2 0.934 0.412 0.532 0.159
    word3 0.334 0.241 0.324 0.18
    ...
    word9 0.334 0.241 0.324 0.188

word2vec词向量的格式：

    9 4   # 这一行包含向量的数量及其维度
    word1 0.123 0.134 0.532 0.152
    word2 0.934 0.412 0.532 0.159
    word3 0.334 0.241 0.324 0.188
    ...
    word9 0.334 0.241 0.324 0.188

gensim库添加了一个模块，可以用来将glove格式的词向量转为word2vec的词向量，具体操作如下：

In [14]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
# 输入文件
glove_file = datapath('/Users/luoyonggui/Documents/temp/GloVe/vectors.txt')
# 输出文件
tmp_file = get_tmpfile("test_word2vec.txt")

# call glove2word2vec script
# default way (through CLI): python -m gensim.scripts.glove2word2vec --input <glove_file> --output <w2v_file>

# 开始转换
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)

# 加载转化后的文件
model = KeyedVectors.load_word2vec_format(tmp_file)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [16]:
type(model)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [17]:
model['the'], model['the'].shape

(array([ 0.346661, -0.657651, -0.323899,  1.404478, -0.681525,  0.332714,
        -0.493434,  0.950834,  0.102952,  0.386339,  0.347187, -0.912068,
        -0.137993, -1.246887,  0.062835, -0.035913, -0.365426, -0.219618,
        -0.971765,  1.701459,  0.939117,  0.349692, -0.533264,  1.316008,
        -0.708478,  0.394686,  0.751405,  0.498659,  1.257347, -2.340782,
        -0.162739,  0.501467,  1.076938, -0.577817,  0.409253,  0.1018  ,
         0.634708, -0.831036,  0.519494,  0.491227, -0.013857,  0.245493,
        -0.521764, -1.030283, -0.486644, -0.087301,  1.220634,  1.356461,
         1.723411,  1.3197  ], dtype=float32), (50,))

# word2vec
https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

## bag-of-words model
This model transforms each document to a fixed-length vector of integers. For example, given the sentences:

    John likes to watch movies. Mary likes movies too.

    John also likes to watch football games. Mary hates football.

The model outputs the vectors:

    [1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]

    [1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]

Bag-of-words models are surprisingly effective, but have several weaknesses.

首先，他们会丢失所有有关单词顺序的信息：“约翰喜欢玛丽”和“玛丽喜欢约翰”对应于相同的向量。 There is a solution: bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.

其次，该模型不会尝试学习基础单词的含义，因此，向量之间的距离并不总是反映出含义上的差异。 

Word2Vec模型解决了第二个问题。

## the Word2Vec Model
Word2Vec是一种更新的模型，它使用浅层神经网络将单词嵌入到低维向量空间中。 结果是一组词向量，其中在向量空间中靠在一起的向量根据上下文具有相似的含义，而彼此远离的词向量具有不同的含义。

The are two versions of this model and Word2Vec class implements them both:

![](img/sk01.png)
### Skip-grams (SG)
The Word2Vec Skip-gram model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input. A virtual one-hot encoding of words goes through a ‘projection layer’ to the hidden layer; these projection weights are later interpreted as the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings.

### Continuous-bag-of-words (CBOW)
Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The synthetic training task now uses the average of multiple input context words, rather than a single word as in skip-gram, to predict the center word. Again, the projection weights that turn one-hot words into averageable vectors, of the same width as the hidden layer, are interpreted as the word embeddings.

## python实现

In [23]:
# !pip install -U gensim
!pip freeze | grep gensim

gensim==3.7.2


### 默认参数

In [None]:
gensim.models.Word2Vec(
    sentences=None,
    corpus_file=None,
    size=100,
    alpha=0.025,
    window=5,
    min_count=5,
    max_vocab_size=None,
    sample=0.001,
    seed=1,
    workers=3,
    min_alpha=0.0001,
    sg=0,
    hs=0,
    negative=5,
    ns_exponent=0.75,
    cbow_mean=1,
    hashfxn=<built-in function hash>,
    iter=5,
    null_word=0,
    trim_rule=None,
    sorted_vocab=1,
    batch_words=10000,
    compute_loss=False,
    callbacks=(),
    max_final_vocab=None,
)

#### min_count
min_count用于修剪内部字典。 在十亿个单词的语料库中仅出现一次或两次的单词可能是无趣的错别字和垃圾。 此外，没有足够的数据来对这些词进行任何有意义的训练，因此最好忽略它们：

min_count = 5的默认值
#### size
size是gensim Word2Vec将单词映射到的N维空间的维数（N）。

较大的值需要更多的训练数据，但可以产生更好（更准确）的模型。 合理的值在数十到数百之间。

default value of size=100
#### worker
worker，最后一个主要参数（此处为完整列表）用于训练并行化，以加快训练速度：

default value of workers=3 

### Training Your Own Model
这个语料库足够小，可以完全放入内存，但是我们将实现一个内存友好的迭代器，逐行读取它，以演示如何处理更大的语料库。

如果我们想进行任何自定义的预处理，例如 解码非标准编码，小写字母，删除数字，提取命名实体……所有这些都可以在MyCorpus迭代器中完成，而word2vec不需要知道。 所需要的就是输入产生一个句子（另一个utf8单词列表）。

In [24]:
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

In [39]:
utils.simple_preprocess('Hundreds of people have been forced to vacate their homes')

['hundreds',
 'of',
 'people',
 'have',
 'been',
 'forced',
 'to',
 'vacate',
 'their',
 'homes']

In [27]:
from gensim.models import Word2Vec

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

In [42]:
# RuntimeError: you must first build vocabulary before training the model
# 原因:训练文本太小
# sentences = [
#     ["cat", "say", "meow"], 
#     ["cat", "say", "meow"], 
#     ["cat", "say", "meow"], 
#     ["cat", "say", "meow"], 
#     ["cat", "say", "meow"], 
#     ["cat", "say", "meow"], 
#     ["dog", "say", "woof"]]
# model = gensim.models.Word2Vec(sentences=sentences)

### 获取词向量

In [28]:
vec_king = model.wv['king']
vec_king

array([-0.01486553,  0.01823401, -0.02772865,  0.00213301,  0.0372683 ,
        0.01642814,  0.04816883, -0.05448446,  0.01253152,  0.03404462,
        0.01756212, -0.04840738, -0.03839137, -0.01304129, -0.03804687,
        0.05305994, -0.00860955, -0.00836853,  0.00134603, -0.03344008,
       -0.03039226, -0.03281286, -0.00623441, -0.02607584,  0.00408613,
        0.04711006,  0.03532708, -0.0055987 , -0.00130932,  0.02229508,
       -0.04697058, -0.02922403,  0.0131788 , -0.01622746,  0.05389683,
       -0.02905632,  0.03431457,  0.01364054,  0.00634684, -0.01930137,
       -0.01475715, -0.04367147, -0.00581169, -0.03634231,  0.01892559,
       -0.01157352,  0.02849645,  0.00408916,  0.03510711, -0.0240911 ,
       -0.03448397, -0.00420031, -0.06505472,  0.00562562, -0.03190516,
       -0.01666111, -0.04189085,  0.07402997,  0.00904749, -0.03249473,
        0.00972437,  0.03047183, -0.00820827, -0.02223486, -0.01516857,
       -0.00403475, -0.02458509,  0.02097399, -0.00187658,  0.01

In [38]:
model['king']  # removed

  """Entry point for launching an IPython kernel.


array([-0.01486553,  0.01823401, -0.02772865,  0.00213301,  0.0372683 ,
        0.01642814,  0.04816883, -0.05448446,  0.01253152,  0.03404462,
        0.01756212, -0.04840738, -0.03839137, -0.01304129, -0.03804687,
        0.05305994, -0.00860955, -0.00836853,  0.00134603, -0.03344008,
       -0.03039226, -0.03281286, -0.00623441, -0.02607584,  0.00408613,
        0.04711006,  0.03532708, -0.0055987 , -0.00130932,  0.02229508,
       -0.04697058, -0.02922403,  0.0131788 , -0.01622746,  0.05389683,
       -0.02905632,  0.03431457,  0.01364054,  0.00634684, -0.01930137,
       -0.01475715, -0.04367147, -0.00581169, -0.03634231,  0.01892559,
       -0.01157352,  0.02849645,  0.00408916,  0.03510711, -0.0240911 ,
       -0.03448397, -0.00420031, -0.06505472,  0.00562562, -0.03190516,
       -0.01666111, -0.04189085,  0.07402997,  0.00904749, -0.03249473,
        0.00972437,  0.03047183, -0.00820827, -0.02223486, -0.01516857,
       -0.00403475, -0.02458509,  0.02097399, -0.00187658,  0.01

#### OOV

In [34]:
try:
    vec_cameroon = model.wv['cameroon']
except KeyError:
    print("The word 'cameroon' does not appear in this model")

The word 'cameroon' does not appear in this model


### 获取最相近的词向量topk

In [35]:
model.wv.most_similar('king', topn=2)

[('away', 0.9970558881759644), ('sea', 0.9966898560523987)]

In [36]:
# Print the 5 most similar words to “car” or “minivan”
print(model.wv.most_similar(positive=['car', 'man'], topn=5))

[('have', 0.9996969699859619), ('five', 0.9996781349182129), ('israeli', 0.9996719360351562), ('just', 0.9996703863143921), ('other', 0.999667763710022)]


In [37]:
model.wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'land'

### 获取词汇表

In [29]:
for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

hundreds
of
people
have
been
forced
to
their
homes
in


### Storing and loading models

In [30]:
pwd

'/Users/luoyonggui/PycharmProjects/nbdevlib'

In [31]:
model.save('/Users/luoyonggui/Downloads/tt.model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [32]:
new_model = Word2Vec.load('/Users/luoyonggui/Downloads/tt.model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [33]:
new_model.wv['king']

array([-0.01486553,  0.01823401, -0.02772865,  0.00213301,  0.0372683 ,
        0.01642814,  0.04816883, -0.05448446,  0.01253152,  0.03404462,
        0.01756212, -0.04840738, -0.03839137, -0.01304129, -0.03804687,
        0.05305994, -0.00860955, -0.00836853,  0.00134603, -0.03344008,
       -0.03039226, -0.03281286, -0.00623441, -0.02607584,  0.00408613,
        0.04711006,  0.03532708, -0.0055987 , -0.00130932,  0.02229508,
       -0.04697058, -0.02922403,  0.0131788 , -0.01622746,  0.05389683,
       -0.02905632,  0.03431457,  0.01364054,  0.00634684, -0.01930137,
       -0.01475715, -0.04367147, -0.00581169, -0.03634231,  0.01892559,
       -0.01157352,  0.02849645,  0.00408916,  0.03510711, -0.0240911 ,
       -0.03448397, -0.00420031, -0.06505472,  0.00562562, -0.03190516,
       -0.01666111, -0.04189085,  0.07402997,  0.00904749, -0.03249473,
        0.00972437,  0.03047183, -0.00820827, -0.02223486, -0.01516857,
       -0.00403475, -0.02458509,  0.02097399, -0.00187658,  0.01

# nb_export

In [4]:
from nbdev.export import *
notebook2script()

Converted 00_core.ipynb.
Converted engineering_nbdev.ipynb.
Converted index.ipynb.


In [7]:
!nbdev_build_docs

No notebooks were modified
converting /Users/luoyonggui/PycharmProjects/nbdevlib/index.ipynb to README.md
