# Phrase Embedding

Phrase embedding的实现，基本上和word2vec没啥差别。

主要步骤如下：

* 抽取出Phrase，然后在语料中的合适位置插入这些phrase
* 使用word2vec网络，基于插入phrase的语料，训练模型

注意：
> embedding层词典的数量，需要把phrase计算在内


phrase的抽取，可以使用PMI（Pointwise Mutual Information），详情参考：[word2vec-for-phrases-learning-embeddings-for-more-than-one-word](https://towardsdatascience.com/word2vec-for-phrases-learning-embeddings-for-more-than-one-word-727b6cf723cf)

## 插入phrase

通过上述方法，已经准备好了phrase，假设文件是**phrase.txt**。接下来就是把这些phrase插入到语料中。

方法如下：

* 使用jieba对句子分词
* 选定一个窗口，对在这个窗口的词语，两两组合成phrase，如果phrase在文件phrase.txt中，则在该窗口中插入该phrase



In [11]:
!pip install jieba

Collecting jieba
  Using cached https://files.pythonhosted.org/packages/71/46/c6f9179f73b818d5827202ad1c4a94e371a29473b7f043b736b4dab6b8cd/jieba-0.39.zip
Building wheels for collected packages: jieba
  Building wheel for jieba (setup.py) ... [?25ldone
[?25h  Stored in directory: /opt/userhome/kdd_luozhouyang/.cache/pip/wheels/c9/c7/63/a9ec0322ccc7c365fd51e475942a82395807186e94f0522243
Successfully built jieba
Installing collected packages: jieba
Successfully installed jieba-0.39


In [9]:

def build_phrase_set(phrase_file):
    phrases = set()
    if not os.path.exists(phrase_file):
        print('file does not exist: %s' % phrase_file)
        return phrases
    with open(phrase_file, mode='rt', encoding='utf8', buffering=8192) as fin:
        for line in fin:
            line = line.strip('\n')
            if not line:
                continue
            phrases.add(line)
    return phrases
        

In [13]:
phrase_file = '/opt/algo_nfs/kdd_luozhouyang/tmp/phrase_v1.txt'
phrase = build_phrase_set(phrase_file)
print(sorted(list(phrase))[100:110])

['1.5年', '1.6亿', '1.8亿', '1.responsible', '1.严格', '1.严格执行', '1.主导', '1.主持', '1.主要', '1.了解']


In [20]:
import os

class SequenceProcessor:
    
    def __init__(self, phrase_file, window_size=5):
        self.phrases = build_phrase_set(phrase_file)
        self.window_size = window_size
        
    def process(self, input_file, output_file):
        if len(self.phrases) == 0:
            print('phrase set is empty.')
            return
        with open(output_file, mode='wt', encoding='utf8', buffering=8192) as fout, \
            open(input_file, mode='rt', encoding='utf8', buffering=8192) as fin:
            for line in fin:
                line = line.strip('\n')
                if not line:
                    continue
                res = self.process_line(line)
                fout.write(res + '\n')
                
    def process_line(self, line):
        words = line.split(' ')
        index = 0
        left = 0
        right = left + self.window_size
        new_words = []
        while True:
            if right >= len(words):
                new_words.append(self.process_window(words[left:]))
                break
            left = index
            right = min(len(words), index + self.window_size)
            selected = words[left:right]
            new_words.append(self.process_window(selected))
            index += self.window_size
        return ' '.join(new_words)
                
    def process_window(self, words):
        res = []
        concated = []
        for i in range(len(words) - 1):
            for j in range(i + 1, len(words)):
                c = words[i] + words[j]
                if c in self.phrases:
                    concated.append(c)
        
        # 交错穿插phrase到words中间。也可以随机打乱插入
        long_list = words if len(words) >= len(concated) else concated
        short_list = words if len(words) < len(concated) else concated
        
        i = 0
        while i<len(short_list):
            res.append(long_list[i])
            res.append(short_list[i])
            i+=1
        while i<len(long_list):
            res.append(long_list[i])
            i+=1
        return ' '.join(res)
                    

In [21]:
p = SequenceProcessor(phrase_file)
words = ['1.严格', '执行', '条款', '1.', '主导', '支持']
print(p.process_window(words))

1.严格 1.严格执行 执行 1.主导 条款 1. 主导 支持


In [22]:
line = '1.严格 执行 条款 1. 主导 支持'
print(p.process_line(line))

1.严格 1.严格执行 执行 1.主导 条款 1. 主导 支持 支持


## word2vec

准备好语料之后，就是word2vec的步骤了：
* 构建数据输入管道
* 构建网络，训练模型


注意：

> 建议使用现成的库来训练word2vec，例如gensim


### 构建数据输入管道

使用tf.data API构建数据输入管道。可以考虑负采样。

如果使用keras内置的skipgram函数构建训练数据，请参考：[A word2vec keras tutorial](https://adventuresinmachinelearning.com/word2vec-keras-tutorial/)，它支持负采样。

如果准备使用tf.data API构建数据输入管道，同时又需要进行负采样，则可以按照以下步骤：
* 使用skipgram方式准备训练数据，文件格式为：'center context label'
* 写程序对上述的训练数据文件进行负采样，生成新的负采样训练数据文件，格式和之前保持一致
* 使用tf.data API通过上述两种文件构建数据输入管道，记得一定要shuffle,并且buffer_size要设置的足够大



### 构建网络


In [35]:
import tensorflow as tf


class Word2Vec(tf.keras.Model):
    
    def __init__(self, vocab_size, embedding_size):
        super(Word2Vec, self).__init__(name='word2vec')
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        
        self.reshape = tf.keras.layers.Reshape((self.embedding_size, 1), name='reshape')
        self.embedding = tf.keras.layers.Embedding(self.vocab_size, self.embedding_size, name='embedding')
        self.cosine = tf.keras.layers.Dot(axes=1, normalize=True, name='cosine')
        self.out = tf.keras.layers.Dense(1, activation='sigmoid', name='out')
        
    def call(self, inputs, training=True, mask=None):
        target, context = inputs
        target = self.reshape(target)
        context = self.reshape(context)
        target_embedding = self.embedding(target)
        context_embedding = self.embedding(context)
        cos = self.cosine([target, context])
        out = self.out(cos)
        res = {
            'out': out,
            'cos': cos
        }
        return res
    

def build_word2vec_model(vocab_size, embedding_size):
    target_input = tf.keras.layers.Input(shape=(1,), name='target_input')
    context_input = tf.keras.layers.Input(shape=(1,), name='context_input')
    
    embedding = tf.keras.layers.Embedding(vocab_size, embedding_size, name='embedding')
    target = embedding(target_input)
    context = embedding(context_input)
    
    reshape = tf.keras.layers.Reshape((embedding_size, 1))
    target = reshape(target)
    context = reshape(context)
    
    cosine = tf.keras.layers.Dot(axes=1, normalize=True, name='cos')([target, context])
    cosine = tf.keras.layers.Reshape((1,))(cosine)
    out = tf.keras.layers.Dense(1, activation='sigmoid', name='out')(cosine)
    
    model = tf.keras.Model(inputs=[target_input, context_input], outputs=[cosine, out])
    return model

In [26]:
w2v = Word2Vec(100, 128)
w2v.compile(loss={'output_1': 'binary_crossentropy'}, optimizer='sgd')
w2v.summary()

ValueError: This model has not yet been built. Build the model first by calling `build()` or calling `fit()` with some data, or specify an `input_shape` argument in the first layer(s) for automatic build.

In [40]:
w2v_model = build_word2vec_model(100, 128)
w2v_model.summary()
w2v_model.compile(loss={'out': 'binary_crossentropy'}, optimizer='sgd')

W0507 16:54:31.956479 140213465438016 training_utils.py:1152] Output reshape_12 missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to reshape_12.


Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
target_input (InputLayer)       [(None, 1)]          0                                            
__________________________________________________________________________________________________
context_input (InputLayer)      [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1, 128)       12800       target_input[0][0]               
                                                                 context_input[0][0]              
__________________________________________________________________________________________________
reshape_11 (Reshape)            (None, 128, 1)       0           embedding[0][0]            

### gensim训练word2vec


In [42]:
!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -q gensim

In [59]:
import logging

import gensim
from gensim.models import Word2Vec

logging.basicConfig(filename="/opt/algo_nfs/kdd_luozhouyang/tmp/w2v.log.20190507", level=logging.INFO)

# 如果有多个文件呢？请看下面的MultiFileSentence
sentence = gensim.models.word2vec.LineSentence('/opt/algo_nfs/kdd_luozhouyang/tmp/part-01023')

# 还有哪些参数，该怎么设置？请看文档 
# https://www.pydoc.io/pypi/gensim-3.2.0/autoapi/models/word2vec/index.html#models.word2vec.Word2Vec
model = Word2Vec(sentence, window=5, negative=5)

model.save('/opt/algo_nfs/kdd_luozhouyang/models/gensim/w2v.20190507.model')
model.wv.save_word2vec_format('/opt/algo_nfs/kdd_luozhouyang/models/gensim/w2v.20190507.vec')
model.wv['开发']


# with open('w2v.vec', mode='rt', encoding='utf8', buffering=8192) as fin:
#     count = 0
#     for line in fin:
#         print(line)
#         count += 1
#         if count == 10:
#             break
            

W0507 17:52:54.546020 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead
W0507 17:52:54.580649 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead
W0507 17:52:54.615082 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead
W0507 17:52:54.649325 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead
W0507 17:52:54.683369 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead
W0507 17:52:54.716284 140211332073216 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead
W0507 17:52:54.749374 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead
W0507 17:52:54.779373 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead


array([ 0.34674248,  0.35570675, -0.1457441 ,  0.00879948,  0.55500114,
        0.38890362,  0.37437907, -0.27349955, -0.33025077,  0.3912286 ,
        0.1590332 , -0.1091135 ,  0.10804082, -0.41607952,  0.11582852,
        0.02014425, -0.3567299 ,  0.13947669,  0.7742888 , -0.41307807,
       -0.36171225, -0.13599542, -0.04251283, -0.19304901, -0.26372176,
       -0.3122111 , -0.31131655,  0.04328417,  0.6876864 ,  0.86348176,
        0.44242084, -0.4835791 , -0.27528867,  0.34724486,  0.26832047,
        0.17167363,  0.00260511, -0.69957006,  0.24595639,  0.7650556 ,
       -0.74492747, -0.41633615,  0.19623893,  0.18933427,  0.42557034,
       -0.07534644, -0.27418914,  0.28443357, -0.07384911, -0.2399095 ,
        0.08272726, -0.28894734, -0.56103647, -0.01218327,  0.46845567,
        0.43867397,  0.02944496, -0.02013955, -0.43548632, -0.4004143 ,
       -0.15792081,  0.2640149 , -0.00460577,  0.44588038, -0.20286655,
       -0.16475451,  0.10544273,  0.47026154, -0.38691202, -0.32

In [60]:

class MultiFileSentence:
    
    def __init__(self, files):
        self.files = files
        
    def __iter__(self):
        for f in self.files:
            with open(f, mode='rt', encoding='utf8', buffering=8192) as fin:
                for line in fin:
                    line = line.strip('\n')
                    if not line:
                        continue
                    words = line.split(' ')
                    yield words
                    

In [61]:
from gensim.models import Word2Vec

files = ['/opt/algo_nfs/kdd_luozhouyang/tmp/part-01023',
        '/opt/algo_nfs/kdd_luozhouyang/tmp/part-01023',]
multi_file_sentence = MultiFileSentence(files)

model = Word2Vec(multi_file_sentence, window=5, negative=5)
model.save('/tmp/w2v.model')
model.wv.save_word2vec_format('/tmp/w2v.vec')


W0507 19:09:24.579389 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead
W0507 19:09:24.600231 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead


In [62]:
m = Word2Vec.load('/tmp/w2v.model')
m.wv['开发']

W0507 19:09:56.170234 140213465438016 smart_open_lib.py:385] this function is deprecated, use smart_open.open instead


array([-0.23979124, -0.35029036, -0.48673213,  0.11066395,  0.03688607,
        0.4783164 ,  0.10762557, -0.55959433, -0.18692617, -0.16329728,
        0.04612793, -0.18539032,  0.02333255, -0.50036645, -0.6414493 ,
        0.36254427, -0.16179731,  0.01201913,  0.3043091 , -0.0179352 ,
       -0.06606959, -0.28152755, -0.04192889,  0.03955165, -0.15794286,
       -0.35448447,  0.08880788,  0.00767798,  0.19260862,  0.91442215,
        0.48151648,  0.04959353, -0.42420253,  0.5900174 ,  0.3191452 ,
        0.51084846, -0.05812714, -0.28591454,  0.15560418,  0.8640183 ,
       -0.92507184, -0.6447743 ,  0.66764784, -0.0323838 ,  0.62571084,
        0.10482378, -0.08711988,  0.16586128, -0.19405377,  0.16375169,
       -0.28106654, -0.41007572, -0.11500295, -0.08608052, -0.19349962,
        0.1785441 , -0.12186533,  0.08123282, -0.19011433, -0.36758485,
       -0.02843239,  0.18410258,  0.29441646, -0.06380797,  0.3583782 ,
       -0.49278077,  0.46070313,  0.2910697 , -0.40502158, -0.33