<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#gensim相关示例" data-toc-modified-id="gensim相关示例-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>gensim相关示例</a></span><ul class="toc-item"><li><span><a href="#建立-dict-及-corpora" data-toc-modified-id="建立-dict-及-corpora-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>建立 dict 及 corpora</a></span></li><li><span><a href="#dictionary-接口测试" data-toc-modified-id="dictionary-接口测试-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>dictionary 接口测试</a></span></li><li><span><a href="#使用-corpora--查找邻近的向量" data-toc-modified-id="使用-corpora--查找邻近的向量-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>使用 corpora  查找邻近的向量</a></span></li></ul></li><li><span><a href="#尝试使用训练好的-word2vector" data-toc-modified-id="尝试使用训练好的-word2vector-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>尝试使用训练好的 word2vector</a></span></li></ul></div>

## gensim相关示例

### 建立 dict 及 corpora

参考[gensim使用方法以及例子](http://blog.csdn.net/u014595019/article/details/52218249)熟悉文档向量表示方法，文章中有描述分布式计算的例子，当前使用不到，略过

In [60]:
from gensim import corpora
from collections import defaultdict
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
# 删除停用词
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]

# 删除词频为 1 的单词
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1] for text in texts]
texts

# 生成词典
dictionary = corpora.Dictionary(texts)

# 保存词典
path_dict = '/tmp/deerwester.dict'
dictionary.save(path_dict)

# 文档向量化
corpus = [dictionary.doc2bow(text) for text in texts]

# 保存向量化文档
path_corpora = '/tmp/deerwester.mm'
corpora.MmCorpus.serialize(path_corpora, corpus)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

### dictionary 接口测试

打印字典中，单词-ID映射表

In [16]:
dictionary.token2id

{'computer': 2,
 'eps': 8,
 'graph': 10,
 'human': 0,
 'interface': 1,
 'minors': 11,
 'response': 6,
 'survey': 3,
 'system': 5,
 'time': 7,
 'trees': 9,
 'user': 4}

打印每个单词出现的频率，这里system出现了4次，但是只统计为3次，难道是指出现在三行中？

In [23]:
dictionary.dfs

{0: 2, 1: 2, 2: 2, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 2, 9: 3, 10: 3, 11: 2}

filter_n_most_frequent 用于删除出席那频率最高的 N 个单词，如果有多个单词出现频率一样，则按ID选取前N个删除

In [24]:
dictionary.filter_n_most_frequent(1)
dictionary.token2id

{'computer': 2,
 'eps': 7,
 'graph': 9,
 'human': 0,
 'interface': 1,
 'minors': 10,
 'response': 5,
 'survey': 3,
 'system': 4,
 'time': 6,
 'trees': 8}

filter_extremes(no_below=5, no_above=0.5, keep_n=3)
* 去掉出现次数低于 no_below 的
* 去掉出现次数高于 no_above 的，这里指百分比
* 在1和2的基础上，保留频率前 keep_n 的单词

filter_tokens(bad_ids=None, good_ids=None)
* 去掉 bad_id 对应的单词
* 去掉除了 good_id 对应的单词

In [26]:
dictionary.filter_tokens(bad_ids=[0, 1])
dictionary.token2id

{'eps': 3,
 'graph': 5,
 'minors': 6,
 'response': 1,
 'system': 0,
 'time': 2,
 'trees': 4}

In [27]:
dictionary.filter_tokens(good_ids=[1, 3, 4, 6])
dictionary.token2id

{'eps': 1, 'minors': 3, 'response': 0, 'trees': 2}

compacit 用于取出可能出现的词典 ID 空隙

In [28]:
dictionary.compactify()
dictionary.token2id

{'eps': 1, 'minors': 3, 'response': 0, 'trees': 2}

### 使用 corpora  查找邻近的向量

In [32]:
import os
from gensim import corpora, models, similarities
from pprint import pprint
from matplotlib import pyplot as plt
import logging

In [38]:
def PrintDictionary(dictionary):
    token2id = dictionary.token2id
    dfs = dictionary.dfs
    token_info = {}
    for word in token2id:
        token_info[word] = dict(word=word, id=token2id[word], freq=dfs[token2id[word]])
    token_items = token_info.values()
    token_items = sorted(token_items, key=lambda x: x['id'])
    pprint(token_items)

In [40]:
def Show2dCorpora(corpus):
    nodes = list(corpus)
    ax0 = [x[0][1] for x in nodes]
    ax1 = [x[1][1] for x in nodes]
    plt.plot(ax0, ax1, 'o')
    plt.show()

In [43]:
if os.path.exists(path_dict) and os.path.exists(path_corpora):
    dictionary = corpora.Dictionary.load(path_dict)
    corpus = corpora.MmCorpus(path_corpora)
    print('success.')
else:
    print('failed.')

success.


In [44]:
PrintDictionary(dictionary)

[{'freq': 2, 'id': 0, 'word': 'human'},
 {'freq': 2, 'id': 1, 'word': 'interface'},
 {'freq': 2, 'id': 2, 'word': 'computer'},
 {'freq': 2, 'id': 3, 'word': 'survey'},
 {'freq': 3, 'id': 4, 'word': 'user'},
 {'freq': 3, 'id': 5, 'word': 'system'},
 {'freq': 2, 'id': 6, 'word': 'response'},
 {'freq': 2, 'id': 7, 'word': 'time'},
 {'freq': 2, 'id': 8, 'word': 'eps'},
 {'freq': 3, 'id': 9, 'word': 'trees'},
 {'freq': 3, 'id': 10, 'word': 'graph'},
 {'freq': 2, 'id': 11, 'word': 'minors'}]


In [53]:
tfidf_model = models.TfidfModel(corpus)
doc_bow = [(0, 1), (1, 1), [4, 3]]
doc_tfidf = tfidf_model[doc_bow]
doc_tfidf

[(0, 0.3834358077080814), (1, 0.3834358077080814), (4, 0.840210665687185)]

In [54]:
corpus_tfidf = tfidf_model[corpus]
list(corpus_tfidf)
list(corpus)

[[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)],
 [(2, 0.44424552527467476),
  (3, 0.44424552527467476),
  (4, 0.3244870206138555),
  (5, 0.3244870206138555),
  (6, 0.44424552527467476),
  (7, 0.44424552527467476)],
 [(1, 0.5710059809418182),
  (4, 0.4170757362022777),
  (5, 0.4170757362022777),
  (8, 0.5710059809418182)],
 [(0, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)],
 [(4, 0.45889394536615247), (6, 0.6282580468670046), (7, 0.6282580468670046)],
 [(9, 1.0)],
 [(9, 0.7071067811865475), (10, 0.7071067811865475)],
 [(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)],
 [(3, 0.6282580468670046),
  (10, 0.45889394536615247),
  (11, 0.6282580468670046)]]

[[(0, 1.0), (1, 1.0), (2, 1.0)],
 [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0)],
 [(1, 1.0), (4, 1.0), (5, 1.0), (8, 1.0)],
 [(0, 1.0), (5, 2.0), (8, 1.0)],
 [(4, 1.0), (6, 1.0), (7, 1.0)],
 [(9, 1.0)],
 [(9, 1.0), (10, 1.0)],
 [(9, 1.0), (10, 1.0), (11, 1.0)],
 [(3, 1.0), (10, 1.0), (11, 1.0)]]

In [58]:
test_text = "Human computer interaction".split()
test_bow = dictionary.doc2bow(test_text)
test_bow

[(2, 1)]

In [59]:
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi_model[corpus_tfidf]
nodes = list(corpus_lsi)
# pprint(nodes)
lsi_model.print_topics(2) # 打印各topic的含义

[(0,
  '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

## 尝试使用训练好的 word2vector

In [26]:
from gensim.models import word2vec
from gensim.models import KeyedVectors
import re

# 加载训练好的 word2vec
model = word2vec.Word2Vec.load("/aiml/data/DevMind/06.SRC/01.TestCaseBot/02.TestAWBot/01.NMT_Train/nmt/sentence_embeding/med64_5.model.bin")
model.wv.similar_by_vector(model['mc'])

[('mc', 1.0),
 ('statichost', 0.43512803316116333),
 ('sense', 0.40406978130340576),
 ('snmpagentinform', 0.3807608187198639),
 ('cumulative', 0.37484827637672424),
 ('optchecksum', 0.3724084496498108),
 ('rmlf', 0.3662354052066803),
 ('adjstrictcheck', 0.3609578609466553),
 ('snmpblacklistuserblock', 0.3530350923538208),
 ('snmpsysinformation', 0.3521064221858978)]

In [2]:
# 将所有 AW 接口进行分词
RUBY_DESCRIPTION_DELIMITER = ' '
def method_to_des(mth):
    mth = re.sub("[^A-Za-z0-9?=]", " ", mth)
    if mth.islower():
        return mth
    mth_arr = []
    for word in mth.split(" "):
        if word.islower():
            mth_arr.append(word)
            continue
        
        tmp_word = ""
        compare_len = len(word)-1
        for i,alp in enumerate(word):
            tmp_word += alp
            if i == compare_len:
                mth_arr.append(tmp_word)
                break
            before_alp_lower = word[i].islower()
            after_alp_upper = word[i+1].isupper()
            before_alp_digit = word[i].isdigit()
            if (i>0 and before_alp_digit and word[i-1].isupper() != after_alp_upper) \
                or (before_alp_lower and after_alp_upper):
                mth_arr.append(tmp_word)
                tmp_word = ""
    mth_result = RUBY_DESCRIPTION_DELIMITER.join(mth_arr)
    return mth_result
method_to_des("Lr>lr.set_startup_file")

'Lr lr set startup file'

In [4]:
# 将分词后的 AW 转换为 embedding，然后加权求平均，记做 N
def get_aw_embedding(aw_sp_list):
    sum_emb = None
    len_emb = 0
    # print(aw_sp_list)
    for aw_sp in aw_sp_list.split():
        if aw_sp not in model.wv.vocab:
            continue
        len_emb += 1
        #print(model[aw_sp][0])
        #print(aw_sp)
        if 1 == len_emb:
            sum_emb = model.wv.word_vec(aw_sp).copy()
        else:
            sum_emb += model.wv.word_vec(aw_sp)
    # print(len_emb)
    if 0 == len_emb:
        # print(aw)
        return []
    sum_emb /= len_emb
    return sum_emb
get_aw_embedding(method_to_des('Lr>lr.set_startup_file_jiexiao'))

array([-0.91744393,  1.0888631 , -2.625137  ,  1.4460866 , -0.2799368 ,
        4.258118  , -0.8719462 ,  1.2196819 , -3.8599148 , -1.908354  ,
       -1.8601754 , -4.560252  , -2.888332  , -3.26212   ,  0.9287324 ,
        3.3939183 , -0.19960232,  0.06626473, -2.1651917 ,  6.310197  ,
        1.3903465 , -0.40090623, -0.41107208, -1.4541266 , -1.4475217 ,
        3.2508206 ,  4.477243  ,  1.0548089 ,  1.7993143 , -0.80789775,
       -2.031032  ,  1.4163105 ,  1.2064116 , -3.240237  , -7.076661  ,
        1.0470177 , -1.9057057 ,  4.655415  , -5.4842176 ,  2.9899838 ,
        1.7770965 ,  2.7014995 , -9.048382  , -0.96679115,  6.3571777 ,
        0.6662647 , -6.4235864 , -2.997609  ,  2.9769065 , -2.1526349 ,
        7.2513504 , -1.4889537 , -1.2638527 , -3.690178  ,  3.5595753 ,
        7.3578444 ,  0.14995785, -1.7461218 , -0.49070913, -5.381696  ,
       -3.1713872 ,  5.235554  ,  4.2035666 , -0.02459496], dtype=float32)

In [14]:
# 加载所有 AW 接口的列表
PATH_AW_LIST = '/aiml/data/DevMind/06.SRC/01.TestCaseBot/02.TestAWBot/01.NMT_Train/nmt/sentence_embeding/vocab.script'
with open(PATH_AW_LIST) as f:
    aw_list = f.readlines()
    aw_list = aw_list[3:]

# 生成 AW 接口的 embedding，KEY是函数名，Value为 N
aw_embedding_list = []
for aw in aw_list:
    aw_embedding = [aw]
    emb = get_aw_embedding(method_to_des(aw))
    if len(emb):
        aw_embedding.extend(emb)
        aw_embedding = [str(x).strip() for x in aw_embedding]
        aw_embedding_list.append(aw_embedding)
PATH_AW_EMBEDDING = "/aiml/data/DevMind/06.SRC/01.TestCaseBot/02.TestAWBot/01.NMT_Train/nmt/sentence_embeding/aw_embedding.model.vec"
with open(PATH_AW_EMBEDDING, 'w') as f:
    _ = f.write("%d %d\n" % (len(aw_embedding_list), len(emb)))
    for x in aw_embedding_list:
        _ = f.write(" ".join(x) + "\n")
print("Write %d to %s" % (len(aw_embedding_list), PATH_AW_EMBEDDING))

# 测试
test = (model.wv.word_vec('Lr') + model.wv.word_vec('lr') + model.wv.word_vec('set') + model.wv.word_vec('startup') + model.wv.word_vec('file') ) / 5
test = [str(x).strip() for x in test1]
test == [str(x) for x in aw_embedding_list[4]][1:]

Write 27817 to /aiml/data/DevMind/06.SRC/01.TestCaseBot/02.TestAWBot/01.NMT_Train/nmt/sentence_embeding/aw_embedding.model.vec


True

In [29]:
# 求预测结果的embedding加权均值，查找最近的 AW 接口
model_aw = KeyedVectors.load_word2vec_format(PATH_AW_EMBEDDING, binary=False)

In [35]:
model_aw.wv.similar_by_word(get_aw_embedding('  set startup '))

[('Startup.set_startup', 0.9543663263320923),
 ('Startup.unset_startup', 0.9046234488487244),
 ('Cfm.set_startup_backupconf', 0.869148313999176),
 ('StartPer>startper.set_startup', 0.8677996397018433),
 ('Cfm.set_startup_systemsoftware', 0.8657248020172119),
 ('Lr>lr.set_startup_file', 0.8655503392219543),
 ('Cfm.unset_startup_backupconf', 0.8394883871078491),
 ('IsisOverload>overload.set_on_startup', 0.8092464804649353),
 ('StartPer>startper.power_down', 0.753767728805542),
 ('Startup.set_save', 0.7346445918083191)]