# Load pre-trained Word2Vec embedding

Download word2vec files, which were trained from Wikipedia by Facebook AI using their fasttext implementation, from [pre-trained word2vec](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md)

Unzip to obtain wiki.zh.vec (text format) 821 MB

In [1]:
import jieba
import numpy as np

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = 300

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

Load wiki.zh.vec to obtain a dict w2v

In [10]:
# load wiki.zh 300d word2vec embedding provided by Facebook
with open('../wiki.zh/wiki.zh.vec', "r", encoding='utf-8') as lines:
    w2v = {line.split()[0]: np.asarray(line.split()[1:], dtype='float32') for line in lines}

Show one word's 300d w2v embedding.

In [10]:
print(w2v[u'美国'])

[ -3.84820014e-01   6.58559978e-01  -2.58219987e-01  -5.38880005e-02
   7.19030023e-01  -2.08000004e-01  -4.21669990e-01   5.07480018e-02
   2.64129996e-01   3.55589986e-02   7.82469988e-01   2.35489994e-01
   1.04100001e+00  -6.62980020e-01   7.08630010e-02  -9.66240019e-02
   7.42129982e-01   6.64950013e-02  -1.05680001e+00  -5.42530000e-01
  -4.26470011e-01   9.47340012e-01   5.06449997e-01  -2.00859994e-01
   2.40410000e-01  -3.85729998e-01   8.07449996e-01   3.22939992e-01
   1.82650000e-01   1.60060003e-01   6.90800011e-01  -5.70349991e-01
  -7.97230005e-01  -6.56369984e-01  -1.06620002e+00   7.20200002e-01
   8.32350016e-01  -5.28789997e-01   2.31859997e-01   1.08270001e+00
   6.82020009e-01   8.13030005e-01   6.05300009e-01  -9.35899973e-01
  -3.28700006e-01  -7.18860030e-02  -7.02260017e-01  -4.53399986e-01
  -1.26190007e-01  -7.45549977e-01   1.35800004e+00  -2.86300004e-01
  -5.73430002e-01  -6.36030018e-01   3.93869996e-01  -9.20090020e-01
  -1.12530005e+00  -9.96249974e-01

# Compute mean of Word2vec from words in each sentence

Create a list containing Chinse sentences (sents_str). Then use jieba to segment words for each sentence to form sents_tokenized.

Use sent2vec transform() function to convert.

In [11]:
sent2vec = MeanEmbeddingVectorizer(w2v)

# test sentences
sents_str = [u'我来到北京清华大学',
         u'小明硕士毕业于中国科学院',
         u'小明后在日本京都大学深造']

# calling jieba word segmenter to prepare sent2vec input
sents_tokenized = [list(jieba.cut(sent, cut_all=False)) for sent in sents_str]
vecs = sent2vec.transform(sents_tokenized)

The below shows three sentences' corresonding w2v_sentmean features. Note that we only show first 5 elements for each w2v vector.

In [13]:
for jieba_out, vec in zip(sents_tokenized, vecs):
    print(jieba_out)
    print(vec[0:5])

['我', '来到', '北京', '清华大学']
[-0.1078045   0.62866789 -0.37951651  0.22102775  0.84329498]
['小明', '硕士', '毕业', '于', '中国科学院']
[-0.29119724  0.73300999 -0.62602252 -0.013232    0.74677247]
['小明', '后', '在', '日本京都大学', '深造']
[-0.07588175  1.02109993 -0.57511997  0.2206755   0.79821748]
