# 第9章: ベクトル空間法 (I)

[enwiki-20150112-400-r10-105752.txt.bz2](data/enwiki-20150112-400-r10-105752.txt.bz2)は，2015年1月12日時点の英語のWikipedia記事のうち，約400語以上で構成される記事の中から，ランダムに1/10サンプリングした105,752記事のテキストをbzip2形式で圧縮したものである．このテキストをコーパスとして，単語の意味を表すベクトル（分散表現）を学習したい．第9章の前半では，コーパスから作成した単語文脈共起行列に主成分分析を適用し，単語ベクトルを学習する過程を，いくつかの処理に分けて実装する．第9章の後半では，学習で得られた単語ベクトル（300次元）を用い，単語の類似度計算やアナロジー（類推）を行う．

なお，問題83を素直に実装すると，大量（約7GB）の主記憶が必要になる．
メモリが不足する場合は，処理を工夫するか，1/100サンプリングのコーパス[enwiki-20150112-400-r100-10576.txt.bz2](data/enwiki-20150112-400-r100-10576.txt.bz2)を用いよ．


In [1]:
#!wget -nc http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-400-r10-105752.txt.bz2 -P ./data/
#!wget -nc http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2 -P ./data/

In [2]:
#!bzip2 -d ./data/enwiki-20150112-400-r10-105752.txt.bz2
#!bzip2 -d ./data/enwiki-20150112-400-r100-10576.txt.bz2

In [3]:
!head -3 ./data/enwiki-20150112-400-r100-10576.txt

Anarchism

Anarchism is a political philosophy that advocates stateless societies often defined as self-governed voluntary institutions, but that several authors have defined as more specific institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, or harmful. While anti-statism is central, anarchism entails opposing authority or hierarchical organisation in the conduct of human relations, including, but not limited to, the state system.


## 80. コーパスの整形

文を単語列に変換する最も単純な方法は，空白文字で単語に区切ることである．
ただ，この方法では文末のピリオドや括弧などの記号が単語に含まれてしまう．
そこで，コーパスの各行のテキストを空白文字でトークンのリストに分割した後，各トークンに以下の処理を施し，単語から記号を除去せよ．

+ トークンの先頭と末尾に出現する次の文字を削除: `.,!?;:()[]'"`
+ 空文字列となったトークンは削除

以上の処理を適用した後，トークンをスペースで連結してファイルに保存せよ．

In [4]:
def tokenize(text):
    return [token.strip('.,!?;:()[]\'"') for token in text.split() if token.strip('.,!?;:()[]\'"')]

In [5]:
with open('./work/q80.txt', 'w') as fw:
    with open('./data/enwiki-20150112-400-r100-10576.txt', 'r') as fr:
        for line in fr:
            if line.rstrip():
                token_list = tokenize(line.rstrip())
                fw.write(' '.join(token_list) + '\n')

In [6]:
!head -5 ./work/q80.txt

Anarchism
Anarchism is a political philosophy that advocates stateless societies often defined as self-governed voluntary institutions but that several authors have defined as more specific institutions based on non-hierarchical free associations Anarchism holds the state to be undesirable unnecessary or harmful While anti-statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of human relations including but not limited to the state system
As a subtle and anti-dogmatic philosophy anarchism draws on many currents of thought and strategy Anarchism does not offer a fixed body of doctrine from a single particular world view instead fluxing and flowing as a philosophy There are many types and traditions of anarchism not all of which are mutually exclusive Anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism Strains of anarchism have often been divided into the categories 

## 81. 複合語からなる国名への対処

英語では，複数の語の連接が意味を成すことがある．例えば，アメリカ合衆国は"United States"，イギリスは"United Kingdom"と表現されるが，"United"や"States"，"Kingdom"という単語だけでは，指し示している概念・実体が曖昧である．そこで，コーパス中に含まれる複合語を認識し，複合語を1語として扱うことで，複合語の意味を推定したい．しかしながら，複合語を正確に認定するのは大変むずかしいので，ここでは複合語からなる国名を認定したい．

インターネット上から国名リストを各自で入手し，80のコーパス中に出現する複合語の国名に関して，スペースをアンダーバーに置換せよ．例えば，"United States"は"United_States"，"Isle of Man"は"Isle_of_Man"になるはずである．

In [7]:
import pandas as pd

countries = pd.read_csv('https://raw.githubusercontent.com/datasets/country-list/master/data.csv')
countries.head()

Unnamed: 0,Name,Code
0,Afghanistan,AF
1,Åland Islands,AX
2,Albania,AL
3,Algeria,DZ
4,American Samoa,AS


In [8]:
countries['n_tokens'] = countries.Name.apply(tokenize).apply(len)   # インデックスの追加
countries.head()

Unnamed: 0,Name,Code,n_tokens
0,Afghanistan,AF,1
1,Åland Islands,AX,2
2,Albania,AL,1
3,Algeria,DZ,1
4,American Samoa,AS,2


### pandas.DataFrame.query
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
### pandas.DataFrame.sort_values
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html

In [9]:
# Nameの中身もtokenizeする必要ある？？
countries.query('n_tokens > 1', inplace=True)   # 元のDeraFrameを書き換える (inplace =True)
countries.sort_values('n_tokens', ascending=False, inplace=True)   # 降順
countries.head()

Unnamed: 0,Name,Code,n_tokens
185,"Saint Helena, Ascension and Tristan da Cunha",SH,7
206,South Georgia and the South Sandwich Islands,GS,7
51,"Congo, the Democratic Republic of the",CD,6
131,"Macedonia, the Former Yugoslav Republic of",MK,6
97,Holy See (Vatican City State),VA,5


In [10]:
# countries.Name

In [11]:
multiword_countries = [(name, name.replace(' ', '_')) for name in countries.Name]

multi_list = []

def replace_multiword_countries(text, multiword_countries):
    for old, new in multiword_countries:
        if old in text:
            text = text.replace(old, new)
            multi_list.append(new)   # 確認用
    return text

In [12]:
with open('./work/q81.txt', 'w') as fw:
    with open('./work/q80.txt', 'r') as fr:
        for line in fr:
            fw.write(replace_multiword_countries(line.rstrip(), multiword_countries) + '\n')

In [13]:
!head -5 ./work/q81.txt

Anarchism
Anarchism is a political philosophy that advocates stateless societies often defined as self-governed voluntary institutions but that several authors have defined as more specific institutions based on non-hierarchical free associations Anarchism holds the state to be undesirable unnecessary or harmful While anti-statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of human relations including but not limited to the state system
As a subtle and anti-dogmatic philosophy anarchism draws on many currents of thought and strategy Anarchism does not offer a fixed body of doctrine from a single particular world view instead fluxing and flowing as a philosophy There are many types and traditions of anarchism not all of which are mutually exclusive Anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism Strains of anarchism have often been divided into the categories 

In [14]:
print(multi_list[:20])

['United_States', 'United_States', 'United_States', 'United_States', 'United_Kingdom', 'United_States', 'United_States', 'United_States', 'United_States', 'United_States', 'United_States', 'United_States', 'Hong_Kong', 'United_States', 'United_States', 'United_States', 'Equatorial_Guinea', 'United_States', 'United_States', 'United_States']


## 82. 文脈の抽出

81で作成したコーパス中に出現するすべての単語$t$に関して，単語$t$と文脈語$c$のペアをタブ区切り形式ですべて書き出せ．ただし，文脈語の定義は次の通りとする．

+ ある単語$t$の前後$d$単語を文脈語$c$として抽出する（ただし，文脈語に単語$t$そのものは含まない）
+ 単語$t$を選ぶ度に，文脈幅$d$は$\{1, 2, 3, 4, 5\}$の範囲でランダムに決める．

In [15]:
def get_context_tokens(tokens, target_i, window_size):
    return tokens[max(0, target_i - window_size) : target_i] + tokens[target_i + 1 : min(len(tokens), target_i + window_size + 1)]   # ターゲットの単語の左側と右側を返す

In [16]:
import numpy as np

def generate_word_context_pairs(tokens):
    window_sizes = np.random.randint(1, 6, len(tokens))
    # print(window_sizes)
    
    for target_i, (token, window_size) in enumerate(zip(tokens, window_sizes)):
        for contecxt in get_context_tokens(tokens, target_i, window_size):
            yield token, contecxt

In [17]:
test_tokens = ['This', 'is', 'a', 'test', 'sentence']

In [18]:
results = [result for result in generate_word_context_pairs(test_tokens)]

In [19]:
results

[('This', 'is'),
 ('This', 'a'),
 ('This', 'test'),
 ('This', 'sentence'),
 ('is', 'This'),
 ('is', 'a'),
 ('is', 'test'),
 ('is', 'sentence'),
 ('a', 'This'),
 ('a', 'is'),
 ('a', 'test'),
 ('a', 'sentence'),
 ('test', 'This'),
 ('test', 'is'),
 ('test', 'a'),
 ('test', 'sentence'),
 ('sentence', 'is'),
 ('sentence', 'a'),
 ('sentence', 'test')]

In [20]:
with open('./work/q82.txt', 'w') as fw:
    with open('./work/q81.txt', 'r') as fr:
        for line in fr:
            tokens = line.strip().split()
            for word, context in generate_word_context_pairs(tokens):
                fw.write(word + '\t' + context + '\n')

In [21]:
!head ./work/q82.txt

Anarchism	is
Anarchism	a
Anarchism	political
is	Anarchism
is	a
is	political
is	philosophy
is	that
a	Anarchism
a	is


## 83. 単語／文脈の頻度の計測

82の出力を利用し，以下の出現分布，および定数を求めよ．

+ $f(t,c)$: 単語$t$と文脈語$c$の共起回数
+ $f(t,*)$: 単語$t$の出現回数
+ $f(*,c)$: 文脈語$c$の出現回数
+ $N$: 単語と文脈語のペアの総出現回数

### collections.Counter
https://docs.python.jp/3/library/collections.html#collections.Counter

In [26]:
from collections import Counter

f_tc = Counter()
f_t = Counter()
f_c = Counter()
N = 0

with open('./work/q82.txt', 'r') as f:
    for line in f:
        word, context = line.strip().split()
        f_tc.update({(word, context):1})
        f_t.update({word:1})
        f_c.update({context:1})
        N += 1

In [27]:
print('rank\ttc\tt\tc')
for i, (tc, t, c) in enumerate(zip(f_tc.most_common(10), f_t.most_common(10), f_c.most_common(10))):
    print(str(i+1) + '\t' + str(tc) + '\t' + str(t) + '\t' + str(c))
    
print('\nN = ' + str(N))

rank	tc	t	c
1	(('the', 'of'), 282854)	('the', 4427791)	('the', 4423959)
2	(('of', 'the'), 282547)	('of', 2396751)	('of', 2397073)
3	(('the', 'the'), 157343)	('and', 2090338)	('and', 2093119)
4	(('the', 'in'), 129229)	('in', 1656493)	('in', 1656118)
5	(('in', 'the'), 129066)	('to', 1548035)	('to', 1550016)
6	(('the', 'and'), 116222)	('a', 1381584)	('a', 1381063)
7	(('and', 'the'), 115909)	('was', 801279)	('was', 802717)
8	(('the', 'to'), 107256)	('is', 613460)	('is', 612348)
9	(('to', 'the'), 106865)	('The', 607701)	('The', 609595)
10	(('of', 'and'), 57000)	('for', 581112)	('for', 580138)

N = 68068416


## 84. 単語文脈行列の作成

83の出力を利用し，単語文脈行列$X$を作成せよ．ただし，行列$X$の各要素$X_{tc}$は次のように定義する．

+ $f(t, c) \geq 10$ならば，$X_{tc} = {\rm PPMI}(t, c) = \max\{\log \frac{N \times f(t,c)}{f(t,*) \times f(*,c)}, 0\}$
+ $f(t, c) < 10$ならば，$X_{tc} = 0$

ここで，${\rm PPMI}(t, c)$はPositive Pointwise Mutual Information（正の相互情報量）と呼ばれる統計量である．なお，行列$X$の行数・列数は数百万オーダとなり，行列のすべての要素を主記憶上に載せることは無理なので注意すること．幸い，行列$X$のほとんどの要素は$0$になるので，非$0$の要素だけを書き出せばよい．

In [28]:
word2idx_t = {w:i for i, w in enumerate(f_t)}
word2idx_c = {w:i for i, w in enumerate(f_c)}
idx2word_t = {i:w for i, w in enumerate(f_t)}
idx2word_c = {i:w for i, w in enumerate(f_c)}

In [29]:
import math

def calc_ppmi(t, c, f_tc, f_t, f_c, N):
    pmi = math.log((N * f_tc[(t, c)]) / (f_t[t] * f_c[c]))
    return max(pmi, 0)

### 疎行列
http://hamukazu.com/2014/09/26/scipy-sparse-basics/

In [30]:
from scipy.sparse import lil_matrix, csr_matrix

In [31]:
X = lil_matrix((len(f_t), len(f_c)))

In [32]:
for ((t,c), count) in f_tc.most_common():
    if count < 10:
        break
    X[word2idx_t[t], word2idx_c[c]] = calc_ppmi(t, c, f_tc, f_t, f_c, N)

In [33]:
X.todense()

matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.58043723, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.58429949, 0.        , 0.30772033, ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

## 85. 主成分分析による次元圧縮

84で得られた単語文脈行列に対して，主成分分析を適用し，単語の意味ベクトルを300次元に圧縮せよ．


In [34]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=300, random_state=0)
wordvec = svd.fit_transform(X)

## 86. 単語ベクトルの表示

85で得た単語の意味ベクトルを読み込み，"United States"のベクトルを表示せよ．ただし，"United States"は内部的には"United_States"と表現されていることに注意せよ．


In [35]:
import pandas as pd

wv_df = pd.DataFrame(wordvec)
wv_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,2.460927e-13,-6.844896e-13,-1.182520e-12,-3.472276e-13,8.756273e-13,8.690699e-13,1.824721e-13,-9.096301e-13,1.643200e-13,-1.545890e-12,...,-1.137739e-12,-5.987609e-13,2.840928e-14,-1.965313e-13,1.025863e-12,-4.667851e-13,-1.229447e-12,2.574192e-13,-8.312946e-13,-1.768165e-12
1,1.088355e+01,-9.813627e+00,1.412993e+01,-7.595616e-01,4.544634e-02,1.021325e+01,-2.488177e+00,1.021237e+00,-3.544547e+00,6.645135e-01,...,5.091738e-03,-3.172501e-01,-2.078844e-01,-2.128880e-01,-1.008824e-01,-2.463419e-01,2.622151e-01,5.174051e-01,-3.943457e-01,-2.004722e-01
2,1.334150e+01,-1.096851e+01,2.153908e+01,3.415285e+00,-8.220613e+00,1.769258e+01,8.417137e+00,-1.018583e+01,-7.523641e+00,-3.942515e+00,...,3.715150e-02,-1.100771e-01,1.033480e-01,-9.295873e-02,3.594049e-01,-1.844290e-01,-9.515895e-02,5.863425e-02,2.720474e-02,-2.912318e-01
3,2.856420e+00,-2.281151e+00,3.900335e-01,3.368598e-01,-2.639881e+00,2.189248e-01,-3.317760e-01,9.176886e-01,-1.173995e-01,2.904791e+00,...,-1.800503e-01,4.488974e-01,-5.665919e-01,4.528470e-01,-2.130982e-01,-1.112813e+00,2.076987e-01,3.166897e-01,-6.742635e-01,-3.867269e-01
4,8.538492e-01,-5.540700e-01,-7.458228e-01,3.369258e-01,-2.159818e-01,5.570205e-01,2.815038e-01,-1.013967e-01,2.548310e-02,2.182103e-01,...,-5.344505e-01,1.780250e-01,-1.667240e-01,1.955697e-02,-3.973619e-01,1.790459e-01,-3.196737e-01,-7.108826e-02,5.149204e-02,3.487818e-02
5,1.127609e+01,-1.025844e+01,1.338673e+01,-8.131937e+00,8.583241e+00,4.066046e+00,-2.058713e+00,-1.586903e+00,-1.563394e+00,1.484563e+00,...,-1.784628e-01,2.238841e-01,-1.991190e-01,-4.389363e-01,3.823723e-01,-5.892128e-01,2.178568e-01,1.942964e-01,-9.838242e-02,2.182350e-01
6,2.086978e-01,-2.028752e-01,-4.313913e-01,2.812408e-02,-1.642490e-03,2.729850e-01,1.497318e-02,-1.030563e-01,1.279059e-01,2.728767e-02,...,-1.039096e-02,-1.490367e-03,-1.201295e-02,-4.899424e-02,6.666673e-03,1.810397e-02,8.778931e-03,1.405546e-03,1.596749e-02,-4.604817e-02
7,-3.333818e-17,-3.567670e-16,-2.500388e-15,6.885675e-15,-5.292789e-15,1.628544e-14,-8.734249e-14,-7.410360e-14,-5.317170e-14,1.089309e-13,...,-3.011642e-13,-3.243437e-13,8.210578e-13,-3.621599e-13,-2.332446e-12,4.322316e-13,-7.747500e-13,7.450634e-13,1.131106e-12,7.511588e-13
8,5.638180e-01,-5.266131e-01,-7.637317e-01,-1.409223e-01,-4.064791e-01,2.098459e-01,-3.323059e-01,4.308274e-01,1.598985e-01,3.881220e-01,...,2.965615e-03,-1.390317e-03,-1.785724e-02,-8.670345e-02,9.350514e-02,-1.286208e-01,-5.041828e-02,-4.219508e-02,-1.070577e-02,1.435861e-02
9,3.217519e+00,-3.010133e+00,1.993065e-01,-2.148144e+00,-6.894042e-02,1.273306e-01,-1.740555e+00,1.574082e+00,1.277484e-01,1.935056e+00,...,2.767808e-01,5.538476e-02,-4.262907e-02,-6.416395e-01,-9.275217e-02,2.808856e-01,-6.441908e-01,5.029069e-01,1.482116e-01,4.040402e-01


### pandas.DataFrame.loc
> loc は 行、列をラベルで指定します。

http://ailaby.com/lox_iloc_ix/

In [36]:
vec = wv_df.loc
vec[word2idx_t['United_States']]

0      3.624157
1     -0.798077
2      0.182374
3      6.580568
4      2.412249
5     -1.198536
6     -2.659294
7      0.414337
8      0.009888
9      0.475007
10    -0.246590
11     0.982270
12    -3.152961
13    -2.099592
14     2.657687
15     0.983020
16    -4.355435
17     0.597528
18     0.781898
19     0.202509
20    -0.199215
21     0.777514
22    -0.203664
23    -1.526082
24    -0.636947
25     0.767524
26     0.439618
27     3.670450
28     2.752429
29     3.531579
         ...   
270   -0.151609
271    1.025357
272   -0.345274
273    0.864976
274   -0.444978
275   -0.112823
276   -0.007311
277    0.465248
278   -0.063938
279   -0.787044
280    0.501671
281   -0.647129
282   -0.707338
283    0.520256
284   -0.324528
285    0.319336
286   -0.310496
287   -0.007833
288   -0.962127
289   -0.252792
290    0.897604
291    1.473350
292    0.210196
293    0.400210
294    0.431741
295    0.717707
296   -0.299395
297    0.430315
298   -0.207797
299   -0.711379
Name: 276, Length: 300, 

## 87. 単語の類似度

85で得た単語の意味ベクトルを読み込み，"United States"と"U.S."のコサイン類似度を計算せよ．ただし，"U.S."は内部的に"U.S"と表現されていることに注意せよ．

### scipy.spatial.distance.cosine
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine

In [37]:
from scipy.spatial.distance import cosine

1 - cosine(vec[word2idx_t['United_States']], vec[word2idx_t['U.S']])

0.8193708976794124

## 88. 類似度の高い単語10件

85で得た単語の意味ベクトルを読み込み，"England"とコサイン類似度が高い10語と，その類似度を出力せよ．

### sklearn.metrics.pairwise.cosine_similarity
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

In [38]:
from sklearn.metrics.pairwise import cosine_similarity

def most_similar(wv_df, target_vec, n =10):
    
    cossims = cosine_similarity(target_vec.values.reshape(1, 300), wv_df)[0]   # .values.reshapeじゃなくて.reshapeだと警告が出る
    print(len(cossims))   #  確認用
    top_cossim_indices = cossims.argsort()[: -n+1 : -1]
    return [(cossims[i], idx2word_t[i]) for i in top_cossim_indices]

In [39]:
most_similar(wv_df, vec[word2idx_t['England']])

383349


[(0.9999999999999998, 'England'),
 (0.6513227113516088, 'Scotland'),
 (0.6278695768618816, 'Wales'),
 (0.6053167458955393, 'Australia'),
 (0.5980394678744362, 'Spain'),
 (0.5934639570648795, 'Italy'),
 (0.5842730965710573, 'France'),
 (0.555217107522618, 'Germany')]

## 89. 加法構成性によるアナロジー

85で得た単語の意味ベクトルを読み込み，vec("Spain") - vec("Madrid") + vec("Athens")を計算し，そのベクトルと類似度の高い10語とその類似度を出力せよ．


In [40]:
most_similar(wv_df, vec[word2idx_t['Spain']] - vec[word2idx_t['Madrid']] + vec[word2idx_t['Athens']])

383349


[(0.9605210639279498, 'Spain'),
 (0.8466364315714688, 'France'),
 (0.8416231375482882, 'Sweden'),
 (0.8266170538087363, 'Austria'),
 (0.8055504913141227, 'Italy'),
 (0.782976764942752, 'Belgium'),
 (0.7828521287975335, 'Portugal'),
 (0.7755049649526187, 'Netherlands')]

### Chapter10 q092用

In [53]:
f_t['Fabbri']

25

In [54]:
f_t['Fabbr']

0

In [48]:
from sklearn.metrics.pairwise import cosine_similarity

def most_similar(wv_df, target_vec, n =10):
    
    cossims = cosine_similarity(target_vec.values.reshape(1, 300), wv_df)[0]   # .values.reshapeじゃなくて.reshapeだと警告が出る
    # print(len(cossims))   #  確認用
    top_cossim_indices = cossims.argsort()[: -n+1 : -1]
    return [(cossims[i], idx2word_t[i]) for i in top_cossim_indices]

In [49]:
with open('../chapter10/work/q092by85', 'w') as fw:
    with open('../chapter10/work/q091.txt', 'r') as fr:
        for line in fr:
            words = line.strip().split()
            if f_t[words[0]] > 0 and f_t[words[1]] > 0 and f_t[words[2]] > 0:   # vocabの中に単語が含まれている時
                most_similar_word = most_similar(wv_df, vec[word2idx_t[words[1]]] - vec[word2idx_t[words[0]]] + vec[word2idx_t[words[2]]])[0]
                fw.write(line.strip() + ' ' + most_similar_word[1] + ' ' + str(most_similar_word[0]) + '\n')

### Chapter10 q094用

In [51]:
from scipy.spatial.distance import cosine

with open('../chapter10/work/q094by85', 'w') as fw:
    with open('../chapter10/data/wordsim353/combined.tab', 'r') as fr:
        for i, line in enumerate(fr):
            if i == 0:
                continue
            words = line.strip().split()
            if f_t[words[0]] > 0 and f_t[words[1]] > 0:   # vocabの中に単語が含まれている時
                fw.write(line.strip() + '\t' + str(1 - cosine(vec[word2idx_t[words[0]]], vec[word2idx_t[words[1]]])) + '\n')

  dist = 1.0 - uv / np.sqrt(uu * vv)
