# 言語処理100本ノック 2020 (Rev 2)



## 第7章: 単語ベクトル

### 60. 単語ベクトルの読み込みと表示

In [3]:
# https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

# https://stmind.hatenablog.com/entry/2017/06/18/230106
# https://blog.amedama.jp/entry/gensim-fasttext-pre-trained-word-vectors

import gensim

googlenews_w2v = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [10]:
print(googlenews_w2v["United_States"])
print(len(googlenews_w2v["United_States"]))


[-3.61328125e-02 -4.83398438e-02  2.35351562e-01  1.74804688e-01
 -1.46484375e-01 -7.42187500e-02 -1.01562500e-01 -7.71484375e-02
  1.09375000e-01 -5.71289062e-02 -1.48437500e-01 -6.00585938e-02
  1.74804688e-01 -7.71484375e-02  2.58789062e-02 -7.66601562e-02
 -3.80859375e-02  1.35742188e-01  3.75976562e-02 -4.19921875e-02
 -3.56445312e-02  5.34667969e-02  3.68118286e-04 -1.66992188e-01
 -1.17187500e-01  1.41601562e-01 -1.69921875e-01 -6.49414062e-02
 -1.66992188e-01  1.00585938e-01  1.15722656e-01 -2.18750000e-01
 -9.86328125e-02 -2.56347656e-02  1.23046875e-01 -3.54003906e-02
 -1.58203125e-01 -1.60156250e-01  2.94189453e-02  8.15429688e-02
  6.88476562e-02  1.87500000e-01  6.49414062e-02  1.15234375e-01
 -2.27050781e-02  3.32031250e-01 -3.27148438e-02  1.77734375e-01
 -2.08007812e-01  4.54101562e-02 -1.23901367e-02  1.19628906e-01
  7.44628906e-03 -9.03320312e-03  1.14257812e-01  1.69921875e-01
 -2.38281250e-01 -2.79541016e-02 -1.21093750e-01  2.47802734e-02
  7.71484375e-02 -2.81982


### 61. 単語の類似度

In [11]:
# https://qiita.com/Qiitaman/items/fa393d93ce8e61a857b1

import numpy as np

def cos_sim(v1, v2):
  return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

X = googlenews_w2v["United_States"]
Y = googlenews_w2v["U.S."]

print(cos_sim(X, Y))


0.7310775



### 62. 類似度の高い単語10件

In [13]:
# https://qiita.com/DancingEnginee1/items/b10c8ef7893d99aa53be

print(googlenews_w2v.most_similar('United_States'))

[('Unites_States', 0.7877248525619507), ('Untied_States', 0.7541370987892151), ('United_Sates', 0.7400725483894348), ('U.S.', 0.7310774922370911), ('theUnited_States', 0.6404394507408142), ('America', 0.6178409457206726), ('UnitedStates', 0.6167312264442444), ('Europe', 0.6132988929748535), ('countries', 0.6044804453849792), ('Canada', 0.6019068956375122)]


### 63. 加法構成性によるアナロジー

In [14]:
# https://www.pc-koubou.jp/magazine/9905

googlenews_w2v.most_similar(positive=[u"Spain",u"Athens"], negative=["Madrid"],topn=10)

[('Greece', 0.6898480653762817),
 ('Aristeidis_Grigoriadis', 0.5606847405433655),
 ('Ioannis_Drymonakos', 0.5552908778190613),
 ('Greeks', 0.545068621635437),
 ('Ioannis_Christou', 0.5400862693786621),
 ('Hrysopiyi_Devetzi', 0.5248445272445679),
 ('Heraklio', 0.5207759737968445),
 ('Athens_Greece', 0.516880989074707),
 ('Lithuania', 0.5166866183280945),
 ('Iraklion', 0.5146791338920593)]


### 64. アナロジーデータでの実験

In [19]:
import pandas as pd

df = pd.read_csv('questions-words.txt',header=None,sep=' ',skiprows=1)
df

Unnamed: 0,0,1,2,3
0,Athens,Greece,Baghdad,Iraq
1,Athens,Greece,Bangkok,Thailand
2,Athens,Greece,Beijing,China
3,Athens,Greece,Berlin,Germany
4,Athens,Greece,Bern,Switzerland
...,...,...,...,...
19552,write,writes,talk,talks
19553,write,writes,think,thinks
19554,write,writes,vanish,vanishes
19555,write,writes,walk,walks


In [28]:
for vec1,vec2,vec3 in zip(df[0],df[1],df[2]):
  print("{} - {} + {} = {}".format(vec2,vec1,vec3,googlenews_w2v.most_similar(positive=[vec2,vec3], negative=[vec1],topn=1)))

Greece - Athens + Baghdad = [('Iraqi', 0.635187029838562)]
Greece - Athens + Bangkok = [('Thailand', 0.7137669920921326)]
Greece - Athens + Beijing = [('China', 0.7235777378082275)]
Greece - Athens + Berlin = [('Germany', 0.6734623312950134)]
Greece - Athens + Bern = [('Switzerland', 0.4919748306274414)]
Greece - Athens + Cairo = [('Egypt', 0.7527808547019958)]
Greece - Athens + Canberra = [('Australia', 0.5837324857711792)]
Greece - Athens + Hanoi = [('Viet_Nam', 0.6276342272758484)]
Greece - Athens + Havana = [('Cuba', 0.6460990905761719)]
Greece - Athens + Helsinki = [('Finland', 0.68999844789505)]
Greece - Athens + Islamabad = [('Pakistan', 0.7233324646949768)]
Greece - Athens + Kabul = [('Afghan', 0.6160916090011597)]
Greece - Athens + London = [('Britain', 0.5646187663078308)]
Greece - Athens + Madrid = [('Spain', 0.7036612629890442)]
Greece - Athens + Moscow = [('Russia', 0.7382973432540894)]
Greece - Athens + Oslo = [('Norway', 0.6470744013786316)]
Greece - Athens + Ottawa = [(

KeyError: "Key 'capital-world' not present"