# 作業 : 變更不同的 GloVe 模型, 並使用 gensim 套件觀察模型表現

# [作業目標]
- 載入不同版本的 GloVe 模型, 並觀察模型間有何差異

# [作業重點]
- 觀察 GloVe 不同的預訓練詞向量, 效果間的差異

# Step 1
- 到 GloVe 官方網站 (https://github.com/stanfordnlp/GloVe), 由四組預訓練模型選擇一項下載
- 將模型檔解壓縮後, 選擇並複製詞向量檔到本程式同一執行目錄中
- 依照你所選擇的詞向量檔, 修改設定模型中的"input_file"與"output_file", 再執行後續程式

In [1]:
# 載入 gensim 與 GloVe 模型容器
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

# 忽略警告訊息
import warnings
warnings.filterwarnings("ignore")



In [2]:
# 設定模型
input_file = 'glove/glove.6B.50d.txt'
output_file = 'gensim_glove.6B.50d.txt'
glove2word2vec(input_file, output_file)

(400000, 50)

In [3]:
# 轉換並讀取模型
model = KeyedVectors.load_word2vec_format(output_file, binary=False)

# Step 2
- 觀察變更預訓練詞向量的效果

In [4]:
# 顯示最相近的字彙
model.most_similar(['woman'])

[('girl', 0.906528115272522),
 ('man', 0.8860336542129517),
 ('mother', 0.876370370388031),
 ('her', 0.8613135814666748),
 ('boy', 0.859611988067627),
 ('she', 0.8430695533752441),
 ('herself', 0.8224567770957947),
 ('child', 0.8108214735984802),
 ('wife', 0.8037394285202026),
 ('old', 0.7982394695281982)]

In [5]:
# 顯示最相近的字彙(附加反義詞)
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

[('queen', 0.8523604273796082),
 ('throne', 0.7664334177970886),
 ('prince', 0.759214460849762),
 ('daughter', 0.7473882436752319),
 ('elizabeth', 0.7460219860076904)]

In [7]:
# 挑選最不相同的字彙
model.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [9]:
# 顯示字彙間的相似性
model.similarity('woman', 'man')

0.8860338

In [10]:
# 顯示字彙的詞向量
model['computer']

array([ 0.079084, -0.81504 ,  1.7901  ,  0.91653 ,  0.10797 , -0.55628 ,
       -0.84427 , -1.4951  ,  0.13418 ,  0.63627 ,  0.35146 ,  0.25813 ,
       -0.55029 ,  0.51056 ,  0.37409 ,  0.12092 , -1.6166  ,  0.83653 ,
        0.14202 , -0.52348 ,  0.73453 ,  0.12207 , -0.49079 ,  0.32533 ,
        0.45306 , -1.585   , -0.63848 , -1.0053  ,  0.10454 , -0.42984 ,
        3.181   , -0.62187 ,  0.16819 , -1.0139  ,  0.064058,  0.57844 ,
       -0.4556  ,  0.73783 ,  0.37203 , -0.57722 ,  0.66441 ,  0.055129,
        0.037891,  1.3275  ,  0.30991 ,  0.50697 ,  1.2357  ,  0.1274  ,
       -0.11434 ,  0.20709 ], dtype=float32)

In [11]:
input_file2 = 'glove/glove.6B.300d.txt'
output_file2 = 'gensim_glove.6B.300d.txt'
glove2word2vec(input_file2, output_file2)

(400000, 300)

In [12]:
def glove_monitor(output_file):
    model = KeyedVectors.load_word2vec_format(output_file, binary=False)
    print('# 顯示最相近的字彙:\n',model.most_similar(['woman']))
    print('顯示最相近的字彙(附加反義詞):\n',model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5))
    print('挑選最不相同的字彙:\n',model.doesnt_match("breakfast cereal dinner lunch".split()))
    print('顯示字彙間的相似性"\n',model.similarity('woman', 'man'))
    print('顯示字彙的詞向量"\n',model['computer'])

In [14]:
glove_monitor(output_file2)

# 顯示最相近的字彙:
 [('girl', 0.7296419143676758), ('man', 0.6998663544654846), ('mother', 0.689943790435791), ('she', 0.6433226466178894), ('her', 0.6327142715454102), ('female', 0.6251604557037354), ('herself', 0.6215280890464783), ('person', 0.6170896887779236), ('women', 0.6047608852386475), ('wife', 0.5986992716789246)]
顯示最相近的字彙(附加反義詞):
 [('queen', 0.6713276505470276), ('princess', 0.5432624220848083), ('throne', 0.5386104583740234), ('monarch', 0.5347574949264526), ('daughter', 0.498025119304657)]
挑選最不相同的字彙:
 cereal
顯示字彙間的相似性"
 0.69986635
顯示字彙的詞向量"
 [-2.7628e-01  1.3999e-01  9.8519e-02 -6.4019e-01  3.1988e-02  1.0066e-01
 -1.8673e-01 -3.7129e-01  5.9740e-01 -2.0405e+00  2.2368e-01 -2.6314e-02
  7.2408e-01 -4.3829e-01  4.8886e-01 -3.5486e-03 -1.0006e-01 -3.0587e-01
 -1.5621e-01 -6.8136e-02  2.1104e-01  2.9287e-01 -8.8861e-02 -2.0462e-01
 -5.7602e-01  3.4526e-01  4.1390e-01  1.7917e-01  2.5143e-01 -2.2678e-01
 -1.0103e-01  1.4576e-01  2.0127e-01  3.1810e-01 -7.8907e-01 -2.2194e-01
 -2.483

當維度變大時，模型訓練的越好，但相對地，載入及訓練時間也較久，在衡量維度的準確以及訓練時間也是相當重要的