# 4.5 GloVe

글로브를 구현...은 좀 귀찮고, 대신 저자들이 제공한 모델을 이용해서 학습시킨 후 벡터 표현을 점검해보자.

일단 GloVe 는 C++ 구현체로 돌릴 수 있다.

옵션은 다음과 같다.
* min-count : 최소 빈도수
* window-size : 고려 대상 문맥 길이
* vector-size : 임베딩 차원수

### 1) vocab_count
This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count.

`models/glove/build/vocab_count -min-count 5 -verbose 2 < data/tokenized/corpus_mecab.txt > data/word-embeddings/glove/glove.vocab`

### 2) cooccur
Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by `vocab_count`, and may specify a variety of parameters, as described by running `./build/cooccur`.

`models/glove/build/cooccur -memory 10.0 -vocab-file data/word-embeddings/glove/glove.vocab -verbose 2 -window-size 15 < data/tokenized/corpus_mecab.txt > data/word-embeddings/glove/glove.cooc`

### 3) shuffle
Shuffles the binary file of cooccurrence statistics produced by `cooccur`. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running `./build/shuffle`.

`models/glove/build/shuffle -memory 10.0 -verbose 2 < data/word-embeddings/glove/glove.cooc > data/word-embeddings/glove/glove.shuf`

### 4) glove
Train the GloVe model on the specified cooccurrence data, which typically will be the output of the `shuffle` tool. The user should supply a vocabulary file, as given by vocab_count, and may specify a number of other parameters, which are described by running `./build/glove`.

`models/glove/build/glove -save-file data/word-embeddings/glove/glove.vecs -threads 4 -input-file data/word-embeddings/glove/glove.shuf -x-max 10 -iter 15 -vector-size 100 -binary 2 -vocab-file data/word-embeddings/glove/glove.vocab -verbose 2`

### 이렇게 훈련한 GloVe 데이터를 불러와보자.

In [1]:
%matplotlib inline

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import time
import pandas as pd

dev = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
dev

device(type='cuda')

In [2]:
vec_path = '../data/word-embeddings/glove/glove.vecs.txt'
#word2id_path = '../data/word-embeddings/glove/glove.vocab'

### 원래 vocab 파일을 불러서 아래와 같이 딕셔너리 구성하려고 했으나... 그럴 필요가 없다는 걸 깨달음.

따라서 안해도 됨... 이유는 word2id 의 id와 vector matrix의 index가 일치하지 않기 때문.

In [74]:
word2id = dict()
for sent in open(word2id_path, 'r', encoding='utf-8').readlines():
    (word, id) = sent.replace('\n','').strip().split() 
    word2id[word] = int(id)

In [38]:
id2word = []
for word, _ in word2id.items():
    id2word.append(word)

In [61]:
len(word2id), len(id2word)

(358042, 358042)

In [63]:
word2id['한국']

138609

In [65]:
mat_df2.iloc[138609]

1     -0.354456
2      0.058935
3      0.115488
4      0.076369
5      0.624948
         ...   
96     0.206904
97    -0.097000
98    -0.085081
99    -0.179100
100   -0.396120
Name: 박세훈, Length: 100, dtype: float32

위와 같이 word2id와 mat_df2의 index가 일치하지 않음....

### GloVe의 결과 벡터를 불러와서 처리하자.

In [76]:
mat = [vec.replace('\n','').split() for vec in open(vec_path, 'r', encoding='utf-8').readlines()]

In [77]:
mat_df = pd.DataFrame(mat)

In [78]:
ind = mat_df[0]

In [79]:
mat_df2 = mat_df.iloc[:,1:].astype(np.float32).copy()

In [80]:
mat_df2.index = ind

In [81]:
mat_df2.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.358644,0.245709,0.216245,-0.190436,-0.756802,-0.563763,-0.328874,0.01778,-0.797071,-0.240773,...,-0.482112,0.669735,-0.020597,-0.428709,-0.446869,-0.444151,1.327251,0.145156,-0.057706,0.518008
.,-0.291055,0.236648,-0.276704,-0.208636,-0.269282,0.119725,-0.307835,-0.857031,-0.106891,-0.828509,...,0.616154,-0.06734,0.387378,-0.022754,-1.10887,-0.222616,1.88907,0.173319,-0.077478,1.213262
0,-0.091244,0.077856,0.729488,0.075314,-0.957595,-0.64704,-0.490441,0.066009,-0.465373,-0.746489,...,-0.511708,0.235092,-0.045828,-0.399763,-0.744628,-0.772276,1.39247,-0.256552,-0.144649,0.454234
2,-0.163885,-0.170753,0.440817,-0.027542,-0.97129,-0.643265,-0.368717,0.307056,-0.738059,-0.345742,...,-0.548439,0.551585,0.184272,-0.720818,-0.860157,-0.475679,1.411633,-0.299546,-0.047707,0.105494
의,0.002961,0.369871,0.323835,-0.40271,0.380283,-0.841707,-0.03086,0.863469,-0.992069,-0.827821,...,0.581238,0.168463,0.432114,-1.164212,-0.625309,-0.036693,1.908723,0.173384,0.200731,0.15215


In [83]:
mat_df2.values

array([[-0.358644,  0.245709,  0.216245, ...,  0.145156, -0.057706,
         0.518008],
       [-0.291055,  0.236648, -0.276704, ...,  0.173319, -0.077478,
         1.213262],
       [-0.091244,  0.077856,  0.729488, ..., -0.256552, -0.144649,
         0.454234],
       ...,
       [ 0.10549 , -0.031464, -0.103652, ..., -0.043713, -0.007134,
        -0.082403],
       [-0.174097, -0.066783,  0.121074, ...,  0.035401,  0.001533,
        -0.125724],
       [-0.011894, -0.004993,  0.022187, ...,  0.040329, -0.075547,
        -0.08533 ]], dtype=float32)

### 이렇게 생성된 데이터프레임으로 word2id 재생성

In [84]:
word2id = dict()
id2word = []
for i, word in enumerate(ind):
    word2id[word] = i
    id2word.append(word)

In [85]:
word2id

{'1': 0,
 '.': 1,
 '0': 2,
 '2': 3,
 '의': 4,
 ',': 5,
 '이': 6,
 '는': 7,
 '다': 8,
 ')': 9,
 '(': 10,
 '9': 11,
 '년': 12,
 '에': 13,
 '을': 14,
 '하': 15,
 '3': 16,
 '5': 17,
 '4': 18,
 '8': 19,
 '은': 20,
 '6': 21,
 '7': 22,
 ':': 23,
 '를': 24,
 '월': 25,
 '분류': 26,
 '일': 27,
 '고': 28,
 '-': 29,
 '가': 30,
 '있': 31,
 '에서': 32,
 '으로': 33,
 '로': 34,
 '한': 35,
 '되': 36,
 '었': 37,
 '과': 38,
 '들': 39,
 '와': 40,
 '도': 41,
 '했': 42,
 '적': 43,
 '인': 44,
 '였': 45,
 '그': 46,
 '어': 47,
 '기': 48,
 '《': 49,
 '제': 50,
 '것': 51,
 '*': 52,
 '~': 53,
 '게': 54,
 '지': 55,
 '"': 56,
 '》': 57,
 '여': 58,
 '한다': 59,
 '수': 60,
 '역': 61,
 '된': 62,
 '등': 63,
 '/': 64,
 '며': 65,
 '대': 66,
 '·': 67,
 '회': 68,
 '선수': 69,
 '영화': 70,
 '대한민국': 71,
 '할': 72,
 '던': 73,
 '해': 74,
 '아': 75,
 '만': 76,
 '%': 77,
 '명': 78,
 '않': 79,
 '자': 80,
 '시': 81,
 '에게': 82,
 '중': 83,
 '주': 84,
 '까지': 85,
 '미국': 86,
 '았': 87,
 '나': 88,
 '번': 89,
 '면': 90,
 '지만': 91,
 '일본': 92,
 '없': 93,
 '사람': 94,
 '받': 95,
 '성': 96,
 '위': 97,
 '때': 98,
 '축구'

In [86]:
id2word

['1',
 '.',
 '0',
 '2',
 '의',
 ',',
 '이',
 '는',
 '다',
 ')',
 '(',
 '9',
 '년',
 '에',
 '을',
 '하',
 '3',
 '5',
 '4',
 '8',
 '은',
 '6',
 '7',
 ':',
 '를',
 '월',
 '분류',
 '일',
 '고',
 '-',
 '가',
 '있',
 '에서',
 '으로',
 '로',
 '한',
 '되',
 '었',
 '과',
 '들',
 '와',
 '도',
 '했',
 '적',
 '인',
 '였',
 '그',
 '어',
 '기',
 '《',
 '제',
 '것',
 '*',
 '~',
 '게',
 '지',
 '"',
 '》',
 '여',
 '한다',
 '수',
 '역',
 '된',
 '등',
 '/',
 '며',
 '대',
 '·',
 '회',
 '선수',
 '영화',
 '대한민국',
 '할',
 '던',
 '해',
 '아',
 '만',
 '%',
 '명',
 '않',
 '자',
 '시',
 '에게',
 '중',
 '주',
 '까지',
 '미국',
 '았',
 '나',
 '번',
 '면',
 '지만',
 '일본',
 '없',
 '사람',
 '받',
 '성',
 '위',
 '때',
 '축구',
 '전',
 '으며',
 '#',
 '된다',
 '세',
 '개',
 '부터',
 '후',
 '화',
 '라고',
 '사용',
 '호',
 '같',
 '말',
 '라는',
 '팀',
 '학교',
 '이후',
 '및',
 '?',
 '는데',
 '두',
 '면서',
 '선',
 '한국',
 '차',
 '라',
 '대학교',
 '서울',
 '–',
 '보',
 '대한',
 '권',
 '상',
 '경기',
 '|',
 '지역',
 '군',
 '리그',
 '다고',
 '!',
 '때문',
 '국가',
 '이름',
 '시작',
 '더',
 '함께',
 '세계',
 '다른',
 '내',
 '경우',
 '현재',
 '살',
 '위해',
 '많',
 '방송',
 'KBS',
 'of',
 '부

### 이제 유사도를 측정해보자.

In [105]:
def similar_words(mat_df, word, k=10):
    cos = nn.CosineSimilarity(dim=1, eps=1e-6)
    
    word_id = word2id[word]
    word_vec = torch.tensor(mat_df.loc[word].values).view(1,-1)
    word_mat = torch.tensor(mat_df.values)
#    print(word_vec.size(), word_mat.size())
    cos_mat = cos(word_vec, word_mat)
    sim, indices = torch.topk(cos_mat,k+1)
    
    
    word_list = []
    for i in indices:
        if i != word_id:
            word_list.append(id2word[i])

    return pd.Series(word_list, np.array(sim[1:].detach()))

In [106]:
similar_words(mat_df2, '한국')

0.766142    대한민국
0.728997      국내
0.698197      협회
0.691419      문화
0.678622      세계
0.676159      대표
0.671660     연구소
0.660957      국제
0.660433      중국
0.657550     연구원
dtype: object

In [107]:
similar_words(mat_df2, '대통령')

0.745762    이명박
0.742417    노무현
0.718058     정부
0.717507    박근혜
0.715104    김대중
0.701533    부통령
0.699795    문재인
0.698852     총리
0.696139    박정희
0.684194     선거
dtype: object

In [111]:
similar_words(mat_df2, '서울')

0.768897      특별시
0.752911       부산
0.713688       대구
0.708653    서울특별시
0.693020     대한민국
0.676501      대학교
0.674008       인천
0.665950       한양
0.654892       광주
0.654595      서울시
dtype: object

In [128]:
similar_words(mat_df2, '컴퓨터')

0.831022       응용
0.779801      시스템
0.768034     하드웨어
0.765600    소프트웨어
0.729581       PC
0.715987       가상
0.707996       기기
0.704437      그래픽
0.703101       기술
0.693981       장치
dtype: object

In [131]:
similar_words(mat_df2, '공부')

0.798458    가르쳤
0.793902    가르치
0.775261     배웠
0.722765     유학
0.699725     수학
0.699584     수업
0.691074     입학
0.671551     학문
0.662154    열심히
0.654888     진학
dtype: object

In [125]:
def analogy(mat_df, word1, word2, word3,k=10):
    cos = nn.CosineSimilarity(dim=1, eps=1e-6)
    
    word_id1 = word2id[word1]
    word_id2 = word2id[word2]
    word_id3 = word2id[word3]
    word_vec1 = torch.tensor(mat_df.loc[word1].values).view(1,-1)
    word_vec2 = torch.tensor(mat_df.loc[word2].values).view(1,-1)
    word_vec3 = torch.tensor(mat_df.loc[word3].values).view(1,-1)
    word_mat = torch.tensor(mat_df.values)
    
    cos_mat = cos(word_vec1-word_vec2+word_vec3, word_mat)
    sim, indices = torch.topk(cos_mat,k+3)
    
    word_list = []
    sim_list = []
    for i, index in enumerate(indices):
        if index not in (word_id1, word_id2, word_id3):
            word_list.append(id2word[index])
            sim_list.append(sim[i].item())
        if len(word_list)==k:break
    return pd.Series(word_list, sim_list)


In [126]:
analogy(mat_df2,'한국','서울','파리')

0.687797    프랑스
0.601987     유럽
0.564660     인도
0.551131     세계
0.539856    러시아
0.533136    벨기에
0.528558    스위스
0.527541     영국
0.520582    그리고
0.519037    캐나다
dtype: object

In [127]:
analogy(mat_df2,'왕','남자','여자')

0.663666     王
0.642142    공주
0.642031    군주
0.632903    태조
0.628493    왕자
0.623341    대왕
0.619844     세
0.610831    아들
0.610641    나라
0.588807    국왕
dtype: object

뭐 이 정도면 나쁘지 않은듯?