# 05. 파이썬을 이용한 Word2Vec (Using Gensim)

* 싸이그래머 / 어바웃 파이썬
* 김무성

# 차례
* Word2Vec 
* Gensim을 이용한 Word2Vec 간단 예제
* Gensim을 이용한 한국어 Word2Vec 예제

# Word2Vec
* [1] Brain's Pick : 단어 간 유사도 파악 방법 - https://brunch.co.kr/@kakao-it/189
* [5] word2vec 관련 이론 정리 - https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/
* [6] Vector Representations of Words - https://www.tensorflow.org/tutorials/word2vec

<img src="05_figures/cat1.png" width=600 />
<img src="05_figures/cat2.png" width=600 />
<img src="05_figures/w2v.png" width=600 />
<img src="05_figures/we.png" width=600 />
<img src="05_figures/tsne.png" width=600 />

--------------------------

# Gensim을 이용한 Word2Vec 간단 예제

* [2] models.word2vec – Deep learning with word2vec - https://radimrehurek.com/gensim/models/word2vec.html

### 영어 예시

In [1]:
import gensim

In [6]:
# Data
documents = [["my", "name", "is", "jamie"], ["jamie", "is", "cute"]]

In [14]:
# 모델 초기화
model = gensim.models.Word2Vec(min_count=1) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

In [15]:
# 모델 사전 만들기
model.build_vocab(documents)

In [23]:
# 학습
model.train(documents, total_examples=len(documents), epochs=10)

4

In [18]:
# 생성된 모델 저장 및 불러오기 - 이것은 나중에 이 모델을 다시 활용하려할 때 써보기. 
fname = 'mymodel.wv'
model.save(fname)
my_model = gensim.models.Word2Vec.load(fname)  

In [19]:
%ls mymodel.wv

mymodel.wv


In [20]:
# 단어의 벡터값 얻기 
model.wv['name']  # numpy vector of a word

array([  1.09521230e-03,   4.64050146e-03,  -4.45094937e-03,
         3.06846108e-03,  -3.19714658e-03,  -3.16221220e-03,
         4.56604082e-03,  -4.27358784e-03,  -1.06759043e-03,
         4.42925887e-03,   4.23925649e-03,  -1.17838802e-03,
         2.55819061e-03,  -4.12957836e-03,   1.79187104e-04,
        -2.00552749e-03,   2.03928980e-03,  -3.58911417e-03,
         1.47786108e-03,   3.37595446e-03,   4.60004807e-03,
        -4.12984838e-04,  -3.15382407e-04,   2.99513136e-04,
         4.55651293e-03,  -3.67720379e-03,  -3.22309928e-03,
        -1.63103791e-03,  -1.51263608e-03,   2.21117213e-03,
        -1.60928350e-03,  -2.75872787e-03,  -5.97351936e-05,
         2.08250666e-03,   3.10056238e-03,   2.09029671e-03,
        -3.98327690e-03,  -1.07583578e-03,  -4.15374897e-03,
        -7.72640575e-04,   3.89523245e-03,  -2.88798497e-03,
         4.52040322e-03,  -4.10884293e-03,  -8.05260439e-04,
         2.14457465e-03,  -1.51244493e-03,   1.13508233e-03,
        -2.00112816e-03,

### 한글 예시

##### 다음 문서들을 이용해서 데이터를 만들자. 
* https://gasazip.com/view.html?no=614736
* https://gasazip.com/1224697
* https://gasazip.com/view.html?no=599082
* https://gasazip.com/view.html?no=645465
* http://gasazip.com/view.html?no=643505
* https://gasazip.com/view.html?no=615362

In [21]:
# -- 코딩 / 아래는 예시 - 우선은 전처리 하지 말고 바로 해보자.

# Data
# ---- 여기서부터 -------------------------------
d1 = ["매일", "울리는", "벨벨벨"]
d2 = ["Sign을", "보내", "signal 보내"]

documents = []
documents.append(d1)
documents.append(d2)
# 여기까지를, , 실습을 위해 제대로 바꾸면 된다 ---------

In [22]:

# 모델 초기화
model = gensim.models.Word2Vec(min_count=1) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

# 모델 사전 만들기
model.build_vocab(documents)

# 학습
model.train(documents, total_examples=len(documents), epochs=10)

# 단어의 벡터값 얻기 
model.wv['벨벨벨']  # numpy vector of a word

array([  2.60377699e-03,  -8.73178069e-05,   2.76896008e-03,
        -3.26095382e-03,  -4.06877045e-03,  -2.96207750e-03,
        -3.83626367e-03,  -2.28576991e-03,   3.08630127e-03,
         2.46401899e-03,  -7.96728826e-04,   1.25383027e-03,
        -2.51484755e-03,   2.83268280e-03,  -1.31098100e-03,
         4.04255185e-03,   3.82540189e-03,   1.11923798e-03,
        -1.57980411e-03,   4.41629533e-03,   1.16598303e-03,
        -3.93112568e-04,   4.75129718e-03,   3.59420595e-03,
        -3.84916854e-03,  -3.39429174e-03,  -4.03628452e-03,
         2.30653421e-03,  -1.29201170e-03,   2.36161286e-03,
        -2.82154826e-04,   9.37287987e-04,  -4.60557523e-04,
        -3.73785058e-03,   2.92462180e-03,   1.24247000e-03,
         1.35458470e-03,  -1.04766956e-03,  -1.75770419e-03,
        -1.70662580e-03,  -1.99063448e-03,   1.05575236e-04,
        -2.21528369e-03,   1.67688332e-03,  -3.50895134e-04,
         4.29488393e-03,   3.98305897e-03,   1.46421313e-04,
         1.12749869e-03,

# Gensim을 이용한 한국어 Word2Vec 예제

* [3] 한국어 Word2Vec - http://blog.theeluwin.kr/post/146591096133/%ED%95%9C%EA%B5%AD%EC%96%B4-word2vec


##### 다음 문서들을 이용해서 데이터를 만들자. 
* https://gasazip.com/view.html?no=614736
* https://gasazip.com/1224697
* https://gasazip.com/view.html?no=599082
* https://gasazip.com/view.html?no=645465
* http://gasazip.com/view.html?no=643505
* https://gasazip.com/view.html?no=615362    

# 실습
참고자료 [3]을 보고 직접 해보자. 형태소 분석도 해서.

In [None]:
# -- coding

# 참고자료
* [1] Brain's Pick : 단어 간 유사도 파악 방법 - https://brunch.co.kr/@kakao-it/189
* [2] models.word2vec – Deep learning with word2vec - https://radimrehurek.com/gensim/models/word2vec.html
* [3] 한국어 Word2Vec - http://blog.theeluwin.kr/post/146591096133/%ED%95%9C%EA%B5%AD%EC%96%B4-word2vec
* [4] Word2Vec으로 문장 분류하기 - https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
* [5] word2vec 관련 이론 정리 - https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/
* [6] Vector Representations of Words - https://www.tensorflow.org/tutorials/word2vec