# 05. 파이썬을 이용한 Word2Vec (Using Gensim)

* 싸이그래머 / 어바웃 파이썬
* 김무성

# 차례
* Word2Vec 
* Gensim을 이용한 Word2Vec 간단 예제
* Gensim을 이용한 한국어 Word2Vec 예제
* Gensim과 plotly를 이용한 한국어 Word2Vec 시각화(t-SNE)
* Gensim과 Tensorboard를 이용한 한국어 Word2Vec 시각화(t-SNE)
* Gensim을 이용한 Word2Vec Online Update (새로운 코러스 기존 모델에 반영하는 업데이트)

# Word2Vec
* [1] Brain's Pick : 단어 간 유사도 파악 방법 - https://brunch.co.kr/@kakao-it/189
* [5] word2vec 관련 이론 정리 - https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/
* [6] Vector Representations of Words - https://www.tensorflow.org/tutorials/word2vec

<img src="05_figures/cat1.png" width=600 />
<img src="05_figures/cat2.png" width=600 />
<img src="05_figures/w2v.png" width=600 />
<img src="05_figures/we.png" width=600 />
<img src="05_figures/tsne.png" width=600 />

--------------------------

# Gensim을 이용한 Word2Vec 간단 예제

* [2] models.word2vec – Deep learning with word2vec - https://radimrehurek.com/gensim/models/word2vec.html

### 영어 예시

In [1]:
import gensim

In [2]:
# Data
documents = [["my", "name", "is", "jamie"], ["jamie", "is", "cute"]]

In [3]:
# 모델 초기화
model = gensim.models.Word2Vec(min_count=1) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

In [4]:
# 모델 사전 만들기
model.build_vocab(documents)

In [5]:
# 학습
model.train(documents, total_examples=len(documents), epochs=10)

8

In [6]:
# 생성된 모델 저장 및 불러오기 - 이것은 나중에 이 모델을 다시 활용하려할 때 써보기. 
fname = 'mymodel.wv'
model.save(fname)
my_model = gensim.models.Word2Vec.load(fname)  

In [7]:
%ls mymodel.wv

mymodel.wv


In [8]:
# 단어의 벡터값 얻기 
model.wv['name']  # numpy vector of a word

array([  9.82454745e-04,  -5.16234315e-04,  -2.97081278e-04,
         1.30274275e-03,   1.11550943e-03,   3.74730938e-04,
        -1.71563565e-03,  -4.92801145e-03,  -2.90550711e-03,
        -4.04318329e-03,  -1.53156300e-03,   2.63447128e-03,
         1.72043452e-03,  -1.50345045e-03,  -2.63330236e-04,
        -3.71542643e-03,  -3.73801659e-03,   2.68575759e-03,
        -1.50888134e-03,  -1.75534142e-03,  -2.72211432e-03,
        -2.30234233e-03,   3.14465375e-03,  -4.14433103e-04,
        -1.01461529e-03,  -2.33443780e-03,   4.97011701e-04,
         3.18882242e-03,  -1.59351365e-03,  -4.50814050e-03,
         3.67813767e-03,   3.12245265e-03,   4.66409000e-03,
        -4.90983250e-03,   2.67941342e-03,   2.99838092e-03,
         3.72898625e-03,  -1.49317016e-03,   1.09126943e-03,
        -2.07198458e-03,   3.19081172e-03,   3.66241974e-03,
         5.49830438e-04,   9.23817861e-05,   3.36892856e-03,
         3.07345181e-04,  -4.84694773e-03,  -1.71916778e-04,
         4.95858490e-03,

### 한글 예시

##### 다음 문서들을 이용해서 데이터를 만들자. 
* https://gasazip.com/view.html?no=614736
* https://gasazip.com/1224697
* https://gasazip.com/view.html?no=599082
* https://gasazip.com/view.html?no=645465
* http://gasazip.com/view.html?no=643505
* https://gasazip.com/view.html?no=615362

In [9]:
# -- 코딩 / 아래는 예시 - 우선은 전처리 하지 말고 바로 해보자.

# Data
# ---- 여기서부터 -------------------------------
d1 = ["매일", "울리는", "벨벨벨"]
d2 = ["Sign을", "보내", "signal 보내"]

documents = []
documents.append(d1)
documents.append(d2)
# 여기까지를, , 실습을 위해 제대로 바꾸면 된다 ---------

In [10]:

# 모델 초기화
model = gensim.models.Word2Vec(min_count=1) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

# 모델 사전 만들기
model.build_vocab(documents)

# 학습
model.train(documents, total_examples=len(documents), epochs=10)

# 단어의 벡터값 얻기 
model.wv['벨벨벨']  # numpy vector of a word

array([ -4.24047466e-03,  -8.78395222e-04,  -2.10404373e-03,
         3.52675840e-03,   3.94527335e-03,   3.14284395e-03,
         3.99605231e-03,   3.64570064e-03,  -2.09902506e-03,
         1.19320431e-03,  -4.23430884e-03,   2.13081669e-03,
         2.67161266e-03,  -4.19553835e-03,  -4.54576965e-03,
         9.87586682e-04,  -2.96802027e-03,  -2.11419654e-04,
         2.68908497e-03,   1.22806896e-03,  -4.80026612e-03,
        -1.90806238e-03,  -1.88224181e-03,   3.23392195e-03,
         2.35580001e-03,  -2.88254954e-03,  -2.66816420e-03,
         4.11981763e-03,  -1.98943005e-03,   2.58773472e-03,
        -3.33732739e-03,  -3.34762549e-03,   3.44596081e-03,
         3.39682703e-03,  -1.38955409e-04,   2.89256824e-03,
        -3.37066990e-03,  -2.63776327e-03,   4.45120968e-03,
         2.50051916e-03,  -2.43246602e-03,  -1.47697655e-03,
         6.64008530e-06,   4.92979493e-03,  -3.65482690e-03,
        -4.74229315e-03,  -3.21268174e-03,  -3.97545286e-03,
         2.90158787e-03,

# Gensim을 이용한 한국어 Word2Vec 예제

* [3] 한국어 Word2Vec - http://blog.theeluwin.kr/post/146591096133/%ED%95%9C%EA%B5%AD%EC%96%B4-word2vec


##### 다음 문서들을 이용해서 데이터를 만들자. 
* https://gasazip.com/view.html?no=614736
* https://gasazip.com/1224697
* https://gasazip.com/view.html?no=599082
* https://gasazip.com/view.html?no=645465
* http://gasazip.com/view.html?no=643505
* https://gasazip.com/view.html?no=615362    

### 실습
참고자료 [3]을 보고 직접 해보자. 형태소 분석도 해서.

In [None]:
# -- coding

# Gensim과 plotly를 이용한 한국어 Word2Vec 시각화(t-SNE)
* plotly로 시각화 함수 만들기
* 간단한 모델 결과 시각화
* konlpy 내장된 국회 관련 데이터를 이용한 모델 시각화 

#### 참고
* [7] Word2Vec Tutorial - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb
* [8] 한글 word2vec을 이용한 유사도 분석 - http://www.neuromancer.kr/t/word2vec/487

## plotly로 시각화 함수 만들기

In [15]:
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#logging.getLogger().setLevel(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling

from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go

def reduce_dimensions(model, plot_in_notebook = True):

    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    vectors = []        # positions in vector space
    labels = []         # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model[word])
        labels.append(word)


    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)
    
    # reduce using t-SNE
    vectors = np.asarray(vectors)
    logging.info('starting tSNE dimensionality reduction. This may take some time.')
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
        
    # Create a trace
    trace = go.Scatter(
        x=x_vals,
        y=y_vals,
        mode='text',
        text=labels
        )
    
    data = [trace]
    
    logging.info('All done. Plotting.')
    
    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')

## 간단한 모델 결과 시각화

In [16]:
reduce_dimensions(model)


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

2018-02-09 07:05:25,415 : INFO : starting tSNE dimensionality reduction. This may take some time.
2018-02-09 07:05:25,596 : INFO : All done. Plotting.


## konlpy 내장된 국회 관련 데이터를 이용한 모델 시각화 

In [17]:
#load from kobill 
from konlpy.corpus import kobill
docs_ko = [kobill.open(i).read() for i in kobill.fileids()]

In [41]:
#Tokenize
from konlpy.tag import Twitter; t = Twitter()
#pos = lambda d: ['/'.join(p) for p in t.pos(d)]
pos = lambda d: [p[0] for p in t.pos(d) if 'Noun' in p] # 명사만
texts_ko = [pos(doc) for doc in docs_ko]

In [61]:
# 모델 초기화
model = gensim.models.Word2Vec(min_count=1) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

# 모델 사전 만들기
model.build_vocab(texts_ko)

# 학습
model.train(texts_ko, total_examples=len(texts_ko), epochs=10)

2018-02-09 07:55:02,483 : INFO : collecting all words and their counts
2018-02-09 07:55:02,488 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-02-09 07:55:02,492 : INFO : collected 1065 word types from a corpus of 8378 raw words and 10 sentences
2018-02-09 07:55:02,493 : INFO : Loading a fresh vocabulary
2018-02-09 07:55:02,498 : INFO : min_count=1 retains 1065 unique words (100% of original 1065, drops 0)
2018-02-09 07:55:02,499 : INFO : min_count=1 leaves 8378 word corpus (100% of original 8378, drops 0)
2018-02-09 07:55:02,508 : INFO : deleting the raw counts dictionary of 1065 items
2018-02-09 07:55:02,509 : INFO : sample=0.001 downsamples 84 most-common words
2018-02-09 07:55:02,511 : INFO : downsampling leaves estimated 6464 word corpus (77.2% of prior 8378)
2018-02-09 07:55:02,512 : INFO : estimated required memory for 1065 words and 100 dimensions: 1384500 bytes
2018-02-09 07:55:02,522 : INFO : resetting layer weights
2018-02-09 07:55:02,556 : IN

64719

In [62]:
reduce_dimensions(model)


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

2018-02-09 07:55:04,470 : INFO : starting tSNE dimensionality reduction. This may take some time.
2018-02-09 07:55:10,977 : INFO : All done. Plotting.


# Gensim과 Tensorboard를 이용한 한국어 Word2Vec 시각화(t-SNE)

#### 참고
* [9] TensorBoard Visualizations - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Tensorboard_visualizations.ipynb

In [None]:
# -- 업데이트 예정

# Gensim을 이용한 Word2Vec Online Update (새로운 코러스 기존 모델에 반영하는 업데이트)

#### 참고
* [10] Online word2vec tutorial - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/online_w2v_tutorial.ipynb

In [None]:
# -- 업데이트 예정 

# 참고자료
* [1] Brain's Pick : 단어 간 유사도 파악 방법 - https://brunch.co.kr/@kakao-it/189
* [2] models.word2vec – Deep learning with word2vec - https://radimrehurek.com/gensim/models/word2vec.html
* [3] 한국어 Word2Vec - http://blog.theeluwin.kr/post/146591096133/%ED%95%9C%EA%B5%AD%EC%96%B4-word2vec
* [4] Word2Vec으로 문장 분류하기 - https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
* [5] word2vec 관련 이론 정리 - https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/
* [6] Vector Representations of Words - https://www.tensorflow.org/tutorials/word2vec
* [7] Word2Vec Tutorial - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb
* [8] 한글 word2vec을 이용한 유사도 분석 - http://www.neuromancer.kr/t/word2vec/487
* [9] TensorBoard Visualizations - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Tensorboard_visualizations.ipynb
* [10] Online word2vec tutorial - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/online_w2v_tutorial.ipynb