# 01.Text Vectorization (Draft)

# 차례
* Frequency Vectors
* One-Hot Encoding
* Term Frequency-Inverse Document Frequency
* Distributed Representation

-----------------------

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch01/figures/cap1.3.png" width=600 />
<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch01/figures/cap1.4.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

<img src="figures/main_nlp_01.png" width=600 />
<img src="figures/main_nlp_02.png" width=600 />
<img src="figures/main_nlp_03.png" width=600 />
<img src="figures/main_nlp_04.png" width=600 />
<img src="figures/main_nlp_05.png" width=600 />

* 출처 - Natural Language Processing - https://www.coursera.org/learn/language-processing - Main approaches in NLP

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap01.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

<img src="https://slideplayer.com/slide/5270778/17/images/10/Distributed+Word+Representation.jpg" width=600 />

* 출처 - https://slideplayer.com/slide/5270778/

---------------------

# Frequency Vectors

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap02.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [1]:
import gensim
import numpy as np

from pprint import pprint

gensim을 이용하려면 문서 목록인 documents를 corpus 클래스로 바꿔야 한다.

In [2]:
# 가상의 문서 4개
documents_en = [
    "a b c a",
    "c b c",
    "b b a",
    "a c c",
    "c b a",
]

In [3]:
# 단어(토큰) 단위로 분할
def tokenize(x) :
    for token in x.split() :
        yield token
        
corpus_en = [list(tokenize(doc)) for doc in  documents_en]

pprint(corpus_en)

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]


In [4]:
# Dictionary 객체. 구체적으로는 단어에 id 할당 (그 외에도 여러가지 기능은 튜토리얼 참조)
dictionary_en = gensim.corpora.Dictionary(corpus_en)

In [5]:
dictionary_en

<gensim.corpora.dictionary.Dictionary at 0x1a173c9048>

In [6]:
dictionary_en.token2id

{'a': 0, 'b': 1, 'c': 2}

In [7]:
vectors = [dictionary_en.doc2bow(doc) for doc in corpus_en]    

In [8]:
pprint(vectors)

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]


----------------

# One-Hot Encoding

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap03.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [9]:
corpus_en

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]

In [12]:
[dictionary_en.doc2bow(doc) for doc in corpus_en]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [13]:
[[token for token in dictionary_en.doc2bow(doc)] 
                         for doc in corpus_en]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [14]:
[[(token[0], token[1]) for token in dictionary_en.doc2bow(doc)] 
                         for doc in corpus_en]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [15]:
vectors = [[(token[0], 1) for token in dictionary_en.doc2bow(doc)] 
                         for doc in corpus_en]

In [17]:
vectors  = np.array([
        [(token[0], 1) for token in dictionary_en.doc2bow(doc)]
        for doc in corpus_en
    ])

In [18]:
vectors

array([list([(0, 1), (1, 1), (2, 1)]), list([(1, 1), (2, 1)]),
       list([(0, 1), (1, 1)]), list([(0, 1), (2, 1)]),
       list([(0, 1), (1, 1), (2, 1)])], dtype=object)

-------------------

# Term Frequency-Inverse Document Frequency

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap04.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [19]:
documents_en

['a b c a', 'c b c', 'b b a', 'a c c', 'c b a']

In [20]:
corpus_en

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]

In [21]:
# Dictionary 생성
dictionary_en = gensim.corpora.Dictionary(corpus_en)

In [22]:
# Tfidf Model 생성
tfidf_en = gensim.models.TfidfModel(dictionary=dictionary_en, normalize=True)
tfidf_en

<gensim.models.tfidfmodel.TfidfModel at 0x1a173e7550>

In [24]:
vectors = [tfidf_en[dictionary_en.doc2bow(vector)] for vector in corpus_en]
vectors

[[(0, 0.816496580927726), (1, 0.408248290463863), (2, 0.408248290463863)],
 [(1, 0.447213595499958), (2, 0.894427190999916)],
 [(0, 0.447213595499958), (1, 0.894427190999916)],
 [(0, 0.447213595499958), (2, 0.894427190999916)],
 [(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]]

In [25]:
# Dictionary와 tfidf model을 저장해 놓으면 나중에 다시 로드해서 쓸 수 있어 편하다. 
dictionary_en.save_as_text('test_en.txt')
tfidf_en.save('tfidf_en.pkl')

In [26]:
%ls test_en.txt

test_en.txt


In [27]:
%ls tfidf_en.pkl

tfidf_en.pkl


--------------------------

#### 한국어 예제 실습

* 출처 - 나만의 웹 크롤러 만들기 with Requests/BeautifulSoup - https://beomi.github.io/2017/01/20/HowToMakeWebCrawler/
* 출처 - https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp
* 출처 - https://github.com/lovit/soynlp/blob/master/tutorials/nounextractor-v2_usage.ipynb
* 출처 - https://wikidocs.net/24603

##### 데이터 수집 - 간단한 웹 스크래핑

In [28]:
import requests
from bs4 import BeautifulSoup

In [29]:
#url = "https://gasazip.com/view.html?no=614736"
#url = "https://gasazip.com/view.html?no=636135"
url = "http://gasazip.com/view.html?no=2276458"

In [30]:
# HTTP GET Request
req = requests.get(url)
# HTML 소스 가져오기
html = req.text
# BeautifulSoup으로 html소스를 python객체로 변환하기
# 첫 인자는 html소스코드, 두 번째 인자는 어떤 parser를 이용할지 명시.
# 여기서는 Python 내장 html.parser를 이용했다.
soup = BeautifulSoup(html, 'html.parser')

In [31]:
lyrics = []
for txt in soup.find_all('div', attrs={'class': 'col-md-8'}) :
    lines = txt.get_text().split('\n')
    for line in lines :
        lyrics.append(line.strip())

In [32]:
lyrics

['작은 것들을 위한 시 (Boy With Luv)Feat.Halsey',
 '',
 '모든 게 궁금해',
 'How’s your day',
 'Oh tell me',
 '뭐가 널 행복하게 하는지',
 'Oh text me',
 'Your every picture',
 '내 머리맡에 두고 싶어',
 'oh bae',
 'Come be my teacher',
 '네 모든 걸 다 가르쳐줘',
 'Your 1 your 2',
 'Listen my my baby 나는',
 '저 하늘을 높이 날고 있어',
 '그때 니가 내게 줬던 두 날개로',
 '이제 여긴 너무 높아',
 '난 내 눈에 널 맞추고 싶어',
 'Yeah you makin’ me a boy with luv',
 'Oh my my my oh my my my',
 "I've waited all my life",
 '네 전부를 함께하고 싶어',
 'Oh my my my oh my my my',
 'Looking for something right',
 '이제 조금은 나 알겠어',
 'I want something stronger',
 'Than a moment',
 'than a moment love',
 'I have waited longer',
 'For a boy with',
 'For a boy with luv',
 '널 알게 된 이후 ya',
 '내 삶은 온통 너 ya',
 '사소한 게 사소하지 않게',
 '만들어버린 너라는 별',
 '하나부터 열까지 모든 게 특별하지',
 '너의 관심사 걸음걸이 말투와',
 '사소한 작은 습관들까지',
 '다 말하지 너무 작던',
 '내가 영웅이 된 거라고',
 'Oh nah',
 '난 말하지 운명 따윈',
 '처음부터 내 게 아니었다고',
 'Oh nah',
 '세계의 평화',
 'No way',
 '거대한 질서',
 'No way',
 '그저 널 지킬 거야 난',
 'Boy with luv',
 'Listen my my baby 나는',
 '저 하늘을 높이 날고

In [44]:
# 하나로는 좀 부족한 듯. 가사를 여럿 모아보자.

import collections
 
BTS = collections.OrderedDict()
BTS['No More Dream'] = 638198
BTS['I NEED U'] = 576129
BTS['쩔어'] = 628809
BTS['봄날'] = 644668
BTS['DNA'] = 1235594
BTS['피 땀 눈물'] = 630845
BTS['불타오르네'] = 615362
BTS['IDOL'] = 2262224
BTS['작은 것들을 위한 시'] = 2276458

In [35]:
lyrics_ids = BTS.values()
url_root = "http://gasazip.com/view.html?no="

urls = [url_root + str(lid) for lid in lyrics_ids]

pprint(urls)

['http://gasazip.com/view.html?no=638198',
 'http://gasazip.com/view.html?no=576129',
 'http://gasazip.com/view.html?no=628809',
 'http://gasazip.com/view.html?no=644668',
 'http://gasazip.com/view.html?no=1235594',
 'http://gasazip.com/view.html?no=630845',
 'http://gasazip.com/view.html?no=615362',
 'http://gasazip.com/view.html?no=2262224',
 'http://gasazip.com/view.html?no=2276458']


In [36]:
docs = []
for url in urls :
    req = requests.get(url)
    html = req.text
    soup = BeautifulSoup(html, 'html.parser')
    
    doc = []
    for txt in soup.find_all('div', attrs={'class': 'col-md-8'}) :
        doc.append(txt.get_text())
    
    docs.append(",".join(doc))

In [37]:
len(docs)

9

In [41]:
print(docs[:3])

['No More Dream,\n얌마 네 꿈은 뭐니\n얌마 네 꿈은 뭐니\n얌마 네 꿈은 뭐니\n네 꿈은 겨우 그거니\n\nI wanna big house, big cars & big rings\nBut 사실은 I dun have any big dreams\n하하 난 참 편하게 살어\n꿈 따위 안 꿔도 아무도 뭐라 안 하잖어\n전부 다다다 똑같이 나처럼 생각하고 있어\n새까까까맣게 까먹은 꿈 많던 어린 시절\n대학은 걱정 마 멀리라도 갈 거니까\n알았어 엄마 지금 독서실 간다니까\n\n네가 꿈꿔 온 네 모습이 뭐야\n지금 네 거울 속엔 누가 보여, I gotta say\n너의 길을 가라고\n단 하루를 살아도\n뭐라도 하라고\n나약함은 담아 둬\n\n왜 말 못하고 있어? 공부는 하기 싫다면서\n학교 때려 치기는 겁나지? 이거 봐 등교할 준비하네 벌써\n철 좀 들어 제발 좀, 너 입만 살아가지고 인마 유리 멘탈 boy\n(Stop!) 자신에게 물어봐 언제 네가 열심히 노력했냐고\n\n얌마 네 꿈은 뭐니\n얌마 네 꿈은 뭐니\n얌마 네 꿈은 뭐니\n네 꿈은 겨우 그거니\n\n거짓말이야 you such a liar\nSee me see me ya 넌 위선자여\n왜 자꾸 딴 길을 가래 야 너나 잘해\n제발 강요하진 말아 줘\n(La la la la la) 네 꿈이 뭐니 네 꿈이 뭐니 뭐니\n(La la la la la) 고작 이거니 고작 이거니 거니\n\n지겨운 same day, 반복되는 매일에\n어른들과 부모님은 틀에 박힌 꿈을 주입해\n장래희망 넘버원... 공무원?\n강요된 꿈은 아냐, 9회말 구원투수\n시간 낭비인 야자에 돌직구를 날려\n지옥 같은 사회에 반항해, 꿈을 특별 사면\n자신에게 물어봐 네 꿈의 profile\n억압만 받던 인생 네 삶의 주어가 되어 봐\n\n네가 꿈꿔 온 네 모습이 뭐여\n지금 네 거울 속엔 누가 보여, I gotta say\n너의 길을 가라고\n단 하루를 살아도\n뭐라도 하라고\n나약함은 담아 둬\n\n얌마 네 꿈은 뭐니\n얌마 네 

In [42]:
documents_ko = docs

##### 데이터 전처리

In [43]:
# 현재 문서는 좀 번잡하다.형태소 분석을 이용해서 정리하고 싶다.
doc = documents_ko[2]
doc

"쩔어,\n어서 와 방탄은 처음이지?\n\nAyo ladies & gentleman\n준비가 됐다면 부를게 yeah!\n딴 녀석들과는 다르게\n내 스타일로 내 내 내 내 스타일로 에오!\n\n밤새 일했지 everyday\n니가 클럽에서 놀 때 yeah\n자 놀라지 말고 들어 매일\nI got a feel, I got a feel\n난 좀 쩔어!\n\n아 쩔어 쩔어 쩔어 우리 연습실 땀내\n봐 쩌렁 쩌렁 쩌렁한 내 춤이 답해\n모두 비실이 찌질이 찡찡이 띨띨이들\n나랑은 상관이 없어 cuz 난 희망이 쩔어 haha\n\nOk 우린 머리부터 발끝까지 전부 다 쩌 쩔어\n하루의 절반을 작업에 쩌 쩔어\n작업실에 쩔어 살어 청춘은 썩어가도\n덕분에 모로 가도 달리는 성공가도\n소녀들아 더 크게 소리질러 쩌 쩌렁\n\n밤새 일했지 everyday\n니가 클럽에서 놀 때 yeah\n딴 녀석들과는 다르게\nI don't wanna say yes\nI don't wanna say yes\n\n소리쳐봐 all right\n몸이 타버리도록 all night (all night)\nCause we got fire (fire!)\nHigher (higher!)\nI gotta make it, I gotta make it\n쩔어!\n\n거부는 거부해\n난 원래 너무해\n모두 다 따라 해\n쩔어\n\n거부는 거부해\n전부 나의 노예\n모두 다 따라 해\n쩔어\n\n3포세대? 5포세대?\n그럼 난 육포가 좋으니까 6포세대\n언론과 어른들은 의지가 없다며 우릴 싹 주식처럼 매도해\n왜 해보기도 전에 죽여 걔넨 enemy enemy enemy\n왜 벌써부터 고개를 숙여 받아 energy energy energy\n절대 마 포기 you know you not lonely\n너와 내 새벽은 낮보다 예뻐\nSo can I get a little bit of hope? (yeah)\n잠든 청춘을 깨워 go\n\n밤새 일했지 everyday\n니가 클럽에서 놀 때 yeah\n딴 녀석들과는 다르게\n

In [45]:
# 간단한 전처리를 해보자 
from soynlp.noun import LRNounExtractor_v2

noun_extractor = LRNounExtractor_v2(verbose=True)

[Noun Extractor] use default predictors
[Noun Extractor] num features: pos=1260, neg=1173, common=12


In [46]:
# 학습기반 단어(명사) 추출기
nouns = noun_extractor.train_extract(documents_ko)

[Noun Extractor] counting eojeols
[EojeolCounter] n eojeol = 1076 from 9 sents. mem=0.117 Gb                    
[Noun Extractor] complete eojeol counter -> lr graph
[Noun Extractor] has been trained. #eojeols=2727, mem=0.118 Gb
[Noun Extractor] batch prediction was completed for 286 words
[Noun Extractor] checked compounds. discovered 0 compounds
[Noun Extractor] postprocessing detaching_features : 80 -> 80
[Noun Extractor] postprocessing ignore_features : 80 -> 77
[Noun Extractor] postprocessing ignore_NJ : 77 -> 77
[Noun Extractor] 77 nouns (0 compounds) with min frequency=1
[Noun Extractor] flushing was done. mem=0.119 Gb                    
[Noun Extractor] 19.80 % eojeols are covered


In [47]:
len(nouns)

77

In [48]:
nouns

{'DNA': NounScore(frequency=5, score=0.8333333333333334),
 '거짓말': NounScore(frequency=5, score=1.0),
 '어른들': NounScore(frequency=2, score=1.0),
 '스타일': NounScore(frequency=4, score=1.0),
 '우리': NounScore(frequency=6, score=1.0),
 '운명': NounScore(frequency=8, score=1.0),
 '하늘': NounScore(frequency=4, score=1.0),
 '시간': NounScore(frequency=4, score=1.0),
 '청춘': NounScore(frequency=2, score=1.0),
 '겨울': NounScore(frequency=6, score=1.0),
 '벌써': NounScore(frequency=2, score=1.0),
 '처음': NounScore(frequency=1, score=0.5),
 '머리': NounScore(frequency=1, score=0.5),
 '우주': NounScore(frequency=3, score=1.0),
 '준비': NounScore(frequency=2, score=1.0),
 '모두': NounScore(frequency=6, score=1.0),
 '봄날': NounScore(frequency=3, score=0.75),
 '매일': NounScore(frequency=4, score=1.0),
 '사랑': NounScore(frequency=7, score=1.0),
 '눈물': NounScore(frequency=11, score=1.0),
 '새벽': NounScore(frequency=2, score=1.0),
 '함께': NounScore(frequency=3, score=0.5),
 '행복': NounScore(frequency=2, score=1.0),
 '아무': NounSc

In [49]:
# 토크나이저 
from soynlp.tokenizer import MaxScoreTokenizer

In [50]:
# 단어별 스코어 
scores = {w:s.score for w, s in nouns.items()}
scores

{'DNA': 0.8333333333333334,
 '거짓말': 1.0,
 '어른들': 1.0,
 '스타일': 1.0,
 '우리': 1.0,
 '운명': 1.0,
 '하늘': 1.0,
 '시간': 1.0,
 '청춘': 1.0,
 '겨울': 1.0,
 '벌써': 1.0,
 '처음': 0.5,
 '머리': 0.5,
 '우주': 1.0,
 '준비': 1.0,
 '모두': 1.0,
 '봄날': 0.75,
 '매일': 1.0,
 '사랑': 1.0,
 '눈물': 1.0,
 '새벽': 1.0,
 '함께': 0.5,
 '행복': 1.0,
 '아무': 1.0,
 '있어': 0.8333333333333334,
 '하루': 1.0,
 '멀리': 1.0,
 '거부': 1.0,
 '날개': 1.0,
 '상처': 1.0,
 '너무': 1.0,
 '달콤': 1.0,
 '이상': 1.0,
 '사소': 1.0,
 '쩌렁': 1.0,
 '특별': 1.0,
 '전부': 1.0,
 '자꾸': 1.0,
 '조금': 0.7142857142857143,
 '작업': 0.5,
 '춤': 1.0,
 '꿈': 1.0,
 '길': 1.0,
 '법': 1.0,
 '눈': 1.0,
 '숨': 1.0,
 '밤': 1.0,
 '손': 0.6666666666666666,
 '끝': 1.0,
 '말': 0.5,
 '보': 1.0,
 '내': 1.0,
 '네': 1.0,
 '수': 1.0,
 '많': 0.9512195121951219,
 '몸': 1.0,
 '높': 0.8,
 '독': 0.5,
 '힘': 0.5,
 '깊': 0.5,
 '너': 1.0,
 '못': 1.0,
 '꿔': 1.0,
 '뭣': 1.0,
 '것': 1.0,
 '둘': 1.0,
 '취': 1.0,
 '더': 1.0,
 '원': 1.0,
 '잘': 0.625,
 '욕': 1.0,
 '꺼': 0.3333333333333333,
 '생': 0.4,
 '때': 0.7142857142857143,
 '모': 1.0,
 '삶': 1.0,
 '열': 0.5}

In [51]:
# 토크나이저 생성
tokenizer = MaxScoreTokenizer(scores=scores)

In [57]:
# 토크나이저 적용.
for doc in documents_ko[:2] :
    print("\n---------------- 원문 --------------\n")
    print(doc)
    print("\n---------------- 처리 후 --------------\n")
    tokens = tokenizer.tokenize(doc)
    print(tokens)


---------------- 원문 --------------

No More Dream,
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

I wanna big house, big cars & big rings
But 사실은 I dun have any big dreams
하하 난 참 편하게 살어
꿈 따위 안 꿔도 아무도 뭐라 안 하잖어
전부 다다다 똑같이 나처럼 생각하고 있어
새까까까맣게 까먹은 꿈 많던 어린 시절
대학은 걱정 마 멀리라도 갈 거니까
알았어 엄마 지금 독서실 간다니까

네가 꿈꿔 온 네 모습이 뭐야
지금 네 거울 속엔 누가 보여, I gotta say
너의 길을 가라고
단 하루를 살아도
뭐라도 하라고
나약함은 담아 둬

왜 말 못하고 있어? 공부는 하기 싫다면서
학교 때려 치기는 겁나지? 이거 봐 등교할 준비하네 벌써
철 좀 들어 제발 좀, 너 입만 살아가지고 인마 유리 멘탈 boy
(Stop!) 자신에게 물어봐 언제 네가 열심히 노력했냐고

얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

거짓말이야 you such a liar
See me see me ya 넌 위선자여
왜 자꾸 딴 길을 가래 야 너나 잘해
제발 강요하진 말아 줘
(La la la la la) 네 꿈이 뭐니 네 꿈이 뭐니 뭐니
(La la la la la) 고작 이거니 고작 이거니 거니

지겨운 same day, 반복되는 매일에
어른들과 부모님은 틀에 박힌 꿈을 주입해
장래희망 넘버원... 공무원?
강요된 꿈은 아냐, 9회말 구원투수
시간 낭비인 야자에 돌직구를 날려
지옥 같은 사회에 반항해, 꿈을 특별 사면
자신에게 물어봐 네 꿈의 profile
억압만 받던 인생 네 삶의 주어가 되어 봐

네가 꿈꿔 온 네 모습이 뭐여
지금 네 거울 속엔 누가 보여, I gotta say
너의 길을 가라고
단 하루를 살아도
뭐라도 하라고
나약함은 담아 둬

얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은

In [58]:
# 그래도 뭔가 부족해보인다. 명사 위주로만 뽑자
# 명사 추출기를 통해 해당 단어가 명사인지 아닌지를 알 수 있다.
print(nouns.get('안녕'))

None


In [56]:
# 적용
for doc in documents_ko[:2] :
    print("\n---------------- 원문 --------------\n")
    print(doc)
    print("\n---------------- 처리 후 --------------\n")
    tokens = tokenizer.tokenize(doc)
    print([token for token in tokens if nouns.get(token)]) # 토큰 중에 명사인 것만 반영


---------------- 원문 --------------

No More Dream,
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

I wanna big house, big cars & big rings
But 사실은 I dun have any big dreams
하하 난 참 편하게 살어
꿈 따위 안 꿔도 아무도 뭐라 안 하잖어
전부 다다다 똑같이 나처럼 생각하고 있어
새까까까맣게 까먹은 꿈 많던 어린 시절
대학은 걱정 마 멀리라도 갈 거니까
알았어 엄마 지금 독서실 간다니까

네가 꿈꿔 온 네 모습이 뭐야
지금 네 거울 속엔 누가 보여, I gotta say
너의 길을 가라고
단 하루를 살아도
뭐라도 하라고
나약함은 담아 둬

왜 말 못하고 있어? 공부는 하기 싫다면서
학교 때려 치기는 겁나지? 이거 봐 등교할 준비하네 벌써
철 좀 들어 제발 좀, 너 입만 살아가지고 인마 유리 멘탈 boy
(Stop!) 자신에게 물어봐 언제 네가 열심히 노력했냐고

얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

거짓말이야 you such a liar
See me see me ya 넌 위선자여
왜 자꾸 딴 길을 가래 야 너나 잘해
제발 강요하진 말아 줘
(La la la la la) 네 꿈이 뭐니 네 꿈이 뭐니 뭐니
(La la la la la) 고작 이거니 고작 이거니 거니

지겨운 same day, 반복되는 매일에
어른들과 부모님은 틀에 박힌 꿈을 주입해
장래희망 넘버원... 공무원?
강요된 꿈은 아냐, 9회말 구원투수
시간 낭비인 야자에 돌직구를 날려
지옥 같은 사회에 반항해, 꿈을 특별 사면
자신에게 물어봐 네 꿈의 profile
억압만 받던 인생 네 삶의 주어가 되어 봐

네가 꿈꿔 온 네 모습이 뭐여
지금 네 거울 속엔 누가 보여, I gotta say
너의 길을 가라고
단 하루를 살아도
뭐라도 하라고
나약함은 담아 둬

얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은

In [63]:
# 단어(토큰) 단위로 분할하는 함수 - 그 중에 명사만 반환한다.
def tokenize(x, tokenizer) :
    for token in tokenizer.tokenize(x) :
        if nouns.get(token) :  # 명사만 선택한다.
            yield token
        
        
# 코퍼스를 만들자
corpus_ko = [list(tokenize(doc, tokenizer)) for doc in documents_ko]

pprint(corpus_ko[:2])

[['네',
  '네',
  '네',
  '네',
  '꿈',
  '아무',
  '전부',
  '있어',
  '꿈',
  '멀리',
  '네',
  '네',
  '하루',
  '말',
  '있어',
  '준비',
  '벌써',
  '너',
  '네',
  '네',
  '네',
  '네',
  '거짓말',
  '자꾸',
  '네',
  '네',
  '매일',
  '어른들',
  '시간',
  '특별',
  '네',
  '네',
  '네',
  '네',
  '하루',
  '네',
  '네',
  '네',
  '네',
  '거짓말',
  '자꾸',
  '네',
  '네',
  '꿔',
  '너',
  '거짓말',
  '자꾸',
  '네',
  '네'],
 ['너',
  '너',
  '뭣',
  '전부',
  '사랑',
  '사랑',
  '자꾸',
  '너무',
  '자꾸',
  '내',
  '내',
  '내',
  '아무',
  '말',
  '하늘',
  '하늘',
  '하늘',
  '내',
  '눈물',
  '더',
  '잘',
  '사랑',
  '자꾸',
  '너무',
  '사랑',
  '사랑',
  '수',
  '사랑',
  '자꾸',
  '너무']]


##### 벡터 생성

In [64]:
# Dictionary 생성
dictionary_ko = gensim.corpora.Dictionary(corpus_ko)

In [65]:
# Tfidf Model 생성
tfidf_ko = gensim.models.TfidfModel(dictionary=dictionary_ko, normalize=True)

In [66]:
vectors = [tfidf_ko[dictionary_ko.doc2bow(vector)] for vector in corpus_ko]
vectors[:2]

[[(0, 0.12107436527752928),
  (1, 0.11791395471330296),
  (2, 0.05895697735665148),
  (3, 0.05895697735665148),
  (4, 0.9685949222202342),
  (5, 0.04035812175917643),
  (6, 0.04035812175917643),
  (7, 0.04035812175917643),
  (8, 0.04035812175917643),
  (9, 0.02947848867832574),
  (10, 0.04035812175917643),
  (11, 0.04035812175917643),
  (12, 0.03154354402424213),
  (13, 0.08843546603497723),
  (14, 0.010879633080850683),
  (15, 0.04035812175917643),
  (16, 0.04035812175917643),
  (17, 0.05895697735665148)],
 [(3, 0.1810555648125473),
  (5, 0.12393889336758655),
  (10, 0.12393889336758655),
  (13, 0.3621111296250946),
  (14, 0.0334111109613129),
  (18, 0.1336444438452516),
  (19, 0.14530428209412652),
  (20, 0.12393889336758655),
  (21, 0.048434760698042166),
  (22, 0.1810555648125473),
  (23, 0.7436333602055194),
  (24, 0.09052778240627365),
  (25, 0.12393889336758655),
  (26, 0.3718166801027597)]]

##### 벡터 비교

In [67]:
# tf-idf 기반 벡터 유사도 
from gensim import similarities

In [68]:
# 거리 구하는 함수
def distance(a, b, dic) :
    index = similarities.MatrixSimilarity([a],num_features=len(dic))
    sim = index[b]
    return sim[0]*100

In [69]:
# 해당 벡터가 어떤 곡에 대한 것인지를 출력하기 위해 제목 리스트만 따로 뽑음
titles = list(BTS.keys())

In [70]:
# 두 벡터 비교 

a = 0
print("A: ")
print(titles[a])
print(corpus_ko[a])
print(vectors[a])

print("-------------------")

b = 3
print("B: ")
print(titles[b])
print(corpus_ko[b])
print(vectors[b])

sim = distance(vectors[a], vectors[b], dictionary_ko)

print("===================")
print(round(sim,2),'% similar')

A: 
No More Dream
['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네']
[(0, 0.12107436527752928), (1, 0.11791395471330296), (2, 0.05895697735665148), (3, 0.05895697735665148), (4, 0.9685949222202342), (5, 0.04035812175917643), (6, 0.04035812175917643), (7, 0.04035812175917643), (8, 0.04035812175917643), (9, 0.02947848867832574), (10, 0.04035812175917643), (11, 0.04035812175917643), (12, 0.03154354402424213), (13, 0.08843546603497723), (14, 0.010879633080850683), (15, 0.04035812175917643), (16, 0.04035812175917643), (17, 0.05895697735665148)]
-------------------
B: 
봄날
['봄날', '더', '있어', '너무', '시간', '우리', '우리', '겨울', '겨울', '시간', '손', '겨울', '봄날', '조금', '더', '수', '조금', '더', '겨울', '봄날', '더', '시간', '우리', '모두', '하루', '조금', '더', '겨울', '조금', '더', '겨울', '봄날', '더']
[(9, 0.179778421536

In [71]:
a = 3
print("A: ")
print(titles[a])
print(corpus_ko[a])
print(vectors[a])

print("-------------------")

b = 3
print("B: ")
print(titles[b])
print(corpus_ko[b])
print(vectors[b])

sim = distance(vectors[a], vectors[b], dictionary_ko)

print("===================")
print(round(sim,2),'% similar')

A: 
봄날
['봄날', '더', '있어', '너무', '시간', '우리', '우리', '겨울', '겨울', '시간', '손', '겨울', '봄날', '조금', '더', '수', '조금', '더', '겨울', '봄날', '더', '시간', '우리', '모두', '하루', '조금', '더', '겨울', '조금', '더', '겨울', '봄날', '더']
[(9, 0.17977842153624138), (12, 0.0320620719751412), (17, 0.05992614051208046), (19, 0.0320620719751412), (21, 0.2244345038259884), (24, 0.05992614051208046), (30, 0.08204309595838546), (33, 0.17977842153624138), (38, 0.7191136861449655), (39, 0.47940912409664366), (40, 0.11985228102416091), (41, 0.32817238383354186)]
-------------------
B: 
봄날
['봄날', '더', '있어', '너무', '시간', '우리', '우리', '겨울', '겨울', '시간', '손', '겨울', '봄날', '조금', '더', '수', '조금', '더', '겨울', '봄날', '더', '시간', '우리', '모두', '하루', '조금', '더', '겨울', '조금', '더', '겨울', '봄날', '더']
[(9, 0.17977842153624138), (12, 0.0320620719751412), (17, 0.05992614051208046), (19, 0.0320620719751412), (21, 0.2244345038259884), (24, 0.05992614051208046), (30, 0.08204309595838546), (33, 0.17977842153624138), (38, 0.7191136861449655), (39, 0.47940912409664366), 

In [72]:
a = 4
print("A: ")
print(titles[a])
print(corpus_ko[a])
print(vectors[a])

print("-------------------")

b = 8
print("B: ")
print(titles[b])
print(corpus_ko[b])
print(vectors[b])

sim = distance(vectors[a], vectors[b], dictionary_ko)

print("===================")
print(round(sim,2),'% similar')

A: 
DNA
['DNA', '내', 'DNA', '우리', '우주', '운명', '내', '내', '운명', '우주', '함께', '운명', 'DNA', '더', 'DNA', '우리', '자꾸', '이상', '사랑', '내', '운명', '우주', '함께', '운명', 'DNA', '운명', '우리', '함께', '운명', 'DNA']
[(13, 0.05726670028364734), (18, 0.08454174074334084), (21, 0.030639201032861517), (23, 0.07840213546948255), (33, 0.17180010085094202), (42, 0.6872004034037681), (43, 0.34360020170188404), (44, 0.5488149482863779), (45, 0.07840213546948255), (46, 0.23520640640844767)]
-------------------
B: 
작은 것들을 위한 시
['행복', '내', '머리', '네', '하늘', '있어', '날개', '너무', '내', '네', '전부', '함께', '조금', '내', '너', '사소', '사소', '특별', '사소', '너무', '운명', '처음', '내', '하늘', '있어', '날개', '너무', '내', '네', '전부', '함께', '조금', '상처', '상처', '때', '날개', '네', '전부', '함께', '조금']
[(3, 0.0788862248590622), (4, 0.4320031332304548), (12, 0.08441243830039596), (14, 0.08734367534565451), (16, 0.1080007833076137), (18, 0.14557279224275751), (19, 0.12661865745059395), (26, 0.2160015666152274), (28, 0.1080007833076137), (29, 0.1080007833076137), (36, 0.1080

-------------------

# Distributed Representation

* Document Embedding
    - Doc2Vec
* Word Embedding
    - Word2Vec
    - Glove
    - FastText
* Sentence Embeding

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap05.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

-----------------------------------

## Document Embedding
* Doc2Vec

### Doc2Vec

In [73]:
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

In [76]:
# 코퍼스 확인
print(corpus_ko[:2])

[['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네'], ['너', '너', '뭣', '전부', '사랑', '사랑', '자꾸', '너무', '자꾸', '내', '내', '내', '아무', '말', '하늘', '하늘', '하늘', '내', '눈물', '더', '잘', '사랑', '자꾸', '너무', '사랑', '사랑', '수', '사랑', '자꾸', '너무']]


In [81]:
docs   = [ 
    TaggedDocument(words, [titles[idx]])
        for idx, words in enumerate(corpus_ko)
]

In [80]:
print(docs[:2])

[TaggedDocument(words=['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네'], tags=['No More Dream']), TaggedDocument(words=['너', '너', '뭣', '전부', '사랑', '사랑', '자꾸', '너무', '자꾸', '내', '내', '내', '아무', '말', '하늘', '하늘', '하늘', '내', '눈물', '더', '잘', '사랑', '자꾸', '너무', '사랑', '사랑', '수', '사랑', '자꾸', '너무'], tags=['I NEED U'])]


In [82]:
d2v = Doc2Vec(docs, min_count=0)

In [83]:
vectors = d2v.docvecs

In [84]:
for i in range(3) :
    print(vectors[i])

[-7.8998436e-04  7.2224741e-04  7.5279403e-04  3.3110820e-03
 -1.6554599e-04 -5.9665216e-04 -2.2180914e-03 -5.0128563e-03
  1.1621257e-03  2.0022118e-03 -4.8898552e-03  7.3794735e-04
  4.1524386e-03 -4.3140757e-03 -4.0426860e-03  3.7782415e-04
  2.1112207e-03  2.6266535e-03  2.0632350e-03  4.5367228e-03
 -1.2671099e-03  4.9808430e-03 -3.1151571e-03 -3.5587098e-03
  1.4732762e-03 -4.1772700e-03 -1.4950533e-03 -2.4870571e-04
 -3.3947878e-04  2.2091547e-03  1.7591557e-03  3.6670095e-03
  1.7021375e-03 -4.1979896e-03  2.9724082e-03 -4.3629901e-03
 -4.1979076e-03 -1.4266249e-03 -1.8863283e-03 -3.3406063e-03
  4.7751768e-03 -4.0315883e-03 -1.0805968e-03  4.5295265e-03
 -4.2952015e-03  4.7686393e-03  1.2336487e-03 -2.6753121e-03
  1.5505255e-03 -8.0554985e-04  2.1500492e-03 -4.8299232e-03
 -1.9262416e-03  1.4880367e-03 -3.5560040e-03  7.2937971e-04
 -4.7101676e-03 -1.6871268e-03 -4.3555470e-03  1.9187623e-03
  3.0014194e-03  3.5826860e-03  1.5113975e-03  1.1298138e-04
  1.9675798e-03  5.04869

In [85]:
# doc2vec기반 벡터 유사도 
def distance(a_doctag, b_doctag, vectors) :
    sim = vectors.similarity(a_doctag, b_doctag)
    return sim*100

In [87]:
a = 0
a_doctag = vectors.index2entity[a]
print("A: ")
print(a_doctag)
print(corpus_ko[a])
print(vectors[a])

print("---------------------")

b = 3
b_doctag = vectors.index2entity[b]
print("B: ")
print(b_doctag)
print(corpus_ko[b])
print(vectors[b])

print("=====================")

sim = distance(a_doctag, b_doctag, vectors)

print(round(sim,2),'% similar')

A: 
No More Dream
['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네']
[-7.8998436e-04  7.2224741e-04  7.5279403e-04  3.3110820e-03
 -1.6554599e-04 -5.9665216e-04 -2.2180914e-03 -5.0128563e-03
  1.1621257e-03  2.0022118e-03 -4.8898552e-03  7.3794735e-04
  4.1524386e-03 -4.3140757e-03 -4.0426860e-03  3.7782415e-04
  2.1112207e-03  2.6266535e-03  2.0632350e-03  4.5367228e-03
 -1.2671099e-03  4.9808430e-03 -3.1151571e-03 -3.5587098e-03
  1.4732762e-03 -4.1772700e-03 -1.4950533e-03 -2.4870571e-04
 -3.3947878e-04  2.2091547e-03  1.7591557e-03  3.6670095e-03
  1.7021375e-03 -4.1979896e-03  2.9724082e-03 -4.3629901e-03
 -4.1979076e-03 -1.4266249e-03 -1.8863283e-03 -3.3406063e-03
  4.7751768e-03 -4.0315883e-03 -1.0805968e-03  4.5295265e-03
 -4.2952015e-03  4.7686393e-03  1.2336487e

In [88]:
a = 3
a_doctag = vectors.index2entity[a]
print("A: ")
print(a_doctag)
print(corpus_ko[a])
print(vectors[a])

print("---------------------")

b = 3
print("B: ")
b_doctag = vectors.index2entity[b]
print(b_doctag)
print(corpus_ko[b])
print(vectors[b])

print("=====================")

sim = distance(a_doctag, b_doctag, vectors)

print(round(sim,2),'% similar')

A: 
봄날
['봄날', '더', '있어', '너무', '시간', '우리', '우리', '겨울', '겨울', '시간', '손', '겨울', '봄날', '조금', '더', '수', '조금', '더', '겨울', '봄날', '더', '시간', '우리', '모두', '하루', '조금', '더', '겨울', '조금', '더', '겨울', '봄날', '더']
[-3.7882584e-03 -1.7288388e-03  2.0746761e-03 -2.5114072e-03
 -3.4897325e-03  1.8923049e-03  1.1728042e-03  3.6649787e-04
 -1.5953288e-03  5.3563365e-04 -2.2353516e-03  2.4905426e-03
  2.5519875e-03 -3.2564029e-03 -4.8096594e-03 -2.9567881e-03
  3.7387132e-03  1.1639760e-03 -4.1232174e-03  4.5650471e-03
 -2.7486365e-03 -1.7876422e-03  1.6164050e-03  2.8504920e-03
  4.7876230e-03 -4.5712106e-03 -2.5195009e-03  2.4036004e-04
  2.8574816e-03 -9.9775160e-04  1.5266036e-03  9.4945979e-05
 -3.2246055e-03 -4.4487547e-03  5.8567477e-04 -3.6685131e-03
  3.0866382e-03 -3.7679628e-03  6.8011490e-04 -6.7939912e-04
 -1.2760557e-03 -8.3443732e-04  3.8418849e-03 -1.5382431e-03
 -4.6656071e-03 -2.3612375e-03 -9.1492292e-04  5.9326412e-04
 -4.7833077e-03 -2.6112140e-04 -4.5117880e-03  3.8066083e-03
 -4.427230

In [89]:
a = 4
a_doctag = vectors.index2entity[a]
print("A: ")
print(a_doctag)
print(corpus_ko[a])
print(vectors[a])

print("---------------------")

b = 8
print("B: ")
print(b_doctag)
print(corpus_ko[b])
print(vectors[b])

print("=====================")

sim = distance(a_doctag, b_doctag, vectors)

print(round(sim,2),'% similar')

A: 
DNA
['DNA', '내', 'DNA', '우리', '우주', '운명', '내', '내', '운명', '우주', '함께', '운명', 'DNA', '더', 'DNA', '우리', '자꾸', '이상', '사랑', '내', '운명', '우주', '함께', '운명', 'DNA', '운명', '우리', '함께', '운명', 'DNA']
[ 2.1743893e-03  1.9190835e-03 -1.6003055e-04 -1.6911449e-03
  1.5877279e-03  1.2325135e-03 -1.7678938e-05 -3.3111908e-03
  1.2617743e-04  3.6203719e-03 -4.0152287e-03 -2.3789858e-03
  4.2678546e-03 -1.4107731e-03  4.6450345e-04 -4.9895192e-05
  4.5410113e-04 -3.5198582e-03 -3.0533033e-03  4.9649063e-03
 -4.1650683e-03  5.0260895e-03  3.6281052e-03  3.5725636e-05
 -1.0537121e-03 -1.9334556e-04  5.2469433e-04  4.6473122e-03
 -4.3411455e-05 -3.3767954e-03 -4.1622366e-03 -2.9033867e-03
  1.1157270e-03 -2.2784849e-03 -2.0726982e-03  4.4518290e-03
  4.3244445e-04 -1.1764695e-03 -4.4763046e-03 -2.3522745e-04
  8.7763081e-05 -4.3056728e-03 -2.1030386e-03 -3.4990835e-03
 -1.3246523e-03 -8.3456485e-04 -6.9528533e-04 -3.6976128e-03
  4.4456453e-04  1.1081741e-03  3.9693872e-03 -4.7257491e-03
  1.6538034e-03  

#### plotly로 시각화 함수 만들기

In [90]:

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling

from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go

def visualize_embeddings(vectors, labels, plot_in_notebook = True):

    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)
    
    # reduce using t-SNE
    logging.info('starting tSNE dimensionality reduction. This may take some time.')
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
        
    # Create a trace
    trace = go.Scatter(
        x=x_vals,
        y=y_vals,
        mode='text',
        text=labels
        )
    
    data = [trace]
    
    logging.info('All done. Plotting.')
    
    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')

In [91]:
labels = titles
vectors = [d2v.docvecs[title] for title in labels]

print(labels[0])
print(vectors[0])

No More Dream
[-7.8998436e-04  7.2224741e-04  7.5279403e-04  3.3110820e-03
 -1.6554599e-04 -5.9665216e-04 -2.2180914e-03 -5.0128563e-03
  1.1621257e-03  2.0022118e-03 -4.8898552e-03  7.3794735e-04
  4.1524386e-03 -4.3140757e-03 -4.0426860e-03  3.7782415e-04
  2.1112207e-03  2.6266535e-03  2.0632350e-03  4.5367228e-03
 -1.2671099e-03  4.9808430e-03 -3.1151571e-03 -3.5587098e-03
  1.4732762e-03 -4.1772700e-03 -1.4950533e-03 -2.4870571e-04
 -3.3947878e-04  2.2091547e-03  1.7591557e-03  3.6670095e-03
  1.7021375e-03 -4.1979896e-03  2.9724082e-03 -4.3629901e-03
 -4.1979076e-03 -1.4266249e-03 -1.8863283e-03 -3.3406063e-03
  4.7751768e-03 -4.0315883e-03 -1.0805968e-03  4.5295265e-03
 -4.2952015e-03  4.7686393e-03  1.2336487e-03 -2.6753121e-03
  1.5505255e-03 -8.0554985e-04  2.1500492e-03 -4.8299232e-03
 -1.9262416e-03  1.4880367e-03 -3.5560040e-03  7.2937971e-04
 -4.7101676e-03 -1.6871268e-03 -4.3555470e-03  1.9187623e-03
  3.0014194e-03  3.5826860e-03  1.5113975e-03  1.1298138e-04
  1.967579

In [92]:
visualize_embeddings(vectors, labels)

2019-04-23 21:09:09,324 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-23 21:09:09,629 : INFO : All done. Plotting.


---------------------

## Word Embedding
* Word2Vec
* Glove
* FastText

### Word2Vec

* 출처 - models.word2vec – Deep learning with word2vec - https://radimrehurek.com/gensim/models/word2vec.html

#### 영어 예시

In [93]:
# Data
corpus_en = [["my", "name", "is", "jamie"], ["jamie", "is", "cute"]]

pprint(corpus_en)

[['my', 'name', 'is', 'jamie'], ['jamie', 'is', 'cute']]


In [103]:
# 모델 초기화
w2v_en = gensim.models.Word2Vec(min_count=0) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

In [104]:
# 모델 사전 만들기
w2v_en.build_vocab(corpus_en)

2019-04-23 21:10:32,693 : INFO : collecting all words and their counts
2019-04-23 21:10:32,694 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 21:10:32,695 : INFO : collected 5 word types from a corpus of 7 raw words and 2 sentences
2019-04-23 21:10:32,696 : INFO : Loading a fresh vocabulary
2019-04-23 21:10:32,697 : INFO : min_count=0 retains 5 unique words (100% of original 5, drops 0)
2019-04-23 21:10:32,698 : INFO : min_count=0 leaves 7 word corpus (100% of original 7, drops 0)
2019-04-23 21:10:32,698 : INFO : deleting the raw counts dictionary of 5 items
2019-04-23 21:10:32,699 : INFO : sample=0.001 downsamples 5 most-common words
2019-04-23 21:10:32,700 : INFO : downsampling leaves estimated 0 word corpus (7.5% of prior 7)
2019-04-23 21:10:32,701 : INFO : estimated required memory for 5 words and 100 dimensions: 6500 bytes
2019-04-23 21:10:32,702 : INFO : resetting layer weights


In [105]:
# 학습
w2v_en.train(corpus_en, total_examples=len(corpus_en), epochs=10)

2019-04-23 21:10:33,685 : INFO : training model with 3 workers on 5 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2019-04-23 21:10:33,688 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-23 21:10:33,689 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-23 21:10:33,690 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-23 21:10:33,691 : INFO : EPOCH - 1 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-23 21:10:33,693 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-23 21:10:33,694 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-23 21:10:33,695 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-23 21:10:33,696 : INFO : EPOCH - 2 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-23 21:10:33,698 : INFO : worker thread finished; awaiting fi

(2, 70)

In [106]:
# 생성된 모델 저장 및 불러오기 - 이것은 나중에 이 모델을 다시 활용하려할 때 써보기. 
fname = 'model_w2v_en.wv'
w2v_en.save(fname)

2019-04-23 21:10:34,413 : INFO : saving Word2Vec object under model_w2v_en.wv, separately None
2019-04-23 21:10:34,414 : INFO : not storing attribute vectors_norm
2019-04-23 21:10:34,415 : INFO : not storing attribute cum_table
2019-04-23 21:10:34,417 : INFO : saved model_w2v_en.wv


In [107]:
%ls model_w2v_en.wv

model_w2v_en.wv


In [108]:
# 저장된 모델 로드해서 사용하기
my_w2v_en = gensim.models.Word2Vec.load(fname)

2019-04-23 21:10:36,085 : INFO : loading Word2Vec object from model_w2v_en.wv
2019-04-23 21:10:36,087 : INFO : loading wv recursively from model_w2v_en.wv.wv.* with mmap=None
2019-04-23 21:10:36,088 : INFO : setting ignored attribute vectors_norm to None
2019-04-23 21:10:36,089 : INFO : loading vocabulary recursively from model_w2v_en.wv.vocabulary.* with mmap=None
2019-04-23 21:10:36,090 : INFO : loading trainables recursively from model_w2v_en.wv.trainables.* with mmap=None
2019-04-23 21:10:36,091 : INFO : setting ignored attribute cum_table to None
2019-04-23 21:10:36,092 : INFO : loaded model_w2v_en.wv


In [109]:
# 단어의 벡터값 얻기
vectors_en = my_w2v_en.wv

In [110]:
vectors_en['name']

array([-3.5927573e-03,  4.4859448e-03, -2.2817224e-03, -3.9224941e-03,
        3.6792208e-03, -1.4695083e-03,  1.9569823e-03,  1.1353698e-03,
        2.6722791e-04, -2.1971969e-03,  3.7626526e-03,  3.7397007e-03,
       -3.0215716e-03,  1.5770639e-03, -4.6257027e-03, -1.6846821e-03,
        3.1876427e-03, -1.5796910e-03, -3.0271267e-03,  3.6565969e-03,
        4.8914663e-03,  2.0221453e-03,  3.9439914e-03, -3.9060025e-03,
       -4.7563524e-03, -4.7203777e-03, -6.6261768e-05, -2.8719442e-04,
       -2.1669192e-03,  3.8900014e-03, -1.1963459e-03, -4.8108720e-03,
        3.4040292e-03, -3.9582141e-03, -2.3571749e-03,  3.9369878e-03,
       -1.6443490e-03,  7.4489706e-04, -1.5425759e-03,  3.1522007e-03,
       -4.5602308e-03,  2.5334940e-03, -1.0126929e-03, -2.4590590e-03,
        2.9935597e-03,  4.0025646e-03, -1.7761880e-03,  1.7690011e-03,
        3.3243871e-04, -8.7835989e-04,  2.8865228e-03, -2.0090507e-03,
        1.0992971e-03, -3.8388525e-03,  3.7692061e-03, -4.5965267e-03,
      

In [111]:
# 시각화
labels = [word for word in vectors_en.vocab]
vectors = [vectors_en[word] for word in labels]

visualize_embeddings(vectors, labels)

2019-04-23 21:10:39,313 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-23 21:10:39,426 : INFO : All done. Plotting.


--------------------------------

#### 한국어 예시

In [468]:
# 문장 단위로 입력 코퍼스를 바꿔야 한다.

def tokenize(x, tokenizer) :
    for token in tokenizer.tokenize(x) :
        if nouns.get(token) :
            yield token
            
sentences_ko = []
for doc in documents_ko :
    sentences = [sent for sent in doc.split('\n') if len(sent)]
    tokens = [list(tokenize(sent, tokenizer)) for sent in sentence]
    sentences_ko.extend([token for token in tokens if len(token)])

In [469]:
print(sentences_ko[:2])

[['네'], ['네']]


In [470]:
# 모델 초기화
w2v_ko = gensim.models.Word2Vec(min_count=0) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
# 모델 사전 만들기
w2v_ko.build_vocab(sentences_ko)
# 학습
w2v_ko.train(corpus_ko, total_examples=len(sentences_ko), epochs=10)
# 모델 저장
fname = 'model_w2v_ko.wv'
w2v_ko.save(fname)

2019-04-24 03:04:01,909 : INFO : collecting all words and their counts
2019-04-24 03:04:01,910 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-24 03:04:01,911 : INFO : collected 18 word types from a corpus of 441 raw words and 378 sentences
2019-04-24 03:04:01,911 : INFO : Loading a fresh vocabulary
2019-04-24 03:04:01,913 : INFO : min_count=0 retains 18 unique words (100% of original 18, drops 0)
2019-04-24 03:04:01,914 : INFO : min_count=0 leaves 441 word corpus (100% of original 441, drops 0)
2019-04-24 03:04:01,914 : INFO : deleting the raw counts dictionary of 18 items
2019-04-24 03:04:01,915 : INFO : sample=0.001 downsamples 18 most-common words
2019-04-24 03:04:01,916 : INFO : downsampling leaves estimated 57 word corpus (13.1% of prior 441)
2019-04-24 03:04:01,917 : INFO : estimated required memory for 18 words and 100 dimensions: 23400 bytes
2019-04-24 03:04:01,918 : INFO : resetting layer weights
2019-04-24 03:04:01,920 : INFO : training mod

In [471]:
# 단어의 벡터값 얻기
vectors_ko = w2v_ko.wv

In [472]:
vectors_ko['꿈']

array([ 4.3770596e-03, -2.3819695e-03, -4.7503787e-04, -1.4193344e-03,
        4.7977436e-03, -9.8665361e-04, -1.8427230e-03,  7.7493320e-04,
       -1.3317795e-03,  1.6155151e-03,  2.3584550e-03,  6.5936468e-04,
        2.6537247e-03, -5.7585782e-04, -1.9154645e-03,  3.3450872e-03,
        1.3377591e-03, -2.3150735e-04,  3.3459403e-03, -3.8644946e-03,
       -8.4396110e-05, -3.2811644e-03, -4.4795186e-03, -4.7085499e-03,
        3.6030859e-05,  2.2660999e-03, -3.5197530e-03,  5.9939484e-04,
       -1.1683941e-03, -2.3626836e-03,  3.0485927e-03,  3.2824073e-03,
        9.8227989e-05,  3.8114837e-03, -2.5666174e-03, -5.2756368e-04,
       -3.2863314e-03,  1.2689834e-03,  2.3157473e-03, -2.1149539e-03,
        1.1043837e-03, -4.0839459e-03,  4.4506108e-03, -1.4522129e-03,
        3.2256318e-03,  3.8084595e-03,  1.5904799e-03,  3.8842952e-03,
        3.2019124e-03, -2.1111062e-03,  2.9661607e-03, -2.3745240e-03,
        4.9216649e-03, -1.3289509e-03,  3.7913534e-03, -4.6393052e-03,
      

In [473]:
# 시각화
labels = [word for word in vectors_ko.vocab]
vectors = [vectors_ko[word] for word in labels]

visualize_embeddings(vectors, labels)

2019-04-24 03:04:06,038 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-24 03:04:06,159 : INFO : All done. Plotting.


In [474]:
w2v_ko.wv.vocab.keys()

dict_keys(['네', '꿈', '아무', '전부', '있어', '멀리', '하루', '말', '준비', '벌써', '너', '거짓말', '자꾸', '매일', '어른들', '시간', '특별', '꿔'])

In [475]:
w2v_ko.wv.most_similar(['꿈'])

2019-04-24 03:04:14,044 : INFO : precomputing L2-norms of word weight vectors


[('자꾸', 0.16839897632598877),
 ('벌써', 0.13236922025680542),
 ('멀리', 0.06959383189678192),
 ('아무', 0.05615692958235741),
 ('하루', 0.05258084461092949),
 ('말', 0.01975037157535553),
 ('꿔', -0.01000899076461792),
 ('네', -0.018199510872364044),
 ('거짓말', -0.025152131915092468),
 ('시간', -0.04419991001486778)]

In [476]:
w2v_ko.wv.most_similar(['하루'])

[('있어', 0.21480423212051392),
 ('전부', 0.10614712536334991),
 ('꿔', 0.10569868236780167),
 ('꿈', 0.05258084461092949),
 ('어른들', 0.02084885537624359),
 ('자꾸', 0.01874619349837303),
 ('특별', 0.00026061758399009705),
 ('벌써', -0.0004631727933883667),
 ('준비', -0.01703988015651703),
 ('매일', -0.02649909257888794)]

In [477]:
w2v_ko.wv.most_similar(['어른들'])

[('매일', 0.18942180275917053),
 ('네', 0.17278073728084564),
 ('특별', 0.11666610091924667),
 ('멀리', 0.04391506314277649),
 ('하루', 0.020848851650953293),
 ('너', -0.011678421869874),
 ('준비', -0.01792408525943756),
 ('시간', -0.036207232624292374),
 ('벌써', -0.04509039223194122),
 ('자꾸', -0.04902574419975281)]

### Glove

### FastText

#### 영어 예시

In [478]:
documents_en = [["my", "name", "is", "jamie"], ["jamie", "is", "cute"]]

In [479]:
# 워드투벡은 학습시 없었던 단어에 대해서는 계산해주지 못한다.
w2v_en.most_similar("jamia")


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).



KeyError: "word 'jamia' not in vocabulary"

In [480]:
sentences_en = documents_en

In [481]:
pprint(sentences_en[:2])

[['my', 'name', 'is', 'jamie'], ['jamie', 'is', 'cute']]


In [482]:
from gensim.models import FastText

ft_en = FastText(size=4, window=3, min_count=1)

In [483]:
ft_en.build_vocab(sentences_en)

2019-04-24 03:04:28,023 : INFO : collecting all words and their counts
2019-04-24 03:04:28,024 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-24 03:04:28,024 : INFO : collected 5 word types from a corpus of 7 raw words and 2 sentences
2019-04-24 03:04:28,025 : INFO : Loading a fresh vocabulary
2019-04-24 03:04:28,027 : INFO : min_count=1 retains 5 unique words (100% of original 5, drops 0)
2019-04-24 03:04:28,027 : INFO : min_count=1 leaves 7 word corpus (100% of original 7, drops 0)
2019-04-24 03:04:28,028 : INFO : deleting the raw counts dictionary of 5 items
2019-04-24 03:04:28,029 : INFO : sample=0.001 downsamples 5 most-common words
2019-04-24 03:04:28,030 : INFO : downsampling leaves estimated 0 word corpus (7.5% of prior 7)
2019-04-24 03:04:28,032 : INFO : estimated required memory for 5 words, 40 buckets and 4 dimensions: 3860 bytes
2019-04-24 03:04:28,033 : INFO : resetting layer weights
2019-04-24 03:04:28,050 : INFO : Total number of ngram

In [484]:
ft_en.train(sentences=sentences_en, total_examples=len(sentences_en), epochs=10)  # train

2019-04-24 03:04:28,897 : INFO : training model with 3 workers on 5 vocabulary and 4 features, using sg=0 hs=0 sample=0.001 negative=5 window=3
2019-04-24 03:04:28,904 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-24 03:04:28,905 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-24 03:04:28,906 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-24 03:04:28,906 : INFO : EPOCH - 1 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-24 03:04:28,909 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-24 03:04:28,910 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-24 03:04:28,911 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-24 03:04:28,912 : INFO : EPOCH - 2 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-24 03:04:28,914 : INFO : worker thread finished; awaiting fini

In [485]:
# FastText는 학습시 없었던 단어에 대해서도 계산해준다.
ft_en.wv.most_similar("jamia")

2019-04-24 03:04:29,758 : INFO : precomputing L2-norms of word weight vectors
2019-04-24 03:04:29,759 : INFO : precomputing L2-norms of ngram weight vectors


[('my', 0.8944276571273804),
 ('jamie', 0.7871583104133606),
 ('name', 0.3816155791282654),
 ('is', 0.3451642394065857),
 ('cute', 0.30520403385162354)]

In [486]:
labels = [word for word in ft_en.wv.vocab]
vectors = [ft_en.wv[word] for word in labels]

visualize_embeddings(vectors, labels)

2019-04-24 03:04:30,631 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-24 03:04:30,755 : INFO : All done. Plotting.


#### 한국어 예시

In [487]:
print(w2v_ko.wv.vocab)

{'네': <gensim.models.keyedvectors.Vocab object at 0x1a1d4fe2b0>, '꿈': <gensim.models.keyedvectors.Vocab object at 0x1a1d4fe1d0>, '아무': <gensim.models.keyedvectors.Vocab object at 0x1a21d6bac8>, '전부': <gensim.models.keyedvectors.Vocab object at 0x1a21d6b9b0>, '있어': <gensim.models.keyedvectors.Vocab object at 0x1a21d57198>, '멀리': <gensim.models.keyedvectors.Vocab object at 0x1a21d57390>, '하루': <gensim.models.keyedvectors.Vocab object at 0x1a21d57358>, '말': <gensim.models.keyedvectors.Vocab object at 0x1a21d573c8>, '준비': <gensim.models.keyedvectors.Vocab object at 0x1a21d572e8>, '벌써': <gensim.models.keyedvectors.Vocab object at 0x1a21d57278>, '너': <gensim.models.keyedvectors.Vocab object at 0x1a21d57320>, '거짓말': <gensim.models.keyedvectors.Vocab object at 0x1a21d57400>, '자꾸': <gensim.models.keyedvectors.Vocab object at 0x1a21d57438>, '매일': <gensim.models.keyedvectors.Vocab object at 0x1a21d57470>, '어른들': <gensim.models.keyedvectors.Vocab object at 0x1a21d574a8>, '시간': <gensim.models.keyed

In [489]:
w2v_ko.wv.vocab.get("특별한")

In [490]:
# 워드투벡은 학습시 없었던 단어에 대해서는 계산해주지 못한다.
w2v_ko.most_similar("특별한")


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).



KeyError: "word '특별한' not in vocabulary"

In [546]:
# 문장 단위로 입력 코퍼스를 바꿔야 한다.

def tokenize(x, tokenizer) :
    for token in tokenizer.tokenize(x) :
        if nouns.get(token) :
            yield token
            
sentences_ko = []
for doc in documents_ko :
    sentences = [sent for sent in doc.split('\n') if len(sent)]
    tokens = [list(tokenize(sent, tokenizer)) for sent in sentence]
    sentences_ko.extend([token for token in tokens if len(token)])

In [547]:
print(sentences_ko[:2])

[['네'], ['네']]


In [548]:
from gensim.models import FastText

ft_ko = FastText(size=4, window=3, min_count=1)
ft_ko.build_vocab(sentences_ko)
ft_ko.train(sentences=sentences_ko, total_examples=len(sentences_ko), epochs=10)  # train

2019-04-24 03:25:42,599 : INFO : collecting all words and their counts
2019-04-24 03:25:42,601 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-24 03:25:42,602 : INFO : collected 18 word types from a corpus of 441 raw words and 378 sentences
2019-04-24 03:25:42,602 : INFO : Loading a fresh vocabulary
2019-04-24 03:25:42,603 : INFO : min_count=1 retains 18 unique words (100% of original 18, drops 0)
2019-04-24 03:25:42,604 : INFO : min_count=1 leaves 441 word corpus (100% of original 441, drops 0)
2019-04-24 03:25:42,604 : INFO : deleting the raw counts dictionary of 18 items
2019-04-24 03:25:42,605 : INFO : sample=0.001 downsamples 18 most-common words
2019-04-24 03:25:42,606 : INFO : downsampling leaves estimated 57 word corpus (13.1% of prior 441)
2019-04-24 03:25:42,608 : INFO : estimated required memory for 18 words, 50 buckets and 4 dimensions: 11640 bytes
2019-04-24 03:25:42,609 : INFO : resetting layer weights
2019-04-24 03:25:42,620 : INFO : To

In [549]:
# FastText는 학습시 없었던 단어에 대해서도 계산해준다.
ft_ko.wv.most_similar("특별한")

2019-04-24 03:25:43,572 : INFO : precomputing L2-norms of word weight vectors
2019-04-24 03:25:43,573 : INFO : precomputing L2-norms of ngram weight vectors


[('있어', 0.9348163604736328),
 ('멀리', 0.9237874746322632),
 ('너', 0.9161602854728699),
 ('시간', 0.8657609224319458),
 ('아무', 0.848251461982727),
 ('꿈', 0.8391091227531433),
 ('말', 0.6695265769958496),
 ('어른들', 0.5211405754089355),
 ('거짓말', 0.4977935552597046),
 ('네', 0.40327689051628113)]

In [550]:
# 한국어는 한 단어가 자음모음 복합이라서, 원하던 효과가 나오지 않는다.  
pprint(ft_ko.wv.vocab)

{'거짓말': <gensim.models.keyedvectors.Vocab object at 0x1a2208be10>,
 '꿈': <gensim.models.keyedvectors.Vocab object at 0x1a25a6aeb8>,
 '꿔': <gensim.models.keyedvectors.Vocab object at 0x1a2208be80>,
 '너': <gensim.models.keyedvectors.Vocab object at 0x1a2208bac8>,
 '네': <gensim.models.keyedvectors.Vocab object at 0x1a25786518>,
 '말': <gensim.models.keyedvectors.Vocab object at 0x1a2208bef0>,
 '매일': <gensim.models.keyedvectors.Vocab object at 0x1a2208bb38>,
 '멀리': <gensim.models.keyedvectors.Vocab object at 0x1a25a6ab70>,
 '벌써': <gensim.models.keyedvectors.Vocab object at 0x1a2208bfd0>,
 '시간': <gensim.models.keyedvectors.Vocab object at 0x1a2208beb8>,
 '아무': <gensim.models.keyedvectors.Vocab object at 0x1a25a6ae10>,
 '어른들': <gensim.models.keyedvectors.Vocab object at 0x1a2208b320>,
 '있어': <gensim.models.keyedvectors.Vocab object at 0x1a25a6af60>,
 '자꾸': <gensim.models.keyedvectors.Vocab object at 0x1a2208bc50>,
 '전부': <gensim.models.keyedvectors.Vocab object at 0x1a25a6af28>,
 '준비': <gensi

In [551]:
labels = [word for word in ft_ko.wv.vocab]
vectors = [ft_ko.wv[word] for word in labels]

visualize_embeddings(vectors, labels)

2019-04-24 03:25:45,313 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-24 03:25:45,458 : INFO : All done. Plotting.


In [552]:
# 초성/중성/종성으로 분해하는 코드
from soynlp.hangle import decompose, compose

def encode(s):
    def process(c):
        if c == ' ':
            return c
        jamo = decompose(c)
        # 'a' or 모음 or 자음
        if (jamo is None) or (jamo[0] == ' ') or (jamo[1] == ' '):
            return ' '
        base = jamo[0]+jamo[1]
        if jamo[2] == ' ':
            return base + '-'
        return base + jamo[2]

    s = ''.join(process(c) for c in s)
    return s.strip() 

def decode(s):
    def process(t):
        assert len(t) % 3 == 0
        t_ = t.replace('-', ' ')
        chars = [tuple(t_[3*i:3*(i+1)]) for i in range(len(t_)//3)]
        recovered = [compose(*char) for char in chars]
        recovered = ''.join(recovered)
        return recovered

    return ' '.join(process(t) for t in s.split())

In [553]:
s = '가나다랄  a2ㅗㅛㅠ ㅋㅋㅋ 하핫'
print(encode(s))
print(decode(encode(s)))

ㄱㅏ-ㄴㅏ-ㄷㅏ-ㄹㅏㄹ            ㅎㅏ-ㅎㅏㅅ
가나다랄 하핫


In [555]:
# 문장 단위로 입력 코퍼스를 바꿔야 한다.
# 초/중/종성을 분리해야 한다.

def tokenize(x, tokenizer) :
    for token in tokenizer.tokenize(x) :
        if nouns.get(token) :
            token_decom = encode(token)
            if len(token_decom) :
                yield token_decom
            
sentences_ko = []
for doc in documents_ko :
    sentences = [sent for sent in doc.split('\n') if len(sent)]
    tokens = [list(tokenize(sent, tokenizer)) for sent in sentence]
    tokens_decom = [token for token in tokens if len(token)]
    sentences_ko.extend(tokens_decom)

In [556]:
print(sentences_ko[:2])

[['ㄴㅔ-'], ['ㄴㅔ-']]


In [557]:
from gensim.models import FastText

ft_ko = FastText(size=4, window=3, min_count=1)
ft_ko.build_vocab(sentences_ko)
ft_ko.train(sentences=sentences_ko, total_examples=len(sentences_ko), epochs=10)  # train

2019-04-24 03:26:02,482 : INFO : collecting all words and their counts
2019-04-24 03:26:02,483 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-24 03:26:02,484 : INFO : collected 18 word types from a corpus of 441 raw words and 378 sentences
2019-04-24 03:26:02,484 : INFO : Loading a fresh vocabulary
2019-04-24 03:26:02,485 : INFO : min_count=1 retains 18 unique words (100% of original 18, drops 0)
2019-04-24 03:26:02,486 : INFO : min_count=1 leaves 441 word corpus (100% of original 441, drops 0)
2019-04-24 03:26:02,486 : INFO : deleting the raw counts dictionary of 18 items
2019-04-24 03:26:02,487 : INFO : sample=0.001 downsamples 18 most-common words
2019-04-24 03:26:02,488 : INFO : downsampling leaves estimated 57 word corpus (13.1% of prior 441)
2019-04-24 03:26:02,491 : INFO : estimated required memory for 18 words, 278 buckets and 4 dimensions: 17192 bytes
2019-04-24 03:26:02,494 : INFO : resetting layer weights
2019-04-24 03:26:02,505 : INFO : T

In [558]:
# 초/중/종성을 분리된 단어들이 벡터화 됨.
pprint(ft_ko.wv.vocab)

{'ㄱㅓ-ㅈㅣㅅㅁㅏㄹ': <gensim.models.keyedvectors.Vocab object at 0x1a2208be10>,
 'ㄲㅜㅁ': <gensim.models.keyedvectors.Vocab object at 0x1a25a5f3c8>,
 'ㄲㅝ-': <gensim.models.keyedvectors.Vocab object at 0x1a2288bb38>,
 'ㄴㅓ-': <gensim.models.keyedvectors.Vocab object at 0x1a2208bac8>,
 'ㄴㅔ-': <gensim.models.keyedvectors.Vocab object at 0x1a25a5f400>,
 'ㅁㅏㄹ': <gensim.models.keyedvectors.Vocab object at 0x1a2208bb38>,
 'ㅁㅐ-ㅇㅣㄹ': <gensim.models.keyedvectors.Vocab object at 0x1a2288b9e8>,
 'ㅁㅓㄹㄹㅣ-': <gensim.models.keyedvectors.Vocab object at 0x1a2208bbe0>,
 'ㅂㅓㄹㅆㅓ-': <gensim.models.keyedvectors.Vocab object at 0x1a2208bcf8>,
 'ㅅㅣ-ㄱㅏㄴ': <gensim.models.keyedvectors.Vocab object at 0x1a2288b668>,
 'ㅇㅏ-ㅁㅜ-': <gensim.models.keyedvectors.Vocab object at 0x1a25786518>,
 'ㅇㅓ-ㄹㅡㄴㄷㅡㄹ': <gensim.models.keyedvectors.Vocab object at 0x1a2288bd68>,
 'ㅇㅣㅆㅇㅓ-': <gensim.models.keyedvectors.Vocab object at 0x1a2208be48>,
 'ㅈㅏ-ㄲㅜ-': <gensim.models.keyedvectors.Vocab object at 0x1a2288b748>,
 'ㅈㅓㄴㅂㅜ-': <gensim.models.key

In [559]:
# FastText는 학습시 없었던 단어에 대해서도 계산해준다.
rst = ft_ko.wv.most_similar(encode("특별히"))
for word, score in rst : 
    print((decode(word),score))

2019-04-24 03:26:09,046 : INFO : precomputing L2-norms of word weight vectors
2019-04-24 03:26:09,047 : INFO : precomputing L2-norms of ngram weight vectors


('특별', 0.9261425733566284)
('멀리', 0.7501364350318909)
('어른들', 0.4801279902458191)
('거짓말', 0.3854236304759979)
('매일', 0.1696784645318985)
('준비', 0.14507009088993073)
('꿔', 0.027564167976379395)
('전부', 0.009496230632066727)
('시간', -0.0006869323551654816)
('아무', -0.07786288857460022)


In [560]:
# FastText는 학습시 없었던 단어에 대해서도 계산해준다.
rst = ft_ko.wv.most_similar(encode("꿈을"))
for word, score in rst : 
    print((decode(word),score))

('너', 0.7622981667518616)
('꿈', 0.7212961912155151)
('하루', 0.6398354172706604)
('네', 0.5679073929786682)
('말', 0.4501268267631531)
('벌써', 0.41831713914871216)
('있어', 0.3568355441093445)
('자꾸', 0.15550968050956726)
('전부', 0.02571043372154236)
('아무', -0.02309088408946991)


In [561]:
labels = [word for word in ft_ko.wv.vocab]
vectors = [ft_ko.wv[word] for word in labels]

d_labels = [decode(word) for word in labels]
visualize_embeddings(vectors, d_labels)

2019-04-24 03:26:27,610 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-24 03:26:27,752 : INFO : All done. Plotting.


In [562]:
# Word2Vec도 다시 그려서 위의 그림과 비교해보자
labels = [word for word in w2v_ko.wv.vocab]
vectors = [w2v_ko.wv[word] for word in labels]

visualize_embeddings(vectors, labels)

2019-04-24 03:26:39,815 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-24 03:26:39,958 : INFO : All done. Plotting.


----------------------------

## Sentence Embeding

--------------------------

# 참고자료 
* Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/
* Natural Language Processing - https://www.coursera.org/learn/language-processing / Main approaches in NLP
* http://git.ajou.ac.kr/open-source-2018-spring/python_Korean_NLP/blob/master/README.md
* https://github.com/lovit/soynlp
* https://github.com/lovit/textmining-tutorial
* https://github.com/lovit/textmining-tutorial/blob/master/topics/topic4_embedding/word_document_embedding.pdf
* https://github.com/lovit/python_ml4nlp
* https://github.com/lovit/python_ml4nlp/blob/master/day5_embedding_and_visualizing/day_5_4_fasttext_gensim.ipynb