# 01.Text Vectorization (Draft)

# 차례
* Frequency Vectors
* One-Hot Encoding
* Term Frequency-Inverse Document Frequency
* Distributed Representation

-----------------------

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch01/figures/cap1.3.png" width=600 />
<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch01/figures/cap1.4.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

<img src="figures/main_nlp_01.png" width=600 />
<img src="figures/main_nlp_02.png" width=600 />
<img src="figures/main_nlp_03.png" width=600 />
<img src="figures/main_nlp_04.png" width=600 />
<img src="figures/main_nlp_05.png" width=600 />

* 출처 - Natural Language Processing - https://www.coursera.org/learn/language-processing - Main approaches in NLP

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap01.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

<img src="https://slideplayer.com/slide/5270778/17/images/10/Distributed+Word+Representation.jpg" width=600 />

* 출처 - https://slideplayer.com/slide/5270778/

---------------------

# Frequency Vectors

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap02.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [17]:
import gensim
import numpy as np

from pprint import pprint

gensim을 이용하려면 문서 목록인 documents를 corpus 클래스로 바꿔야 한다.

In [31]:
# 가상의 문서 4개
documents_en = [
    "a b c a",
    "c b c",
    "b b a",
    "a c c",
    "c b a",
]

In [32]:
# 단어(토큰) 단위로 분할
def tokenize(x) :
    for token in x.split() :
        yield token
        
corpus_en = [list(tokenize(doc)) for doc in  documents_en]

pprint(corpus_en)

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]


In [33]:
# Dictionary 객체. 구체적으로는 단어에 id 할당 (그 외에도 여러가지 기능은 튜토리얼 참조)
dictionary_en = gensim.corpora.Dictionary(corpus_en)

In [34]:
dictionary_en

<gensim.corpora.dictionary.Dictionary at 0x1a20f0a630>

In [35]:
dictionary_en.token2id

{'a': 0, 'b': 1, 'c': 2}

In [36]:
vectors = [dictionary_en.doc2bow(doc) for doc in corpus_en]    

In [37]:
pprint(vectors)

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]


----------------

# One-Hot Encoding

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap03.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [38]:
corpus_en

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]

In [12]:
[dictionary.doc2bow(doc) for doc in corpus]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [13]:
[[token for token in dictionary.doc2bow(doc)] 
                         for doc in corpus]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [14]:
[[(token[0], token[1]) for token in dictionary.doc2bow(doc)] 
                         for doc in corpus]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [15]:
vectors = [[(token[0], 1) for token in dictionary.doc2bow(doc)] 
                         for doc in corpus]

In [18]:
vectors  = np.array([
        [(token[0], 1) for token in dictionary.doc2bow(doc)]
        for doc in corpus
    ])

In [19]:
vectors

array([list([(0, 1), (1, 1), (2, 1)]), list([(1, 1), (2, 1)]),
       list([(0, 1), (1, 1)]), list([(0, 1), (2, 1)]),
       list([(0, 1), (1, 1), (2, 1)])], dtype=object)

-------------------

# Term Frequency-Inverse Document Frequency

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap04.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [20]:
documents_en

['a b c a', 'c b c', 'b b a', 'a c c', 'c b a']

In [39]:
corpus_en

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]

In [40]:
# Dictionary 생성
dictionary_en = gensim.corpora.Dictionary(corpus_en)

In [41]:
# Tfidf Model 생성
tfidf_en = gensim.models.TfidfModel(dictionary=dictionary_en, normalize=True)
tfidf_en

<gensim.models.tfidfmodel.TfidfModel at 0x1a20f18da0>

In [42]:
vectors = [tfidf_en[dictionary_en.doc2bow(vector)] for vector in corpus]
vectors

[[(0, 0.816496580927726), (1, 0.408248290463863), (2, 0.408248290463863)],
 [(1, 0.447213595499958), (2, 0.894427190999916)],
 [(0, 0.447213595499958), (1, 0.894427190999916)],
 [(0, 0.447213595499958), (2, 0.894427190999916)],
 [(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]]

In [43]:
# Dictionary와 tfidf model을 저장해 놓으면 나중에 다시 로드해서 쓸 수 있어 편하다. 
dictionary_en.save_as_text('test_en.txt')
tfidf_en.save('tfidf_en.pkl')

In [44]:
%ls

01_text_features.ipynb  model_w2v_ko.wv         test_en.txt
[34mfigures[m[m/                mymodel.wv              tfidf.pkl
model_w2v_en.wv         test.txt                tfidf_en.pkl


#### 한국어 예제 실습

* 출처 - 나만의 웹 크롤러 만들기 with Requests/BeautifulSoup - https://beomi.github.io/2017/01/20/HowToMakeWebCrawler/
* 출처 - https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp
* 출처 - https://github.com/lovit/soynlp/blob/master/tutorials/nounextractor-v2_usage.ipynb
* 출처 - https://wikidocs.net/24603

In [45]:
import requests
from bs4 import BeautifulSoup

In [46]:
#url = "https://gasazip.com/view.html?no=614736"
#url = "https://gasazip.com/view.html?no=636135"
url = "http://gasazip.com/view.html?no=2276458"

In [47]:
# HTTP GET Request
req = requests.get(url)
# HTML 소스 가져오기
html = req.text
# BeautifulSoup으로 html소스를 python객체로 변환하기
# 첫 인자는 html소스코드, 두 번째 인자는 어떤 parser를 이용할지 명시.
# 여기서는 Python 내장 html.parser를 이용했다.
soup = BeautifulSoup(html, 'html.parser')

In [48]:
lyrics = []
for txt in soup.find_all('div', attrs={'class': 'col-md-8'}) :
    lines = txt.get_text().split('\n')
    for line in lines :
        lyrics.append(line.strip())

In [49]:
lyrics

['작은 것들을 위한 시 (Boy With Luv)Feat.Halsey',
 '',
 '모든 게 궁금해',
 'How’s your day',
 'Oh tell me',
 '뭐가 널 행복하게 하는지',
 'Oh text me',
 'Your every picture',
 '내 머리맡에 두고 싶어',
 'oh bae',
 'Come be my teacher',
 '네 모든 걸 다 가르쳐줘',
 'Your 1 your 2',
 'Listen my my baby 나는',
 '저 하늘을 높이 날고 있어',
 '그때 니가 내게 줬던 두 날개로',
 '이제 여긴 너무 높아',
 '난 내 눈에 널 맞추고 싶어',
 'Yeah you makin’ me a boy with luv',
 'Oh my my my oh my my my',
 "I've waited all my life",
 '네 전부를 함께하고 싶어',
 'Oh my my my oh my my my',
 'Looking for something right',
 '이제 조금은 나 알겠어',
 'I want something stronger',
 'Than a moment',
 'than a moment love',
 'I have waited longer',
 'For a boy with',
 'For a boy with luv',
 '널 알게 된 이후 ya',
 '내 삶은 온통 너 ya',
 '사소한 게 사소하지 않게',
 '만들어버린 너라는 별',
 '하나부터 열까지 모든 게 특별하지',
 '너의 관심사 걸음걸이 말투와',
 '사소한 작은 습관들까지',
 '다 말하지 너무 작던',
 '내가 영웅이 된 거라고',
 'Oh nah',
 '난 말하지 운명 따윈',
 '처음부터 내 게 아니었다고',
 'Oh nah',
 '세계의 평화',
 'No way',
 '거대한 질서',
 'No way',
 '그저 널 지킬 거야 난',
 'Boy with luv',
 'Listen my my baby 나는',
 '저 하늘을 높이 날고

In [156]:
import collections
 
BTS = collections.OrderedDict()
BTS['No More Dream'] = 638198
BTS['I NEED U'] = 576129
BTS['쩔어'] = 628809
BTS['봄날'] = 644668
BTS['DNA'] = 1235594
BTS['피 땀 눈물'] = 630845
BTS['불타오르네'] = 615362
BTS['IDOL'] = 2262224
BTS['작은 것들을 위한 시'] = 2276458

In [159]:
lyrics_ids = BTS.values()
url_root = "http://gasazip.com/view.html?no="

urls = [url_root + str(lid) for lid in lyrics_ids]

print(urls)

['http://gasazip.com/view.html?no=638198', 'http://gasazip.com/view.html?no=576129', 'http://gasazip.com/view.html?no=628809', 'http://gasazip.com/view.html?no=644668', 'http://gasazip.com/view.html?no=1235594', 'http://gasazip.com/view.html?no=630845', 'http://gasazip.com/view.html?no=615362', 'http://gasazip.com/view.html?no=2262224', 'http://gasazip.com/view.html?no=2276458']


In [160]:
docs = []
for url in urls :
    req = requests.get(url)
    html = req.text
    soup = BeautifulSoup(html, 'html.parser')
    
    doc = []
    for txt in soup.find_all('div', attrs={'class': 'col-md-8'}) :
        doc.append(txt.get_text())
    
    docs.append(",".join(doc))

In [161]:
len(docs)

9

In [162]:
docs[:3]

['No More Dream,\n얌마 네 꿈은 뭐니\n얌마 네 꿈은 뭐니\n얌마 네 꿈은 뭐니\n네 꿈은 겨우 그거니\n\nI wanna big house, big cars & big rings\nBut 사실은 I dun have any big dreams\n하하 난 참 편하게 살어\n꿈 따위 안 꿔도 아무도 뭐라 안 하잖어\n전부 다다다 똑같이 나처럼 생각하고 있어\n새까까까맣게 까먹은 꿈 많던 어린 시절\n대학은 걱정 마 멀리라도 갈 거니까\n알았어 엄마 지금 독서실 간다니까\n\n네가 꿈꿔 온 네 모습이 뭐야\n지금 네 거울 속엔 누가 보여, I gotta say\n너의 길을 가라고\n단 하루를 살아도\n뭐라도 하라고\n나약함은 담아 둬\n\n왜 말 못하고 있어? 공부는 하기 싫다면서\n학교 때려 치기는 겁나지? 이거 봐 등교할 준비하네 벌써\n철 좀 들어 제발 좀, 너 입만 살아가지고 인마 유리 멘탈 boy\n(Stop!) 자신에게 물어봐 언제 네가 열심히 노력했냐고\n\n얌마 네 꿈은 뭐니\n얌마 네 꿈은 뭐니\n얌마 네 꿈은 뭐니\n네 꿈은 겨우 그거니\n\n거짓말이야 you such a liar\nSee me see me ya 넌 위선자여\n왜 자꾸 딴 길을 가래 야 너나 잘해\n제발 강요하진 말아 줘\n(La la la la la) 네 꿈이 뭐니 네 꿈이 뭐니 뭐니\n(La la la la la) 고작 이거니 고작 이거니 거니\n\n지겨운 same day, 반복되는 매일에\n어른들과 부모님은 틀에 박힌 꿈을 주입해\n장래희망 넘버원... 공무원?\n강요된 꿈은 아냐, 9회말 구원투수\n시간 낭비인 야자에 돌직구를 날려\n지옥 같은 사회에 반항해, 꿈을 특별 사면\n자신에게 물어봐 네 꿈의 profile\n억압만 받던 인생 네 삶의 주어가 되어 봐\n\n네가 꿈꿔 온 네 모습이 뭐여\n지금 네 거울 속엔 누가 보여, I gotta say\n너의 길을 가라고\n단 하루를 살아도\n뭐라도 하라고\n나약함은 담아 둬\n\n얌마 네 꿈은 뭐니\n얌마 네 

In [163]:
documents_ko = docs

In [165]:
# 문서를 형태소 분석을 이용해서 정리하고 싶다.
doc = documents_ko[2]
doc

"쩔어,\n어서 와 방탄은 처음이지?\n\nAyo ladies & gentleman\n준비가 됐다면 부를게 yeah!\n딴 녀석들과는 다르게\n내 스타일로 내 내 내 내 스타일로 에오!\n\n밤새 일했지 everyday\n니가 클럽에서 놀 때 yeah\n자 놀라지 말고 들어 매일\nI got a feel, I got a feel\n난 좀 쩔어!\n\n아 쩔어 쩔어 쩔어 우리 연습실 땀내\n봐 쩌렁 쩌렁 쩌렁한 내 춤이 답해\n모두 비실이 찌질이 찡찡이 띨띨이들\n나랑은 상관이 없어 cuz 난 희망이 쩔어 haha\n\nOk 우린 머리부터 발끝까지 전부 다 쩌 쩔어\n하루의 절반을 작업에 쩌 쩔어\n작업실에 쩔어 살어 청춘은 썩어가도\n덕분에 모로 가도 달리는 성공가도\n소녀들아 더 크게 소리질러 쩌 쩌렁\n\n밤새 일했지 everyday\n니가 클럽에서 놀 때 yeah\n딴 녀석들과는 다르게\nI don't wanna say yes\nI don't wanna say yes\n\n소리쳐봐 all right\n몸이 타버리도록 all night (all night)\nCause we got fire (fire!)\nHigher (higher!)\nI gotta make it, I gotta make it\n쩔어!\n\n거부는 거부해\n난 원래 너무해\n모두 다 따라 해\n쩔어\n\n거부는 거부해\n전부 나의 노예\n모두 다 따라 해\n쩔어\n\n3포세대? 5포세대?\n그럼 난 육포가 좋으니까 6포세대\n언론과 어른들은 의지가 없다며 우릴 싹 주식처럼 매도해\n왜 해보기도 전에 죽여 걔넨 enemy enemy enemy\n왜 벌써부터 고개를 숙여 받아 energy energy energy\n절대 마 포기 you know you not lonely\n너와 내 새벽은 낮보다 예뻐\nSo can I get a little bit of hope? (yeah)\n잠든 청춘을 깨워 go\n\n밤새 일했지 everyday\n니가 클럽에서 놀 때 yeah\n딴 녀석들과는 다르게\n

In [166]:
from soynlp.noun import LRNounExtractor_v2

noun_extractor = LRNounExtractor_v2(verbose=True)

[Noun Extractor] use default predictors
[Noun Extractor] num features: pos=1260, neg=1173, common=12


In [167]:
# 학습기반 단어 추출기
nouns = noun_extractor.train_extract(documents_ko)

[Noun Extractor] counting eojeols
[EojeolCounter] n eojeol = 1076 from 9 sents. mem=0.181 Gb                    
[Noun Extractor] complete eojeol counter -> lr graph
[Noun Extractor] has been trained. #eojeols=2727, mem=0.181 Gb
[Noun Extractor] batch prediction was completed for 286 words
[Noun Extractor] checked compounds. discovered 0 compounds
[Noun Extractor] postprocessing detaching_features : 80 -> 80
[Noun Extractor] postprocessing ignore_features : 80 -> 77
[Noun Extractor] postprocessing ignore_NJ : 77 -> 77
[Noun Extractor] 77 nouns (0 compounds) with min frequency=1
[Noun Extractor] flushing was done. mem=0.181 Gb                    
[Noun Extractor] 19.80 % eojeols are covered


In [168]:
len(nouns)

77

In [169]:
nouns

{'스타일': NounScore(frequency=4, score=1.0),
 '거짓말': NounScore(frequency=5, score=1.0),
 'DNA': NounScore(frequency=5, score=0.8333333333333334),
 '어른들': NounScore(frequency=2, score=1.0),
 '사랑': NounScore(frequency=7, score=1.0),
 '함께': NounScore(frequency=3, score=0.5),
 '하루': NounScore(frequency=4, score=1.0),
 '우주': NounScore(frequency=3, score=1.0),
 '운명': NounScore(frequency=8, score=1.0),
 '특별': NounScore(frequency=2, score=1.0),
 '사소': NounScore(frequency=3, score=1.0),
 '날개': NounScore(frequency=4, score=1.0),
 '거부': NounScore(frequency=8, score=1.0),
 '너무': NounScore(frequency=12, score=1.0),
 '달콤': NounScore(frequency=3, score=1.0),
 '이상': NounScore(frequency=2, score=1.0),
 '행복': NounScore(frequency=2, score=1.0),
 '전부': NounScore(frequency=10, score=1.0),
 '상처': NounScore(frequency=2, score=1.0),
 '준비': NounScore(frequency=2, score=1.0),
 '봄날': NounScore(frequency=3, score=0.75),
 '매일': NounScore(frequency=4, score=1.0),
 '하늘': NounScore(frequency=4, score=1.0),
 '눈물': NounS

In [170]:
from soynlp.tokenizer import MaxScoreTokenizer

In [171]:
scores = {w:s.score for w, s in nouns.items()}
scores

{'스타일': 1.0,
 '거짓말': 1.0,
 'DNA': 0.8333333333333334,
 '어른들': 1.0,
 '사랑': 1.0,
 '함께': 0.5,
 '하루': 1.0,
 '우주': 1.0,
 '운명': 1.0,
 '특별': 1.0,
 '사소': 1.0,
 '날개': 1.0,
 '거부': 1.0,
 '너무': 1.0,
 '달콤': 1.0,
 '이상': 1.0,
 '행복': 1.0,
 '전부': 1.0,
 '상처': 1.0,
 '준비': 1.0,
 '봄날': 0.75,
 '매일': 1.0,
 '하늘': 1.0,
 '눈물': 1.0,
 '새벽': 1.0,
 '겨울': 1.0,
 '시간': 1.0,
 '청춘': 1.0,
 '우리': 1.0,
 '모두': 1.0,
 '벌써': 1.0,
 '처음': 0.5,
 '머리': 0.5,
 '멀리': 1.0,
 '쩌렁': 1.0,
 '작업': 0.5,
 '자꾸': 1.0,
 '조금': 0.7142857142857143,
 '아무': 1.0,
 '있어': 0.8333333333333334,
 '꺼': 0.3333333333333333,
 '너': 1.0,
 '말': 0.5,
 '눈': 1.0,
 '것': 1.0,
 '못': 1.0,
 '꿈': 1.0,
 '삶': 1.0,
 '모': 1.0,
 '욕': 1.0,
 '원': 1.0,
 '잘': 0.625,
 '더': 1.0,
 '보': 1.0,
 '많': 0.9512195121951219,
 '몸': 1.0,
 '높': 0.8,
 '독': 0.5,
 '춤': 1.0,
 '힘': 0.5,
 '끝': 1.0,
 '숨': 1.0,
 '깊': 0.5,
 '길': 1.0,
 '법': 1.0,
 '밤': 1.0,
 '손': 0.6666666666666666,
 '내': 1.0,
 '네': 1.0,
 '수': 1.0,
 '둘': 1.0,
 '때': 0.7142857142857143,
 '열': 0.5,
 '취': 1.0,
 '꿔': 1.0,
 '뭣': 1.0,
 '생': 0.4}

In [172]:
tokenizer = MaxScoreTokenizer(scores=scores)

In [176]:
for doc in documents_ko[:2] :
    print(doc)
    tokens = tokenizer.tokenize(doc)
    print(tokens)

No More Dream,
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

I wanna big house, big cars & big rings
But 사실은 I dun have any big dreams
하하 난 참 편하게 살어
꿈 따위 안 꿔도 아무도 뭐라 안 하잖어
전부 다다다 똑같이 나처럼 생각하고 있어
새까까까맣게 까먹은 꿈 많던 어린 시절
대학은 걱정 마 멀리라도 갈 거니까
알았어 엄마 지금 독서실 간다니까

네가 꿈꿔 온 네 모습이 뭐야
지금 네 거울 속엔 누가 보여, I gotta say
너의 길을 가라고
단 하루를 살아도
뭐라도 하라고
나약함은 담아 둬

왜 말 못하고 있어? 공부는 하기 싫다면서
학교 때려 치기는 겁나지? 이거 봐 등교할 준비하네 벌써
철 좀 들어 제발 좀, 너 입만 살아가지고 인마 유리 멘탈 boy
(Stop!) 자신에게 물어봐 언제 네가 열심히 노력했냐고

얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

거짓말이야 you such a liar
See me see me ya 넌 위선자여
왜 자꾸 딴 길을 가래 야 너나 잘해
제발 강요하진 말아 줘
(La la la la la) 네 꿈이 뭐니 네 꿈이 뭐니 뭐니
(La la la la la) 고작 이거니 고작 이거니 거니

지겨운 same day, 반복되는 매일에
어른들과 부모님은 틀에 박힌 꿈을 주입해
장래희망 넘버원... 공무원?
강요된 꿈은 아냐, 9회말 구원투수
시간 낭비인 야자에 돌직구를 날려
지옥 같은 사회에 반항해, 꿈을 특별 사면
자신에게 물어봐 네 꿈의 profile
억압만 받던 인생 네 삶의 주어가 되어 봐

네가 꿈꿔 온 네 모습이 뭐여
지금 네 거울 속엔 누가 보여, I gotta say
너의 길을 가라고
단 하루를 살아도
뭐라도 하라고
나약함은 담아 둬

얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

거짓말이야 you such a liar
See me

In [278]:
for doc in documents_ko[:2] :
    print(doc)
    tokens = tokenizer.tokenize(doc)
    print([token for token in tokens if nouns.get(token)])

No More Dream,
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

I wanna big house, big cars & big rings
But 사실은 I dun have any big dreams
하하 난 참 편하게 살어
꿈 따위 안 꿔도 아무도 뭐라 안 하잖어
전부 다다다 똑같이 나처럼 생각하고 있어
새까까까맣게 까먹은 꿈 많던 어린 시절
대학은 걱정 마 멀리라도 갈 거니까
알았어 엄마 지금 독서실 간다니까

네가 꿈꿔 온 네 모습이 뭐야
지금 네 거울 속엔 누가 보여, I gotta say
너의 길을 가라고
단 하루를 살아도
뭐라도 하라고
나약함은 담아 둬

왜 말 못하고 있어? 공부는 하기 싫다면서
학교 때려 치기는 겁나지? 이거 봐 등교할 준비하네 벌써
철 좀 들어 제발 좀, 너 입만 살아가지고 인마 유리 멘탈 boy
(Stop!) 자신에게 물어봐 언제 네가 열심히 노력했냐고

얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

거짓말이야 you such a liar
See me see me ya 넌 위선자여
왜 자꾸 딴 길을 가래 야 너나 잘해
제발 강요하진 말아 줘
(La la la la la) 네 꿈이 뭐니 네 꿈이 뭐니 뭐니
(La la la la la) 고작 이거니 고작 이거니 거니

지겨운 same day, 반복되는 매일에
어른들과 부모님은 틀에 박힌 꿈을 주입해
장래희망 넘버원... 공무원?
강요된 꿈은 아냐, 9회말 구원투수
시간 낭비인 야자에 돌직구를 날려
지옥 같은 사회에 반항해, 꿈을 특별 사면
자신에게 물어봐 네 꿈의 profile
억압만 받던 인생 네 삶의 주어가 되어 봐

네가 꿈꿔 온 네 모습이 뭐여
지금 네 거울 속엔 누가 보여, I gotta say
너의 길을 가라고
단 하루를 살아도
뭐라도 하라고
나약함은 담아 둬

얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
얌마 네 꿈은 뭐니
네 꿈은 겨우 그거니

거짓말이야 you such a liar
See me

In [276]:
print(nouns.get('안녕'))

None


In [283]:
# 단어(토큰) 단위로 분할하는 함수
def tokenize(x, tokenizer) :
    for token in tokenizer.tokenize(x) :
        if nouns.get(token) :  # 명사만 선택한다.
            yield token
        
        
# 코퍼스를 만들자
corpus_ko = [list(tokenize(doc, tokenizer)) for doc in documents_ko]

print(corpus_ko[:2])

[['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네'], ['너', '너', '뭣', '전부', '사랑', '사랑', '자꾸', '너무', '자꾸', '내', '내', '내', '아무', '말', '하늘', '하늘', '하늘', '내', '눈물', '더', '잘', '사랑', '자꾸', '너무', '사랑', '사랑', '수', '사랑', '자꾸', '너무']]


In [284]:
# Dictionary 생성
dictionary_ko = gensim.corpora.Dictionary(corpus_ko)

2019-04-23 15:54:30,749 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-04-23 15:54:30,750 : INFO : built Dictionary(56 unique tokens: ['거짓말', '꿈', '꿔', '너', '네']...) from 9 documents (total 290 corpus positions)


In [285]:
# Tfidf Model 생성
tfidf_ko = gensim.models.TfidfModel(dictionary=dictionary_ko, normalize=True)

In [286]:
vectors = [tfidf_ko[dictionary_ko.doc2bow(vector)] for vector in corpus_ko]
vectors[:2]

[[(0, 0.12107436527752928),
  (1, 0.11791395471330296),
  (2, 0.05895697735665148),
  (3, 0.05895697735665148),
  (4, 0.9685949222202342),
  (5, 0.04035812175917643),
  (6, 0.04035812175917643),
  (7, 0.04035812175917643),
  (8, 0.04035812175917643),
  (9, 0.02947848867832574),
  (10, 0.04035812175917643),
  (11, 0.04035812175917643),
  (12, 0.03154354402424213),
  (13, 0.08843546603497723),
  (14, 0.010879633080850683),
  (15, 0.04035812175917643),
  (16, 0.04035812175917643),
  (17, 0.05895697735665148)],
 [(3, 0.1810555648125473),
  (5, 0.12393889336758655),
  (10, 0.12393889336758655),
  (13, 0.3621111296250946),
  (14, 0.0334111109613129),
  (18, 0.1336444438452516),
  (19, 0.14530428209412652),
  (20, 0.12393889336758655),
  (21, 0.048434760698042166),
  (22, 0.1810555648125473),
  (23, 0.7436333602055194),
  (24, 0.09052778240627365),
  (25, 0.12393889336758655),
  (26, 0.3718166801027597)]]

In [287]:
# tf-idf 기반 벡터 유사도 
from gensim import similarities

In [288]:
def distance(a, b, dic) :
    index = similarities.MatrixSimilarity([a],num_features=len(dic))
    sim = index[b]
    return sim[0]*100

In [289]:
titles = list(BTS.keys())

In [290]:
a = 0
print("A: ")
print(titles[a])
print(corpus_ko[a])
print(vectors[a])

print("-------------------")

b = 3
print("B: ")
print(titles[b])
print(corpus_ko[b])
print(vectors[b])

sim = distance(vectors[a], vectors[b], dictionary_ko)

print("===================")
print(round(sim,2),'% similar')

2019-04-23 15:54:43,296 : INFO : creating matrix with 1 documents and 56 features


A: 
No More Dream
['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네']
[(0, 0.12107436527752928), (1, 0.11791395471330296), (2, 0.05895697735665148), (3, 0.05895697735665148), (4, 0.9685949222202342), (5, 0.04035812175917643), (6, 0.04035812175917643), (7, 0.04035812175917643), (8, 0.04035812175917643), (9, 0.02947848867832574), (10, 0.04035812175917643), (11, 0.04035812175917643), (12, 0.03154354402424213), (13, 0.08843546603497723), (14, 0.010879633080850683), (15, 0.04035812175917643), (16, 0.04035812175917643), (17, 0.05895697735665148)]
-------------------
B: 
봄날
['봄날', '더', '있어', '너무', '시간', '우리', '우리', '겨울', '겨울', '시간', '손', '겨울', '봄날', '조금', '더', '수', '조금', '더', '겨울', '봄날', '더', '시간', '우리', '모두', '하루', '조금', '더', '겨울', '조금', '더', '겨울', '봄날', '더']
[(9, 0.179778421536

In [291]:
a = 3
print("A: ")
print(titles[a])
print(corpus_ko[a])
print(vectors[a])

print("-------------------")

b = 3
print("B: ")
print(titles[b])
print(corpus_ko[b])
print(vectors[b])

sim = distance(vectors[a], vectors[b], dictionary_ko)

print("===================")
print(round(sim,2),'% similar')

2019-04-23 15:54:58,832 : INFO : creating matrix with 1 documents and 56 features


A: 
봄날
['봄날', '더', '있어', '너무', '시간', '우리', '우리', '겨울', '겨울', '시간', '손', '겨울', '봄날', '조금', '더', '수', '조금', '더', '겨울', '봄날', '더', '시간', '우리', '모두', '하루', '조금', '더', '겨울', '조금', '더', '겨울', '봄날', '더']
[(9, 0.17977842153624138), (12, 0.0320620719751412), (17, 0.05992614051208046), (19, 0.0320620719751412), (21, 0.2244345038259884), (24, 0.05992614051208046), (30, 0.08204309595838546), (33, 0.17977842153624138), (38, 0.7191136861449655), (39, 0.47940912409664366), (40, 0.11985228102416091), (41, 0.32817238383354186)]
-------------------
B: 
봄날
['봄날', '더', '있어', '너무', '시간', '우리', '우리', '겨울', '겨울', '시간', '손', '겨울', '봄날', '조금', '더', '수', '조금', '더', '겨울', '봄날', '더', '시간', '우리', '모두', '하루', '조금', '더', '겨울', '조금', '더', '겨울', '봄날', '더']
[(9, 0.17977842153624138), (12, 0.0320620719751412), (17, 0.05992614051208046), (19, 0.0320620719751412), (21, 0.2244345038259884), (24, 0.05992614051208046), (30, 0.08204309595838546), (33, 0.17977842153624138), (38, 0.7191136861449655), (39, 0.47940912409664366), 

In [292]:
a = 4
print("A: ")
print(titles[a])
print(corpus_ko[a])
print(vectors[a])

print("-------------------")

b = 8
print("B: ")
print(titles[b])
print(corpus_ko[b])
print(vectors[b])

sim = distance(vectors[a], vectors[b], dictionary_ko)

print("===================")
print(round(sim,2),'% similar')

2019-04-23 15:55:02,520 : INFO : creating matrix with 1 documents and 56 features


A: 
DNA
['DNA', '내', 'DNA', '우리', '우주', '운명', '내', '내', '운명', '우주', '함께', '운명', 'DNA', '더', 'DNA', '우리', '자꾸', '이상', '사랑', '내', '운명', '우주', '함께', '운명', 'DNA', '운명', '우리', '함께', '운명', 'DNA']
[(13, 0.05726670028364734), (18, 0.08454174074334084), (21, 0.030639201032861517), (23, 0.07840213546948255), (33, 0.17180010085094202), (42, 0.6872004034037681), (43, 0.34360020170188404), (44, 0.5488149482863779), (45, 0.07840213546948255), (46, 0.23520640640844767)]
-------------------
B: 
작은 것들을 위한 시
['행복', '내', '머리', '네', '하늘', '있어', '날개', '너무', '내', '네', '전부', '함께', '조금', '내', '너', '사소', '사소', '특별', '사소', '너무', '운명', '처음', '내', '하늘', '있어', '날개', '너무', '내', '네', '전부', '함께', '조금', '상처', '상처', '때', '날개', '네', '전부', '함께', '조금']
[(3, 0.0788862248590622), (4, 0.4320031332304548), (12, 0.08441243830039596), (14, 0.08734367534565451), (16, 0.1080007833076137), (18, 0.14557279224275751), (19, 0.12661865745059395), (26, 0.2160015666152274), (28, 0.1080007833076137), (29, 0.1080007833076137), (36, 0.1080

-------------------

# Distributed Representation

* Document Embedding
    - Doc2Vec
* Word Embedding
    - Word2Vec
    - Glove
    - FastText
* Sentence Embeding

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap05.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

-----------------------------------

## Document Embedding
* Doc2Vec

### Doc2Vec

In [293]:
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

In [294]:
print(corpus_ko[:2])

[['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네'], ['너', '너', '뭣', '전부', '사랑', '사랑', '자꾸', '너무', '자꾸', '내', '내', '내', '아무', '말', '하늘', '하늘', '하늘', '내', '눈물', '더', '잘', '사랑', '자꾸', '너무', '사랑', '사랑', '수', '사랑', '자꾸', '너무']]


In [295]:
docs   = [ 
    TaggedDocument(words, [titles[idx]])
        for idx, words in enumerate(corpus_ko)
]

In [296]:
docs[:2]

[TaggedDocument(words=['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네'], tags=['No More Dream']),
 TaggedDocument(words=['너', '너', '뭣', '전부', '사랑', '사랑', '자꾸', '너무', '자꾸', '내', '내', '내', '아무', '말', '하늘', '하늘', '하늘', '내', '눈물', '더', '잘', '사랑', '자꾸', '너무', '사랑', '사랑', '수', '사랑', '자꾸', '너무'], tags=['I NEED U'])]

In [297]:
d2v = Doc2Vec(docs, min_count=0)

2019-04-23 15:55:49,949 : INFO : collecting all words and their counts
2019-04-23 15:55:49,951 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-04-23 15:55:49,951 : INFO : collected 56 word types and 9 unique tags from a corpus of 9 examples and 290 words
2019-04-23 15:55:49,952 : INFO : Loading a fresh vocabulary
2019-04-23 15:55:49,953 : INFO : min_count=0 retains 56 unique words (100% of original 56, drops 0)
2019-04-23 15:55:49,954 : INFO : min_count=0 leaves 290 word corpus (100% of original 290, drops 0)
2019-04-23 15:55:49,955 : INFO : deleting the raw counts dictionary of 56 items
2019-04-23 15:55:49,955 : INFO : sample=0.001 downsamples 56 most-common words
2019-04-23 15:55:49,957 : INFO : downsampling leaves estimated 78 word corpus (27.0% of prior 290)
2019-04-23 15:55:49,958 : INFO : estimated required memory for 56 words and 100 dimensions: 78200 bytes
2019-04-23 15:55:49,959 : INFO : resetting layer weights
2019-04-23 15:55:49,962 : INF

In [298]:
vectors = d2v.docvecs

In [299]:
for i in range(3) :
    print(vectors[i])

[-6.1519467e-04  3.6071020e-03  3.8809460e-03  4.6946714e-03
  2.9825661e-03 -2.0385412e-03  3.0491012e-03 -1.3842918e-03
  4.4256831e-03  1.7111637e-03  8.2507607e-04  2.8566094e-03
 -2.0712824e-03 -3.5699098e-03  1.1796689e-04 -3.9170324e-03
 -2.6237357e-03  9.4670407e-04  8.4810663e-04 -9.9855277e-04
  4.7059203e-03 -5.0347755e-03 -3.8452967e-04 -3.4272957e-03
 -2.5331886e-03  3.5015543e-03 -4.8382082e-03 -3.8819930e-03
  3.9205715e-05 -3.9196974e-03  1.1671892e-03  2.2203308e-03
 -2.9040407e-03 -4.5154365e-03  8.5221307e-04  2.1265133e-03
  4.1276924e-03 -1.6352560e-03 -4.6726898e-03 -4.2092111e-03
  2.6181755e-03  5.4962689e-04  3.0222598e-03  1.3455185e-03
  4.4579674e-03 -1.9941297e-03  7.6458702e-04  4.7507389e-03
  2.3123790e-03  2.7946397e-03  3.5345498e-03 -3.2431672e-03
 -3.5833416e-03 -2.2028058e-03 -1.2297489e-03 -4.8115267e-03
  4.4754525e-03  9.3083136e-04 -5.4658402e-04 -2.5224797e-03
  3.8910999e-03 -1.7981066e-03 -4.1203932e-03 -5.4971850e-04
  7.2735769e-04 -1.01825

In [300]:
# doc2vec기반 벡터 유사도 
def distance(a_doctag, b_doctag, vectors) :
    sim = vectors.similarity(a_doctag, b_doctag)
    return sim*100

In [301]:
a = 0
a_doctag = vectors.index2entity[a]
print("A: ")
print(a_doctag)
print(corpus_ko[a])
print(vectors[a])

print("---------------------")

b = 3
print("B: ")
print(b_doctag)
print(corpus_ko[b])
print(vectors[b])

print("=====================")

sim = distance(a_doctag, b_doctag, vectors)

print(round(sim,2),'% similar')

A: 
No More Dream
['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네']
[-6.1519467e-04  3.6071020e-03  3.8809460e-03  4.6946714e-03
  2.9825661e-03 -2.0385412e-03  3.0491012e-03 -1.3842918e-03
  4.4256831e-03  1.7111637e-03  8.2507607e-04  2.8566094e-03
 -2.0712824e-03 -3.5699098e-03  1.1796689e-04 -3.9170324e-03
 -2.6237357e-03  9.4670407e-04  8.4810663e-04 -9.9855277e-04
  4.7059203e-03 -5.0347755e-03 -3.8452967e-04 -3.4272957e-03
 -2.5331886e-03  3.5015543e-03 -4.8382082e-03 -3.8819930e-03
  3.9205715e-05 -3.9196974e-03  1.1671892e-03  2.2203308e-03
 -2.9040407e-03 -4.5154365e-03  8.5221307e-04  2.1265133e-03
  4.1276924e-03 -1.6352560e-03 -4.6726898e-03 -4.2092111e-03
  2.6181755e-03  5.4962689e-04  3.0222598e-03  1.3455185e-03
  4.4579674e-03 -1.9941297e-03  7.6458702e

In [302]:
a = 3
a_doctag = vectors.index2entity[a]
print("A: ")
print(a_doctag)
print(corpus_ko[a])
print(vectors[a])

print("---------------------")

b = 3
print("B: ")
print(b_doctag)
print(corpus_ko[b])
print(vectors[b])

print("=====================")

sim = distance(a_doctag, b_doctag, vectors)

print(round(sim,2),'% similar')

A: 
봄날
['봄날', '더', '있어', '너무', '시간', '우리', '우리', '겨울', '겨울', '시간', '손', '겨울', '봄날', '조금', '더', '수', '조금', '더', '겨울', '봄날', '더', '시간', '우리', '모두', '하루', '조금', '더', '겨울', '조금', '더', '겨울', '봄날', '더']
[-2.7601644e-03  1.3943871e-04  2.6803692e-03 -2.0198654e-03
 -1.7366105e-03 -4.9571082e-04  4.1714371e-03 -1.3520259e-03
  4.1089756e-03  4.9122646e-03  3.7869300e-05  4.2981221e-03
  2.2153403e-03 -1.4963220e-03  2.9957250e-03 -4.2272946e-03
 -3.7375118e-03  3.9186627e-03 -3.2491740e-03  3.2713660e-03
  8.3561777e-04  9.1597793e-04  2.9779766e-03 -3.2917124e-03
  4.5024282e-03  3.9476319e-03 -3.3793929e-03  3.5513504e-03
 -4.4827466e-03 -3.5801858e-03 -3.4950839e-03 -1.8302792e-03
 -4.9370732e-03  4.8388201e-03  2.7085065e-03 -2.6980066e-03
  9.7624131e-04  1.2016608e-03 -4.6868650e-03  2.3331409e-03
 -8.3741517e-04 -6.1276322e-04 -2.9011194e-03  8.4080943e-04
  2.2523403e-03  3.5871812e-03 -3.2537663e-03  4.2647179e-03
 -4.5948159e-03 -2.6578264e-04  2.3791564e-03  4.8016678e-03
  4.179859

In [303]:
a = 4
a_doctag = vectors.index2entity[a]
print("A: ")
print(a_doctag)
print(corpus_ko[a])
print(vectors[a])

print("---------------------")

b = 8
print("B: ")
print(b_doctag)
print(corpus_ko[b])
print(vectors[b])

print("=====================")

sim = distance(a_doctag, b_doctag, vectors)

print(round(sim,2),'% similar')

A: 
DNA
['DNA', '내', 'DNA', '우리', '우주', '운명', '내', '내', '운명', '우주', '함께', '운명', 'DNA', '더', 'DNA', '우리', '자꾸', '이상', '사랑', '내', '운명', '우주', '함께', '운명', 'DNA', '운명', '우리', '함께', '운명', 'DNA']
[-6.42233994e-04  4.74497303e-03  7.27963692e-04 -1.72769919e-03
 -3.92033020e-03  2.77536991e-03  4.00252175e-03  3.78438411e-03
 -2.42577144e-03  1.12691254e-03  2.35645822e-03  3.51692713e-03
 -1.94363017e-03  3.60522652e-03  4.24801518e-04  1.40756357e-03
 -3.32491845e-03 -2.47707358e-03  1.09643978e-03  4.54185298e-03
 -2.60659005e-03  4.33713384e-03  2.87028553e-04 -3.62039427e-03
  2.71040923e-03  1.83596136e-03 -4.12064925e-04  1.13342816e-04
 -1.65617699e-03 -3.26014910e-04  3.04114656e-03  2.81146873e-04
 -1.21135039e-04  2.18284247e-03  2.54173577e-03 -1.76452973e-03
  1.08698057e-03  4.77150455e-03  3.62216285e-03 -1.33855315e-03
  3.82080721e-03  4.54136170e-05  2.40317802e-03 -1.57573365e-03
  1.92285725e-03 -3.63529287e-03  4.21741104e-04 -6.97662937e-04
 -5.54188795e-04  3.32073029e-

#### plotly로 시각화 함수 만들기

In [304]:

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling

from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go

def visualize_embeddings(vectors, labels, plot_in_notebook = True):

    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)
    
    # reduce using t-SNE
    logging.info('starting tSNE dimensionality reduction. This may take some time.')
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
        
    # Create a trace
    trace = go.Scatter(
        x=x_vals,
        y=y_vals,
        mode='text',
        text=labels
        )
    
    data = [trace]
    
    logging.info('All done. Plotting.')
    
    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')

In [305]:
labels = titles
vectors = [d2v.docvecs[title] for title in labels]

print(labels[0])
print(vectors[0])

No More Dream
[-6.1519467e-04  3.6071020e-03  3.8809460e-03  4.6946714e-03
  2.9825661e-03 -2.0385412e-03  3.0491012e-03 -1.3842918e-03
  4.4256831e-03  1.7111637e-03  8.2507607e-04  2.8566094e-03
 -2.0712824e-03 -3.5699098e-03  1.1796689e-04 -3.9170324e-03
 -2.6237357e-03  9.4670407e-04  8.4810663e-04 -9.9855277e-04
  4.7059203e-03 -5.0347755e-03 -3.8452967e-04 -3.4272957e-03
 -2.5331886e-03  3.5015543e-03 -4.8382082e-03 -3.8819930e-03
  3.9205715e-05 -3.9196974e-03  1.1671892e-03  2.2203308e-03
 -2.9040407e-03 -4.5154365e-03  8.5221307e-04  2.1265133e-03
  4.1276924e-03 -1.6352560e-03 -4.6726898e-03 -4.2092111e-03
  2.6181755e-03  5.4962689e-04  3.0222598e-03  1.3455185e-03
  4.4579674e-03 -1.9941297e-03  7.6458702e-04  4.7507389e-03
  2.3123790e-03  2.7946397e-03  3.5345498e-03 -3.2431672e-03
 -3.5833416e-03 -2.2028058e-03 -1.2297489e-03 -4.8115267e-03
  4.4754525e-03  9.3083136e-04 -5.4658402e-04 -2.5224797e-03
  3.8910999e-03 -1.7981066e-03 -4.1203932e-03 -5.4971850e-04
  7.273576

In [306]:
visualize_embeddings(vectors, labels)

2019-04-23 15:56:50,045 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-23 15:56:50,181 : INFO : All done. Plotting.


---------------------

## Word Embedding
* Word2Vec
* Glove
* FastText

### Word2Vec

* 출처 - models.word2vec – Deep learning with word2vec - https://radimrehurek.com/gensim/models/word2vec.html

#### 영어 예시

In [307]:
# Data
corpus_en = [["my", "name", "is", "jamie"], ["jamie", "is", "cute"]]

pprint(corpus_en)

[['my', 'name', 'is', 'jamie'], ['jamie', 'is', 'cute']]


In [308]:
# 모델 초기화
w2v_en = gensim.models.Word2Vec(min_count=1) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

In [309]:
# 모델 사전 만들기
w2v_en.build_vocab(corpus_en)

2019-04-23 15:57:06,866 : INFO : collecting all words and their counts
2019-04-23 15:57:06,868 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 15:57:06,869 : INFO : collected 5 word types from a corpus of 7 raw words and 2 sentences
2019-04-23 15:57:06,869 : INFO : Loading a fresh vocabulary
2019-04-23 15:57:06,871 : INFO : min_count=1 retains 5 unique words (100% of original 5, drops 0)
2019-04-23 15:57:06,871 : INFO : min_count=1 leaves 7 word corpus (100% of original 7, drops 0)
2019-04-23 15:57:06,872 : INFO : deleting the raw counts dictionary of 5 items
2019-04-23 15:57:06,873 : INFO : sample=0.001 downsamples 5 most-common words
2019-04-23 15:57:06,874 : INFO : downsampling leaves estimated 0 word corpus (7.5% of prior 7)
2019-04-23 15:57:06,875 : INFO : estimated required memory for 5 words and 100 dimensions: 6500 bytes
2019-04-23 15:57:06,875 : INFO : resetting layer weights


In [310]:
# 학습
w2v_en.train(corpus_en, total_examples=len(corpus_en), epochs=10)

2019-04-23 15:57:07,522 : INFO : training model with 3 workers on 5 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2019-04-23 15:57:07,525 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-23 15:57:07,526 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-23 15:57:07,526 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-23 15:57:07,528 : INFO : EPOCH - 1 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-23 15:57:07,530 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-23 15:57:07,531 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-23 15:57:07,532 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-23 15:57:07,533 : INFO : EPOCH - 2 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-23 15:57:07,536 : INFO : worker thread finished; awaiting fi

(2, 70)

In [311]:
# 생성된 모델 저장 및 불러오기 - 이것은 나중에 이 모델을 다시 활용하려할 때 써보기. 
fname = 'model_w2v_en.wv'
w2v_en.save(fname)

2019-04-23 15:57:08,226 : INFO : saving Word2Vec object under model_w2v_en.wv, separately None
2019-04-23 15:57:08,227 : INFO : not storing attribute vectors_norm
2019-04-23 15:57:08,229 : INFO : not storing attribute cum_table
2019-04-23 15:57:08,235 : INFO : saved model_w2v_en.wv


In [312]:
%ls model_w2v_en.wv

model_w2v_en.wv


In [313]:
# 저장된 모델 로드해서 사용하기
my_w2v_en = gensim.models.Word2Vec.load(fname)

2019-04-23 15:57:09,745 : INFO : loading Word2Vec object from model_w2v_en.wv
2019-04-23 15:57:09,747 : INFO : loading wv recursively from model_w2v_en.wv.wv.* with mmap=None
2019-04-23 15:57:09,748 : INFO : setting ignored attribute vectors_norm to None
2019-04-23 15:57:09,749 : INFO : loading vocabulary recursively from model_w2v_en.wv.vocabulary.* with mmap=None
2019-04-23 15:57:09,750 : INFO : loading trainables recursively from model_w2v_en.wv.trainables.* with mmap=None
2019-04-23 15:57:09,750 : INFO : setting ignored attribute cum_table to None
2019-04-23 15:57:09,751 : INFO : loaded model_w2v_en.wv


In [314]:
# 단어의 벡터값 얻기
vectors_en = my_w2v_en.wv

In [315]:
vectors_en['name']

array([-2.4506054e-03, -7.2212395e-05, -2.5793440e-03,  4.5915958e-03,
       -2.1344704e-04, -2.0485444e-03,  2.0943519e-03, -4.3851822e-03,
       -2.6934405e-03, -6.1185437e-04,  4.5549552e-04,  3.1820151e-03,
        4.5488086e-03, -1.1867728e-03, -1.3333425e-03, -1.7533402e-03,
        9.8656095e-04,  4.2431941e-03,  2.9202390e-03, -2.6887951e-03,
       -3.0517210e-03,  3.5098076e-04, -3.3224120e-03, -4.4679195e-03,
        4.7339909e-03,  3.8174842e-03, -3.8166842e-04, -2.9919059e-03,
       -5.7659869e-04,  4.2707864e-03,  3.4398325e-03, -3.2739765e-03,
       -4.8487340e-03,  4.2158011e-03,  3.7358804e-03,  4.2841828e-04,
       -4.7639189e-03,  4.2284880e-04,  2.6165601e-03,  1.9673620e-04,
        2.5048950e-03,  1.5577474e-03,  1.9009429e-03,  2.8468533e-03,
       -1.6710690e-03, -3.6142489e-03, -2.9028608e-03,  1.3266195e-03,
       -1.6994998e-04, -2.5732673e-03,  4.9008811e-03, -3.6815112e-03,
        3.3471847e-04,  4.3144398e-03, -1.5480816e-03, -4.4504181e-03,
      

In [316]:
# 영어 예시
labels = [word for word in vectors_en.vocab]
vectors = [vectors_en[word] for word in labels]

visualize_embeddings(vectors, labels)

2019-04-23 15:57:12,244 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-23 15:57:12,330 : INFO : All done. Plotting.


#### 한국어 예시

In [317]:
print(corpus_ko[:2])

[['네', '네', '네', '네', '꿈', '아무', '전부', '있어', '꿈', '멀리', '네', '네', '하루', '말', '있어', '준비', '벌써', '너', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '매일', '어른들', '시간', '특별', '네', '네', '네', '네', '하루', '네', '네', '네', '네', '거짓말', '자꾸', '네', '네', '꿔', '너', '거짓말', '자꾸', '네', '네'], ['너', '너', '뭣', '전부', '사랑', '사랑', '자꾸', '너무', '자꾸', '내', '내', '내', '아무', '말', '하늘', '하늘', '하늘', '내', '눈물', '더', '잘', '사랑', '자꾸', '너무', '사랑', '사랑', '수', '사랑', '자꾸', '너무']]


In [318]:
# 모델 초기화
w2v_ko = gensim.models.Word2Vec(min_count=0) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
# 모델 사전 만들기
w2v_ko.build_vocab(corpus_ko)
# 학습
w2v_ko.train(corpus_ko, total_examples=len(corpus_ko), epochs=10)
# 모델 저장
fname = 'model_w2v_ko.wv'
w2v_ko.save(fname)

2019-04-23 15:59:18,380 : INFO : collecting all words and their counts
2019-04-23 15:59:18,381 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 15:59:18,382 : INFO : collected 56 word types from a corpus of 290 raw words and 9 sentences
2019-04-23 15:59:18,382 : INFO : Loading a fresh vocabulary
2019-04-23 15:59:18,383 : INFO : min_count=0 retains 56 unique words (100% of original 56, drops 0)
2019-04-23 15:59:18,385 : INFO : min_count=0 leaves 290 word corpus (100% of original 290, drops 0)
2019-04-23 15:59:18,386 : INFO : deleting the raw counts dictionary of 56 items
2019-04-23 15:59:18,387 : INFO : sample=0.001 downsamples 56 most-common words
2019-04-23 15:59:18,389 : INFO : downsampling leaves estimated 78 word corpus (27.0% of prior 290)
2019-04-23 15:59:18,390 : INFO : estimated required memory for 56 words and 100 dimensions: 72800 bytes
2019-04-23 15:59:18,391 : INFO : resetting layer weights
2019-04-23 15:59:18,393 : INFO : training model

In [319]:
# 단어의 벡터값 얻기
vectors_ko = w2v_ko.wv

In [322]:
vectors_ko['꿈']

array([-4.1817571e-03, -1.9619069e-03, -6.5305247e-04,  3.9571412e-03,
       -3.7814099e-03,  3.0307663e-03, -4.2649838e-03, -1.6334316e-03,
       -1.2990105e-03,  1.7397442e-03,  4.7864942e-03, -4.4085360e-03,
       -2.3842664e-03,  4.4826619e-04,  1.3650140e-03, -4.6924055e-03,
       -1.5269438e-03, -2.3491395e-04,  2.1357939e-03,  3.6097455e-03,
        5.0640032e-03,  4.4606412e-03, -2.6770909e-03, -4.0351227e-03,
       -4.7652451e-03,  4.7990144e-03,  5.0683768e-04,  4.3097758e-03,
        9.1948477e-04,  3.2819756e-03,  2.0079310e-03,  4.4531766e-03,
        4.1890419e-03,  4.3220230e-04, -4.6784263e-03, -4.0817219e-03,
        3.4473810e-04,  4.9517304e-03,  1.3107036e-03, -1.8481718e-03,
       -2.7574315e-03,  3.4291109e-03,  3.0130681e-03,  4.3133940e-03,
        2.1797868e-03, -1.4413848e-03,  2.8493502e-03, -4.6739304e-03,
        4.5851031e-03, -4.4769822e-03, -3.4144388e-03,  9.4040454e-04,
        2.1450822e-03,  3.2715972e-03, -3.4318136e-03, -2.5561536e-03,
      

In [321]:
# 한국어 예시
labels = [word for word in vectors_ko.vocab]
vectors = [vectors_ko[word] for word in labels]

visualize_embeddings(vectors, labels)

2019-04-23 15:59:32,369 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-23 15:59:32,632 : INFO : All done. Plotting.


In [336]:
w2v_ko.wv.most_similar(positive=['청춘', '이상'], topn=3)

[('우리', 0.18357814848423004),
 ('쩌렁', 0.17851543426513672),
 ('자꾸', 0.17661643028259277)]

In [337]:
w2v_ko.wv.most_similar(positive=['청춘'], negative=['이상'], topn=3)

[('너', 0.24085015058517456),
 ('매일', 0.22824066877365112),
 ('아무', 0.2158772349357605)]

In [338]:
w2v_ko.wv.most_similar(positive=['청춘', '사랑'], topn=3)

[('자꾸', 0.2426394373178482),
 ('거부', 0.16644002497196198),
 ('달콤', 0.1578788459300995)]

In [340]:
w2v_ko.wv.most_similar(positive=['청춘'], negative=['사랑'], topn=3)

[('쩌렁', 0.24117586016654968),
 ('아무', 0.17572994530200958),
 ('준비', 0.15204693377017975)]

In [344]:
w2v_ko.wv.most_similar(positive=['청춘', '사랑'], negative=['이상'], topn=3)

[('매일', 0.2140905261039734),
 ('너', 0.20670655369758606),
 ('사소', 0.16080278158187866)]

In [339]:
w2v_ko.wv.most_similar(positive=['청춘', '꿈'], topn=3)

[('너', 0.2511928081512451),
 ('쩌렁', 0.20078524947166443),
 ('매일', 0.19382727146148682)]

In [346]:
w2v_ko.wv.most_similar(positive=['청춘', '꿈'], negative=['사랑'], topn=3)

[('쩌렁', 0.2457757592201233),
 ('준비', 0.1919165849685669),
 ('너', 0.18878716230392456)]

In [323]:
w2v_ko.wv.most_similar(['청춘'])

2019-04-23 16:00:13,315 : INFO : precomputing L2-norms of word weight vectors


[('너', 0.2012811005115509),
 ('쩌렁', 0.19223423302173615),
 ('아무', 0.18791252374649048),
 ('모두', 0.18108433485031128),
 ('거부', 0.1399013102054596),
 ('있어', 0.1257329136133194),
 ('자꾸', 0.11723458021879196),
 ('매일', 0.11033092439174652),
 ('사랑', 0.10599315166473389),
 ('하늘', 0.10538944602012634)]

In [324]:
w2v_ko.wv.most_similar(['거짓말'])

[('꿔', 0.25926023721694946),
 ('멀리', 0.2127598077058792),
 ('거부', 0.21093915402889252),
 ('쩌렁', 0.1840948909521103),
 ('작업', 0.17772698402404785),
 ('매일', 0.17621026933193207),
 ('네', 0.1756512075662613),
 ('사소', 0.171067014336586),
 ('있어', 0.15519334375858307),
 ('하늘', 0.14969755709171295)]

In [325]:
w2v_ko.wv.most_similar(['눈물'])

[('내', 0.27980220317840576),
 ('함께', 0.22473911941051483),
 ('전부', 0.22231543064117432),
 ('준비', 0.20952899754047394),
 ('하늘', 0.2063641995191574),
 ('말', 0.20554494857788086),
 ('어른들', 0.19966566562652588),
 ('행복', 0.17111042141914368),
 ('거부', 0.17105409502983093),
 ('네', 0.16873550415039062)]

In [326]:
w2v_ko.wv.most_similar(['행복'])

[('꿈', 0.2412845492362976),
 ('쩌렁', 0.22397257387638092),
 ('내', 0.21039380133152008),
 ('잘', 0.19469007849693298),
 ('사소', 0.1904958337545395),
 ('네', 0.18744784593582153),
 ('매일', 0.18147645890712738),
 ('눈물', 0.17111042141914368),
 ('꿔', 0.15997441112995148),
 ('뭣', 0.13203249871730804)]

### Glove

### FastText

#### 영어 예시

In [347]:
documents_en = [["my", "name", "is", "jamie"], ["jamie", "is", "cute"]]

In [348]:
# 워드투벡은 학습시 없었던 단어에 대해서는 계산해주지 못한다.
w2v_en.most_similar("jamia")


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).

2019-04-23 16:12:08,882 : INFO : precomputing L2-norms of word weight vectors


KeyError: "word 'jamia' not in vocabulary"

In [349]:
from gensim.models import FastText

ft_en = FastText(size=4, window=3, min_count=1)

In [350]:
ft_en.build_vocab(documents_en)

2019-04-23 16:12:18,851 : INFO : collecting all words and their counts
2019-04-23 16:12:18,852 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 16:12:18,853 : INFO : collected 5 word types from a corpus of 7 raw words and 2 sentences
2019-04-23 16:12:18,854 : INFO : Loading a fresh vocabulary
2019-04-23 16:12:18,855 : INFO : min_count=1 retains 5 unique words (100% of original 5, drops 0)
2019-04-23 16:12:18,856 : INFO : min_count=1 leaves 7 word corpus (100% of original 7, drops 0)
2019-04-23 16:12:18,856 : INFO : deleting the raw counts dictionary of 5 items
2019-04-23 16:12:18,857 : INFO : sample=0.001 downsamples 5 most-common words
2019-04-23 16:12:18,858 : INFO : downsampling leaves estimated 0 word corpus (7.5% of prior 7)
2019-04-23 16:12:18,859 : INFO : estimated required memory for 5 words, 40 buckets and 4 dimensions: 3860 bytes
2019-04-23 16:12:18,861 : INFO : resetting layer weights
2019-04-23 16:12:18,878 : INFO : Total number of ngram

In [351]:
ft_en.train(sentences=documents_en, total_examples=len(documents_en), epochs=10)  # train

2019-04-23 16:12:22,920 : INFO : training model with 3 workers on 5 vocabulary and 4 features, using sg=0 hs=0 sample=0.001 negative=5 window=3
2019-04-23 16:12:22,923 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-23 16:12:22,924 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-23 16:12:22,924 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-23 16:12:22,925 : INFO : EPOCH - 1 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-23 16:12:22,928 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-23 16:12:22,929 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-23 16:12:22,930 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-23 16:12:22,930 : INFO : EPOCH - 2 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-23 16:12:22,933 : INFO : worker thread finished; awaiting fini

In [352]:
# FastText는 학습시 없었던 단어에 대해서도 계산해준다.
ft_en.wv.most_similar("jamia")

2019-04-23 16:12:28,032 : INFO : precomputing L2-norms of word weight vectors
2019-04-23 16:12:28,033 : INFO : precomputing L2-norms of ngram weight vectors


[('my', 0.8944276571273804),
 ('jamie', 0.7871583104133606),
 ('name', 0.3816155791282654),
 ('is', 0.3451642394065857),
 ('cute', 0.30520403385162354)]

In [354]:
labels = [word for word in ft_en.wv.vocab]
vectors = [ft_en.wv[word] for word in labels]

visualize_embeddings(vectors, labels)

2019-04-23 16:17:28,681 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-23 16:17:28,819 : INFO : All done. Plotting.


#### 한국어 예시

In [355]:
# 워드투벡은 학습시 없었던 단어에 대해서는 계산해주지 못한다.
w2v_ko.most_similar("사람")


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).



KeyError: "word '사람' not in vocabulary"

In [365]:
from gensim.models import FastText

ft_ko = FastText(size=4, window=3, min_count=1)
ft_ko.build_vocab(documents_ko)
ft_ko.train(sentences=documents_ko, total_examples=len(documents_ko), epochs=10)  # train

2019-04-23 16:29:44,576 : INFO : collecting all words and their counts
2019-04-23 16:29:44,578 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 16:29:44,580 : INFO : collected 531 word types from a corpus of 10160 raw words and 9 sentences
2019-04-23 16:29:44,581 : INFO : Loading a fresh vocabulary
2019-04-23 16:29:44,583 : INFO : min_count=1 retains 531 unique words (100% of original 531, drops 0)
2019-04-23 16:29:44,584 : INFO : min_count=1 leaves 10160 word corpus (100% of original 10160, drops 0)
2019-04-23 16:29:44,587 : INFO : deleting the raw counts dictionary of 531 items
2019-04-23 16:29:44,588 : INFO : sample=0.001 downsamples 64 most-common words
2019-04-23 16:29:44,589 : INFO : downsampling leaves estimated 4984 word corpus (49.1% of prior 10160)
2019-04-23 16:29:44,601 : INFO : estimated required memory for 531 words, 530 buckets and 4 dimensions: 320708 bytes
2019-04-23 16:29:44,602 : INFO : resetting layer weights
2019-04-23 16:29:44,

In [366]:
# FastText는 학습시 없었던 단어에 대해서도 계산해준다.
ft_ko.wv.most_similar("사람")

2019-04-23 16:29:45,901 : INFO : precomputing L2-norms of word weight vectors
2019-04-23 16:29:45,902 : INFO : precomputing L2-norms of ngram weight vectors


KeyError: 'all ngrams for word 사람 absent from model'

In [367]:
labels = [word for word in ft_ko.wv.vocab]
vectors = [ft_ko.wv[word] for word in labels]

visualize_embeddings(vectors, labels)

2019-04-23 16:29:49,584 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-23 16:29:52,169 : INFO : All done. Plotting.


----------------------------

## Sentence Embeding

--------------------------

# 참고자료 
* Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/
* Natural Language Processing - https://www.coursera.org/learn/language-processing / Main approaches in NLP
* http://git.ajou.ac.kr/open-source-2018-spring/python_Korean_NLP/blob/master/README.md
* https://github.com/lovit/soynlp
* https://github.com/lovit/textmining-tutorial
* https://github.com/lovit/textmining-tutorial/blob/master/topics/topic4_embedding/word_document_embedding.pdf
* https://github.com/lovit/python_ml4nlp
* https://github.com/lovit/python_ml4nlp/blob/master/day5_embedding_and_visualizing/day_5_4_fasttext_gensim.ipynb