# 01.Text Vectorization (Draft)

# 차례
* Frequency Vectors
* One-Hot Encoding
* Term Frequency-Inverse Document Frequency
* Distributed Representation

-----------------------

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch01/figures/cap1.3.png" width=600 />
<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch01/figures/cap1.4.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

<img src="figures/main_nlp_01.png" width=600 />
<img src="figures/main_nlp_02.png" width=600 />
<img src="figures/main_nlp_03.png" width=600 />
<img src="figures/main_nlp_04.png" width=600 />
<img src="figures/main_nlp_05.png" width=600 />

* 출처 - Natural Language Processing - https://www.coursera.org/learn/language-processing - Main approaches in NLP

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap01.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

<img src="https://slideplayer.com/slide/5270778/17/images/10/Distributed+Word+Representation.jpg" width=600 />

* 출처 - https://slideplayer.com/slide/5270778/

---------------------

# Frequency Vectors

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap02.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [45]:
import gensim
from pprint import pprint

gensim을 이용하려면 문서 목록인 documents를 corpus 클래스로 바꿔야 한다.

In [3]:
# 가상의 문서 4개
documents = [
    "a b c a",
    "c b c",
    "b b a",
    "a c c",
    "c b a",
]

In [13]:
# 단어(토큰) 단위로 분할
def tokenize(x) :
    for token in x.split() :
        yield token
        
corpus = [list(tokenize(doc)) for doc in  documents]

pprint(corpus)

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]


In [14]:
# Dictionary 객체. 구체적으로는 단어에 id 할당 (그 외에도 여러가지 기능은 튜토리얼 참조)
id2word = gensim.corpora.Dictionary(corpus)

In [15]:
id2word

<gensim.corpora.dictionary.Dictionary at 0x1a174cd198>

In [16]:
id2word.token2id

{'a': 0, 'b': 1, 'c': 2}

In [17]:
vectors = [id2word.doc2bow(doc) for doc in corpus]    

In [18]:
pprint(vectors)

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]


----------------

# One-Hot Encoding

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap03.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [20]:
corpus

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]

In [23]:
[id2word.doc2bow(doc) for doc in corpus]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [26]:
[[token for token in id2word.doc2bow(doc)] 
                         for doc in corpus]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [30]:
[[(token[0], token[1]) for token in id2word.doc2bow(doc)] 
                         for doc in corpus]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [34]:
vectors = [[(token[0], 1) for token in id2word.doc2bow(doc)] 
                         for doc in corpus]

In [32]:
vectors  = np.array([
        [(token[0], 1) for token in id2word.doc2bow(doc)]
        for doc in corpus
    ])

In [35]:
vectors

[[(0, 1), (1, 1), (2, 1)],
 [(1, 1), (2, 1)],
 [(0, 1), (1, 1)],
 [(0, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)]]

-------------------

# Term Frequency-Inverse Document Frequency

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap04.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [38]:
documents

['a b c a', 'c b c', 'b b a', 'a c c', 'c b a']

In [39]:
corpus

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]

In [41]:
# Dictionary 생성
id2word = gensim.corpora.Dictionary(corpus)

In [43]:
# Tfidf Model 생성
tfidf = gensim.models.TfidfModel(dictionary=id2word, normalize=True)
tfidf

<gensim.models.tfidfmodel.TfidfModel at 0x1a174d4940>

In [44]:
vectors = [tfidf[id2word.doc2bow(vector)] for vector in corpus]
vectors

[[(0, 0.816496580927726), (1, 0.408248290463863), (2, 0.408248290463863)],
 [(1, 0.447213595499958), (2, 0.894427190999916)],
 [(0, 0.447213595499958), (1, 0.894427190999916)],
 [(0, 0.447213595499958), (2, 0.894427190999916)],
 [(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]]

In [46]:
# Dictionary와 tfidf model을 저장해 놓으면 나중에 다시 로드해서 쓸 수 있어 편하다. 
id2word.save_as_text('test.txt')
tfidf.save('tfidf.pkl')

In [47]:
%ls

01_text_features.ipynb  test.txt
[34mfigures[m[m/                tfidf.pkl


#### 한국어 예제 실습

* 출처 - 나만의 웹 크롤러 만들기 with Requests/BeautifulSoup - https://beomi.github.io/2017/01/20/HowToMakeWebCrawler/
* 출처 - https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp
* 출처 - https://github.com/lovit/soynlp/blob/master/tutorials/nounextractor-v2_usage.ipynb
* 출처 - https://wikidocs.net/24603

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = "https://gasazip.com/view.html?no=614736"
#url = "https://gasazip.com/view.html?no=636135"

In [3]:
# HTTP GET Request
req = requests.get(url)
# HTML 소스 가져오기
html = req.text
# BeautifulSoup으로 html소스를 python객체로 변환하기
# 첫 인자는 html소스코드, 두 번째 인자는 어떤 parser를 이용할지 명시.
# 여기서는 Python 내장 html.parser를 이용했다.
soup = BeautifulSoup(html, 'html.parser')

In [4]:
lyrics = []
for txt in soup.find_all('div', attrs={'class': 'col-md-8'}) :
    lines = txt.get_text().split('\n')
    for line in lines :
        lyrics.append(line.strip())

In [28]:
lyrics

['CHEER UP',
 '',
 '매일 울리는 벨벨벨',
 '이젠 나를 배려 해줘',
 '배터리 낭비하긴 싫어',
 '자꾸만 봐 자꾸 자꾸만 와',
 '전화가 펑 터질 것만 같아',
 '몰라 몰라 숨도 못 쉰대',
 '나 때문에 힘들어',
 '쿵 심장이 떨어진대 왜',
 '걔 말은 나 너무 예쁘대',
 '자랑 하는건 아니구',
 '아 아까는 못 받아서 미안해',
 '친구를 만나느라 shy shy shy',
 '만나긴 좀 그렇구 미안해',
 '좀 있다 연락할게 later',
 '조르지마 얼마 가지 않아',
 '부르게 해줄게 Baby',
 '아직은 좀 일러 내 맘 같긴 일러',
 '하지만 더 보여줄래',
 'CHEER UP BABY CHEER UP BABY',
 '좀 더 힘을 내',
 '여자가 쉽게 맘을 주면 안돼',
 '그래야 니가 날 더 좋아하게 될걸',
 '태연하게 연기할래 아무렇지 않게',
 '내가 널 좋아하는 맘 모르게',
 'just get it together',
 'and then baby CHEER UP',
 '안절부절 목소리가',
 '여기까지 들려',
 '땀에 젖은 전화기가',
 '여기서도 보여',
 '바로 바로 대답하는 것도',
 '매력 없어',
 '메시지만 읽고',
 '확인 안 하는 건 기본',
 '어어어 너무 심했나 boy',
 '이러다가 지칠까 봐',
 '걱정되긴 하고',
 '어어어 안 그러면 내가 더',
 '빠질 것만 같어 빠질 것만 같어',
 '아 답장을 못해줘서 미안해',
 '친구를 만나느라 shy shy shy',
 '만나긴 좀 그렇구 미안해',
 '좀 있다 연락할게 later',
 '조르지마 어디 가지 않아',
 '되어줄게 너의 Baby',
 '너무 빨린 싫어 성의를 더 보여',
 '내가 널 기다려줄게',
 'CHEER UP BABY CHEER UP BABY',
 '좀 더 힘을 내',
 '여자가 쉽게 맘을 주면 안돼',
 '그래야 니가 날 더 좋아하게 될걸',
 '태연하게 연기할래 아무렇지 않게',
 '내가 널 좋아하는 맘 모르

In [29]:
# 각 문장을 좀 형태소 분석을 이용해서 정리하고 싶다.
sent = lyrics[2]
sent

'매일 울리는 벨벨벨'

In [30]:
from soynlp.noun import LRNounExtractor_v2

noun_extractor = LRNounExtractor_v2(verbose=True)

[Noun Extractor] use default predictors
[Noun Extractor] num features: pos=1260, neg=1173, common=12


In [31]:
# 학습기반 단어 추출기
nouns = noun_extractor.train_extract(lyrics)

[Noun Extractor] counting eojeols
[EojeolCounter] n eojeol = 162 from 79 sents. mem=0.113 Gb                    
[Noun Extractor] complete eojeol counter -> lr graph
[Noun Extractor] has been trained. #eojeols=331, mem=0.113 Gb
[Noun Extractor] batch prediction was completed for 48 words
[Noun Extractor] checked compounds. discovered 0 compounds
[Noun Extractor] postprocessing detaching_features : 11 -> 11
[Noun Extractor] postprocessing ignore_features : 11 -> 10
[Noun Extractor] postprocessing ignore_NJ : 10 -> 10
[Noun Extractor] 10 nouns (0 compounds) with min frequency=1
[Noun Extractor] flushing was done. mem=0.113 Gb                    
[Noun Extractor] 13.60 % eojeols are covered


In [32]:
len(nouns)

10

In [33]:
nouns

{'여자': NounScore(frequency=3, score=1.0),
 '여기': NounScore(frequency=1, score=1.0),
 '자꾸': NounScore(frequency=3, score=1.0),
 '좋아': NounScore(frequency=4, score=1.0),
 '내': NounScore(frequency=9, score=1.0),
 '안': NounScore(frequency=5, score=0.8333333333333334),
 '것': NounScore(frequency=4, score=1.0),
 '힘': NounScore(frequency=3, score=0.75),
 '맘': NounScore(frequency=7, score=1.0),
 '심': NounScore(frequency=1, score=0.5)}

In [34]:
from soynlp.tokenizer import MaxScoreTokenizer

In [35]:
scores = {w:s.score for w, s in nouns.items()}
scores

{'여자': 1.0,
 '여기': 1.0,
 '자꾸': 1.0,
 '좋아': 1.0,
 '내': 1.0,
 '안': 0.8333333333333334,
 '것': 1.0,
 '힘': 0.75,
 '맘': 1.0,
 '심': 0.5}

In [36]:
tokenizer = MaxScoreTokenizer(scores=scores)

In [39]:
for sent in lyrics[:10] :
    print(sent)
    tokens = tokenizer.tokenize(sent)
    print(tokens)

CHEER UP
['CHEER', 'UP']

[]
매일 울리는 벨벨벨
['매일', '울리는', '벨벨벨']
이젠 나를 배려 해줘
['이젠', '나를', '배려', '해줘']
배터리 낭비하긴 싫어
['배터리', '낭비하긴', '싫어']
자꾸만 봐 자꾸 자꾸만 와
['자꾸', '만', '봐', '자꾸', '자꾸', '만', '와']
전화가 펑 터질 것만 같아
['전화가', '펑', '터질', '것만', '같아']
몰라 몰라 숨도 못 쉰대
['몰라', '몰라', '숨도', '못', '쉰대']
나 때문에 힘들어
['나', '때문에', '힘들어']
쿵 심장이 떨어진대 왜
['쿵', '심장이', '떨어진대', '왜']


In [46]:
# 단어(토큰) 단위로 분할하는 함수
def tokenize(x, tokenizer) :
    for token in tokenizer.tokenize(x) :
        yield token
        
# 코퍼스를 만들자
corpus = [list(tokenize(sent, tokenizer)) for sent in lyrics]

pprint(corpus)

[['CHEER', 'UP'],
 [],
 ['매일', '울리는', '벨벨벨'],
 ['이젠', '나를', '배려', '해줘'],
 ['배터리', '낭비하긴', '싫어'],
 ['자꾸', '만', '봐', '자꾸', '자꾸', '만', '와'],
 ['전화가', '펑', '터질', '것만', '같아'],
 ['몰라', '몰라', '숨도', '못', '쉰대'],
 ['나', '때문에', '힘들어'],
 ['쿵', '심장이', '떨어진대', '왜'],
 ['걔', '말은', '나', '너무', '예쁘대'],
 ['자랑', '하는건', '아니구'],
 ['아', '아까는', '못', '받아서', '미안해'],
 ['친구를', '만나느라', 'shy', 'shy', 'shy'],
 ['만나긴', '좀', '그렇구', '미안해'],
 ['좀', '있다', '연락할게', 'later'],
 ['조르지마', '얼마', '가지', '않아'],
 ['부르게', '해줄게', 'Baby'],
 ['아직은', '좀', '일러', '내', '맘', '같긴', '일러'],
 ['하지만', '더', '보여줄래'],
 ['CHEER', 'UP', 'BABY', 'CHEER', 'UP', 'BABY'],
 ['좀', '더', '힘을', '내'],
 ['여자', '가', '쉽게', '맘을', '주면', '안돼'],
 ['그래야', '니가', '날', '더', '좋아', '하게', '될걸'],
 ['태연하게', '연기할래', '아무렇지', '않게'],
 ['내가', '널', '좋아', '하는', '맘', '모르게'],
 ['just', 'get', 'it', 'together'],
 ['and', 'then', 'baby', 'CHEER', 'UP'],
 ['안절부절', '목소리가'],
 ['여기', '까지', '들려'],
 ['땀에', '젖은', '전화기가'],
 ['여기', '서도', '보여'],
 ['바로', '바로', '대답하는', '것도'],
 ['매력', '없어'],
 ['메시지만'

In [56]:
# Dictionary 생성
dictionary = gensim.corpora.Dictionary(corpus)

In [140]:
# Tfidf Model 생성
tfidf = gensim.models.TfidfModel(dictionary=dictionary, normalize=True)

In [141]:
vectors = [tfidf[dictionary.doc2bow(vector)] for vector in corpus]
vectors[:10]

[[(0, 0.7071067811865475), (1, 0.7071067811865475)],
 [],
 [(2, 0.5773502691896257), (3, 0.5773502691896257), (4, 0.5773502691896257)],
 [(5, 0.5), (6, 0.5), (7, 0.5), (8, 0.5)],
 [(9, 0.6076927837595194), (10, 0.6076927837595194), (11, 0.5112914639745239)],
 [(12, 0.5241360043138908),
  (13, 0.1961761235128819),
  (14, 0.2620680021569454),
  (15, 0.7862040064708362)],
 [(16, 0.4608786708968913),
  (17, 0.3877672018740884),
  (18, 0.4608786708968913),
  (19, 0.4608786708968913),
  (20, 0.4608786708968913)],
 [(21, 0.7722125641071458),
  (22, 0.3248563278629508),
  (23, 0.3861062820535729),
  (24, 0.3861062820535729)],
 [(25, 0.5112914639745239),
  (26, 0.6076927837595194),
  (27, 0.6076927837595194)],
 [(28, 0.5), (29, 0.5), (30, 0.5), (31, 0.5)]]

In [142]:
# tf-idf 기반 벡터 유사도 
from gensim import similarities

In [143]:
def distance(a, b, dic) :
    index = similarities.MatrixSimilarity([a],num_features=len(dic))
    sim = index[b]
    return sim[0]*100

In [144]:
a = 0
print("A: ", corpus[a])
print(vectors[a])

b = 3
print("B: ", corpus[b])
print(vectors[b])

sim = distance(vectors[a], vectors[b], dictionary)

print(round(sim,2),'% similar')

A:  ['CHEER', 'UP']
[(0, 0.7071067811865475), (1, 0.7071067811865475)]
B:  ['이젠', '나를', '배려', '해줘']
[(5, 0.5), (6, 0.5), (7, 0.5), (8, 0.5)]
0.0 % similar


In [145]:
a = 3
print("A: ", corpus[a])
print(vectors[a])

b = 3
print("B: ", corpus[b])
print(vectors[b])

sim = distance(vectors[a], vectors[b], dictionary)

print(round(sim,2),'% similar')

A:  ['이젠', '나를', '배려', '해줘']
[(5, 0.5), (6, 0.5), (7, 0.5), (8, 0.5)]
B:  ['이젠', '나를', '배려', '해줘']
[(5, 0.5), (6, 0.5), (7, 0.5), (8, 0.5)]
100.0 % similar


In [150]:
a = 5
print("A: ", corpus[a])
print(vectors[a])

b = 25
print("B: ", corpus[b])
print(vectors[b])

sim = distance(vectors[a], vectors[b], dictionary)

print(round(sim,2),'% similar')

A:  ['자꾸', '만', '봐', '자꾸', '자꾸', '만', '와']
[(12, 0.5241360043138908), (13, 0.1961761235128819), (14, 0.2620680021569454), (15, 0.7862040064708362)]
B:  ['내가', '널', '좋아', '하는', '맘', '모르게']
[(61, 0.418187890302657), (79, 0.3397391794282149), (85, 0.3869069216131805), (86, 0.418187890302657), (87, 0.4585160729809741), (88, 0.418187890302657)]
0.0 % similar


-------------------

# Distributed Representation

* Document Embedding
    - Doc2Vec
* Word Embedding
    - Word2Vec
    - Glove
    - FastText
* Sentence Embeding

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap05.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

-----------------------------------

## Document Embedding
* Doc2Vec

### Doc2Vec

In [151]:
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

In [152]:
corpus

[['CHEER', 'UP'],
 [],
 ['매일', '울리는', '벨벨벨'],
 ['이젠', '나를', '배려', '해줘'],
 ['배터리', '낭비하긴', '싫어'],
 ['자꾸', '만', '봐', '자꾸', '자꾸', '만', '와'],
 ['전화가', '펑', '터질', '것만', '같아'],
 ['몰라', '몰라', '숨도', '못', '쉰대'],
 ['나', '때문에', '힘들어'],
 ['쿵', '심장이', '떨어진대', '왜'],
 ['걔', '말은', '나', '너무', '예쁘대'],
 ['자랑', '하는건', '아니구'],
 ['아', '아까는', '못', '받아서', '미안해'],
 ['친구를', '만나느라', 'shy', 'shy', 'shy'],
 ['만나긴', '좀', '그렇구', '미안해'],
 ['좀', '있다', '연락할게', 'later'],
 ['조르지마', '얼마', '가지', '않아'],
 ['부르게', '해줄게', 'Baby'],
 ['아직은', '좀', '일러', '내', '맘', '같긴', '일러'],
 ['하지만', '더', '보여줄래'],
 ['CHEER', 'UP', 'BABY', 'CHEER', 'UP', 'BABY'],
 ['좀', '더', '힘을', '내'],
 ['여자', '가', '쉽게', '맘을', '주면', '안돼'],
 ['그래야', '니가', '날', '더', '좋아', '하게', '될걸'],
 ['태연하게', '연기할래', '아무렇지', '않게'],
 ['내가', '널', '좋아', '하는', '맘', '모르게'],
 ['just', 'get', 'it', 'together'],
 ['and', 'then', 'baby', 'CHEER', 'UP'],
 ['안절부절', '목소리가'],
 ['여기', '까지', '들려'],
 ['땀에', '젖은', '전화기가'],
 ['여기', '서도', '보여'],
 ['바로', '바로', '대답하는', '것도'],
 ['매력', '없어'],
 ['메시지만'

In [153]:
docs   = [ 
    TaggedDocument(words, ['d{}'.format(idx)])
        for idx, words in enumerate(corpus)
]

In [154]:
docs

[TaggedDocument(words=['CHEER', 'UP'], tags=['d0']),
 TaggedDocument(words=[], tags=['d1']),
 TaggedDocument(words=['매일', '울리는', '벨벨벨'], tags=['d2']),
 TaggedDocument(words=['이젠', '나를', '배려', '해줘'], tags=['d3']),
 TaggedDocument(words=['배터리', '낭비하긴', '싫어'], tags=['d4']),
 TaggedDocument(words=['자꾸', '만', '봐', '자꾸', '자꾸', '만', '와'], tags=['d5']),
 TaggedDocument(words=['전화가', '펑', '터질', '것만', '같아'], tags=['d6']),
 TaggedDocument(words=['몰라', '몰라', '숨도', '못', '쉰대'], tags=['d7']),
 TaggedDocument(words=['나', '때문에', '힘들어'], tags=['d8']),
 TaggedDocument(words=['쿵', '심장이', '떨어진대', '왜'], tags=['d9']),
 TaggedDocument(words=['걔', '말은', '나', '너무', '예쁘대'], tags=['d10']),
 TaggedDocument(words=['자랑', '하는건', '아니구'], tags=['d11']),
 TaggedDocument(words=['아', '아까는', '못', '받아서', '미안해'], tags=['d12']),
 TaggedDocument(words=['친구를', '만나느라', 'shy', 'shy', 'shy'], tags=['d13']),
 TaggedDocument(words=['만나긴', '좀', '그렇구', '미안해'], tags=['d14']),
 TaggedDocument(words=['좀', '있다', '연락할게', 'later'], tags=['d

In [155]:
model = Doc2Vec(docs, min_count=0)

In [156]:
vectors = model.docvecs

In [157]:
for i in range(3) :
    print(vectors[i])

[-1.3605299e-03 -4.9551218e-03  6.4320973e-04 -3.1982341e-03
  4.4268142e-03  3.4082234e-03 -5.5610308e-05 -3.3584442e-03
 -3.4386558e-03  3.9370586e-03 -3.2208264e-03 -2.0945859e-03
 -5.5775727e-04  4.6909028e-03 -8.3822437e-04 -2.6412073e-03
 -4.9597463e-03  3.0055246e-03  4.8844487e-04 -9.9924020e-04
 -4.4104201e-03  4.5410311e-03 -1.4134985e-05  2.0951689e-03
  1.6765220e-03 -2.1535123e-03 -4.9690870e-03 -2.7008341e-03
 -4.3723034e-03 -2.1838325e-03 -4.7371862e-03  2.5483817e-03
 -4.9351221e-03 -1.3243871e-04 -7.3936710e-04 -1.4161090e-04
 -4.9034525e-03 -2.5488301e-03  4.6267419e-04  4.4306791e-03
  1.7403271e-03  4.8091557e-04  4.6775737e-03  2.4558259e-03
 -6.6190143e-04  3.7965947e-03  1.4651486e-03 -2.3283160e-03
  4.6461667e-03 -2.4500517e-03  3.2302451e-03  2.2956124e-03
 -3.2030870e-03 -1.7275230e-03 -3.1310250e-03 -3.4666355e-03
 -2.8409101e-03 -3.6558085e-03 -7.7481044e-04 -9.1449899e-04
 -7.2874362e-04 -4.0982035e-03  9.3519141e-04  4.9596098e-03
 -4.0754857e-03 -2.05284

In [163]:
# doc2vec기반 벡터 유사도 
def distance(a_doctag, b_doctag, vectors) :
    sim = vectors.similarity(a_doctag, b_doctag)
    return sim*100

In [165]:
a = 0
a_doctag = vectors.index2entity[a]
print("A: ", corpus[a], a_doctag)
print(vectors[a])

b = 3
b_doctag = vectors.index2entity[b]
print("B: ", corpus[b], b_doctag)
print(vectors[b])

sim = distance(a_doctag, b_doctag, vectors)

print(round(sim,2),'% similar')

A:  ['CHEER', 'UP'] d0
[-1.3605299e-03 -4.9551218e-03  6.4320973e-04 -3.1982341e-03
  4.4268142e-03  3.4082234e-03 -5.5610308e-05 -3.3584442e-03
 -3.4386558e-03  3.9370586e-03 -3.2208264e-03 -2.0945859e-03
 -5.5775727e-04  4.6909028e-03 -8.3822437e-04 -2.6412073e-03
 -4.9597463e-03  3.0055246e-03  4.8844487e-04 -9.9924020e-04
 -4.4104201e-03  4.5410311e-03 -1.4134985e-05  2.0951689e-03
  1.6765220e-03 -2.1535123e-03 -4.9690870e-03 -2.7008341e-03
 -4.3723034e-03 -2.1838325e-03 -4.7371862e-03  2.5483817e-03
 -4.9351221e-03 -1.3243871e-04 -7.3936710e-04 -1.4161090e-04
 -4.9034525e-03 -2.5488301e-03  4.6267419e-04  4.4306791e-03
  1.7403271e-03  4.8091557e-04  4.6775737e-03  2.4558259e-03
 -6.6190143e-04  3.7965947e-03  1.4651486e-03 -2.3283160e-03
  4.6461667e-03 -2.4500517e-03  3.2302451e-03  2.2956124e-03
 -3.2030870e-03 -1.7275230e-03 -3.1310250e-03 -3.4666355e-03
 -2.8409101e-03 -3.6558085e-03 -7.7481044e-04 -9.1449899e-04
 -7.2874362e-04 -4.0982035e-03  9.3519141e-04  4.9596098e-03
 

In [166]:
a = 3
a_doctag = vectors.index2entity[a]
print("A: ", corpus[a], a_doctag)
print(vectors[a])

b = 3
b_doctag = vectors.index2entity[b]
print("B: ", corpus[b], b_doctag)
print(vectors[b])

sim = distance(a_doctag, b_doctag, vectors)

print(round(sim,2),'% similar')

A:  ['이젠', '나를', '배려', '해줘'] d3
[ 3.2491328e-03  7.1090215e-04 -2.7161281e-05 -2.3146234e-03
 -1.4069683e-03 -4.8270924e-03 -1.9606224e-03  3.9852923e-03
 -1.4496733e-03  8.5326307e-04  1.2165764e-03 -3.3238190e-03
  3.1587514e-03  9.1244234e-04 -2.4789849e-03  4.5363382e-03
  4.2812927e-03  3.3755128e-03 -4.9798000e-03 -7.8511442e-04
  2.3399035e-03 -2.9274377e-03  1.4618956e-03  2.4535176e-03
  1.6874820e-03 -8.7828335e-04 -7.6996192e-04 -4.6177073e-03
  3.4920970e-04  1.3129411e-03 -5.1112944e-04 -2.3980963e-03
 -8.1497239e-04 -1.0307080e-03 -2.1936453e-03  6.8878224e-05
  1.8026853e-03 -4.6162084e-03 -3.5915363e-03 -2.7602266e-03
 -3.1288078e-03  1.2767917e-03 -4.0043755e-03 -2.1624153e-03
  6.8267173e-04 -1.6978015e-03 -2.1708200e-03  3.9177979e-04
  3.6089690e-03  1.3884509e-03  1.9688446e-03  4.3411446e-03
 -3.3373185e-03  2.3367659e-03 -3.7528847e-03 -3.3241382e-03
  4.0214657e-04  4.1026366e-03 -4.2420798e-03 -3.1949137e-03
  3.7420453e-03  3.9839162e-03  6.6488312e-04 -1.0751

In [167]:
a = 5
a_doctag = vectors.index2entity[a]
print("A: ", corpus[a], a_doctag)
print(vectors[a])

b = 25
b_doctag = vectors.index2entity[b]
print("B: ", corpus[b], b_doctag)
print(vectors[b])

sim = distance(a_doctag, b_doctag, vectors)

print(round(sim,2),'% similar')

A:  ['자꾸', '만', '봐', '자꾸', '자꾸', '만', '와'] d5
[ 4.9744649e-03  1.3063629e-03  4.3983497e-03 -2.6128520e-03
 -2.6264319e-03  1.5924338e-03  5.0056553e-03  1.8458734e-03
  9.0326869e-04 -1.6680191e-04  1.7456752e-03  4.1497886e-04
  4.5265998e-03  1.2190437e-03 -3.0204274e-03 -1.6526962e-03
 -2.2982252e-03 -2.5420394e-03  2.0805870e-03  3.3668738e-03
  3.3744788e-03  5.7537219e-04  6.5630517e-04  1.4987971e-03
 -4.4436618e-03 -2.1766955e-03 -3.1012702e-03 -2.8978421e-03
  6.7165575e-04  1.3199238e-03  1.8868636e-03  4.0666522e-03
  9.1044401e-04 -2.4786401e-03 -2.9423998e-05 -1.5440270e-03
  4.5353626e-03  4.4620582e-03 -9.8900324e-05  1.8128047e-03
  2.9826069e-03 -4.6059079e-03 -4.2957631e-03 -4.3336581e-03
 -1.8758045e-03  2.9721744e-03 -1.9718010e-03 -4.8088743e-03
  4.7251517e-03  1.4109866e-03 -2.3168100e-03  3.5558599e-03
  1.0009286e-03  4.4028573e-03 -4.8340266e-03 -3.4305216e-03
 -1.5632069e-03  4.3344107e-03 -4.0552816e-03 -2.3120465e-03
 -3.0983952e-03  2.5032803e-03 -4.18901

---------------------

## Word Embedding
* Word2Vec
* Glove
* FastText

### Word2Vec

* 출처 - models.word2vec – Deep learning with word2vec - https://radimrehurek.com/gensim/models/word2vec.html

#### 영어 예시

In [223]:
# Data
documents_en = [["my", "name", "is", "jamie"], ["jamie", "is", "cute"]]

pprint(documents_en)

[['my', 'name', 'is', 'jamie'], ['jamie', 'is', 'cute']]


In [224]:
# 모델 초기화
w2v_en = gensim.models.Word2Vec(min_count=1) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

In [225]:
# 모델 사전 만들기
w2v_en.build_vocab(documents_en)

In [226]:
# 학습
w2v_en.train(documents_en, total_examples=len(documents_en), epochs=10)

(2, 70)

In [227]:
# 생성된 모델 저장 및 불러오기 - 이것은 나중에 이 모델을 다시 활용하려할 때 써보기. 
fname = 'model_w2v_en.wv'
w2v_en.save(fname)

In [228]:
%ls model_w2v_en.wv

model_w2v_en.wv


In [229]:
my_w2v_en = gensim.models.Word2Vec.load(fname)

In [230]:
# 단어의 벡터값 얻기
vectors_en = my_w2v_en.wv

In [231]:
vectors_en['name']

array([ 1.8463916e-03, -2.4474319e-03, -1.3874241e-03,  4.8990846e-03,
        2.7666870e-03, -2.5682717e-03, -1.0529022e-03,  1.5984683e-03,
        3.5162780e-03, -4.3272837e-03, -7.5240212e-04,  3.5976199e-03,
        4.1191247e-03, -4.1965623e-03, -4.7344430e-03,  3.0092369e-03,
        4.6412502e-03,  4.8348550e-03, -1.1275733e-03,  8.3893735e-04,
        2.7623824e-03,  3.0995489e-03,  4.0115002e-03, -2.3493078e-03,
       -2.7641570e-03, -3.0821294e-03,  2.0122917e-03,  4.8594424e-03,
        1.1204177e-03,  1.1255337e-03, -1.6868635e-03,  1.0017377e-03,
       -5.1385420e-04, -1.0345953e-03,  8.1620106e-05, -3.5002567e-03,
       -2.5435812e-03,  2.7102412e-04,  2.2499840e-04,  4.0757204e-03,
        1.3699288e-03, -3.1554259e-03,  4.6244864e-03, -2.2115754e-03,
        3.0307646e-03,  4.8729051e-03,  2.5988133e-03,  4.5161145e-03,
        4.4576917e-03, -9.9189021e-04, -3.4095140e-03, -1.6049502e-03,
        7.8812923e-04, -1.2764394e-03,  9.1111817e-04,  1.2262160e-03,
      

#### 한국어 예시

In [217]:
documents_ko = [list(tokenize(sent, tokenizer)) for sent in lyrics]

pprint(documents_ko)

[['CHEER', 'UP'],
 [],
 ['매일', '울리는', '벨벨벨'],
 ['이젠', '나를', '배려', '해줘'],
 ['배터리', '낭비하긴', '싫어'],
 ['자꾸', '만', '봐', '자꾸', '자꾸', '만', '와'],
 ['전화가', '펑', '터질', '것만', '같아'],
 ['몰라', '몰라', '숨도', '못', '쉰대'],
 ['나', '때문에', '힘들어'],
 ['쿵', '심장이', '떨어진대', '왜'],
 ['걔', '말은', '나', '너무', '예쁘대'],
 ['자랑', '하는건', '아니구'],
 ['아', '아까는', '못', '받아서', '미안해'],
 ['친구를', '만나느라', 'shy', 'shy', 'shy'],
 ['만나긴', '좀', '그렇구', '미안해'],
 ['좀', '있다', '연락할게', 'later'],
 ['조르지마', '얼마', '가지', '않아'],
 ['부르게', '해줄게', 'Baby'],
 ['아직은', '좀', '일러', '내', '맘', '같긴', '일러'],
 ['하지만', '더', '보여줄래'],
 ['CHEER', 'UP', 'BABY', 'CHEER', 'UP', 'BABY'],
 ['좀', '더', '힘을', '내'],
 ['여자', '가', '쉽게', '맘을', '주면', '안돼'],
 ['그래야', '니가', '날', '더', '좋아', '하게', '될걸'],
 ['태연하게', '연기할래', '아무렇지', '않게'],
 ['내가', '널', '좋아', '하는', '맘', '모르게'],
 ['just', 'get', 'it', 'together'],
 ['and', 'then', 'baby', 'CHEER', 'UP'],
 ['안절부절', '목소리가'],
 ['여기', '까지', '들려'],
 ['땀에', '젖은', '전화기가'],
 ['여기', '서도', '보여'],
 ['바로', '바로', '대답하는', '것도'],
 ['매력', '없어'],
 ['메시지만'

In [218]:
# 모델 초기화
w2v_ko = gensim.models.Word2Vec(min_count=1) # 예) Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
# 모델 사전 만들기
w2v_ko.build_vocab(documents_ko)
# 학습
w2v_ko.train(documents_ko, total_examples=len(documents_ko), epochs=10)
# 모델 저장
fname = 'model_w2v_ko.wv'
w2v_ko.save(fname)

In [221]:
# 단어의 벡터값 얻기
vectors_ko = w2v_ko.wv

In [222]:
vectors_ko['조르지마']

array([ 0.00299507,  0.00389504, -0.00402283, -0.00496571, -0.00286623,
        0.0009864 , -0.0042139 ,  0.00134176,  0.00015939, -0.00359018,
        0.00363198,  0.00025816,  0.00398525,  0.00473751,  0.00380818,
       -0.00395916, -0.00234931,  0.00471063, -0.00311525, -0.00092859,
       -0.00479069,  0.00348141,  0.00422892,  0.00425649, -0.0029851 ,
        0.00317291,  0.00463596,  0.00278836,  0.00231364,  0.00217661,
       -0.00463204,  0.00366445,  0.00257002, -0.00154999, -0.00412736,
        0.00445884,  0.00200433, -0.00264274, -0.00037009, -0.00267405,
        0.00313294,  0.00278946, -0.00180913,  0.00306309,  0.00143279,
        0.0030039 ,  0.0014327 , -0.00355948, -0.00220458, -0.0001424 ,
       -0.00205512,  0.00276355,  0.00023416, -0.00326569, -0.00431944,
       -0.00316572, -0.00451421, -0.00380698, -0.00302657, -0.00318542,
       -0.00463803, -0.0014342 ,  0.00180713,  0.00289438,  0.00352388,
        0.00198569,  0.00469513,  0.00028651,  0.00256516,  0.00

### plotly로 시각화 함수 만들기

In [235]:
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling

from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go

def reduce_dimensions(model, plot_in_notebook = True):

    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    vectors = []        # positions in vector space
    labels = []         # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model[word])
        labels.append(word)


    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)
    
    # reduce using t-SNE
    vectors = np.asarray(vectors)
    logging.info('starting tSNE dimensionality reduction. This may take some time.')
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
        
    # Create a trace
    trace = go.Scatter(
        x=x_vals,
        y=y_vals,
        mode='text',
        text=labels
        )
    
    data = [trace]
    
    logging.info('All done. Plotting.')
    
    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')

In [236]:
# 영어 예시
reduce_dimensions(w2v_en)


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

2019-04-23 08:09:58,422 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-23 08:09:58,757 : INFO : All done. Plotting.


In [237]:
# 한국어 예시
reduce_dimensions(w2v_ko)


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

2019-04-23 08:10:58,746 : INFO : starting tSNE dimensionality reduction. This may take some time.
2019-04-23 08:10:59,560 : INFO : All done. Plotting.


### Glove

### FastText

In [247]:
documents_en = [["my", "name", "is", "jamie"], ["jamie", "is", "cute"]]

In [248]:
# 워드투벡은 학습시 없었던 단어에 대해서는 계산해주지 못한다.
w2v_en.most_similar("jamia")


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).

2019-04-23 09:58:10,864 : INFO : precomputing L2-norms of word weight vectors


KeyError: "word 'jamia' not in vocabulary"

In [244]:
from gensim.models import FastText

ft_en = FastText(size=4, window=3, min_count=1)

In [245]:
ft_en.build_vocab(documents_en)

2019-04-23 09:53:50,084 : INFO : collecting all words and their counts
2019-04-23 09:53:50,085 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-23 09:53:50,086 : INFO : collected 5 word types from a corpus of 7 raw words and 2 sentences
2019-04-23 09:53:50,087 : INFO : Loading a fresh vocabulary
2019-04-23 09:53:50,087 : INFO : min_count=1 retains 5 unique words (100% of original 5, drops 0)
2019-04-23 09:53:50,088 : INFO : min_count=1 leaves 7 word corpus (100% of original 7, drops 0)
2019-04-23 09:53:50,089 : INFO : deleting the raw counts dictionary of 5 items
2019-04-23 09:53:50,090 : INFO : sample=0.001 downsamples 5 most-common words
2019-04-23 09:53:50,090 : INFO : downsampling leaves estimated 0 word corpus (7.5% of prior 7)
2019-04-23 09:53:50,092 : INFO : estimated required memory for 5 words, 40 buckets and 4 dimensions: 3860 bytes
2019-04-23 09:53:50,093 : INFO : resetting layer weights
2019-04-23 09:53:50,106 : INFO : Total number of ngram

In [246]:
ft_en.train(sentences=documents_en, total_examples=len(documents_en), epochs=10)  # train

2019-04-23 09:54:31,424 : INFO : training model with 3 workers on 5 vocabulary and 4 features, using sg=0 hs=0 sample=0.001 negative=5 window=3
2019-04-23 09:54:31,426 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-23 09:54:31,427 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-23 09:54:31,427 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-23 09:54:31,428 : INFO : EPOCH - 1 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-23 09:54:31,431 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-23 09:54:31,432 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-23 09:54:31,433 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-23 09:54:31,433 : INFO : EPOCH - 2 : training on 7 raw words (0 effective words) took 0.0s, 0 effective words/s
2019-04-23 09:54:31,436 : INFO : worker thread finished; awaiting fini

In [254]:
# FastText는 학습시 없었던 단어에 대해서도 계산해준다.
ft_en.wv.most_similar("jamia")

[('my', 0.8944276571273804),
 ('jamie', 0.7871583104133606),
 ('name', 0.3816155791282654),
 ('is', 0.3451642394065857),
 ('cute', 0.30520403385162354)]

----------------------------

## Sentence Embeding

--------------------------

# 참고자료 
* Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/
* Natural Language Processing - https://www.coursera.org/learn/language-processing / Main approaches in NLP
* http://git.ajou.ac.kr/open-source-2018-spring/python_Korean_NLP/blob/master/README.md
* https://github.com/lovit/soynlp