# 01.Text Vectorization

# 차례
* Frequency Vectors
* One-Hot Encoding
* Term Frequency-Inverse Document Frequency
* Distributed Representation

-----------------------

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch01/figures/cap1.3.png" width=600 />
<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch01/figures/cap1.4.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

<img src="figures/main_nlp_01.png" width=600 />
<img src="figures/main_nlp_02.png" width=600 />
<img src="figures/main_nlp_03.png" width=600 />
<img src="figures/main_nlp_04.png" width=600 />
<img src="figures/main_nlp_05.png" width=600 />

* 출처 - Natural Language Processing - https://www.coursera.org/learn/language-processing - Main approaches in NLP

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap01.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

<img src="https://slideplayer.com/slide/5270778/17/images/10/Distributed+Word+Representation.jpg" width=600 />

* 출처 - https://slideplayer.com/slide/5270778/

---------------------

# Frequency Vectors

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap02.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [6]:
import gensim
from pprint import pprint

gensim을 이용하려면 문서 목록인 documents를 corpus 클래스로 바꿔야 한다.

In [3]:
# 가상의 문서 4개
documents = [
    "a b c a",
    "c b c",
    "b b a",
    "a c c",
    "c b a",
]

In [13]:
# 단어(토큰) 단위로 분할
def tokenize(x) :
    for token in x.split() :
        yield token
        
corpus = [list(tokenize(doc)) for doc in  documents]

pprint(corpus)

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]


In [14]:
# Dictionary 객체. 구체적으로는 단어에 id 할당 (그 외에도 여러가지 기능은 튜토리얼 참조)
id2word = gensim.corpora.Dictionary(corpus)

In [15]:
id2word

<gensim.corpora.dictionary.Dictionary at 0x1a174cd198>

In [16]:
id2word.token2id

{'a': 0, 'b': 1, 'c': 2}

In [17]:
vectors = [id2word.doc2bow(doc) for doc in corpus]    

In [18]:
pprint(vectors)

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]


----------------

# One-Hot Encoding

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap03.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [20]:
corpus

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]

In [23]:
[id2word.doc2bow(doc) for doc in corpus]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [26]:
[[token for token in id2word.doc2bow(doc)] 
                         for doc in corpus]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [30]:
[[(token[0], token[1]) for token in id2word.doc2bow(doc)] 
                         for doc in corpus]

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 2)],
 [(0, 1), (1, 2)],
 [(0, 1), (2, 2)],
 [(0, 1), (1, 1), (2, 1)]]

In [34]:
vectors = [[(token[0], 1) for token in id2word.doc2bow(doc)] 
                         for doc in corpus]

In [32]:
vectors  = np.array([
        [(token[0], 1) for token in id2word.doc2bow(doc)]
        for doc in corpus
    ])

In [35]:
vectors

[[(0, 1), (1, 1), (2, 1)],
 [(1, 1), (2, 1)],
 [(0, 1), (1, 1)],
 [(0, 1), (2, 1)],
 [(0, 1), (1, 1), (2, 1)]]

-------------------

# Term Frequency-Inverse Document Frequency

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap04.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

In [38]:
documents

['a b c a', 'c b c', 'b b a', 'a c c', 'c b a']

In [39]:
corpus

[['a', 'b', 'c', 'a'],
 ['c', 'b', 'c'],
 ['b', 'b', 'a'],
 ['a', 'c', 'c'],
 ['c', 'b', 'a']]

In [41]:
# Dictionary 생성
id2word = gensim.corpora.Dictionary(corpus)

In [43]:
# Tfidf Model 생성
tfidf = gensim.models.TfidfModel(dictionary=id2word, normalize=True)
tfidf

<gensim.models.tfidfmodel.TfidfModel at 0x1a174d4940>

In [44]:
vectors = [tfidf[id2word.doc2bow(vector)] for vector in corpus]
vectors

[[(0, 0.816496580927726), (1, 0.408248290463863), (2, 0.408248290463863)],
 [(1, 0.447213595499958), (2, 0.894427190999916)],
 [(0, 0.447213595499958), (1, 0.894427190999916)],
 [(0, 0.447213595499958), (2, 0.894427190999916)],
 [(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]]

In [46]:
# Dictionary와 tfidf model을 저장해 놓으면 나중에 다시 로드해서 쓸 수 있어 편하다. 
id2word.save_as_text('test.txt')
tfidf.save('tfidf.pkl')

In [47]:
%ls

01_text_features.ipynb  test.txt
[34mfigures[m[m/                tfidf.pkl


-------------------

# Distributed Representation

<img src="https://nbviewer.jupyter.org/github/psygrammer/psyml/blob/master/nlp_ml/ch04/figures/cap05.png" width=600 />

* 출처 - Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/

---------------------

# 참고자료 
* Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning - https://www.amazon.com/Applied-Text-Analysis-Python-Language-Aware/dp/1491963042/
* Natural Language Processing - https://www.coursera.org/learn/language-processing / Main approaches in NLP
* http://git.ajou.ac.kr/open-source-2018-spring/python_Korean_NLP/blob/master/README.md
* https://github.com/lovit/soynlp