# Vectorization

### 1. 단어 수 기반 벡터화
- Bag of Words (BoW) : 각 단어의 등장 횟수를 세서 벡터로 표현
- TF-IDF : 흔하게 등장하는 단어에는 낮은 가중치를, 특정 문서에만 자주 등장하는 단어엔 높은 가중치
    - 계산식
    $$
   \text{TF}(t, d) = \frac{\text{단어 } t \text{가 문서 } d \text{에 등장한 횟수}}{\text{문서 } d \text{의 총 단어 수}}
   $$

   $$
   \text{IDF}(t, D) = \log \left( \frac{|D|}{|\{d \in D : t \in d\}|} \right)
   $$
   
    - \( |D| \): 전체 문서 수
    - \( \{d \in D : t \in d\} \): 단어 \( t \)가 포함된 문서의 집합



### 2. 단어 임베딩 기반 벡터화

- 단어와 주변 단어(context) 사이의 관계를 예측하여 의미를 반영한 고차원 실수 벡터 학습
    - e.g. `king - man + woman ≈ queen`
- 종류
    - Word2Vec : 단어의 의미를 벡터 공간 상에 표현하는 임베딩 방법으로, 의미가 비슷한 단어는 벡터 거리 가깝게 학습됨
        - 의미가 비슷한 단어는 어떻게 정의? 비슷한 문맥(context)에서 사용됨
        - 종류
            - CBOW (Continuous Bag of Words) : 주변 단어들을 보고 중심 단어를 예측 (일반적인 단어에 유리)
            - Skip-gram : 중심 단어를 보고 주변 단어를 예측 (희귀 단어에 유리)
        - 사실상 간단한 신경망 (1 hidden layer)
            - 입력층 : One-hot 벡터 (어휘 수 V 차원)
            - 은닉층 : Embedding matrix W (V X N) -> **단어를 N차원 임베딩으로 변환**
            - 출력층 : Softmax 확률 분포 -> 예측 결과
    - GloVe
    - FastText

In [71]:
import pandas as pd
from IPython.display import display

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

docs = [
    "나는 콜드플레이 노래를 좋아한다",
    "디즈니 영화를 보고있다",
    "좋은 날씨에 종종 테니스를 친다",
    "좋은 날씨에 잔다",
    "영화가 좋은 나는 영화를 본다",
    "콜드플레이는 영화를 찍은 적이 있나?",
    "날씨가 좋아야 밖에 나간다",
    "테니스 만화는 테니스를 내용이다 그것이 테니스니까"
]

In [72]:
from kiwipiepy import Kiwi

kiwi = Kiwi()  # Korean Intelligent Word Identifier
docs_kiwi = []

for d in docs:
    result = kiwi.analyze(d)
    tokens = [token.form for token in result[0][0]]  # tag : 품사
    
    docs_kiwi.append(tokens)

docs_kiwi

[['나', '는', '콜드플레이', '노래', '를', '좋아하', 'ᆫ다'],
 ['디즈니', '영화', '를', '보', '고', '있', '다'],
 ['좋', '은', '날씨', '에', '종종', '테니스', '를', '치', 'ᆫ다'],
 ['좋', '은', '날씨', '에', '자', 'ᆫ다'],
 ['영화', '가', '좋', '은', '나', '는', '영화', '를', '보', 'ᆫ다'],
 ['콜드플레이', '는', '영화', '를', '찍', '은', '적', '이', '있', '나', '?'],
 ['날씨', '가', '좋', '어야', '밖', '에', '나가', 'ᆫ다'],
 ['테니스', '만화', '는', '테니스', '를', '내용', '이', '다', '그것', '이', '테니스', '이', '니까']]

In [73]:
# 1. BoW
bow_vectorizer = CountVectorizer()
bow = bow_vectorizer.fit_transform([" ".join(tokens) for tokens in docs_kiwi])

print("BoW feature names:", bow_vectorizer.get_feature_names_out())  # 사전 순 단어 정렬
print("Bow vector:\n", bow.toarray())

BoW feature names: ['ᆫ다' '그것' '나가' '날씨' '내용' '노래' '니까' '디즈니' '만화' '어야' '영화' '종종' '좋아하'
 '콜드플레이' '테니스']
Bow vector:
 [[1 0 0 0 0 1 0 0 0 0 0 0 1 1 0]
 [0 0 0 0 0 0 0 1 0 0 1 0 0 0 0]
 [1 0 0 1 0 0 0 0 0 0 0 1 0 0 1]
 [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 2 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 1 0]
 [1 0 1 1 0 0 0 0 0 1 0 0 0 0 0]
 [0 1 0 0 1 0 1 0 1 0 0 0 0 0 3]]


In [74]:
# 2. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform([" ".join(tokens) for tokens in docs_kiwi])

print("TF_IDF feature names:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF vector:\n")
display(pd.DataFrame(tfidf.toarray(), columns=bow_vectorizer.get_feature_names_out()))

TF_IDF feature names: ['ᆫ다' '그것' '나가' '날씨' '내용' '노래' '니까' '디즈니' '만화' '어야' '영화' '종종' '좋아하'
 '콜드플레이' '테니스']
TF-IDF vector:



Unnamed: 0,ᆫ다,그것,나가,날씨,내용,노래,니까,디즈니,만화,어야,영화,종종,좋아하,콜드플레이,테니스
0,0.323114,0.0,0.0,0.0,0.0,0.575683,0.0,0.0,0.0,0.0,0.0,0.0,0.575683,0.482467,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.810306,0.0,0.0,0.586007,0.0,0.0,0.0,0.0
2,0.352144,0.0,0.0,0.453735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.627406,0.0,0.0,0.525815
3,0.613115,0.0,0.0,0.789994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.361767,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.932268,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.653308,0.0,0.0,0.757092,0.0
6,0.333168,0.0,0.593597,0.429285,0.0,0.0,0.0,0.0,0.0,0.593597,0.0,0.0,0.0,0.0,0.0
7,0.0,0.311266,0.0,0.0,0.311266,0.0,0.311266,0.0,0.311266,0.0,0.0,0.0,0.0,0.0,0.782595


In [13]:
# 3. Word2Vec

In [None]:
import urllib.request
import pandas as pd

from gensim.models import Word2Vec

In [None]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings.txt", filename="data/ratings.txt")
train_data  = pd.read_table('data/ratings.txt')
train_data  # label : 긍정(1), 부정(0)