# Vectorization

### 1. 단어 수 기반 벡터화
- Bag of Words (BoW) : 각 단어의 등장 횟수를 세서 벡터로 표현
- TF-IDF : 흔하게 등장하는 단어에는 낮은 가중치를, 특정 문서에만 자주 등장하는 단어엔 높은 가중치
    - 계산식
    $$
   \text{TF}(t, d) = \frac{\text{단어 } t \text{가 문서 } d \text{에 등장한 횟수}}{\text{문서 } d \text{의 총 단어 수}}
   $$

   $$
   \text{IDF}(t, D) = \log \left( \frac{|D|}{|\{d \in D : t \in d\}|} \right)
   $$
   
    - \( |D| \): 전체 문서 수
    - \( \{d \in D : t \in d\} \): 단어 \( t \)가 포함된 문서의 집합



### 2. 단어 임베딩 기반 벡터화
- 보다 고차원
- 예시
    - Word2Vec
        - CBOW
        - Skip-gram
    - GloVe
    - FastText

In [3]:
import pandas as pd
from IPython.display import display

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

docs = [
    "나는 콜드플레이 노래를 좋아한다",
    "디즈니 영화를 보고있다",
    "좋은 날씨에 종종 테니스를 친다",
    "좋은 날씨에 잔다",
    "영화가 좋은 나는 영화를 본다",
]

In [4]:
# 한글 전처리 konlpy 버전 문제 해결 필요
# from konlpy.tag import Okt
# okt = Okt()

In [5]:
# 1. BoW
bow_vectorizer = CountVectorizer()
bow = bow_vectorizer.fit_transform(docs)

print("BoW feature names:", bow_vectorizer.get_feature_names_out())  # 사전 순 단어 정렬
print("Bow vector:\n", bow.toarray())

BoW feature names: ['나는' '날씨에' '노래를' '디즈니' '보고있다' '본다' '영화가' '영화를' '잔다' '종종' '좋아한다' '좋은' '친다'
 '콜드플레이' '테니스를']
Bow vector:
 [[1 0 1 0 0 0 0 0 0 0 1 0 0 1 0]
 [0 0 0 1 1 0 0 1 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 1 0 1 1 0 1]
 [0 1 0 0 0 0 0 0 1 0 0 1 0 0 0]
 [1 0 0 0 0 1 1 1 0 0 0 1 0 0 0]]


In [6]:
# 2. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(docs)

print("TF_IDF feature names:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF vector:\n")
display(pd.DataFrame(tfidf.toarray(), columns=bow_vectorizer.get_feature_names_out()))

TF_IDF feature names: ['나는' '날씨에' '노래를' '디즈니' '보고있다' '본다' '영화가' '영화를' '잔다' '종종' '좋아한다' '좋은' '친다'
 '콜드플레이' '테니스를']
TF-IDF vector:



Unnamed: 0,나는,날씨에,노래를,디즈니,보고있다,본다,영화가,영화를,잔다,종종,좋아한다,좋은,친다,콜드플레이,테니스를
0,0.422242,0.0,0.523358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.523358,0.0,0.0,0.523358,0.0
1,0.0,0.0,0.0,0.614189,0.614189,0.0,0.0,0.495524,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.398475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.493899,0.0,0.33077,0.493899,0.0,0.493899
3,0.0,0.556816,0.0,0.0,0.0,0.0,0.0,0.0,0.690159,0.0,0.0,0.462208,0.0,0.0,0.0
4,0.416607,0.0,0.0,0.0,0.0,0.516374,0.516374,0.416607,0.0,0.0,0.0,0.345822,0.0,0.0,0.0
