문서 단어 행렬(Document-Term Matrix)
   - 문서 단어 행렬은 문서에 등장하는 여러 단어들의 빈도를 행렬로 표현
   - 각 문서에 대한 BoW를 하나의 행렬로 표현한 것

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["Think like a man of action and act like man od thought.",
          "Try not to become a man of success but rather try to become a man of value.",
          "Give me liberty, or give me death."]
vector = CountVectorizer(stop_words='english')
bow = vector.fit_transform(corpus)

print(bow.toarray())
print(vector.vocabulary_)


[[1 1 0 0 2 2 1 0 1 1 0 0]
 [0 0 0 0 0 2 0 1 0 0 2 1]
 [0 0 1 1 0 0 0 0 0 0 0 0]]
{'think': 8, 'like': 4, 'man': 5, 'action': 1, 'act': 0, 'od': 6, 'thought': 9, 'try': 10, 'success': 7, 'value': 11, 'liberty': 3, 'death': 2}


In [3]:
import pandas as pd

columns = []
for k, v in sorted(vector.vocabulary_.items(), key=lambda item:item[1]):
    columns.append(k)
    
df = pd.DataFrame(bow.toarray(), columns=columns)
df

Unnamed: 0,act,action,death,liberty,like,man,od,success,think,thought,try,value
0,1,1,0,0,2,2,1,0,1,1,0,0
1,0,0,0,0,0,2,0,1,0,0,2,1
2,0,0,1,1,0,0,0,0,0,0,0,0


어휘 빈도-문서 역빈도(TF-IDF; Term Frequency-Inverse Document Frequency) 분석

   - 어휘 빈도-문서 역빈도는 단순히 빈도수가 높은 단어가 핵심어가 아닌, 특정 문서에서만 집중적으로 등장할 때 해당 단어가 주제르 잘 담고 있는 핵심어라고 가정
   - 특정 문서에서 특정단어가 많이 등장하고 그 단어가 다른 문서에서 적게 등장할 때, 그 단어를 특정 문서의 핵심어로 간주
   - 어휘 빈도-문서 역빈도는 어휘 빈도와 역문서 빈도를 곱해 계산 가능

   - 어휘 빈도(Term Frequency) : 특정 문서에서 특정 단어가 많이 등장하는 것을 의미
   - 역문서 빈도(Inverse Document Freequency) : 다른 문서에서 등장하는 않는 단어의 빈도를 의미
   - 어휘 빈도-문서 역빈도
      - TF-IDF를 편리하게 계산하기 위해 skkit-learn의 tfidfvectorizer를 이용
      - 앞서 계산한 단어 빈도 수를 입력하여 tf-idf로 변환

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english').fit(corpus)

print(tfidf.transform(corpus).toarray())
print(tfidf.vocabulary_)

[[0.29730323 0.29730323 0.         0.         0.59460647 0.45221354
  0.29730323 0.         0.29730323 0.29730323 0.         0.        ]
 [0.         0.         0.         0.         0.         0.52753275
  0.         0.34682109 0.         0.         0.69364217 0.34682109]
 [0.         0.         0.70710678 0.70710678 0.         0.
  0.         0.         0.         0.         0.         0.        ]]
{'think': 8, 'like': 4, 'man': 5, 'action': 1, 'act': 0, 'od': 6, 'thought': 9, 'try': 10, 'success': 7, 'value': 11, 'liberty': 3, 'death': 2}


   - 좀더 편리하게 확인하기 위해 데이터 프레임으로 변환하기

In [7]:
columns = []
for k, v in sorted(tfidf.vocabulary_.items(), key=lambda item:item[1]):
    columns.append(k)
    
df = pd.DataFrame(tfidf.transform(corpus).toarray(), columns=columns)
df

# 하단에 보면 머신(Tf-idf)은 0번째 행에서는 like와 man이 중요 핵심어라고 판단함

Unnamed: 0,act,action,death,liberty,like,man,od,success,think,thought,try,value
0,0.297303,0.297303,0.0,0.0,0.594606,0.452214,0.297303,0.0,0.297303,0.297303,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.527533,0.0,0.346821,0.0,0.0,0.693642,0.346821
2,0.0,0.0,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
