### TF-IDF : 단어의 중요도의 지표

---

- TF ( Term Frequency, 단어 빈도) : 어떤 단어가 문서 내에서 얼마나 자주 등장하는지를 나타내는 지표
- IDF ( Inverse Document Frequency, 역문서 빈도) : 문서 빈도(어떤 단어가 문서 **군** 내에서 얼마나 자주 등장하는지를 나타내는 지표)의 역수
  - 이는 문서 전체에서 공통적으로 등장하는 흔한 단어(예: "the", "is" 등)의 중요도를 낮추는 역할을 한다.
- TF-IDF : TF * IDF (TF에 비례, IDF에 반비례)
<br></br>
어떤 단어 $t$가 있고, 문서 $d$가 있다. $d$를 포함한 문서군은 $D$다. 이 때, 아래와 같이 정의한다.
<br></br>
- 불린 빈도 $tf(t,d)$ = $t$ 가 $d$에 나타났는가 ? True : 1, False: 0

$f(t,d)$ : 문서 내에서 단어의 총 빈도

In [188]:
docs = [
'그는 오늘 아침에 운동을 했다.',
'나는 주말에 가족과 시간을 보냈다.',
'그녀는 이번 주말에 친구를 만나기로 했다.',
'나는 어제 친구와 저녁을 먹었다.'
]

In [189]:
# 반복문은 왼쪽에서 오른쪽의 순서고, 리턴값은 항상 맨 왼쪽
[word for doc in docs for word in doc.split()]

['그는',
 '오늘',
 '아침에',
 '운동을',
 '했다.',
 '나는',
 '주말에',
 '가족과',
 '시간을',
 '보냈다.',
 '그녀는',
 '이번',
 '주말에',
 '친구를',
 '만나기로',
 '했다.',
 '나는',
 '어제',
 '친구와',
 '저녁을',
 '먹었다.']

In [190]:
vocab = list(set([word for doc in docs for word in doc.split()]))
vocab

['오늘',
 '했다.',
 '아침에',
 '시간을',
 '주말에',
 '친구와',
 '그는',
 '만나기로',
 '먹었다.',
 '나는',
 '운동을',
 '친구를',
 '저녁을',
 '보냈다.',
 '그녀는',
 '이번',
 '어제',
 '가족과']

In [191]:
vocab.sort()
vocab

['가족과',
 '그녀는',
 '그는',
 '나는',
 '만나기로',
 '먹었다.',
 '보냈다.',
 '시간을',
 '아침에',
 '어제',
 '오늘',
 '운동을',
 '이번',
 '저녁을',
 '주말에',
 '친구를',
 '친구와',
 '했다.']

In [192]:
N = len(docs)

In [193]:
dtm = []

for idx in range(N):
    doc = docs[idx]
    in_dtm = []

    for v_idx in range(len(vocab)):
        voca = vocab[v_idx]
        in_dtm.append(doc.count(voca))

    dtm.append(in_dtm)

In [194]:
dtm

[[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1],
 [1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1],
 [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0]]

In [195]:
import pandas as pd

dtm = pd.DataFrame(dtm, columns=vocab, index=['1번문서', '2번문서', '3번문서', '4번문서'])

In [196]:
dtm

Unnamed: 0,가족과,그녀는,그는,나는,만나기로,먹었다.,보냈다.,시간을,아침에,어제,오늘,운동을,이번,저녁을,주말에,친구를,친구와,했다.
1번문서,0,0,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,1
2번문서,1,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0
3번문서,0,1,0,0,1,0,0,0,0,0,0,0,1,0,1,1,0,1
4번문서,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0


In [197]:
import math

def tf(t, d):
    # 문서 d에서 단어 t가 몇 번이나 나왔는지를 카운트한 뒤 리턴
    return d.count(t)

def df(t, D):
    df_count = 0
    for d in D:
        df_count += 1 if t in d else 0
    # 문서군 D의 각 문서 d에서 t가 존재하는 지 여부만 체크하여 카운트
    return df_count

def idf(t, D):
    # df의 역수에 로그를 취함, 로그 변환 시 0분모 방지를 위해 +1
    return math.log(N / (df(t, D) + 1))

def tf_idf(t, d, D):
    # tf * idf 계산
    return tf(t, d) * idf(t, D)


In [198]:
result = []

for idx in range(N):
    result.append([])
    doc = docs[idx]

    for v_idx in range(len(vocab)):
        token = vocab[v_idx]
        result[-1].append(tf(token, doc))

_tf = pd.DataFrame(result, columns=vocab)

In [199]:
_tf # 앞에 언더바 있으면 상수로 취급하자는 약속이다.

Unnamed: 0,가족과,그녀는,그는,나는,만나기로,먹었다.,보냈다.,시간을,아침에,어제,오늘,운동을,이번,저녁을,주말에,친구를,친구와,했다.
0,0,0,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,1
1,1,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0
2,0,1,0,0,1,0,0,0,0,0,0,0,1,0,1,1,0,1
3,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0


In [200]:
result = []
for v_idx in range(len(vocab)):
    token = vocab[v_idx]
    result.append(idf(token, docs))

In [201]:
_idf = pd.DataFrame(result, index=vocab, columns=['IDF'])

In [202]:
_idf

Unnamed: 0,IDF
가족과,0.693147
그녀는,0.693147
그는,0.693147
나는,0.287682
만나기로,0.693147
먹었다.,0.693147
보냈다.,0.693147
시간을,0.693147
아침에,0.693147
어제,0.693147


In [203]:
result = []
for idx in range(N):
    result.append([])
    d = docs[idx]

    for v_idx in range(len(vocab)):
        t = vocab[v_idx]
        result[-1].append(tf_idf(t,d,docs))
        

In [204]:
_tf_idf = pd.DataFrame(result, columns = vocab)

In [205]:
_tf_idf

Unnamed: 0,가족과,그녀는,그는,나는,만나기로,먹었다.,보냈다.,시간을,아침에,어제,오늘,운동을,이번,저녁을,주말에,친구를,친구와,했다.
0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.693147,0.0,0.693147,0.693147,0.0,0.0,0.0,0.0,0.0,0.287682
1,0.693147,0.0,0.0,0.287682,0.0,0.0,0.693147,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.287682,0.0,0.0,0.0
2,0.0,0.693147,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.693147,0.0,0.287682,0.693147,0.0,0.287682
3,0.0,0.0,0.0,0.287682,0.0,0.693147,0.0,0.0,0.0,0.693147,0.0,0.0,0.0,0.693147,0.0,0.0,0.693147,0.0


In [206]:
from sklearn.feature_extraction.text import CountVectorizer

In [207]:
vector = CountVectorizer()

In [208]:
vector.fit_transform(docs).toarray()

array([[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1],
       [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0]],
      dtype=int64)

In [209]:
vector.vocabulary_

{'그는': 2,
 '오늘': 10,
 '아침에': 8,
 '운동을': 11,
 '했다': 17,
 '나는': 3,
 '주말에': 14,
 '가족과': 0,
 '시간을': 7,
 '보냈다': 6,
 '그녀는': 1,
 '이번': 12,
 '친구를': 15,
 '만나기로': 4,
 '어제': 9,
 '친구와': 16,
 '저녁을': 13,
 '먹었다': 5}

In [210]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [211]:
tfidf= TfidfVectorizer()

In [212]:
arr = tfidf.fit_transform(docs).toarray()

In [213]:
tfidf.vocabulary_

{'그는': 2,
 '오늘': 10,
 '아침에': 8,
 '운동을': 11,
 '했다': 17,
 '나는': 3,
 '주말에': 14,
 '가족과': 0,
 '시간을': 7,
 '보냈다': 6,
 '그녀는': 1,
 '이번': 12,
 '친구를': 15,
 '만나기로': 4,
 '어제': 9,
 '친구와': 16,
 '저녁을': 13,
 '먹었다': 5}

In [214]:
col = tfidf.get_feature_names_out()

In [215]:
pd.DataFrame(arr, columns=col)

Unnamed: 0,가족과,그녀는,그는,나는,만나기로,먹었다,보냈다,시간을,아침에,어제,오늘,운동을,이번,저녁을,주말에,친구를,친구와,했다
0,0.0,0.0,0.465162,0.0,0.0,0.0,0.0,0.0,0.465162,0.0,0.465162,0.465162,0.0,0.0,0.0,0.0,0.0,0.366739
1,0.485461,0.0,0.0,0.382743,0.0,0.0,0.485461,0.485461,0.0,0.0,0.0,0.0,0.0,0.0,0.382743,0.0,0.0,0.0
2,0.0,0.436719,0.0,0.0,0.436719,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.436719,0.0,0.344315,0.436719,0.0,0.344315
3,0.0,0.0,0.0,0.366739,0.0,0.465162,0.0,0.0,0.0,0.465162,0.0,0.0,0.0,0.465162,0.0,0.0,0.465162,0.0
