# TF-IDF
---
* TF (Term Frequency, 단어 빈도): 어떤 단어가 문서 내에서 얼마나 자주 등장하는지를 나타내는 지표
* DF (Document Frequency, 문서 빈도): 어떤 단어가 문서**군** 내에서 얼마나 자주 등장하는지를 나타내는 지표. 이 때, 등장 빈도는 해당 단어가 존재하는지 여부만 체크합니다
* IDF (Inverse Document Frequency, 역문서 빈도): 문서 빈도의 역수
* TF-IDF: TF * IDF
</br></br>
어떤 단어 $t$ 가 있고, 문서 $d$ 가 있다 $d$ 를 포함한 문서군은 $D$ 이다. 이 때, 아래와 같이 정의한다.
</br></br>
* 불린 빈도 $tf(t,d)$ = $t$ 가 $d$ 에 나타났는가? True: 1, False: 0
* $f(t,d)$ : 문서 내에서 단어의 총 빈도
</br></br>
<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/9116cd515075990e05a5489020384c714408d63f">
</br></br>
<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/cc5cc57e5b68902a0bfaf42f04e53458503601c4">

In [35]:
import pandas as pd

In [7]:
docs = [
    '마트 딸기 진짜 비싸다',
    '나는 오늘 아침 사과 먹겠다',
    '아침 사과 금 사과',
    '나는 오늘 저녁 딸기 먹겠다'
]

In [19]:
vocab = list(set([word for doc in docs for word in doc.split()]))

In [21]:
vocab.sort()

In [23]:
vocab

['금', '나는', '딸기', '마트', '먹겠다', '비싸다', '사과', '아침', '오늘', '저녁', '진짜']

In [27]:
N = len(docs)

In [31]:
dtm = []

for idx in range(N):
    doc = docs[idx]
    in_dtm = []

    for v_idx in range(len(vocab)):
        voca = vocab[v_idx]
        in_dtm.append(doc.count(voca))

    dtm.append(in_dtm)

In [41]:
dtm = pd.DataFrame(dtm, columns=vocab, index=['1번문서','2번문서','3번문서','4번문서'])

In [43]:
dtm

Unnamed: 0,금,나는,딸기,마트,먹겠다,비싸다,사과,아침,오늘,저녁,진짜
1번문서,0,0,1,1,0,1,0,0,0,0,1
2번문서,0,1,0,0,1,0,1,1,1,0,0
3번문서,1,0,0,0,0,0,2,1,0,0,0
4번문서,0,1,1,0,1,0,0,0,1,1,0


In [46]:
import math

def tf(t, d):
    # 문서 d에서 단어 t가 몇번이나 나왔는지를 카운트한 뒤 리턴
    return d.count(t)

def df(t, D):
    # 문서군 D의 각 문서 d에서 t가 몇번이나 나왔는지를 카운트한 뒤 리턴. 단, d에서 t가 존재하는지 여부만 체크
    df = 0
    
    for d in D:
        df += t in d

    return df

def idf(t, D):
    # df의 역수를 리턴
    N = len(D)
    return math.log(N / (df(t, D) + 1))

def tf_idf(t, d, D):
    # tf * idf를 리턴
    return tf(t, d) * idf(t, D)

In [48]:
result = []

for idx in range(N):
    result.append([])
    doc = docs[idx]

    for v_idx in range(len(vocab)):
        token = vocab[v_idx]
        result[-1].append(tf(token, doc))

In [60]:
_tf = pd.DataFrame(result, columns=vocab)

In [62]:
_tf

Unnamed: 0,금,나는,딸기,마트,먹겠다,비싸다,사과,아침,오늘,저녁,진짜
0,0,0,1,1,0,1,0,0,0,0,1
1,0,1,0,0,1,0,1,1,1,0,0
2,1,0,0,0,0,0,2,1,0,0,0
3,0,1,1,0,1,0,0,0,1,1,0


In [64]:
result = []
for v_idx in range(len(vocab)):
    token = vocab[v_idx]
    result.append(idf(token, docs))

In [70]:
_idf = pd.DataFrame(result, index=vocab, columns=['IDF'])

In [72]:
_idf

Unnamed: 0,IDF
금,0.693147
나는,0.287682
딸기,0.287682
마트,0.693147
먹겠다,0.287682
비싸다,0.693147
사과,0.287682
아침,0.287682
오늘,0.287682
저녁,0.693147


In [74]:
result = []
for idx in range(N):
    result.append([])
    d = docs[idx]

    for v_idx in range(len(vocab)):
        t = vocab[v_idx]
        result[-1].append(tf_idf(t, d, docs))

In [86]:
_tf_idf = pd.DataFrame(result, columns=vocab)

In [88]:
_tf_idf

Unnamed: 0,금,나는,딸기,마트,먹겠다,비싸다,사과,아침,오늘,저녁,진짜
0,0.0,0.0,0.287682,0.693147,0.0,0.693147,0.0,0.0,0.0,0.0,0.693147
1,0.0,0.287682,0.0,0.0,0.287682,0.0,0.287682,0.287682,0.287682,0.0,0.0
2,0.693147,0.0,0.0,0.0,0.0,0.0,0.575364,0.287682,0.0,0.0,0.0
3,0.0,0.287682,0.287682,0.0,0.287682,0.0,0.0,0.0,0.287682,0.693147,0.0


In [90]:
from sklearn.feature_extraction.text import CountVectorizer

In [92]:
vector = CountVectorizer()

In [96]:
vector.fit_transform(docs).toarray()

array([[0, 1, 1, 0, 1, 0, 0, 0, 0, 1],
       [1, 0, 0, 1, 0, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 2, 1, 0, 0, 0],
       [1, 1, 0, 1, 0, 0, 0, 1, 1, 0]], dtype=int64)

In [98]:
vector.vocabulary_

{'마트': 2,
 '딸기': 1,
 '진짜': 9,
 '비싸다': 4,
 '나는': 0,
 '오늘': 7,
 '아침': 6,
 '사과': 5,
 '먹겠다': 3,
 '저녁': 8}

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
arr = tfidf.fit_transform(docs).toarray()
col = tfidf.get_feature_names_out()

In [118]:
pd.DataFrame(arr, columns=col)

Unnamed: 0,나는,딸기,마트,먹겠다,비싸다,사과,아침,오늘,저녁,진짜
0,0.0,0.414289,0.525473,0.0,0.525473,0.0,0.0,0.0,0.0,0.525473
1,0.447214,0.0,0.0,0.447214,0.0,0.447214,0.447214,0.447214,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.894427,0.447214,0.0,0.0,0.0
3,0.422247,0.422247,0.0,0.422247,0.0,0.0,0.0,0.422247,0.535566,0.0


In [120]:
f = open('./강경애-원고료_이백원-신가정.txt', 'r', encoding='utf8')

In [122]:
txt = f.readlines()

In [130]:
txt2 = ' '.join(txt).split('.\n')

In [136]:
txt2 = [t + '.' for t in txt2]

In [138]:
txt2[5]

'  K야, 너도 짐작하는지 모르겠다마는! 나는 어려서부터 순조롭지 못한 가정\n 에서 자랐고 또 커서까지라도 순경에 처하지 못한 나는 그나마 쥐꼬리만큼\n 배운 이 지식까지라도 우리 형부의 덕이었니라. 그러니 어려서부터 명일빔\n 한 벌 색들여 못 입어 봤으며 먹는 것이란 언제나 조밥이었구나. 그러고 학\n 교에 다니면서도 맘대로 학용품을 어디 써보았겠니. 학기초마다 책을 못 사\n 서 울고 울다가는 겨우 남의 낡은 책을 얻어 가졌으며 종이와 붓이 없어 나\n 의 조고만 가슴은 그 몇 번이나 달막거리었는지 모른다.'

In [140]:
from konlpy.tag import Okt

In [142]:
okt = Okt()

In [148]:
target = [okt.nouns(t) for t in txt2[5:25]]

In [152]:
vec = TfidfVectorizer()

In [172]:
target2 = [' '.join(t) for t in target]

In [174]:
target_matrix = vec.fit_transform(target2)

In [182]:
df_val = target_matrix.toarray()

In [184]:
df_col = vec.get_feature_names_out()

In [186]:
df = pd.DataFrame(df_val, columns=df_col)

In [202]:
for c, v in enumerate(df.iloc[3]):
    print(c, v)

0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
21 0.0
22 0.0
23 0.0
24 0.0
25 0.0
26 0.0
27 0.0
28 0.0
29 0.0
30 0.0
31 0.0
32 0.0
33 0.0
34 0.0
35 0.0
36 0.0
37 0.0
38 0.0
39 0.0
40 0.0
41 0.0
42 0.0
43 0.0
44 0.0
45 0.0
46 0.0
47 0.0
48 0.0
49 0.0
50 0.0
51 0.0
52 0.0
53 0.0
54 0.0
55 0.0
56 0.0
57 0.0
58 0.0
59 0.0
60 0.0
61 0.0
62 0.0
63 0.0
64 0.0
65 0.7424277384490324
66 0.0
67 0.0
68 0.6699261550211748
69 0.0
70 0.0
71 0.0
72 0.0
73 0.0
74 0.0
75 0.0
76 0.0
77 0.0
78 0.0
79 0.0
80 0.0
81 0.0
82 0.0
83 0.0
84 0.0
85 0.0
86 0.0
87 0.0
88 0.0
89 0.0
90 0.0
91 0.0
92 0.0
93 0.0
94 0.0
95 0.0
96 0.0
97 0.0
98 0.0
99 0.0
100 0.0
101 0.0
102 0.0
103 0.0
104 0.0
105 0.0
106 0.0
107 0.0
108 0.0
109 0.0
110 0.0
111 0.0
112 0.0
113 0.0
114 0.0
115 0.0
116 0.0
117 0.0
118 0.0
119 0.0
120 0.0
121 0.0
122 0.0
123 0.0
124 0.0
125 0.0
126 0.0
127 0.0
128 0.0
129 0.0
130 0.0
131 0.0
132 0.0
133 0.0
134 0.0


In [198]:
df.iloc[3, 65], df.iloc[3, 68]

(0.7424277384490324, 0.6699261550211748)

In [200]:
df.columns[65], df.columns[68]

('눈사람', '다가')

In [204]:
txt2[8]

' 나는 벌을 서면서도 눈사람의 그 입과 그 눈이 우스워서 킥 하고 웃다가 또\n 울다가 하였다.'