## TF-IDF(Term Frequency-Inverse Document Frequency)
===> 단어의 빈도와 역 문서 빈도를 사용하여 DTM내의 각 단어들 마다 중요한 정도를 가중치로 주는 방법


- DTM내에 있는 각 단어에 대한 중요도를 계산할 수 있다.
- 기존의 DTM을 사용하는 것보다 더 많은 정보를 고려하여 문서들을 비교할 수 있다.
- 하지만 TF-IDF가 꼭 DTM보다 성능이 뛰어난 것은 아니다

### 계산법
- TF x IDF
- TF(d,t) : 특정 문서 d 에서의 특정 단어 t의 등장 횟수
- df(t) : 특정 단어 t가 등장한 문저의 수
- idf(d,t) : df(t)에 반비례하는 수

===> TF-IDF는 모든 문서에서 자주 등장하는 단어는 중요도가 낮다고 판단하며, 특정 문서에서만 자주 등장하는 단어는 중요도가 높다고 판단 <br>
<b>즉 ti-idf값이 낮으면 중요도가 낮은것, 높다면 중요도가 큰것</b>


In [1]:
import pandas as pd
from math import log

In [2]:
docs = [
  '먹고 싶은 사과',
  '먹고 싶은 바나나',
  '길고 노란 바나나 바나나',
  '저는 과일이 좋아요'
] 

In [17]:
vocab = list(set(w for doc in docs for w in doc.split()))

In [18]:
vocab.sort()

In [19]:
vocab

['과일이', '길고', '노란', '먹고', '바나나', '사과', '싶은', '저는', '좋아요']

In [34]:
N = len(docs)

def tf(t,d):
    return d.count(t)

def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    
    return log(N/(df+1))

def tfidf(t,d):
    return tf(t,d)*idf(t)

In [27]:
result = []
for i in range(N):
    result.append([])
    d = docs[i]
    
    for j in range(len(vocab)):
        t = vocab[j]
        result[-1].append(tf(t,d))
        
dtm = pd.DataFrame(result, columns=vocab)

In [28]:
dtm

Unnamed: 0,과일이,길고,노란,먹고,바나나,사과,싶은,저는,좋아요
0,0,0,0,1,0,1,1,0,0
1,0,0,0,1,1,0,1,0,0
2,0,1,1,0,2,0,0,0,0
3,1,0,0,0,0,0,0,1,1


In [29]:
result = []

for j in range(len(vocab)):
    t = vocab[j]
    result.append(idf(t))
    
idf = pd.DataFrame(result, index= vocab, columns=['idf'])

In [30]:
idf

Unnamed: 0,idf
과일이,0.693147
길고,0.693147
노란,0.693147
먹고,0.287682
바나나,0.287682
사과,0.693147
싶은,0.287682
저는,0.693147
좋아요,0.693147


In [35]:
result = []

for i in range(N):
    result.append([])
    d = docs[i]
    
    for j in range(len(vocab)):
        t = vocab[j]
        
        result[-1].append(tfidf(t,d))

In [36]:
tfidf = pd.DataFrame(result, columns = vocab)
tfidf

Unnamed: 0,과일이,길고,노란,먹고,바나나,사과,싶은,저는,좋아요
0,0.0,0.0,0.0,0.287682,0.0,0.693147,0.287682,0.0,0.0
1,0.0,0.0,0.0,0.287682,0.287682,0.0,0.287682,0.0,0.0
2,0.0,0.693147,0.693147,0.0,0.575364,0.0,0.0,0.0,0.0
3,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.693147,0.693147


## DTM&TF-IDF(sklearn)

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#### make DTM

In [38]:
corpus = [
    'you know I want your love',
    'I like you',
    'what should I do ',    
]

In [39]:
cv = CountVectorizer()  

In [42]:
print(cv.fit_transform(corpus).toarray())
print(cv.vocabulary_)

[[0 1 0 1 0 1 0 1 1]
 [0 0 1 0 0 0 0 1 0]
 [1 0 0 0 1 0 1 0 0]]
{'you': 7, 'know': 1, 'want': 5, 'your': 8, 'love': 3, 'like': 2, 'what': 6, 'should': 4, 'do': 0}


In [45]:
pd.DataFrame(cv.fit_transform(corpus).toarray(), columns = cv.vocabulary_)

Unnamed: 0,you,know,want,your,love,like,what,should,do
0,0,1,0,1,0,1,0,1,1
1,0,0,1,0,0,0,0,1,0
2,1,0,0,0,1,0,1,0,0


#### make TF-IDF

In [43]:
tf = TfidfVectorizer().fit(corpus)
print(tf.transform(corpus).toarray())
print(tf.vocabulary_)

[[0.         0.46735098 0.         0.46735098 0.         0.46735098
  0.         0.35543247 0.46735098]
 [0.         0.         0.79596054 0.         0.         0.
  0.         0.60534851 0.        ]
 [0.57735027 0.         0.         0.         0.57735027 0.
  0.57735027 0.         0.        ]]
{'you': 7, 'know': 1, 'want': 5, 'your': 8, 'love': 3, 'like': 2, 'what': 6, 'should': 4, 'do': 0}


In [44]:
pd.DataFrame(tf.transform(corpus).toarray(), columns = tf.vocabulary_)

Unnamed: 0,you,know,want,your,love,like,what,should,do
0,0.0,0.467351,0.0,0.467351,0.0,0.467351,0.0,0.355432,0.467351
1,0.0,0.0,0.795961,0.0,0.0,0.0,0.0,0.605349,0.0
2,0.57735,0.0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0
