## Reference:
https://radimrehurek.com/gensim/models/fasttext.html

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_train = ['This is the first document, we can put another document here.', 'That is the second document.','No comment']
X_test = ['This is the third document.']

# max_df = 0.5 單詞如果在 50% 以上的文件出現就不考慮
# min_df = 5 單詞如果出現次數少於 5 次就不考慮, min_df = 0.1 單詞如果在 10% 以下的文件中出現就不考慮
vectorizer = TfidfVectorizer() 

# 用 X_train 來 build 字典，字數還有 document 數量
vectorizer.fit(X_train)

# 得到 tfidf vector
tfidf_train = vectorizer.transform(X_train)
tfidf_test = vectorizer.transform(X_test)

print(tfidf_test.toarray())
print('\n')
print('單詞對應 index: \n{}'.format(vectorizer.vocabulary_))

[[0.         0.         0.         0.45985353 0.         0.
  0.45985353 0.         0.         0.         0.         0.45985353
  0.60465213 0.        ]]


單詞對應 index: 
{'this': 12, 'is': 6, 'the': 11, 'first': 4, 'document': 3, 'we': 13, 'can': 1, 'put': 8, 'another': 0, 'here': 5, 'that': 10, 'second': 9, 'no': 7, 'comment': 2}


---

## 我們也可以將 `TfidfVectorizer` 拆解成兩個步驟：
1. `CountVectorizer`
2. `TfidfTransformer`

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [8]:
# 文本
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

## 1. `CountVectorizer` 
通過 `fit_transform` 函數將文本中的詞語轉換為詞頻矩陣
* `get_feature_names()` 可看到所有文本的關鍵字
* `vocabulary_` 可看到所有文本的關鍵字和其位置
* `toarray()` 可看到詞頻矩陣的結果

In [9]:
# CountVectorizer 就是 tokenlizer
vectorizer = CountVectorizer()
count = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())  
print(vectorizer.vocabulary_)
print(count.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


## 2. `TfidfTransformer` 
統計 `CountVectorizer` 中每個詞語的 TF-IDF 權值

In [10]:
transformer = TfidfTransformer()
tfidf_matrix = transformer.fit_transform(count)

print(tfidf_matrix.toarray())

[[0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]]
