# What is it?

When performing text vectorization, one common approach is to use a vector of word counts. Instead of weighting each word by its count within the document, though, another approach that’s often more helpful is to weight each word by its **tf-idf (term frequency-inverse document frequency)** score, defined as follows:

```
tf(t, d) = # of times term t appears in document d
idf(t, D) = log (total number of documents / total # of documents in which term t appears)
tf-idf(t, d, D) = tf(t, d) * idf(t, D)
```

For example, given the following corpus:

```
Document 1: The quick brown fox
Document 2: The brown brown dog
Document 3: The fox ate the dog
```

The tf-idf score of “brown” within Document 2 is:

```
tf(brown, Document 2) = 2
idf(brown, corpus) = log (3 / 2)
tf-idf(brown, Document 2, corpus) = 2 * log (3 / 2)
```

# Why is it important?

Weighting terms by the number of times they appear in a document is often suboptimal. For example, words like “the” can appear frequently within a document, but are relatively meaningless. Tf-idf scores help mitigate this problem by also taking into account a term’s frequency across the entire corpus, not just within the document itself.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

corpus = [
 'the brown fox jumped over the brown dog',
 'the quick brown fox',
 'the brown brown dog',
 'the fox ate the dog'
]

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
print(X.toarray())