# TfidfVectorizer
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency:
<img src="images/tfidf-general-formula.png" style="width:450px;height:300px;">
where <i>idf(t)</i>:
<img src="images/idf-default.png" style="width:450px;height:300px;">
when <i>smooth_idf=True</i>.
If <i>False</i> 
<img src="images/tfidf-False.png" style="width:450px;height:300px;"> 
At the end, everything is normalized by the Euclidian norm:
<img src="images/euclidian-norm.png" style="width:450px;height:300px;">
## First Example
Let see the <b>fit</b> function:

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

text = ["The fox is on the garden",
        "the cat is green"]
vectorizer = TfidfVectorizer()
vectorizer.fit(text)
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{u'on': 5, u'green': 3, u'garden': 2, u'is': 4, u'fox': 1, u'the': 6, u'cat': 0}
[1.40546511 1.40546511 1.40546511 1.40546511 1.         1.40546511
 1.        ]


The second vector represent the idf of the word

In [15]:
vector = vectorizer.transform(text)
vector.toarray()

array([[0.        , 0.42519636, 0.42519636, 0.        , 0.30253071,
        0.42519636, 0.60506143],
       [0.57615236, 0.        , 0.        , 0.57615236, 0.40993715,
        0.        , 0.40993715]])