# TfidfVectorizer
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency:
<img src="images/tfidf-general-formula.png" style="width:331px;height:39px;">
where <i>idf(t)</i>:
<img src="images/idf-default.png" style="width:247px;height:65px;">
when <i>smooth_idf=True</i>.
If <i>False</i> 
<img src="images/idf-False.png" style="width:247px;height:65px;"> 
At the end, everything is normalized by the Euclidian norm:
<img src="images/euclidian-norm.png" style="width:376px;height:66px;">
## First Example
Let see the <b>fit</b> function:

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

text = ["The cat is on the and the cat",
        "the",
        "the cat"]
vectorizer = TfidfVectorizer()
vectorizer.fit(text)
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{u'and': 0, u'on': 3, u'the': 4, u'is': 2, u'cat': 1}
[1.69314718 1.28768207 1.69314718 1.69314718 1.        ]


The second vector represent the idf of the word

In [42]:
vector = vectorizer.transform(text)
vector.toarray()

array([[0.34394851, 0.52316341, 0.34394851, 0.34394851, 0.60942458],
       [0.        , 0.        , 0.        , 0.        , 1.        ],
       [0.        , 0.78980693, 0.        , 0.        , 0.61335554]])

Every number represent its importance in according on its presence in the documents and the times that it appears in the document.
Less is the number, and more the word will be considered if <i>tfidf</i> is set to <b>True</b>.