# Recipe 3-6. Converting Text to Features Using TF-IDF
**Term frequency (TF):** Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence. TF is basically capturing the importance of the word irrespective of the length of the document. For example, a word with the frequency of 3 with the length of sentence being 10 is not the same as when the word length of sentence is 100 words. It should get more importance in the first scenario; that is what TF does.

**Inverse Document Frequency (IDF):** IDF of each word is the log of
the ratio of the total number of rows to the number of rows in a particular
document in which that word is present.
IDF = log(N/n), where N is the total number of rows and n is the
number of rows in which the word was present.
IDF will measure the rareness of a term. Words like “a,” and “the” show
up in all the documents of the corpus, but rare words will not be there
in all the documents. So, if a word is appearing in almost all documents,
then that word is of no use to us since it is not helping to classify or in
information retrieval. IDF will nullify this problem.
TF-IDF is the simple product of TF and IDF so that both of the
drawbacks are addressed, which makes predictions and information
retrieval relevant.

In [6]:
Text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The gray fox"]
Text

['The quick brown fox jumped over the lazy dog.', 'The dog.', 'The gray fox']

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(Text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [14]:
#vectorizer.vocabulary_
vectorizer.get_feature_names()

['brown', 'dog', 'fox', 'gray', 'jumped', 'lazy', 'over', 'quick', 'the']

In [9]:
vectorizer.idf_

array([1.69314718, 1.28768207, 1.28768207, 1.69314718, 1.69314718,
       1.69314718, 1.69314718, 1.69314718, 1.        ])

In [15]:
# https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences
import pandas as pd
df_idf = pd.DataFrame(vectorizer.idf_, index=vectorizer.get_feature_names(),columns=["tf_idf_weights"])
df_idf.sort_values(by=['tf_idf_weights'])

Unnamed: 0,tf_idf_weights
the,1.0
dog,1.287682
fox,1.287682
brown,1.693147
gray,1.693147
jumped,1.693147
lazy,1.693147
over,1.693147
quick,1.693147
