<a href="https://colab.research.google.com/github/pritiyadav888/Machine-Learning-Projects/blob/main/NLP_TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TF-IDF

In all the other approaches we saw so far, all the words in the text are treated equally important. There is no notion of some words in the document being more important than others. TF-IDF addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus. It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query. 

This notebook shows a simple example of how to get the TF-IDF representation of a document using sklearn's TfidfVectorizer. 

In [3]:
documents = ["Some actors are singers.", "All the singers are dancers."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['some actors are singers', 'all the singers are dancers']

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(processed_docs)

#IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)
print("-"*10)
#All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names())
print("-"*10)

#TFIDF representation for all documents in our corpus 
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray()) 
print("-"*10)

temp = tfidf.transform(["No singer is actor."])
print("Tfidf representation for 'No singer is actor':\n", temp.toarray())

IDF for all words in the vocabulary [1.40546511 1.40546511 1.         1.40546511 1.         1.40546511
 1.40546511]
----------
All words in the vocabulary ['actors', 'all', 'are', 'dancers', 'singers', 'some', 'the']
----------
TFIDF representation for all documents in our corpus
 [[0.57615236 0.         0.40993715 0.         0.40993715 0.57615236
  0.        ]
 [0.         0.49922133 0.35520009 0.49922133 0.35520009 0.
  0.49922133]]
----------
Tfidf representation for 'No singer is actor':
 [[0. 0. 0. 0. 0. 0. 0.]]


