###TF IDF###

- TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistical measure used in Natural Language Processing (NLP) to evaluate the importance of a word in a document relative to a collection of documents (corpus).

- TF-IDF is a widely used technique to weigh the importance of words in documents while down-weighting common words that appear in many documents, making it extremely useful for many NLP tasks.



##Why Use TF-IDF?##

#Feature Extraction:

TF-IDF is used to extract features from text for machine learning models. It converts textual data into numerical data that models can work with.

#Information Retrieval:

It helps in searching and ranking documents based on the relevance of the query. Words that appear more frequently in fewer documents are considered more important for ranking.

#Text Classification:

In classification tasks like spam detection or sentiment analysis, TF-IDF helps models understand which words are relevant to each category.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer  # it is used to convert a collection of raw text documents into a matrix of TF-IDF features.

In [16]:
texts = ["Prajakta loved doing NLP. she also loved doing NLP with ML"]

- Document 1: "Prajakta loved doing NLP."

- Document 2: "She also loved doing NLP with ML."

In [17]:
vectorizer = TfidfVectorizer()   # This creates an instance of the TfidfVectorizer. It automatically handles tokenization (splitting words) and removes common English stopwords.

In [18]:
tfidf_matrix = vectorizer.fit_transform(texts)

- Fitting:

The vectorizer.fit_transform(texts) learns the vocabulary of the input texts and computes the TF-IDF scores for each word.

- Transforming:

It then transforms each document into a sparse matrix of TF-IDF scores. The output tfidf_matrix contains the TF-IDF values for each word in each document.

In [19]:
print("TF-IDF Matrix:", tfidf_matrix.toarray())

TF-IDF Matrix: [[0.24253563 0.48507125 0.48507125 0.24253563 0.48507125 0.24253563
  0.24253563 0.24253563]]


In [20]:
print("Feature Names:", vectorizer.get_feature_names_out())

Feature Names: ['also' 'doing' 'loved' 'ml' 'nlp' 'prajakta' 'she' 'with']


INTERPRETATION :

- Prajakta and ML are unique to their respective sentences, so they get higher TF-IDF scores.

- NLP and loved appear in both sentences, so their TF-IDF scores are lower due to lower inverse document frequency.


- This shows how TF-IDF assigns importance to words based on their frequency in each sentence and the entire corpus.

- Words that are common in both sentences (like "NLP" and "loved") have lower TF-IDF scores, while unique words (like "Prajakta" and "ML") have higher scores.