### TF - IDF (Term Frequency-Inverse Document Frequency )

#### ref : https://medium.com/@abhishekjainindore24/tf-idf-in-nlp-term-frequency-inverse-document-frequency-e05b65932f1d
#### Ref 2 : https://www.markovml.com/blog/tf-idf
#### Ref 3: https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089
#### Ref 4 : https://www.kaggle.com/code/paulrohan2020/tf-idf-tutorial

#### Youtube ref: https://www.youtube.com/watch?v=ENLEjGozrio

It's a statistical measure used in text mining and information retrieval to evaluate how important a word is to a document in a collection or corpus.

Term Frequency (TF)
Definition: The number of times a word appears in a document, divided by the total number of words in that document.
Formula: TF = (Number of times the word appears in the document / Total number of words in the document)

                        TF= (Total number of words in the document / Number of times the word appears in the document)

Purpose: TF helps understand how frequently a word is used in a single document. Words that appear frequently in a document get a higher score.

2. Inverse Document Frequency (IDF)
Definition: A measure of how important a word is across a set of documents. Words that appear in many documents are less informative and thus given a lower weight.

Formula:
                    IDF = log( Total number of documents / Number of documents containing the word )


Purpose: IDF helps reduce the importance of common words (like "the", "is", "and") that occur in many documents. Rare words that appear in fewer documents get a higher score.

                                                        TF-IDF=TF×IDF

Purpose: TF-IDF gives a weight to each word in a document, highlighting words that are important (i.e., frequent in a specific document but not common across all documents in the corpus).
Example
Let's say we have a corpus of 3 documents:

"The cat is on the mat."
"The dog is in the house."
"The cat and the dog are friends."
Step 1: Compute Term Frequency (TF)

For the word "cat" in Document 1:
TF
=
1/6
(
since there are 6 words in total
)
TF= 
6
1
​
 (since there are 6 words in total)
Step 2: Compute Inverse Document Frequency (IDF)

"cat" appears in 2 out of 3 documents.
IDF
=
log
⁡
(
3
2
)
IDF=log( 
2
3
​
 )
Step 3: Compute TF-IDF

TF-IDF
=
TF
×
IDF
TF-IDF=TF×IDF
Applications of TF-IDF
Text Classification: It is used to convert text data into numerical features for machine learning models.
Search Engines: Search engines use TF-IDF to rank documents based on their relevance to a search query.
Document Similarity: Used in algorithms to find similar documents.
Keyword Extraction: Helps in identifying important words in a document.

## Applications of TF-IDF
a) Text Classification: It is used to convert text data into numerical features for machine learning models.

b) Search Engines: Search engines use TF-IDF to rank documents based on their relevance to a search query.

c) Document Similarity: Used in algorithms to find similar documents.

d) Keyword Extraction: Helps in identifying important words in a document(weight).

In [49]:
# import pandas as pd

# # Define the advantages and disadvantages of TF-IDF
# tfidf_info = {
#     "Advantages of TF-IDF": [
#         {
#             "Aspect": "Simplicity & Ease of Use",
#             "Description": "TF-IDF is relatively easy to understand and implement, making it accessible for beginners and useful for quick prototyping."
#         },
#         {
#             "Aspect": "Effective for Basic Text Analysis",
#             "Description": "It works well for text classification, document similarity, and keyword extraction in many scenarios, especially when you need a straightforward, interpretable representation of text."
#         },
#         {
#             "Aspect": "Highlights Important Words",
#             "Description": "TF-IDF emphasizes unique and meaningful words by assigning higher scores to terms that are frequent in a document but rare across the corpus, which is useful for understanding key concepts."
#         },
#         {
#             "Aspect": "Reduces the Impact of Common Words",
#             "Description": "By incorporating the inverse document frequency, it reduces the weight of common stop words (e.g., 'the', 'is', 'and'), making the representation more informative."
#         },
#         {
#             "Aspect": "Sparse Representation",
#             "Description": "The output is usually a sparse matrix, which can be efficient in terms of storage and computation when used with libraries like SciPy."
#         },
#         {
#             "Aspect": "Widely Supported",
#             "Description": "It is supported by many machine learning libraries and frameworks, like Scikit-Learn, which simplifies its application in projects."
#         }
#     ],
#     "Disadvantages of TF-IDF": [
#         {
#             "Aspect": "Ignores Word Order and Context",
#             "Description": "TF-IDF treats text as a 'bag of words' and does not consider the order or context of words, which limits its ability to understand semantics. For example, 'New York' and 'York New' would be treated the same way."
#         },
#         {
#             "Aspect": "Not Suitable for Complex NLP Tasks",
#             "Description": "For advanced natural language processing tasks like sentiment analysis, machine translation, or text generation, TF-IDF is insufficient as it does not capture meaning, context, or relationships between words."
#         },
#         {
#             "Aspect": "Vocabulary Size Can Be Large",
#             "Description": "In large corpora, the number of unique words (features) can be very high, leading to increased memory consumption and slower performance."
#         },
#         {
#             "Aspect": "Suffers from High Dimensionality",
#             "Description": "The sparse matrix representation can become inefficient for very large datasets, especially when working with millions of unique words."
#         },
#         {
#             "Aspect": "Static Weights",
#             "Description": "The term weights in TF-IDF are static and do not adapt to changes in the document set. If the corpus is updated or expanded, the weights need to be recalculated, making it inefficient for dynamic or streaming data."
#         },
#         {
#             "Aspect": "Domain-Specific Challenges",
#             "Description": "In some domains, important terms may not be frequent and can be underweighted by TF-IDF, which may require domain-specific adaptations."
#         },
#         {
#             "Aspect": "Lack of Handling for Synonyms and Polysemy",
#             "Description": "TF-IDF does not understand that different words can mean the same thing (synonyms) or that the same word can have multiple meanings (polysemy). For example, 'car' and 'automobile' are treated as separate features even though they mean the same."
#         },
#         {
#             "Aspect": "No Inherent Handling of Out-of-Vocabulary Words",
#             "Description": "When encountering new or rare words in test data that were not in the training corpus, TF-IDF has no built-in mechanism to handle them effectively."
#         }
#     ]
# }

# # Function to print the formatted advantages and disadvantages
# def print_tfidf_info(tfidf_info):
#     for category, items in tfidf_info.items():
#         # print(f"{category}")
#         for item in items:
#             print(f"\n{item['Aspect']}\n")
#             print(f"{item['Description']}\n")

# # Display the formatted information
# # print_tfidf_info(tfidf_info)
# # tfidf_info

## Advantages of TF-IDF

#### Simplicity & Ease of Use

TF-IDF is relatively easy to understand and implement, making it accessible for beginners and useful for quick prototyping.

#### Effective for Basic Text Analysis

It works well for text classification, document similarity, and keyword extraction in many scenarios, especially when you need a straightforward, interpretable representation of text.

#### Highlights Important Words

TF-IDF emphasizes unique and meaningful words by assigning higher scores to terms that are frequent in a document but rare across the corpus, which is useful for understanding key concepts.

#### Reduces the Impact of Common Words

By incorporating the inverse document frequency, it reduces the weight of common stop words (e.g., "the", "is", "and"), making the representation more informative.

#### Sparse Representation

The output is usually a sparse matrix, which can be efficient in terms of storage and computation when used with libraries like SciPy.

#### Widely Supported

It is supported by many machine learning libraries and frameworks, like Scikit-Learn, which simplifies its application in projects.

## Disadvantages of TF-IDF

#### Ignores Word Order and Context

TF-IDF treats text as a "bag of words" and does not consider the order or context of words, which limits its ability to understand semantics. For example, "New York" and "York New" would be treated the same way.

#### Not Suitable for Complex NLP Tasks

For advanced natural language processing tasks like sentiment analysis, machine translation, or text generation, TF-IDF is insufficient as it does not capture meaning, context, or relationships between words.

#### Vocabulary Size Can Be Large

In large corpora, the number of unique words (features) can be very high, leading to increased memory consumption and slower performance.

#### Suffers from High Dimensionality

The sparse matrix representation can become inefficient for very large datasets, especially when working with millions of unique words.

#### Static Weights

The term weights in TF-IDF are static and do not adapt to changes in the document set. If the corpus is updated or expanded, the weights need to be recalculated, making it inefficient for dynamic or streaming data.

#### Domain-Specific Challenges

In some domains, important terms may not be frequent and can be underweighted by TF-IDF, which may require domain-specific adaptations.

#### Lack of Handling for Synonyms and Polysemy

TF-IDF does not understand that different words can mean the same thing (synonyms) or that the same word can have multiple meanings (polysemy). For example, "car" and "automobile" are treated as separate features even though they mean the same.

#### No Inherent Handling of Out-of-Vocabulary Words

When encountering new or rare words in test data that were not in the training corpus, TF-IDF has no built-in mechanism to handle them effectively.