# TF-IDF
TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a statistical measure used in Natural Language Processing (NLP) to evaluate the importance of a word in a document relative to a collection of documents. It combines two components: term frequency (TF) and inverse document frequency (IDF).

### Term Frequency (TF):

TF measures how frequently a term occurs in a document relative to the total number of words in that document.

It is calculated using the formula:


               Number of occurrences of the term in the document
    TF =      ____________________________________________________
                     Total number of terms in the document


### Inverse Document Frequency (IDF):

IDF measures the rarity of a term across a collection of documents.
It is calculated using the formula:


                        Number of documents containing the term
        IDF =   loge   _________________________________________
                              Total number of documents
​
### TF-IDF Calculation:

TF-IDF is calculated by multiplying the TF of a term by its IDF.
It gives higher weight to terms that appear frequently in a specific document but infrequently across all documents.

### Example:

Consider a collection of documents:

Document 1: "The cat sat on the mat."
Document 2: "The dog played in the garden."
Let's calculate TF-IDF for the term "cat":

TF for "cat" in Document 1: 1/6

IDF for "cat" across both documents: loge (2/1) =0.693

TF-IDF for "cat" in Document 1:  (1/6) * 0.693 ≈ 0.1155

### Usage:

TF-IDF is used in information retrieval, text mining, and document classification tasks.

It helps in identifying important words in a document or corpus.

It can be used to extract keywords, rank documents, or perform similarity searches.

### Advantages:

Provides a fixed-size input suitable for machine learning algorithms.

Captures the importance of words by considering both their frequency in the document and their rarity across the document collection.

### Disadvantages:

May result in a sparse matrix representation, especially for large vocabularies.

Out-of-vocabulary (OOV) words may not be handled effectively.

Does not capture semantic relationships between words.

### N-grams:

N-grams can be incorporated into TF-IDF to capture semantic information and contextual relationships between words, enhancing its effectiveness in certain tasks.

## Using NLTK:

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
documents = [
    "The cat sat on the mat.",
    "The dog played in the garden."
]

In [None]:
tokenized_documents = [word_tokenize(doc.lower()) for doc in documents]
tokenized_documents

[['the', 'cat', 'sat', 'on', 'the', 'mat', '.'],
 ['the', 'dog', 'played', 'in', 'the', 'garden', '.']]

In [None]:
# Remove stopwords and lemmatize tokens
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
filtered_documents = [[lemmatizer.lemmatize(word) for word in doc if word not in stop_words] for doc in tokenized_documents]

In [None]:
filtered_documents

[['cat', 'sat', 'mat', '.'], ['dog', 'played', 'garden', '.']]

In [None]:
# Convert the filtered documents back to strings
preprocessed_documents = [' '.join(doc) for doc in filtered_documents]
preprocessed_documents

['cat sat mat .', 'dog played garden .']

In [None]:
# Calculate TF-IDF using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_representation = tfidf_vectorizer.fit_transform(preprocessed_documents)

In [None]:
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

Vocabulary: ['cat' 'dog' 'garden' 'mat' 'played' 'sat']


In [None]:
print("TF-IDF matrix:")
print(tfidf_representation.toarray())

TF-IDF matrix:
[[0.57735027 0.         0.         0.57735027 0.         0.57735027]
 [0.         0.57735027 0.57735027 0.         0.57735027 0.        ]]


In [None]:
tfidf_vectorizer.vocabulary_

{'cat': 0, 'sat': 5, 'mat': 3, 'dog': 1, 'played': 4, 'garden': 2}

### NGram

In [None]:
# Calculate TF-IDF using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(2,2), max_features = 10)
tfidf_representation = tfidf_vectorizer.fit_transform(preprocessed_documents)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF matrix:")
print(tfidf_representation.toarray())

Vocabulary: ['cat sat' 'dog played' 'played garden' 'sat mat']
TF-IDF matrix:
[[0.70710678 0.         0.         0.70710678]
 [0.         0.70710678 0.70710678 0.        ]]


In [None]:
tfidf_vectorizer.vocabulary_

{'cat sat': 0, 'sat mat': 3, 'dog played': 1, 'played garden': 2}

## Using spaCy:

In [None]:
import spacy
# Load the English language model in SpaCy
nlp = spacy.load("en_core_web_sm")

In [None]:
documents = [
    "I enjoy reading books.",
    "Books are a great source of knowledge."
]

In [None]:
# Tokenize and preprocess the documents using spaCy
tokenized_documents = [[token.text.lower() for token in nlp(doc)] for doc in documents]
tokenized_documents

[['i', 'enjoy', 'reading', 'books', '.'],
 ['books', 'are', 'a', 'great', 'source', 'of', 'knowledge', '.']]

In [None]:
# Convert the tokenized documents back to strings
preprocessed_documents = [' '.join(doc) for doc in tokenized_documents]
preprocessed_documents

['i enjoy reading books .', 'books are a great source of knowledge .']

In [None]:
# Calculate TF-IDF using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features = 5)
tfidf_representation = tfidf_vectorizer.fit_transform(preprocessed_documents)

In [None]:
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

Vocabulary: ['are' 'books' 'enjoy' 'great' 'knowledge']


In [None]:
print("TF-IDF matrix:")
print(tfidf_representation.toarray())

TF-IDF matrix:
[[0.         0.57973867 0.81480247 0.         0.        ]
 [0.53404633 0.37997836 0.         0.53404633 0.53404633]]


In [None]:
tfidf_vectorizer.vocabulary_

{'enjoy': 2, 'books': 1, 'are': 0, 'great': 3, 'knowledge': 4}