<a href="https://colab.research.google.com/github/niksom406/Learning_NLP/blob/main/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TF-IDF and N-grams


### TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

**How it works:**

*   **Term Frequency (TF):** This measures how frequently a term appears in a document. The more a word appears, the higher its TF.
*   **Inverse Document Frequency (IDF):** This measures how important a term is. It is calculated by taking the logarithm of the total number of documents divided by the number of documents containing the term. Words that are common across many documents (like "the" or "a") will have a low IDF, while words that are unique to a few documents will have a high IDF.

The TF-IDF score is the product of the TF and IDF. A high TF-IDF score indicates that a word is frequent in a document but rare in the rest of the corpus, suggesting it is a significant word for that document.

### N-grams

N-grams are contiguous sequences of n items from a given sample of text or speech. In the context of text processing, N-grams are sequences of words.

*   **Unigrams (n=1):** Individual words (e.g., "the", "quick", "brown").
*   **Bigrams (n=2):** Sequences of two words (e.g., "the quick", "quick brown").
*   **Trigrams (n=3):** Sequences of three words (e.g., "the quick brown").

Using N-grams in text representation techniques like TF-IDF can help capture the context and relationships between words, which can be beneficial for tasks like spam detection where the order and combination of words matter. By using bigrams (as done in the code), the model can learn that phrases like "free entry" or "claim prize" are more indicative of spam than the individual words "free" or "claim" alone.

### Load Dataset

This cell loads the SMS Spam Collection dataset into a pandas DataFrame. The dataset is a tab-separated file with two columns: 'label' (indicating if the message is 'ham' or 'spam') and 'message' (the text of the SMS message).

In [None]:
import pandas as pd
messages=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/SMSSpam_Dataset/SMSSpamCollection.txt',
                    sep='\t',names=["label","message"])

### Display DataFrame

This cell displays the contents of the loaded DataFrame. The output shows the first and last few rows of the DataFrame, giving an overview of the data structure and content.

In [None]:
messages

### Data Cleaning And Preprocessing - Import Libraries

This cell imports necessary libraries for data cleaning and preprocessing.
- `re`: Regular expression module for text cleaning.
- `nltk`: Natural Language Toolkit for various text processing tasks, including downloading stopwords.
The code also downloads the 'stopwords' corpus from NLTK, which is a list of common words that are often removed during text preprocessing.

In [None]:
## Data Cleaning And Preprocessing
import re
import nltk
nltk.download('stopwords')

### Data Cleaning And Preprocessing - Initialize Tools

This cell imports `stopwords` and `WordNetLemmatizer` from NLTK and initializes a `WordNetLemmatizer` object.
- `stopwords`: A list of common English words to be removed.
- `WordNetLemmatizer`: A tool to reduce words to their base or dictionary form (lemmatization).

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
wordlemmatize=WordNetLemmatizer()

### Data Cleaning And Preprocessing - Download WordNet

This cell downloads the 'wordnet' corpus from NLTK. WordNet is a lexical database of English, and it's used by the `WordNetLemmatizer` to perform lemmatization.

In [None]:
nltk.download('wordnet')

### Data Cleaning And Preprocessing - Apply Cleaning and Lemmatization

This cell performs data cleaning and preprocessing on the 'message' column of the DataFrame.
It iterates through each message, performs the following steps:
1. Removes characters that are not letters using regular expressions.
2. Converts the message to lowercase.
3. Splits the message into individual words.
4. Applies lemmatization to each word and removes stopwords.
5. Joins the processed words back into a string.
The cleaned and preprocessed messages are stored in the `corpus` list.

In [None]:
corpus=[]
for i in range(0,len(messages)):
    review=re.sub('[^a-zA-z]',' ',messages['message'][i])
    review=review.lower()
    review=review.split()
    review=[wordlemmatize.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

### Display Corpus

This cell displays the contents of the `corpus` list, which contains the cleaned and preprocessed SMS messages.

In [None]:
corpus

### Import TfidfVectorizer

This cell imports the `TfidfVectorizer` class from the `sklearn.feature_extraction.text` module. This class is used to convert a collection of raw documents to a matrix of TF-IDF features.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

### TF-IDF Vectorization

This cell creates an instance of `TfidfVectorizer` with `max_features=1000` (to consider only the top 1000 most frequent terms) and `binary=True` (to use binary term frequency). It then applies the vectorizer to the `corpus` to transform the text data into a matrix of TF-IDF features, which is stored in the variable `X`. The `.toarray()` method converts the sparse matrix output of `fit_transform` into a dense NumPy array.

In [None]:
tfidfconverter = TfidfVectorizer(max_features=1000,binary=True)
X = tfidfconverter.fit_transform(corpus).toarray()

### Configure NumPy Print Options

This cell imports the `numpy` library and configures its print options to display the full array without truncation and format floating-point numbers with 3 significant digits.

In [None]:
import numpy as np
np.set_printoptions(edgeitems=30,linewidth=100000,
                    formatter=dict(float=lambda x:"%.3g" % x))

### Display TF-IDF Matrix

This cell displays the `X` array, which contains the TF-IDF features of the SMS messages. The output is a numerical representation of the text data, suitable for machine learning models.

In [None]:
X

### TF-IDF with N-grams Vectorization

This cell creates an instance of `TfidfVectorizer` with `max_features=1000` and `binary=True`, but this time it also includes `ngram_range=(2, 2)`. This means the vectorizer will consider bigrams (sequences of two words) instead of individual words. It then applies the vectorizer to the `corpus` to transform the text data into a matrix of TF-IDF features based on bigrams.

In [None]:
tfidf = TfidfVectorizer(max_features=1000,binary=True,ngram_range=( 2,2))
X = tfidf.fit_transform(corpus).toarray()

### Display TF-IDF Vocabulary with N-grams

This cell displays the vocabulary learned by the `TfidfVectorizer` when using bigrams (`ngram_range=(2, 2)`). The output is a dictionary where keys are the bigrams (sequences of two words) and values are their corresponding indices in the TF-IDF matrix.

In [None]:
tfidf.vocabulary_

### Display TF-IDF Matrix with N-grams

This cell displays the `X` array, which contains the TF-IDF features of the SMS messages based on bigrams. The output is a numerical representation of the text data using bigrams, which can capture more context than individual words.

In [None]:
X