<a href="https://colab.research.google.com/github/humeratabassum/NLP/blob/main/Exp_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLP Text Representation Methods – Code Overview
This code demonstrates four popular ways to convert text into numerical features using a sample corpus of three sentences. These methods are commonly used in Natural Language Processing (NLP) tasks like classification or clustering.

# 2. One-Hot Encoding
Function: one_hot_encoding(corpus)

Converts each unique word in the corpus into a binary vector.

For every sentence, a vector is created where:

1 → word is present

0 → word is absent

Example: [0, 1, 0, 1, 0] (depends on word order in the dictionary)

#3. Bag of Words (BoW)
Function: bag_of_words(corpus)

Uses CountVectorizer from sklearn.

Converts each sentence into a vector of word frequencies.

It counts how many times each word appears in the sentence.

Output: A 2D matrix where:

Rows = sentences

Columns = words

Values = word count in each sentence

#4. TF-IDF (Term Frequency - Inverse Document Frequency)
Function: tf_idf(corpus)

Uses TfidfVectorizer from sklearn.

Similar to BoW, but instead of just counting, it calculates how important each word is.

Common words across all sentences get lower weight.

Output: Matrix with TF-IDF scores for each word per sentence.

#5. N-Grams
Function: n_grams(corpus, n=2)

Splits text into sequences of 'n' consecutive words.

Example (for bigrams, n=2): "I love" → ('I', 'love')

Helps to capture context and word relationships better than individual words.

Output: List of n-grams for each sentence.

#6. Execution and Output
Each function is called with the corpus, and the results are printed:

✅ One-Hot Vectors and their Word Index Mapping

✅ BoW Vectors and Feature Words

✅ TF-IDF Vectors and Feature Words

✅ 2-Grams from each sentence

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
from nltk.util import ngrams

# Sample Text Data
corpus = [
    "I love programming in Python.",
    "Python is an amazing language for data science.",
    "I am learning machine learning with Python."
]

# 1. One-Hot Encoding
def one_hot_encoding(corpus):
    words = list(set(' '.join(corpus).split()))  # Unique words
    word_to_index = {word: idx for idx, word in enumerate(words)}  # Mapping word to index
    one_hot_vectors = []

    for text in corpus:
        vector = [0] * len(words)
        for word in text.split():
            vector[word_to_index[word]] = 1
        one_hot_vectors.append(vector)

    return one_hot_vectors, word_to_index

# 2. Bag of Words (BoW)
def bag_of_words(corpus):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(corpus)
    return X.toarray(), vectorizer.get_feature_names_out()

# 3. TF-IDF
def tf_idf(corpus):
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)
    return X.toarray(), vectorizer.get_feature_names_out()

# 4. N-Grams (2-Grams)
def n_grams(corpus, n=2):
    ngram_list = []
    for text in corpus:
        tokens = text.split()
        ngram_list.append(list(ngrams(tokens, n)))  # List of n-grams
    return ngram_list

# Running the functions and printing results:

# One-Hot Encoding
one_hot_vectors, word_to_index = one_hot_encoding(corpus)
print("\nOne-Hot Encoding:")
print("Word to Index:", word_to_index)
print("One-Hot Vectors:\n", np.array(one_hot_vectors))

# Bag of Words (BoW)
bow_vectors, bow_words = bag_of_words(corpus)
print("\nBag of Words (BoW):")
print("BoW Vectors:\n", bow_vectors)
print("BoW Words:", bow_words)

# TF-IDF
tfidf_vectors, tfidf_words = tf_idf(corpus)
print("\nTF-IDF Representation:")
print("TF-IDF Vectors:\n", tfidf_vectors)
print("TF-IDF Words:", tfidf_words)

# N-Grams (2-Grams)
ngrams_result = n_grams(corpus, n=2)
print("\nN-Grams (2-Grams):")
for idx, ngram in enumerate(ngrams_result):
    print(f"Sentence {idx+1} N-Grams:", ngram)



One-Hot Encoding:
Word to Index: {'learning': 0, 'I': 1, 'is': 2, 'Python': 3, 'love': 4, 'an': 5, 'in': 6, 'programming': 7, 'data': 8, 'science.': 9, 'am': 10, 'machine': 11, 'language': 12, 'with': 13, 'for': 14, 'Python.': 15, 'amazing': 16}
One-Hot Vectors:
 [[0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0]
 [0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 1]
 [1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0]]

Bag of Words (BoW):
BoW Vectors:
 [[0 0 0 0 0 1 0 0 0 1 0 1 1 0 0]
 [0 1 1 1 1 0 1 1 0 0 0 0 1 1 0]
 [1 0 0 0 0 0 0 0 2 0 1 0 1 0 1]]
BoW Words: ['am' 'amazing' 'an' 'data' 'for' 'in' 'is' 'language' 'learning' 'love'
 'machine' 'programming' 'python' 'science' 'with']

TF-IDF Representation:
TF-IDF Vectors:
 [[0.         0.         0.         0.         0.         0.54645401
  0.         0.         0.         0.54645401 0.         0.54645401
  0.32274454 0.         0.        ]
 [0.         0.36888498 0.36888498 0.36888498 0.36888498 0.
  0.36888498 0.36888498 0.         0.         0.         0.
  0.21786941 0.3688