## lab6-Tutorial Objective
In this tutorial, you will learn how to:
- Create a Bag of Words (BoW) model.
- Generate N-Grams to capture relationships between words.

# Embeddings
---

### 1. Bag of Words (BoW)

* **Key Insight:** **Presence & Prevalence.** * **The Logic:** If a word appears frequently, the document is likely about that topic.
* **The Reality:** It treats text as an unordered "soup." It captures the **vocabulary** used but completely ignores grammar, word order, and the relative importance of common words (like "the" or "is").

### 2. TF-IDF

* **Key Insight:** **Uniqueness & Significance.**
* **The Logic:** A word is only "important" if it appears often in *one* document but rarely across the rest of the collection.
* **The Reality:** It effectively filters out "noise" and automatically identifies **keywords**. It tells you what makes a specific document distinct from the crowd.

### 3. Word Embeddings (e.g., Word2Vec)

* **Key Insight:** **Similarity & Association.**
* **The Logic:** "You shall know a word by the company it keeps." Words used in similar environments are mathematically mapped close to one another.
* **The Reality:** It captures **relationships** (e.g., *Paris* is to *France* as *Tokyo* is to *Japan*). However, it struggles with words that have multiple meanings depending on the sentence.

### 4. Contextual Embeddings (e.g., BERT/Transformers)

* **Key Insight:** **Nuance & Intent.**
* **The Logic:** The meaning of a word is defined dynamically by every other word in the sentence.
* **The Reality:** It understands **polysemy** (one word, multiple meanings) and complex structures like sarcasm or negation. It represents text as a living sequence rather than a static snapshot.

---

### Summary Comparison Table

| Method | Level of Insight | Best For... |
| --- | --- | --- |
| **BoW** | Quantitative (How many?) | Basic keyword matching |
| **TF-IDF** | Statistical (How unique?) | Search engines & document labeling |
| **Word2Vec** | Semantic (How related?) | Finding synonyms & broad concepts |
| **BERT** | Holistic (What's the intent?) | Translation, Q&A, & deep sentiment |

--

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample corpus
documents = [  
    "I love programming in Python.",  
    "Python is a great programming language.",  
    "I love coding in Python."  
]

print("Original Documents:")
for i, doc in enumerate(documents):
    print(f"Document {i+1}: {doc}")
print()

Original Documents:
Document 1: I love programming in Python.
Document 2: Python is a great programming language.
Document 3: I love coding in Python.



## 1. Bag of Words (BoW)
The Bag of Words model represents text data as a collection of word frequencies. It converts each document
into a vector based on the frequency of words.
### 1.1. Using CountVectorizer from scikit-learn
Setup: Import the necessary modules and prepare the data

In [2]:
# Initialize CountVectorizer for BoW
bow_vectorizer = CountVectorizer()

# Fit and transform the corpus
X_bow = bow_vectorizer.fit_transform(documents)

# Convert to DataFrame
bow_df = pd.DataFrame(X_bow.toarray(), columns=bow_vectorizer.get_feature_names_out())


CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.  This can be visualized as follows -

In [3]:
print("Bag of Words Model:")
print(bow_df)
#print(f"\nVocabulary: {list(bow_vectorizer.get_feature_names_out())}")
print("............Explanation............")
print("• Each row represents a document.\n")
print("• Each column represents a word from the corpus.\n")
print("• The values in the matrix represent the frequency of the words in the corresponding\n")

Bag of Words Model:
   coding  great  in  is  language  love  programming  python
0       0      0   1   0         0     1            1       1
1       0      1   0   1         1     0            1       1
2       1      0   1   0         0     1            0       1
............Explanation............
• Each row represents a document.

• Each column represents a word from the corpus.

• The values in the matrix represent the frequency of the words in the corresponding



## N-Grams are contiguous sequences of n words from a given text. For example:
-  1-Grams: Single words (unigrams).
-  2-Grams: Pairs of words (bigrams).
-  3-Grams: Triples of words (trigrams)

In [4]:
# 2. N-Grams Models
print("=" * 50)
print("2. N-GRAMS MODELS")
print("=" * 50)

# 2.1 Bigrams (2-grams)
print("\n2.1 BIGRAMS (2-GRAMS)")
print("-" * 20)

# Initialize CountVectorizer for bigrams
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Fit and transform the corpus
X_bigram = bigram_vectorizer.fit_transform(documents)

# Convert to DataFrame
bigram_df = pd.DataFrame(X_bigram.toarray(), columns=bigram_vectorizer.get_feature_names_out())

print("Bigrams Model:")
print(bigram_df)
print(f"\nBigram features: {list(bigram_vectorizer.get_feature_names_out())}")
print(f"Number of bigram features: {len(bigram_vectorizer.get_feature_names_out())}")

# 2.2 Trigrams (3-grams)
print("\n2.2 TRIGRAMS (3-GRAMS)")
print("-" * 20)

# Initialize CountVectorizer for trigrams
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))

# Fit and transform the corpus
X_trigram = trigram_vectorizer.fit_transform(documents)

# Convert to DataFrame
trigram_df = pd.DataFrame(X_trigram.toarray(), columns=trigram_vectorizer.get_feature_names_out())

print("Trigrams Model:")
print(trigram_df)
print(f"\nTrigram features: {list(trigram_vectorizer.get_feature_names_out())}")
print(f"Number of trigram features: {len(trigram_vectorizer.get_feature_names_out())}")

# 2.3 Mixed N-grams (1-3 grams)
print("\n2.3 MIXED N-GRAMS (1-3 GRAMS)")
print("-" * 25)

# Initialize CountVectorizer for mixed n-grams (unigrams, bigrams, trigrams)
mixed_vectorizer = CountVectorizer(ngram_range=(1, 3))

# Fit and transform the corpus
X_mixed = mixed_vectorizer.fit_transform(documents)

# Convert to DataFrame
mixed_df = pd.DataFrame(X_mixed.toarray(), columns=mixed_vectorizer.get_feature_names_out())

print("Mixed N-grams Model (1-3 grams):")
print(mixed_df)
print(f"\nNumber of mixed n-gram features: {len(mixed_vectorizer.get_feature_names_out())}")

2. N-GRAMS MODELS

2.1 BIGRAMS (2-GRAMS)
--------------------
Bigrams Model:
   coding in  great programming  in python  is great  love coding  \
0          0                  0          1         0            0   
1          0                  1          0         1            0   
2          1                  0          1         0            1   

   love programming  programming in  programming language  python is  
0                 1               1                     0          0  
1                 0               0                     1          1  
2                 0               0                     0          0  

Bigram features: ['coding in', 'great programming', 'in python', 'is great', 'love coding', 'love programming', 'programming in', 'programming language', 'python is']
Number of bigram features: 9

2.2 TRIGRAMS (3-GRAMS)
--------------------
Trigrams Model:
   coding in python  great programming language  is great programming  \
0                 0            

## Summary
- Bag of Words: A representation of text data where each document is represented by word
frequencies.
- N-Grams: Sequences of n words that capture contextual relationships between words (unigrams,
bigrams, trigrams)

In [5]:
# Summary
print("\n" + "=" * 50)
print("SUMMARY")
print("=" * 50)
print(f"• Bag of Words features: {len(bow_vectorizer.get_feature_names_out())}")
print(f"• Bigram features: {len(bigram_vectorizer.get_feature_names_out())}")
print(f"• Trigram features: {len(trigram_vectorizer.get_feature_names_out())}")
print(f"• Mixed N-gram features: {len(mixed_vectorizer.get_feature_names_out())}")
print("\nNote: As n increases, the feature space grows exponentially!")


SUMMARY
• Bag of Words features: 8
• Bigram features: 9
• Trigram features: 7
• Mixed N-gram features: 24

Note: As n increases, the feature space grows exponentially!


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

def show_vectorization(name, docs, ngram_range=(1,1)):
    print(f"\n=== {name} ===")
    print("Documents:")
    for i, d in enumerate(docs, 1):
        print(f"{i}. {d}")
    
    vec = CountVectorizer(ngram_range=ngram_range)
    X = vec.fit_transform(docs)
    df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())
    
    print(f"\nVocabulary size: {len(vec.get_feature_names_out())}")
    print(df)
    print("-"*70)

# Example usage
corpus1 = [
    "This movie was absolutely fantastic and thrilling.",
    "The plot was boring and the acting felt fake.",
    "I loved the visuals but hated the ending.",
    "Best action film I've seen in years!"
]

corpus2 = [
    "The battery lasts all day and charges quickly.",
    "Really disappointed with the build quality.",
    "Great value for money, highly recommend.",
    "Screen is bright but speakers are weak.",
    "Perfect size and very lightweight."
]

# Run different ranges
show_vectorization("Corpus 1 – Movies (unigrams)", corpus1, (1,1))
show_vectorization("Corpus 1 – Movies (bigrams)", corpus1, (2,2))
show_vectorization("Corpus 2 – Products (1–2 grams)", corpus2, (1,2))


=== Corpus 1 – Movies (unigrams) ===
Documents:
1. This movie was absolutely fantastic and thrilling.
2. The plot was boring and the acting felt fake.
3. I loved the visuals but hated the ending.
4. Best action film I've seen in years!

Vocabulary size: 25
   absolutely  acting  action  and  best  boring  but  ending  fake  \
0           1       0       0    1     0       0    0       0     0   
1           0       1       0    1     0       1    0       0     1   
2           0       0       0    0     0       0    1       1     0   
3           0       0       1    0     1       0    0       0     0   

   fantastic  ...  movie  plot  seen  the  this  thrilling  ve  visuals  was  \
0          1  ...      1     0     0    0     1          1   0        0    1   
1          0  ...      0     1     0    2     0          0   0        0    1   
2          0  ...      0     0     0    2     0          0   0        1    0   
3          0  ...      0     0     1    0     0          0   1    

For tutorial Lab6, here are the key insights from each text representation method and a reflection on how these choices impact analysis.

### **Key Insights from Text Representations**

* **Bag of Words (BoW):**
* **- Insight:** Focuses on Presence & Prevalence, assuming that frequent words indicate the document's topic.
* **- Core Idea:** It uses simple word counts or presence.
* **- Reality:** It treats text as an unordered "soup," capturing vocabulary but completely ignoring grammar, word order, and the importance of common words like "the" or "is".


* **TF-IDF (Term Frequency-Inverse Document Frequency):**
* **- Insight:** Focuses on Uniqueness & Significance.
* **- Core Idea:** It down-weights common words by multiplying term frequency by inverse document frequency.
* **- Reality:** It effectively identifies keywords that make a document distinct while filtering out "noise," though it still ignores word order.


* **Word Embeddings (e.g., Word2Vec):**
* **- Insight:** Focuses on Similarity & Association based on the distributional hypothesis ("you shall know a word by the company it keeps").
* **- Core Idea:** Uses static embeddings to map words used in similar environments close together mathematically.
* **- Reality:** It captures semantic relationships and analogies but struggles with context and polysemy (words with multiple meanings).


* **Contextual Embeddings (e.g., BERT/Transformers):**
* **- Insight:** Focuses on Nuance & Intent.
* **- Core Idea:** The meaning of a word is defined dynamically by every other word in the sentence.
* **- Reality:** It understands complex structures like sarcasm, negation, and polysemy, though it is computationally heavy.


* **N-Grams:**
* **- Insight:** Captures contextual relationships between words by looking at contiguous sequences of *n* words (unigrams, bigrams, trigrams).
* **- Reality:** While they provide more context than single words, the feature space grows exponentially as *n* increases.



---

### **Reflection: Impact of Choice on Analysis**

The choice of text representation serves as a filter that determines what information your analysis can actually "see."

1. **From Counting to Understanding:** Choosing **BoW** or **TF-IDF** limits the analysis to a quantitative or statistical level (keyword matching), whereas BERT allows for a holistic analysis of intent and meaning.
2. **Trade-off between Speed and Depth:** BoW is very simple and fast, making it ideal for baselines or keyword-based tasks. In contrast, BERT is computationally heavy but necessary for modern NLP tasks like Question Answering (QA) or deep sentiment analysis.
3. **Handling Ambiguity:** If your analysis involves words with multiple meanings (polysemy), static methods like Word2Vec will fail to distinguish them, whereas contextual embeddings like **BERT** represent text as a "living sequence" that adapts to the surrounding words.
4. **Feature Explosion:** Choosing to use N-Grams to capture context can significantly increase the complexity of the data; for example, a small corpus might have only 8 BoW features but jump to 24 features when using mixed 1-3 grams.