# VL03 - Text Representation
In this seminar we will explore how to transform raw text data into numerical vectors that machine learning models can understand. We will focus on two foundational methods: Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
import re

## 1. Loading the dataset

Later we will use:
https://www.kaggle.com/datasets/saadmakhdoom/ecommerce-faq-chatbot-dataset

In [None]:
faqs = [
    "How do I reset my password, and are passwords case-sensitive?", 
    "How to update my email address, and is updating necessary for security?", 
    "I can't log in to my account. What are the common login issues?", 
    "How do I change my payment method for the upcoming billing cycle?", 
    "I want to cancel my active monthly subscription and stop future payments.",
    "What is the return policy if a product is still under warranty?",
    "How to change my username if I am currently logging in with my email?",
    "How do I enable two-factor authentication for my account?",
    "How to update the billing address associated with my credit card.",
    "How do I account for changes in my order? Can I cancel it?"
]

## 2. Bag of Words
The Bag-of-Words model is the simplest form of text representation. It represents a text (like a sentence or document) as the multiset of its words, disregarding grammar and even word order, but keeping track of word frequency.

### 2.1 Building the vocabulary
The first step in BoW is to fit the corpus with `.fit_transform()` to create a vocabulary—a list of all unique words found across the entire corpus. `CountVectorizer` which performs tokenization (breaking text into words) and optionally, basic preprocessing (like lowercasing).

With after processing the corpus with `.fit_transform()` we 


In [None]:
# CountVectorizer comes with some text-processing features, such as stop_words, case folding
vectorizer_bow = CountVectorizer(stop_words='english')

# Fit the vectorizer to the corpus to build the vocabulary 
# (vectorizer_bow.vocabulary_ contains a dictionary mapping token -> column index)
bow_matrix = vectorizer_bow.fit_transform(faqs)

# Get the unique words (features) that form the vocabulary
vocabulary = vectorizer_bow.get_feature_names_out()

def print_vocabulary(vocabulary):
    print(f"Total vocabulary size: {len(vocabulary)}\n")
    print("Vocabulary (unique tokens):")
    print(vocabulary)

print_vocabulary(vocabulary)


In [None]:
vectorizer_bow.vocabulary_

### 2.2 Our own text-processing pipeline
We can define our own pre-processing pipeine by providing a tokenizer argument `CountVectorizer(tokenizer=my_tokenizer)`, which returns the list of (pre-processed) tokens for the input corpus.

In [None]:
import spacy

# Load the small English language model for spaCy
# We will use this model to perform advanced pre-processing (like lemmatization)
# Note: You may need to run `python -m spacy download en_core_web_sm` once in your environment
try:
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
except OSError:
    print("Warning: spaCy model 'en_core_web_sm' not found. Please run 'python -m spacy download en_core_web_sm'")
    # Fallback to a simpler model creation if the standard one fails
    nlp = spacy.blank("en")


# --- NEW CUSTOM TOKENIZER FUNCTION ---
def spacy_tokenizer(document):
    """
    Custom tokenizer using spaCy for advanced pre-processing.
    It performs tokenization, lemmatization, and removes stop words/punctuation.
    This aligns the process with the manual steps students learned previously.
    """
    # 1. Process the document with spaCy
    doc = nlp(document)
    
    # 2. Extract tokens, perform lemmatization, and filter:
    #    - .lemma_: the base form of the word (e.g., 'updates' -> 'update')
    #    - .is_stop: skip common stop words (if the model is loaded)
    #    - .is_punct: skip punctuation
    #    - .is_alpha: keep only tokens that are purely alphabetic
    tokens = [
        token.lemma_.lower() 
        for token in doc 
        if not token.is_stop and not token.is_punct and token.is_alpha
    ]
    return tokens
# --- END NEW CUSTOM TOKENIZER FUNCTION ---


# Initialize the CountVectorizer
# 1. We now pass our custom spacy_tokenizer function to the 'tokenizer' argument.
# 2. We set 'stop_words' to None because spaCy_tokenizer handles stop words internally.
# 3. We set 'token_pattern' to None because the custom tokenizer handles token extraction.
vectorizer_bow = CountVectorizer(tokenizer=spacy_tokenizer, 
                                 stop_words=None, 
                                 token_pattern=None)

# Fit the vectorizer to the corpus to build the vocabulary
# The vectorizer now calls spacy_tokenizer(faq) for every document in `faqs`
bow_matrix = vectorizer_bow.fit_transform(faqs)

# Get the unique words (features) that form the vocabulary
vocabulary = vectorizer_bow.get_feature_names_out()


print_vocabulary(vocabulary)
    
print("\n--- Note the effect of Lemmatization! ---")
print("Words like 'resets' or 'updating' would now appear as 'reset' or 'update' if they were present in the corpus.")


### 2.3 Examining the BoW vector
Each document is now represented as a vector where the length equals the size of the vocabulary. The value at each position in the vector is the count of how often that corresponding word appears in the document.

Let's look at the full matrix and then inspect the vector for a Document by changing `doc_index`.

In [None]:
# Convert to a DataFrame for readability
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vocabulary)
bow_df.index = [f"D{i+1}" for i in range(len(faqs))]

print("--- Full BoW Matrix (Counts) ---")
print(bow_df.T) # the transpose to display documents as columns
print("\n" + "="*50 + "\n")

# Inspect a specific document
doc_index = 0
doc_vector = bow_df.iloc[doc_index]

print(f"--- BoW Vector for Document {doc_index}---")
print(f"Document: {faqs[doc_index]}\n")

# Show only the words that appear in this document (count > 0)
present_words = doc_vector[doc_vector > 0].sort_values(ascending=False)
print(present_words)


### 2.4 Document similarity
A key use of BoW is to find documents that are semantically similar. We can calculate the Cosine Similarity between their vectors. Cosine similarity measures the cosine of the angle between two non-zero vectors. A value close to 1 indicates high similarity.

Note: To use the `cosine_similarity()` function, we added  `from sklearn.metrics.pairwise import cosine_similarity` at the top of this notebook.

In [None]:
vector_u_idx = 1
vector_v_idx = 6
#vector_v_idx = 2

# Calculate cosine similarity
similarity_u_v = cosine_similarity(bow_matrix[vector_u_idx], bow_matrix[vector_v_idx])

print(f"D1: {faqs[vector_u_idx]}")
print(f"D5: {faqs[vector_v_idx]}")
print(f"\nCosine Similarity (BoW): {similarity_u_v[0][0]:.4f}")

similarity_u_v

### 2.5 Matching an incoming query
To match an incoming user query to the most relevant FAQ, we transform the query into a BoW vector using the same fitted vectorizer and calculate its similarity against all document vectors.

In the code below, notice that we take `vectorizer_bow` which was fitted with the FAQ corpus in Section 2.1 (`vectorizer_bow.fit_transform(faqs)`), so as to apply the exact same vocabulary, tokenization, and pipeline to the incoming query. Here, however, we apply the `.transform()` function. 

Let's try two queries:
- Q1. How do I change my password
- Q2. How do I change my account's password

In [None]:
#query = ["How do I change my account's password",]
query = ["How do I change my password",]

# Transform the query using the *fitted* CountVectorizer
query_vector = vectorizer_bow.transform(query)

# Calculate similarity between the query and ALL document vectors
similarities = cosine_similarity(query_vector, bow_matrix)

# Create a DataFrame for the results, correctly pairing the scores and the FAQ text.
# 1. Start with the index and similarity scores.
# 2. Add the corresponding FAQ text as a new column.
results_bow = pd.DataFrame({
    'similarity_score' : similarities[0], # the index '0' is the index of the query document
    'faq_text': faqs
}, index = [f"{i}" for i in range(len(faqs))] )

results_bow = results_bow.sort_values(by=["similarity_score"], ascending=False)


print("--- BoW Matching Results ---")
print(f"Query: {query[0]}")
print("-" * 30)
print(results_bow.head(5))


### 2.6 Reflect on the BoW limitations
Can you inspect the results for each query and explain the results? When is the similarity-based matching failing?

## 3. TF-IDF Representations
TF-IDF (Term Frequency-Inverse Document Frequency) addresses the "equal importance" problem of BoW. It assigns a higher score to words that are relevant (frequent within a document) but distinctive (rare across the entire corpus).

### 3.1 Inspecting TF and IDF components
`TfidfVectorizer` calculates two components:
- Term Frequency (TF): How often a term appears in a document (similar to BoW, but often normalized).
- Inverse Document Frequency (IDF): A measure of how important or rare a term is across the whole corpus. Rare words get a higher IDF score.

In [None]:
# Initialize the TfidfVectorizer with the custom tokenizer
vectorizer_tfidf = TfidfVectorizer(tokenizer=spacy_tokenizer, 
                                   stop_words=None,
                                   token_pattern=None)

# Fit and transform the corpus
tfidf_matrix = vectorizer_tfidf.fit_transform(faqs)

# Get the feature names and the IDF scores
vocabulary_tfidf = vectorizer_tfidf.get_feature_names_out()
idf_scores = vectorizer_tfidf.idf_

# Create a DataFrame to view words and their IDF scores
idf_df = pd.DataFrame({
    'word': vocabulary_tfidf,
    'idf_score': idf_scores
}).sort_values(by='idf_score', ascending=False).reset_index(drop=True)

print("--- Words Sorted by IDF Score (Rarity) ---")
print("High IDF -> Rare words (More valuable for distinction)")
print("Low IDF -> Frequent words (Less valuable for distinction)\n")

# Show the 5 rarest and 5 most frequent words
print(idf_df)


### 3.2 Query matching with TF-IDF
Let's re-run the same query matching exercise with the TF-IDF vectors. We expect the results to be more accurate because the unique, differentiating words in the query will be weighted more heavily.

In [None]:
# query = from the bow code cell..

# Transform the query using the *fitted* TfidfVectorizer
query_vector_tfidf = vectorizer_tfidf.transform(query)

# Calculate similarity between the query and ALL document vectors
similarities_tfidf = cosine_similarity(query_vector_tfidf, tfidf_matrix)

# Create a dataframe to display the results nicely
results_tfidf = pd.DataFrame({
    'similarity_score' : similarities_tfidf[0], # the index '0' is the index of the query document
    'faq_text': faqs
}, index = [f"{i}" for i in range(len(faqs))] )

results_tfidf = results_tfidf.sort_values(by=["similarity_score"], ascending=False)


print("--- BoW Matching Results ---")
print(f"Query: {query[0]}")
print("-" * 30)
print(results_bow.head(3))

print("--- TF-IDF Matching Results ---")
print(f"Query: {query[0]}")
print("-" * 30)
print(results_tfidf.head(3))


### 3.3 Reflect on the differences between BoW and TF-IDF representations
- How do they compare in terms of how they address Q1 and Q2?
- What are the lingering limitations in TF-IDF?

## 4. Demo on a real FAQ dataset

Execute the following on your terminal to download the dataset:

``
mkdir -p data && \
curl -L -o data/kaggle_faq_dataset.zip 'https://www.kaggle.com/api/v1/datasets/download/saadmakhdoom/ecommerce-faq-chatbot-dataset' && \
unzip -o data/kaggle_faq_dataset.zip -d data
``


In [None]:
# This script performs a benchmark comparison of Bag-of-Words (BoW) and TF-IDF
# on a real-world FAQ retrieval task using a locally available JSON dataset.
import json
import os

from nltk.stem import PorterStemmer
porter = PorterStemmer()

# --- 2. TEXT PRE-PROCESSING  ---

def spacy_tokenizer(text, do_normalise = True):
    """
    Custom tokenizer function using spaCy for high-quality preprocessing:
    1. Tokenization
    2. Lowercasing
    3. Lemmatization (reducing words to root form, e.g., 'updates' -> 'update')
    4. Removing punctuation, extra whitespace, and stop words
    """
    # Process the text using spaCy
    doc = nlp(text)

    if (do_normalise):
    # Extract clean tokens: lemma_, is_punct=False, is_space=False, is_stop=False
        tokens = [
            token.lemma_
            for token in doc
            if not token.is_punct and not token.is_space and not token.is_stop
        ]
    else:
        tokens = [t for t in doc]
    # Remove any empty strings resulting from filtering
    return [t for t in tokens if t]

def load_data(file_path):
    """Loads and preprocesses the FAQ data from the specified JSON file."""

    # --- DEBUGGING STEP ---
    # Print the absolute path to determine the expected file location
    absolute_path = os.path.abspath(file_path)
    print(f"Attempting to open file at absolute path: {absolute_path}")
    # -----------------------    
    
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        # Determine the structure: either a list of Q&A pairs or a dict with a 'questions' key
        if isinstance(data, dict) and 'questions' in data:
            qa_pairs = data['questions']
        elif isinstance(data, list):
            qa_pairs = data
        else:
            raise ValueError("JSON data structure not recognized (expected list or dict with 'questions').")

        # Extract Questions (Q) and Answers (A)
        questions = [item['question'] for item in qa_pairs]
        answers = [item['answer'] for item in qa_pairs]
        
        # Limit to the first 500 pairs for notebook performance
        questions = questions[:500]
        answers = answers[:500]
        
        if len(questions) != len(answers) or not questions:
            raise ValueError("Data extraction failed or corpus is empty.")

        print(f"Successfully loaded {len(questions)} Q/A pairs from '{file_path}'.")
        return questions, answers

    except Exception as e:
        print(f"Error loading local file: {e}.")
        return None

# Make sure you downloaded the dataset and is in the following path
JSON_FILE_PATH = "data/Ecommerce_FAQ_Chatbot_dataset.json"
QUESTIONS, ANSWERS = load_data(JSON_FILE_PATH)


# --- 2. VECTORIZATION ---

# Answers will be the document corpus (D) that we search against
DOCUMENT_CORPUS = ANSWERS 

# Initialize and fit vectorizers on the ANSWER corpus
vectorizer_bow = CountVectorizer(tokenizer=spacy_tokenizer, stop_words=None, token_pattern=None)
bow_matrix = vectorizer_bow.fit_transform(DOCUMENT_CORPUS)

vectorizer_tfidf = TfidfVectorizer(tokenizer=spacy_tokenizer, stop_words=None, token_pattern=None)
tfidf_matrix = vectorizer_tfidf.fit_transform(DOCUMENT_CORPUS)

print(f"Corpus Vocabulary Size (TF-IDF): {len(vectorizer_tfidf.get_feature_names_out())}")


# --- 4. PERFORMANCE BENCHMARK FUNCTION ---

def calculate_top_1_accuracy(vectorizer, matrix, queries):
    """
    Calculates Top-1 Accuracy: the percentage of times the model correctly 
    ranks the corresponding answer (at index i) as the top match when querying 
    with the question (at index i).
    """
    correct_matches = 0
    num_queries = len(queries)

    # Iterate over every question (Q_i) and treat it as a query
    for i, query_text in enumerate(queries):
        # 1. Transform the query (Q_i) using the fitted vectorizer
        query_vector = vectorizer.transform([query_text])

        # 2. Calculate similarity against ALL document vectors (Answers)
        similarities = cosine_similarity(query_vector, matrix)[0]
        
        # 3. Find the index of the best match
        best_match_index = np.argmax(similarities)
        
        # 4. Check if the best match index corresponds to the correct answer index (i)
        if best_match_index == i:
            correct_matches += 1

    return correct_matches / num_queries


# --- 5. EXECUTION AND RESULTS ---

print("\n--- Running FAQ Retrieval Benchmark ---")

# Calculate BoW accuracy (Questions against Answers)
accuracy_bow = calculate_top_1_accuracy(vectorizer_bow, bow_matrix, QUESTIONS)

# Calculate TF-IDF accuracy (Questions against Answers)
accuracy_tfidf = calculate_top_1_accuracy(vectorizer_tfidf, tfidf_matrix, QUESTIONS)

print(f"\nTotal Queries Tested: {len(QUESTIONS)}")
print(f"Top-1 Accuracy (BoW):   {accuracy_bow:.4f} (Raw Frequency Weighting)")
print(f"Top-1 Accuracy (TF-IDF): {accuracy_tfidf:.4f} (Rarity-Adjusted Weighting)")

if accuracy_tfidf > accuracy_bow:
    print("\nConclusion: TF-IDF achieved higher Top-1 accuracy. This is expected in real-world IR tasks because TF-IDF successfully identifies and leverages the discriminatory power of rare terms found between the questions and long answers, filtering out common noise.")
elif accuracy_bow > accuracy_tfidf:
    print("\nConclusion: BoW achieved higher Top-1 accuracy. This can happen if the Answers are very long and contain highly repetitive terms that BoW heavily weights, which might be more effective than TF-IDF's penalty on common terms.")
else:
    print("\nConclusion: Both models performed equally, suggesting the core overlapping terms are frequent and distinctive enough for both weighting schemes.")


#### Reflection
You can try changing your pre-processing pipeline to see if you get better results. 