# Praktikum 3: Applications of Word Embeddings

In this Praktikum, we compare different **text representations** in a document retrieval setting — a **FAQ retrieval task**.  
We will work on a slice of the `clips/mfaq` dataset, which contains multilingual question–answer pairs collected from various online domains.

The goal is to explore how different representations influence retrieval quality. In particular, you will:

- Apply **pre-trained word embeddings** to represent text semantically.  
- Build a **FAQ retrieval system** using embeddings in place of TF-IDF.  
- Compare **dense (semantic)** and **sparse (lexical)** representations to understand their respective strengths and limitations.

Throughout the notebook, you will be asked to complete small implementation or reflection tasks that guide you through the analysis.

## 1. Setup and Base Code

### 1.1 Preparing the corpus

We start by downloading we are going to use in this Praktikum. the `clips/mfaq` dataset is multilingual...

If you have access to internet from your notebook, simply use `download_dataset("clips/mfaq")`
Otherwise, go to the terminal, at the root of the repository and run:

```bash
$ python scripts/download_dataset.py clips/mfaq en  data/clips_mfaq_en
```

This will save the dataset in `data/clips_mafaq_en`, and you can use `load_from_disk()` referencing to the relative path to the dataset to load it.

In [None]:
from datasets import load_from_disk
import pandas as pd

dataset_folder = "clips_mfaq_en"

ds = load_from_disk("../../data/" + dataset_folder)
df =  ds["train"].to_pandas()

In [None]:
# You can explore the dataset here

df.head()

#### Preparing Question–Answer Pairs

Each row in the dataset contains a list of **question–answer pairs** for one domain.  
To evaluate different text representations, we first need to **flatten** these lists into two aligned arrays:

- **QUESTIONS** – used as *queries* in our retrieval task  
- **ANSWERS** – used to build the *document representations* (our searchable corpus)

During evaluation, the goal will be to test whether each question retrieves **its corresponding answer** as the top-ranked result.


**Important note.**

We can control the **characteristics of the dataset** by adjusting the parameters of `build_qa_simple()`:

- `min_len` and `max_len` — limit the **length (in characters)** of the question and answer pairs included.  
  This helps us test how **document length** influences the performance of TF-IDF and embeddings.  
- The number of rows from `df["qa_pairs"]` passed to the function determines how many **domains** we include.  
  Each row corresponds to one domain containing many QA pairs.
- `max_pairs` — optionally restricts the total number of QA pairs extracted, useful for faster experimentation.

These parameters let us build smaller, controlled subsets of the dataset to explore how dataset **size** and **text length** affect different retrieval methods.

In [None]:
import numpy as np
import pandas as pd

def build_qa_simple(qa_series, min_len=None, max_len=None, dedup=True, max_pairs=None):
    """
    Flatten a Series of qa_pairs into QUESTIONS/ANSWERS.
    We expect that  each cell is:
      - Each cell is list-like (np.ndarray) of dicts.
      - Each dict has 'question' and 'answer' (strings).
    Returns QUESTIONS, ANSWERS, df_pairs.
    """
    rows = []
    for cell in qa_series:
        # accept ndarray; skip anything else
        if isinstance(cell, np.ndarray):
            items = cell.tolist()
        else:
            continue

        for d in items:
            if not isinstance(d, dict):
                continue
            q, a = d.get("question"), d.get("answer")

            # tiny fallback if values are dicts like {'text': ...}
            if isinstance(q, dict): q = q.get("text")
            if isinstance(a, dict): a = a.get("text")

            if not q or not a:
                continue

            # length filters (if given)
            if max_len is not None and (len(q) > max_len or len(a) > max_len):
                continue
            if min_len is not None and (len(q) < min_len or len(a) < min_len):
                continue

            rows.append((q.strip(), a.strip()))

    df_pairs = pd.DataFrame(rows, columns=["question", "answer"])
    if dedup and not df_pairs.empty:
        df_pairs = df_pairs.drop_duplicates(subset=["question", "answer"], keep="first")
    if max_pairs is not None and not df_pairs.empty:
        df_pairs = df_pairs.head(max_pairs)

    QUESTIONS = df_pairs["question"].tolist()
    ANSWERS   = df_pairs["answer"].tolist()
    return QUESTIONS, ANSWERS, df_pairs


# Build from the top 10 domains rows (as you intended)
QUESTIONS, ANSWERS, df_pairs = build_qa_simple(df["qa_pairs"].head(150), max_len=45, min_len=30)
print(f"Pairs extracted: {len(QUESTIONS)}")
display(df_pairs.head(3))

DOCUMENT_CORPUS = ANSWERS

### 1.2. Building Representations

#### A. Tokenization and Preprocessing

Before we can build any document representation, we need to **tokenize** and **normalize** the text.  
As in previous labs, we’ll use *spaCy* to split text into tokens, remove stop words and punctuation, and lemmatize words so that related forms (e.g., *run*, *running*, *ran*) map to the same base form.

The function below returns a clean list of tokens that will serve as the input for both **TF-IDF** and **embedding-based** representations.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

def tokenize(text):
    doc = nlp(text)
    return [t.lemma_.lower() for t in doc if not t.is_stop and not t.is_punct and t.is_alpha]
    #return [t.text.lower() for t in doc if not t.is_stop and not t.is_punct and t.is_alpha]

tokenize("How do I change my payment method for the upcoming billing cycle?")

#### B. TF-IDF-based Retrieval

We first build a **TF-IDF** index over the **ANSWERS** (our document corpus).  
Given a query (the **QUESTION**), we compute cosine similarity between the query vector and all answer vectors and return the top-k matches.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# we use our tokenizer previously defined
vectorizer_tfidf = TfidfVectorizer(tokenizer=tokenize, token_pattern=None)
tfidf_matrix = vectorizer_tfidf.fit_transform(DOCUMENT_CORPUS)

def tfidf_search(query, topk=3):
    q_vec = vectorizer_tfidf.transform([query])
    sims = cosine_similarity(q_vec, tfidf_matrix)[0]
    return (pd.DataFrame({"similarity": sims, "faq": DOCUMENT_CORPUS})
              .sort_values("similarity", ascending=False)
              .head(topk))

# Example
tfidf_search("I can’t remember my password and need to recover access", topk=3)

#### C. Embedding-based Retrieval
Next, we use **pre-trained word embeddings** to represent the meaning of words in our texts.  
We’ll use the **GloVe** model trained on Wikipedia and Gigaword (`glove-wiki-gigaword-50`), which maps each word to a 50-dimensional vector.  

The code below, which we also used in our seminar, loads the model from a local folder (if available) or downloads it automatically using **Gensim**.
Refer to the `VL06_Embeddings.ipynb` notebook for information on how to download the model locally from the terminal.

In [None]:
import os

# Set the GEMSIM_DATA_DIR to the expected local path
GENSIM_DATA_DIR = os.path.abspath("../../models")
os.environ["GENSIM_DATA_DIR"] = GENSIM_DATA_DIR

#Verify if the model file is locally present
model_name = "glove-wiki-gigaword-50"
model_path = os.path.join(GENSIM_DATA_DIR, model_name)

if os.path.exists(model_path):
    print(f"Model found at: {model_path}. We will load the local model.")
else:
    print(f"Model not found at {model_path}. Attempting to download  `{model_name}`.")

# Loading the model from its files
import gensim.downloader as api

model = api.load(model_name)

**Weighted-Pooled Embeddings & Cosine Search.**  We represent each **answer** with a single vector by:
1) tokenizing the text,  
2) looking up each token’s pre-trained embedding, and  
3) taking a **TF-IDF–weighted average** of those vectors (tokens with higher IDF contribute more).

At query time, we embed the **question** the same way and rank answers by **cosine similarity**.

Note: We’ll later implement the **mean-pooled** variant as a student task and compare both.

In [None]:
idf = dict(zip(vectorizer_tfidf.get_feature_names_out(), vectorizer_tfidf.idf_))

def doc_vector_tfidf_weighted(text, model=model, idf=idf):
    toks = tokenize(text)
    vecs, weights = [], []
    for t in toks:
        if t in model.key_to_index and t in idf:
            vecs.append(model[t])
            weights.append(idf[t])
    if not vecs:
        return np.zeros(model.vector_size, dtype=np.float32)
    return np.average(np.vstack(vecs), axis=0, weights=np.asarray(weights, dtype=np.float32))

emb_tfidf_matrix = np.vstack([doc_vector_tfidf_weighted(a) for a in DOCUMENT_CORPUS])

def hybrid_search(query_text, topk=3):
    q_vec = doc_vector_tfidf_weighted(query_text).reshape(1, -1)
    sims = cosine_similarity(q_vec, emb_tfidf_matrix)[0]
    order = np.argsort(sims)[::-1][:topk]
    return pd.DataFrame({
        "similarity": sims[order],
        "faq": [DOCUMENT_CORPUS[i] for i in order]
    })
    
hybrid_search("I can’t remember my password and need to recover access", topk=3)

### 1.3 Evaluating a Retrieval Model

#### A. Top-1 accuracy
We evaluate a search function by **Top-1 accuracy**: for each *question i*, does the system rank its **corresponding answer i** as the top result?

- `search_fn(query, topk)` must return a DataFrame with a `faq` column ordered by similarity (best first).
- We compare returned text to the gold **answer text** (string match with simple normalization).


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd

def calculate_top_1_accuracy(
    search_fn, queries, answers, verbose=False, normalize_text=True
):
    """
    Benchmarks a retrieval function using Top-1 accuracy.

    - search_fn(query, topk) must return a DataFrame with a 'faq' column (top-ranked first).
    - 'answers' should be the gold answer *texts* aligned with queries.
    """
    def norm(s):
        if not normalize_text or not isinstance(s, str): 
            return s
        return " ".join(s.lower().split())

    correct, total = 0, len(queries)

    for i, q in enumerate(queries):
        results = search_fn(q, topk=1)
        if results is None or len(results) == 0:
            if verbose:
                print(f"\n[WARN] No results for query: {q}")
            continue

        # Top-1 by position, independent of index labels
        top_answer = results.iloc[0]["faq"]

        if norm(top_answer) == norm(answers[i]):
            correct += 1
        elif verbose:
            print(f"\nQuery: {q}\nTop prediction: {top_answer}\nExpected: {answers[i]}")

    return correct / total

calculate_top_1_accuracy(tfidf_search, QUESTIONS, DOCUMENT_CORPUS, verbose=False)

#### B. Top-k Accuracy (Recall@k)
Besides Top-1, many retrieval scenarios care whether the **correct answer appears anywhere in the top-k** results:
- *Human-in-the-loop UIs** (FAQ search, support agents): showing the right answer in the **top 3–5** is often sufficient.
- *Exploratory search / recommendations*: users scan several options, so **Recall@k** is the key metric.

So additionally, we’ll use **Top-k accuracy (Recall@k)**: for each question *i*, check if its gold answer *i* is present in the first *k* retrieved items.

In [None]:
def calculate_top_k_accuracy(search_fn, queries, answers, k=3, normalize_text=True):
    def norm(s):
        if not normalize_text or not isinstance(s, str):
            return s
        return " ".join(html.unescape(s).lower().split())

    correct = 0
    for i, q in enumerate(queries):
        df = search_fn(q, topk=k)
        if df is None or len(df) == 0:
            continue
        gold = norm(answers[i])
        preds = [norm(a) for a in df["faq"].tolist()]
        if gold in preds:
            correct += 1
    return correct / len(queries)

calculate_top_k_accuracy(tfidf_search, QUESTIONS, DOCUMENT_CORPUS, k=3)    

## 2. Tasks and exploration

### Task 1. Implement Mean-Pooled Embeddings

Implement `doc_vector_mean(text)` that:
- tokenizes the text,  
- keeps only in-vocabulary tokens,  
- returns the **mean** of their embeddings (zeros if no valid tokens).

Then:
- build `emb_matrix` with your function,  
- implement `embed_search(query, topk=3)` mirroring `hybrid_search`

In [None]:

def doc_vector_mean(text, model=model):
    # TODO: tokenize, keep tokens in model.key_to_index,
    #       return mean of embeddings (or zeros if none)
    return None

# Build matrix for ANSWERS using mean pooling
emb_matrix = np.vstack([doc_vector_mean(a) for a in DOCUMENT_CORPUS])

def embed_search(query, topk=3):
    # TODO: embed query with doc_vector_mean,
    #       cosine vs emb_matrix, return top-k DataFrame like hybrid_search
    return None

# Quick check to see if the output has the proper dimensions
v = doc_vector_mean("short example")
assert v is not None and v.shape == (model.vector_size,)

embed_search("I can’t remember my password and need to recover access", topk=3) 

### Task 2. Analysing Retrieval Performance

Now that you have implemented both retrieval methods (**TF-IDF** and **Embeddings**), let’s explore how their performance changes under different conditions.

#### A. Vary document length
   - Use the parameters `min_len` and `max_len` in `build_qa_simple()` to create **small**, **medium**, and **large** answer sets.  
   - Observe how accuracy changes with length.  

 **Question**: *At what point does the mean-pooled embedding start to get “diluted”?*

#### B. Test lemmatization:
   - In your `tokenize()` function, try switching between: `t.lemma_.lower()` and `t.text.lower()`
   - Re-run your evaluation

**Question**: Does lemmatization help or hurt? How does it impact the different representations, and why might that be?

#### C. Compare Top-1 and Top-k accuracy
   Use `calculate_top_1_accuracy()` and `calculate_top_k_accuracy()` to compare the performance of embeddings.
   
**Question**: *What model benefit more from allowing multiple results?*


The cell below runs all experiments (TF-IDF, weighted embeddings, and mean-pooled embeddings) for **small**, **medium**, and **large** document lengths in one go. This makes it easier to compare how each method behaves as text length increases.  
*Note:* execution may take 1-5min depending number of datasets / dataset size.


In [None]:
%%time 
N_DOMAIN_ROWS = 100     
TOPK = 3                          # 1 for Top-1; try 3 or 5 for Recall@k

bands = { 
    "small":  (30,  50),   # (min_len, max_len)
    "medium": (50, 200),
    "large":  (200, None),
}

def run_band(min_len, max_len):
    # 1) Rebuild QA with current length filters
    QUESTIONS, ANSWERS, _ = build_qa_simple(
        df["qa_pairs"].head(N_DOMAIN_ROWS),
        min_len=min_len, max_len=max_len, dedup=True
    )
    if not QUESTIONS:
        return None
    DOCS = ANSWERS

    # 2) Refit TF-IDF on current DOCS + search
    vec = TfidfVectorizer(tokenizer=tokenize, token_pattern=None)
    tfm = vec.fit_transform(DOCS)

    def tfidf_s(q, topk=3):
        sims = cosine_similarity(vec.transform([q]), tfm)[0]
        order = np.argsort(sims)[::-1][:topk]
        return pd.DataFrame({"similarity": sims[order],
                             "faq": [DOCS[i] for i in order]})

    # 3) Weighted embeddings (TF-IDF-weighted)
    idf_local = dict(zip(vec.get_feature_names_out(), vec.idf_))

    def doc_vec_weighted(text):
        toks = [t for t in tokenize(text) if t in model.key_to_index and t in idf_local]
        if not toks:
            return np.zeros(model.vector_size, dtype=np.float32)
        vecs = np.vstack([model[t] for t in toks])
        wts  = np.asarray([idf_local[t] for t in toks], dtype=np.float32)
        return np.average(vecs, axis=0, weights=wts)

    emb_w = np.vstack([doc_vec_weighted(a) for a in DOCS])

    def emb_w_s(q, topk=3):
        qv = doc_vec_weighted(q).reshape(1, -1)
        sims = cosine_similarity(qv, emb_w)[0]
        order = np.argsort(sims)[::-1][:topk]
        return pd.DataFrame({"similarity": sims[order],
                             "faq": [DOCS[i] for i in order]})

    # 4) Mean-pooled embeddings 
    def doc_vec_mean(text):
        toks = [t for t in tokenize(text) if t in model.key_to_index]
        if not toks:
            return np.zeros(model.vector_size, dtype=np.float32)
        return np.mean(np.vstack([model[t] for t in toks]), axis=0)

    emb_mean = np.vstack([doc_vec_mean(a) for a in DOCS])

    def emb_mean_s(q, topk=3):
        qv = doc_vec_mean(q).reshape(1, -1)
        sims = cosine_similarity(qv, emb_mean)[0]
        order = np.argsort(sims)[::-1][:topk]
        return pd.DataFrame({"similarity": sims[order],
                             "faq": [DOCS[i] for i in order]})

    # 5) Metrics
    row = {
        "n_pairs": len(DOCS),
        "TF-IDF@1":        calculate_top_1_accuracy(tfidf_s,   QUESTIONS, DOCS),
        "Emb(TFIDFw)@1":   calculate_top_1_accuracy(emb_w_s,   QUESTIONS, DOCS),
        "Emb(mean)@1":     calculate_top_1_accuracy(emb_mean_s,QUESTIONS, DOCS),
    }
    if TOPK and TOPK > 1:
        row.update({
            f"TF-IDF@{TOPK}":      calculate_top_k_accuracy(tfidf_s,    QUESTIONS, DOCS, k=TOPK),
            f"Emb(TFIDFw)@{TOPK}": calculate_top_k_accuracy(emb_w_s,    QUESTIONS, DOCS, k=TOPK),
            f"Emb(mean)@{TOPK}":   calculate_top_k_accuracy(emb_mean_s, QUESTIONS, DOCS, k=TOPK),
        })
    return row

rows = []
for name, (mn, mx) in bands.items():
    r = run_band(mn, mx)
    if r is not None:
        r["band"] = name
        r["range"] = f"[{mn or 0}, {mx or '∞'}]"
        rows.append(r)

df_len = pd.DataFrame(rows).set_index("band")
display(df_len)

### Reporting your results

Summarize your observations clearly and concisely in **three short paragraphs or tables**, one per experiment:

A. **Effect of document length:**  
   - Report the Top-1 accuracy for *small*, *medium*, and *large* subsets.  
   - Highlight where the **embedding performance starts to drop** (semantic dilution).  
   - You may present this in a small table like:

     | Representation | Small | Medium  | Large  |
     |----------------|-------|---------|--------|
     | TF-IDF         | 0.xxx | 0.xxx   | 0.xxx  |
     | Embeddings     | 0.xxx | 0.xxx   | 0.xxx  |     

B. **Effect of lemmatization:**  
   - Run both versions of `tokenize()` (lemma vs. raw text).
   - Fix the comparison under one dataset size (e.g., medium)  
   - Compare TF-IDF and embeddings.  
   - Write 2–3 sentences explaining **why** lemmatization might help or hurt the models.
  
     | Representation | Raw   | Lemma   | 
     |----------------|-------|---------|
     | TF-IDF         | 0.xxx | 0.xxx   |
     | Embeddings     | 0.xxx | 0.xxx   |

C. **Effect of Top-k retrieval:**  
   - Report Top-1 and Top-k (e.g., k=3 or 5) accuracy side-by-side.  
   - Briefly answer: *Which representation benefits more from allowing multiple retrieved answers?*  
   - Example table:

     | Representation | Top-1 | Recall@3 | Recall@5 |
     |----------------|-------|-----------|----------|
     | TF-IDF         | 0.xxx | 0.xxx     | 0.xxx    |
     | Embeddings     | 0.xxx | 0.xxx     | 0.xxx    |


Keep your answers short and interpretive — focus on describing your interpretations, not just repeating raw numbers.