# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [7]:
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["DISABLE_TQDM"] = "1"  

In [8]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

# Compute similarities into a neat table
rows = []
for s1, s2 in sentence_pairs:
    e1 = model.encode(s1, normalize_embeddings=True)  # no progress bars here
    e2 = model.encode(s2, normalize_embeddings=True)
    sim = float(cosine_similarity([e1], [e2])[0][0])
    rows.append({"sentence_1": s1, "sentence_2": s2, "cosine_similarity": sim})

sim_df = pd.DataFrame(rows).sort_values("cosine_similarity", ascending=False)
print(sim_df.to_string(index=False))

                   sentence_1                   sentence_2  cosine_similarity
                I love pizza.           I enjoy ice cream.           0.528068
A dog is playing in the park. A dog is running in a field.           0.521975
                  What is AI?   How does a computer learn?           0.319435


### Questions:
- Which sentence pairs are the most semantically similar? Why? - “I love pizza.” vs. “I enjoy ice cream.” (cosine similarity ≈ 0.53). The pair involves paraphrase-like relationships. “love” and “enjoy” are semantically close, and “pizza” and “ice cream” are both food items.
- Can you think of cases where cosine similarity might fail to capture true semantic meaning? Antonyms, context, sacarsm, numeric differences ...


## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [11]:
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

try:
    from sentence_transformers import SentenceTransformer
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", "sentence-transformers"])
    from sentence_transformers import SentenceTransformer

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [12]:
# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(documents, normalize_embeddings=True)  # (n_docs, dim)

#YOUR CODE HERE
cand_ks = range(2, min(5, len(documents)-1)+1)  # 2..5 (bounded by data size)
best_k, best_score = None, -1
for k in cand_ks:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(embeddings)
    # Silhouette needs >1 label; guard just in case
    if len(set(labels)) > 1:
        score = silhouette_score(embeddings, labels)
        if score > best_score:
            best_k, best_score = k, score
if best_k is None:
    best_k = 3

# Final KMeans with best_k
kmeans = KMeans(n_clusters=best_k, n_init=10, random_state=42)
labels = kmeans.fit_predict(embeddings)

# Pack results
df = pd.DataFrame({"doc": documents, "cluster": labels})
print("Cluster assignments:\n", df.sort_values("cluster").to_string(index=False))

# Representative doc per cluster 
closest_idx, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, embeddings)
print("\nRepresentative document per cluster:")
for c, idx in enumerate(closest_idx):
    print(f"  Cluster {c}: {documents[idx]}")

Cluster assignments:
                                          doc  cluster
       How do I change a flat tire on a car?        0
                How do I fix a leaky faucet?        0
              What is the capital of France?        1
             How do I bake a chocolate cake?        1
What is the distance between Earth and Mars?        1
       What is the best way to learn Python?        2

Representative document per cluster:
  Cluster 0: How do I change a flat tire on a car?
  Cluster 1: What is the distance between Earth and Mars?
  Cluster 2: What is the best way to learn Python?


In [15]:
# Perform KMeans clustering

# Choose number of clusters 
k = 3
kmeans = KMeans(n_clusters=k, n_init=10, random_state=42)

# Fit on embeddings and get labels
labels = kmeans.fit_predict(embeddings)

# Show assignments
cluster_df = pd.DataFrame({"doc": documents, "cluster": labels})
print("Cluster assignments:\n", cluster_df.sort_values("cluster").to_string(index=False))

# Show a representative document (closest to centroid) for each cluster
closest_idx, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, embeddings)
print("\nRepresentative document per cluster:")
for c, idx in enumerate(closest_idx):
    print(f"  Cluster {c}: {documents[idx]}")

Cluster assignments:
                                          doc  cluster
       How do I change a flat tire on a car?        0
                How do I fix a leaky faucet?        0
              What is the capital of France?        1
             How do I bake a chocolate cake?        1
What is the distance between Earth and Mars?        1
       What is the best way to learn Python?        2

Representative document per cluster:
  Cluster 0: How do I change a flat tire on a car?
  Cluster 1: What is the distance between Earth and Mars?
  Cluster 2: What is the best way to learn Python?


In [17]:
# Print cluster assignments

for doc, label in zip(documents, labels):
    print(f"[Cluster {label}] {doc}")

[Cluster 1] What is the capital of France?
[Cluster 1] How do I bake a chocolate cake?
[Cluster 1] What is the distance between Earth and Mars?
[Cluster 0] How do I change a flat tire on a car?
[Cluster 2] What is the best way to learn Python?
[Cluster 0] How do I fix a leaky faucet?


### Questions:
- How many clusters make the most sense? Why? The model found that three clusters (k = 3) make the most sense. With six questions, the embeddings naturally separate into about three semantic groups that share related meanings. This is also supported by the silhouette score, which was highest at k = 3, indicating good cohesion within clusters and clear separation between them.
- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster? Yes — the clusters are semantically meaningful and easy.
Cluster 0 - Practical how-to tasks / household repairs
Cluster 1 - General knowledge / informational questions
Cluster 2 - Learning & technology
- Try this exercise with a larger dataset of your choice
Repeating this experiment with a larger dataset would make the clustering more robust. This would demonstrate how sentence embeddings capture semantic similarity and how KMeans can uncover latent themes within text data.

## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [31]:
import numpy as np

# Documents dataset
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cpu")

#YOUR CODE HERE
doc_embeddings = model.encode(
    documents,
    normalize_embeddings=True,      # normalize for cosine similarity
    convert_to_numpy=True,
    show_progress_bar=False
)

print("Embeddings shape:", doc_embeddings.shape)

Embeddings shape: (10, 384)


In [32]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_search(query, documents, doc_embeddings, top_n=5):

    # Encode the query into an embedding
    query_emb = model.encode(
        [query],
        normalize_embeddings=True,
        convert_to_numpy=True,
        show_progress_bar=False
    )

    # Compute cosine similarity between the query and all document embeddings
    sims = cosine_similarity(query_emb, doc_embeddings)[0]

    # Sort documents by similarity score (descending)
    top_indices = np.argsort(sims)[::-1][:top_n]

    # Prepare results
    results = [(documents[i], float(sims[i])) for i in top_indices]

    return results


In [33]:
# Test the search function
query = "Explain programming languages."
semantic_search(query, documents, doc_embeddings)

[('What is quantum computing?', 0.4352477192878723),
 ('What is the best way to learn Python?', 0.31878265738487244),
 ('How do I build a mobile app?', 0.11044073104858398),
 ('How do I set up a local server?', 0.09112647920846939),
 ('What are the best travel destinations in Europe?', 0.09064781665802002)]

### Questions:
- What are the top-ranked results for the given queries?
1. What is quantum computing? — 0.435
2. What is the best way to learn Python? — 0.319
3. How do I build a mobile app? — 0.110
4. How do I set up a local server? — 0.091
5. What are the best travel destinations in Europe? — 0.091
- How can you improve the ranking explanation for users? Display cosine similarity (rounded) and a “relevance” label: High (≥0.45), Medium (0.25–0.45), Low (<0.25). Extract and show overlapping concepts between query and doc. Include a short rationale line. Re-rank for better quality. Add a snippet. Hide items below a similarity threshold to avoid obviously off-topic results.
- Try this approach with a larger dataset

In [35]:
import pandas as pd

def search_and_explain(query, top_n=5, min_sim=0.15):
    results = semantic_search(query, documents, doc_embeddings, top_n=top_n)
    rows = []
    for doc, s in results:
        label = "High" if s >= 0.45 else ("Medium" if s >= 0.25 else "Low")
        rows.append({
            "similarity": round(s, 3),
            "relevance": label,
            "document": doc,
            "why": "Shares CS/tech topic" if any(w in doc.lower() for w in ["python","server","app","computing","program"]) else "General"
        })
    df = pd.DataFrame(rows)
    df = df[df["similarity"] >= min_sim].reset_index(drop=True)
    return df

# Example
search_and_explain("Explain programming languages.", top_n=5, min_sim=0.10)


Unnamed: 0,similarity,relevance,document,why
0,0.435,Medium,What is quantum computing?,Shares CS/tech topic
1,0.319,Medium,What is the best way to learn Python?,Shares CS/tech topic
2,0.11,Low,How do I build a mobile app?,Shares CS/tech topic
