# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [5]:
!pip install sentence-transformers

Collecting sentence-transformers
  Obtaining dependency information for sentence-transformers from https://files.pythonhosted.org/packages/8b/c8/990e22a465e4771338da434d799578865d6d7ef1fdb50bd844b7ecdcfa19/sentence_transformers-3.3.1-py3-none-any.whl.metadata
  Downloading sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Obtaining dependency information for transformers<5.0.0,>=4.41.0 from https://files.pythonhosted.org/packages/45/d6/a69764e89fc5c2c957aa473881527c8c35521108d553df703e9ba703daeb/transformers-4.48.0-py3-none-any.whl.metadata
  Downloading transformers-4.48.0-py3-none-any.whl.metadata (44 kB)
     ---------------------------------------- 0.0/44.4 kB ? eta -:--:--
     ----------------- -------------------- 20.5/44.4 kB 640.0 kB/s eta 0:00:01
     -------------------------------------- 44.4/44.4 kB 726.6 kB/s eta 0:00:00
Collecting tokenizers<0.22,>=0.21 (from transformers<5.0.0,>=4.41.0->se

In [11]:
!pip install tf-keras --user

Collecting tf-keras
  Obtaining dependency information for tf-keras from https://files.pythonhosted.org/packages/8a/ed/e08afca471299b04a34cd548e64e89d0153eda0e6cf9b715356777e24774/tf_keras-2.18.0-py3-none-any.whl.metadata
  Using cached tf_keras-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Using cached tf_keras-2.18.0-py3-none-any.whl (1.7 MB)
Installing collected packages: tf-keras
Successfully installed tf-keras-2.18.0


In [15]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [19]:
#Sentence 1
Query_1 = "A dog is playing in the park."
Query_2 = "A dog is running in a field."

# Encode the two queries
Query_1_embedding = model.encode([Query_1])
Query_2_embedding = model.encode([Query_2])

#compute the cosine similarities
similarities = cosine_similarity(Query_1_embedding, Query_2_embedding)
print("Similarity Scores:", similarities)

Similarity Scores: [[0.5219753]]


In [21]:
#Sentence 2
Query_1 = "I love pizza."
Query_2 = "I enjoy ice cream."

# Encode the two queries
Query_1_embedding = model.encode([Query_1])
Query_2_embedding = model.encode([Query_2])

#compute the cosine similarities
similarities = cosine_similarity(Query_1_embedding, Query_2_embedding)
print("Similarity Scores:", similarities)

Similarity Scores: [[0.52806807]]


In [23]:
#Sentence 3
Query_1 = "What is AI?"
Query_2 = "How does a computer learn?"

# Encode the two queries
Query_1_embedding = model.encode([Query_1])
Query_2_embedding = model.encode([Query_2])

#compute the cosine similarities
similarities = cosine_similarity(Query_1_embedding, Query_2_embedding)
print("Similarity Scores:", similarities)

Similarity Scores: [[0.3194349]]


### Questions:
- Which sentence pairs are the most semantically similar? Why?
- Can you think of cases where cosine similarity might fail to capture true semantic meaning?


In [None]:
#Sentence that is most similar is the sentece sentece because both involves eating of food 
#Lack of context awareness,sensitivity to poor embeddings

## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [27]:
# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents
X = model.encode(documents)

In [53]:
# Perform KMeans clustering
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state= 2)
kmeans.fit(X)
pred = kmeans.predict(X)

In [55]:
# Print cluster assignments
#predicted clusters
import pandas as pd
doc_clusters = pd.concat([pd.DataFrame(documents, columns = ["Sentence"]),pd.DataFrame(pred, columns = ["Cluster"])], axis = 1)

In [57]:
doc_clusters

Unnamed: 0,Sentence,Cluster
0,What is the capital of France?,0
1,How do I bake a chocolate cake?,1
2,What is the distance between Earth and Mars?,0
3,How do I change a flat tire on a car?,1
4,What is the best way to learn Python?,1
5,How do I fix a leaky faucet?,1


### Questions:
- How many clusters make the most sense? Why?
- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
- Try this exercise with a larger dataset of your choice

In [None]:
#Two clusters makes the most sense because we can categorise into location and home activity.
# Yes, the cluster 0 is more about distance

## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [61]:
import numpy as np

# Documents dataset
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings
doc_embeddings = model.encode(documents)

In [68]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_n=5):
    query_embeddings = model.encode([query])
    similarities = cosine_similarity(query_embeddings, doc_embeddings).flatten() # flatten is used to convert into 2Darray
    top_indices = np.argsort(similarities)[::-1][:top_n]
    return [(documents[i], similarities[i]) for i in top_indices]

In [76]:
# Test the search function
query = "Explain programming languages."
semantic_search(query, documents, doc_embeddings, top_n=6)

[('What is quantum computing?', 0.43524772),
 ('What is the best way to learn Python?', 0.31878263),
 ('How do I build a mobile app?', 0.11044082),
 ('How do I set up a local server?', 0.09112651),
 ('What are the best travel destinations in Europe?', 0.09064777),
 ('How do I fix a leaky faucet?', 0.08145966)]

### Questions:
- What are the top-ranked results for the given queries?
- How can you improve the ranking explanation for users?
- Try this approach with a larger dataset

In [None]:
#