# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

# Compute similarities

#YOUR CODE HERE
similarities = []
for sentence1, sentence2 in sentence_pairs:
    embeddings1 = model.encode(sentence1, convert_to_tensor=True)
    embeddings2 = model.encode(sentence2, convert_to_tensor=True)
    similarity = cosine_similarity(embeddings1.cpu().unsqueeze(0), embeddings2.cpu().unsqueeze(0))
    similarities.append(similarity[0][0])

# Print results
for i, (sentence1, sentence2) in enumerate(sentence_pairs):
    print(f"Similarity between '{sentence1}' and '{sentence2}': {similarities[i]:.4f}")

  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


Similarity between 'A dog is playing in the park.' and 'A dog is running in a field.': 0.5220
Similarity between 'I love pizza.' and 'I enjoy ice cream.': 0.5281
Similarity between 'What is AI?' and 'How does a computer learn?': 0.3194


### Questions:
- Which sentence pairs are the most semantically similar? Why?
  - The sentence pair "I love pizza." and "I enjoy ice cream." is the most semantically similar with a cosine similarity score of 0.5281.
- Can you think of cases where cosine similarity might fail to capture true semantic meaning?
  - Cases with different words that have similar meaning to it maybe? Or sentences where the context is very important for the meaning of the phrase.


## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [4]:
from sklearn.cluster import KMeans

# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents

#YOUR CODE HERE
document_embeddings = model.encode(documents, convert_to_tensor=True)

  return forward_call(*args, **kwargs)


In [5]:
# Perform KMeans clustering

#YOUR CODE HERE
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(document_embeddings.cpu().numpy())

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [6]:
# Print cluster assignments

#YOUR CODE HERE
for i, doc in enumerate(documents):
    print(f"Document: '{doc}' is in cluster {kmeans.labels_[i]}")

Document: 'What is the capital of France?' is in cluster 0
Document: 'How do I bake a chocolate cake?' is in cluster 0
Document: 'What is the distance between Earth and Mars?' is in cluster 0
Document: 'How do I change a flat tire on a car?' is in cluster 1
Document: 'What is the best way to learn Python?' is in cluster 1
Document: 'How do I fix a leaky faucet?' is in cluster 1


### Questions:
- How many clusters make the most sense? Why?
- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
- Try this exercise with a larger dataset of your choice

## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [10]:
import numpy as np

# Documents dataset
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings

#YOUR CODE HERE
doc_embeddings = model.encode(documents, convert_to_tensor=True)

In [11]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_n=5):
    # YOUR CODE HERE
    query_embedding = model.encode(query, convert_to_tensor=True)
    similarities = cosine_similarity(query_embedding.cpu().unsqueeze(0), doc_embeddings.cpu())
    top_indices = np.argsort(similarities[0])[::-1][:top_n]
    return [documents[i] for i in top_indices]

In [12]:
# Test the search function
query = "Explain programming languages."
semantic_search(query, documents, doc_embeddings)

['What is quantum computing?',
 'What is the best way to learn Python?',
 'How do I build a mobile app?',
 'How do I set up a local server?',
 'What are the best travel destinations in Europe?']

### Questions:
- What are the top-ranked results for the given queries?
  - 'What is quantum computing?',
  - 'What is the best way to learn Python?',
  - 'How do I build a mobile app?',
  - 'How do I set up a local server?',
  - 'What are the best travel destinations in Europe?'
- How can you improve the ranking explanation for users?
  - by displaying similarity scores alongside the results
- Try this approach with a larger dataset