## RAG Mini Project
## Milestone #2:
This notebook uses the embedding code to create embeddings from the text chunks generated and save in Pickle file from Milestone #1.

- Create a Python dictionary as a Vector database using the embedding vector as keys (note: convert list of embeddings to a tuple) and the text as the value
- Experiment with some queries and use cosine similarity to get the most similar text from your vector database.
- If the results are not satisfactory, you may want to refactor your code by:
1. changing the embedding technique
2. modifying the chunking technique from Milestone #1.


In [1]:
import pickle

# Load the chunks from the pickle file
with open("chunks.pkl", "rb") as f:
    chunk_data = pickle.load(f)

# Extract fixed-size and semantic chunks
fixed_chunks = chunk_data.get("fixed_chunks", [])
semantic_chunks = chunk_data.get("semantic_chunks", [])

print(f"Loaded {len(fixed_chunks)} fixed-size chunks and {len(semantic_chunks)} semantic chunks.")


Loaded 5 fixed-size chunks and 5 semantic chunks.


In [2]:
from sentence_transformers import SentenceTransformer

# Load the same embedding model from Assignment A1
model = SentenceTransformer("all-MiniLM-L6-v2")

# Generate embeddings for fixed-size chunks
fixed_embeddings = [model.encode(chunk) for chunk in fixed_chunks]

# Generate embeddings for semantic chunks
semantic_embeddings = [model.encode(chunk) for chunk in semantic_chunks]

print(f"Generated {len(fixed_embeddings)} fixed-size embeddings and {len(semantic_embeddings)} semantic embeddings.")


  from .autonotebook import tqdm as notebook_tqdm


Generated 5 fixed-size embeddings and 5 semantic embeddings.


In [3]:
# Convert embeddings to tuples (as required) and create a dictionary
vector_db_fixed = {tuple(embedding): text for embedding, text in zip(fixed_embeddings, fixed_chunks)}
vector_db_semantic = {tuple(embedding): text for embedding, text in zip(semantic_embeddings, semantic_chunks)}

print(f"Vector database created with {len(vector_db_fixed)} fixed-size chunks and {len(vector_db_semantic)} semantic chunks.")


Vector database created with 5 fixed-size chunks and 5 semantic chunks.


In [4]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def retrieve_similar_text(query, vector_db):
    """Find the most similar text chunk using cosine similarity."""
    
    # Encode the query into an embedding
    query_embedding = model.encode(query).reshape(1, -1)
    
    # Convert stored embeddings back to arrays for comparison
    stored_embeddings = np.array([list(key) for key in vector_db.keys()])
    
    # Compute cosine similarity
    similarities = cosine_similarity(query_embedding, stored_embeddings)[0]
    
    # Find the most similar chunk
    best_match_idx = np.argmax(similarities)
    best_match_text = list(vector_db.values())[best_match_idx]
    
    return best_match_text, similarities[best_match_idx]

# Example Query
query = "How does Agentic AI work?"
best_text, similarity_score = retrieve_similar_text(query, vector_db_semantic)

print(f"Best Match:\n{best_text}\n\nSimilarity Score: {similarity_score}")


Best Match:
Agentic AI refers to artificial intelligence systems that exhibit autonomous decision-making, adaptability, and goal-directed behavior. Unlike traditional AI, which primarily follows predefined rules or relies on statistical pattern recognition, agentic AI is characterized by its ability to plan, reason, and take initiative in dynamic environments. This type of AI is particularly relevant for applications that require independent problem-solving, such as robotics, autonomous agents, and strategic decision-making systems.

Similarity Score: 0.7948917746543884


In [5]:
# Save the vector database in a pickle file
with open("vector_db.pkl", "wb") as f:
    pickle.dump({"fixed": vector_db_fixed, "semantic": vector_db_semantic}, f)

print("Vector database saved as vector_db.pkl")


Vector database saved as vector_db.pkl
