<a href="https://colab.research.google.com/github/khawar-khan520/nlp_project/blob/main/retrieval_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install and Import Libraries:

In [None]:
!pip install openai sentence-transformers faiss-cpu hf_xet

Load and Chunk your Document:

In [None]:
from google.colab import files
uploaded = files.upload()


In [None]:

with open('winnie_the_pooh.txt', 'r') as file:
    # Read the entire content of the file into a string
    text = file.read()

chunks = [text[i:i+200] for i in range(0, len(text), 200)]

Generate Embeddings with SenteceTransformers:

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(chunks)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Take a sample of 10 chunks
sample_embeddings = embeddings[:10]
similarity_matrix = cosine_similarity(sample_embeddings)

# Print the similarity matrix
print(np.round(similarity_matrix, 2))


In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced = pca.fit_transform(sample_embeddings)

plt.figure(figsize=(8, 6))
plt.scatter(reduced[:, 0], reduced[:, 1])
for i, chunk in enumerate(chunks[:10]):
    plt.annotate(f"Chunk {i}", (reduced[i, 0], reduced[i, 1]))
plt.title("PCA of Text Embeddings")
plt.show()


Store Embeddings in a FAISS Index for Similarity Search:

In [None]:
import faiss
import numpy as np

index = faiss.IndexFlatL2(embeddings[0].shape[0])
index.add(np.array(embeddings))

# Search
query = "Who is always sad?"
query_embedding = model.encode([query])
D, I = index.search(np.array(query_embedding), k=3)

In [None]:
for i in I[0]:
    print(chunks[i])
    print("....")

Build the Prompt from Retrieved Chunks:

In [None]:

retrieved_chunks = [chunks[i] for i in I[0]]

# Format the prompt
context = "\n\n".join(retrieved_chunks)
#query = "What is the capital of France?"

prompt = f"""You are a helpful assistant. Use the following context to answer the question.

Context:
{context}

Question:
{query}

Answer:"""

print(prompt)

Generate an Answer Using a Lightweight Language Model:

In [None]:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Load a small, instruction-tuned model
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Build prompt from chunks
retrieved_chunks = [chunks[i] for i in I[0]]
context = "\n\n".join(retrieved_chunks)


# Simple instruction-style prompt for T5
prompt = f"Answer the question based on the context.\n\nContext:\n{context}\n\nQuestion:\n{query}"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)

# Generate output
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)

# Decode and print
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Answer:", answer)

### Performance Observations

- **Query 1**: "Who is Winnie the Pooh?" - Retrieved top-k chunks focus on Pooh's identity and characteristics.
- **Query 2**: "Tell me about Pooh." - Retrieved chunks include more **descriptive information** about Pooh.
- **Query 3**: "What is the plot of Winnie the Pooh?" - Retrieved chunks contain more of the **story** and overall plot.
- **Query 4**: "Who is the protagonist in the story?" - Similar to Query 1, but retrieved chunks could be **more focused on his role in the story**.

The differences in performance happen because each query targets slightly different aspects of the text, leading FAISS to retrieve different chunks.
