Sprint 1: High-Performance Data Ingestion & Retrieval
Data: 19,187 Semantic Chunks from Algebra Textbooks.

Tech: HuggingFace Embeddings (all-mpnet-base-v2), BM25 Keyword Search, and CrossEncoder Reranking.

Status: Production-ready vector foundation complete.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Install LangChain, PDF loaders, and the Semantic Chunker
!pip install -qU langchain_experimental langchain_openai langchain_community pypdf tiktoken

The Semantic Ingestion Script

Summary of the Flow:
Find a PDF.

Extract the text and page numbers.
Analyze the meaning (using your all-mpnet-base-v2 model).

Cut the text into chunks based on topic shifts.
Store those chunks in a list with their "ID card" (filename and page number).

In [None]:
# Install HuggingFace and Sentence Transformers
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

# 1. Initialize Local Embeddings (No API Key Required)
# We use 'all-mpnet-base-v2' because it's the gold standard for academic sentence similarity
print("Initializing local embedding model...")
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# 2. Setup the Semantic Chunker
# 'percentile' threshold helps capture topic shifts in structured textbooks
text_splitter = SemanticChunker(embeddings, breakpoint_threshold_type="percentile")

# 3. Path Configuration
DATA_FOLDER = "/content/drive/MyDrive/Alfie/Data"
all_chunks = []

# 4. The Ingestion Loop
print("Starting PDF Ingestion...")
for filename in os.listdir(DATA_FOLDER): #os.listdir() method is used to retrieve a list containing the names of all entries (files and directories) within a specified path.
    if filename.endswith(".pdf"):
        file_path = os.path.join(DATA_FOLDER, filename)
        print(f"--- Processing: {filename} ---")

        try:
            loader = PyPDFLoader(file_path) # This "opens" the PDF and extracts the raw text. It also captures metadata (like which page number the text came from).
            pages = loader.load()

            # We split the entire document while keeping track of page numbers
            chunks = text_splitter.split_documents(pages)

            # Append to our master list
            all_chunks.extend(chunks)
            print(f"Added {len(chunks)} chunks from {filename}")

        except Exception as e:
            print(f"Error processing {filename}: {e}")

print(f"\n✅ Success! Total semantic chunks created: {len(all_chunks)}")

# Quick Check: Look at the first chunk's metadata
if all_chunks:
    print(f"Sample Metadata: {all_chunks[0].metadata}")

STORAGE PHASE - Saving the chunks to vectorDB


In [None]:
# 1. Install ChromaDB
!pip install -qU chromadb
#ChromaDB: This is an open-source Vector Database.
from langchain_community.vectorstores import Chroma

# 2. Define where to save the database on your Drive
PERSIST_DIRECTORY = "/content/drive/MyDrive/Alfie/VectorDB"

# 3. Create the Vector Store and Save the Chunks
print("Creating Vector Database... this might take a few minutes for 6000+ chunks.")
vector_db = Chroma.from_documents(
    documents=all_chunks,
    embedding=embeddings, # The HuggingFace model we initialized earlier
    persist_directory=PERSIST_DIRECTORY
)

# 4. Finalize the save
vector_db.persist() # Fuck it we can remove this since we are already saing it in the drive.
print(f"✅ Database saved to {PERSIST_DIRECTORY}")

Entire flow so far -

Sentence Splitting: The PDF text is broken into individual sentences.

Temporary Embedding: Each sentence is turned into a vector just to calculate where to cut.

Semantic Chunking: Using Cosine Similarity, sentences are grouped into chunks (e.g., sentences 1-5 become "Chunk A").

Final Embedding: Once "Chunk A" is created, the entire chunk is sent to the embedding model one last time to get a single "Master Vector" representing that whole paragraph.

Storage: That Master Vector + the Text + the Metadata (page number/filename) are saved into ChromaDB.

Why do we embed a second time?
We do this because a single vector representing a whole paragraph is much more powerful for search than just having a bunch of individual sentence vectors. It captures the "full context" of the idea.


In [None]:
# 1. Load the DB (This proves it saved correctly)
vector_db = Chroma(
    persist_directory="/content/drive/MyDrive/Alfie/VectorDB",
    embedding_function=embeddings
)

# 2. Check the count
print(f"Verified Chunks in DB: {vector_db._collection.count()}")

# 3. Test Retrieval Quality
query = "What are the rules for adding and subtracting polynomials?"
results = vector_db.similarity_search(query, k=3)

print("\n--- Top Result ---")
if results:
    print(f"Source: {results[0].metadata.get('source')}")
    print(f"Content Snippet: {results[0].page_content[:400]}...")
else:
    print("❌ No results found. Something went wrong with the index!")

# Task
Create a hybrid retrieval system by manually combining and deduplicating results from both BM25 and vector retrievers, then apply a reranking model to these combined documents and display the reranked results, explaining the benefits of this approach.

## Manual Combination and Deduplication

### Subtask:
Execute both the BM25 retriever and the vector retriever separately, then combine their results and remove any duplicate documents. This will provide a comprehensive list of potentially relevant documents.


**Reasoning**:
Execute both BM25 and vector retrievers separately for the given query, then combine their results and deduplicate them to fulfill the subtask.



**Reasoning**:
The previous cell failed because `bm25_retriever` and `vector_retriever` were not defined. This happened because the cell where they were initialized (`ZbIMELbkiwQ8`) failed due to an `ImportError` related to `EnsembleRetriever`. To fix this, I will re-initialize `bm25_retriever` and `vector_retriever` in the current cell, omitting the problematic `EnsembleRetriever` part as it's not needed for the current manual deduplication subtask.



In [None]:
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma

query = "What is the FOIL method?"

# Re-initialize the retrievers as their previous initialization failed
# 1. Initialize the Keyword Retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(all_chunks)
bm25_retriever.k = 3 # Top 3 keyword matches

# 2. Re-load Chroma and Initialize the Vector Retriever
# `embeddings` and `PERSIST_DIRECTORY` are available from previous cells.
vector_db = Chroma(
    persist_directory=PERSIST_DIRECTORY,
    embedding_function=embeddings
)
vector_retriever = vector_db.as_retriever(search_kwargs={"k": 3})


# 1. Use the existing bm25_retriever to retrieve documents
bm25_docs = bm25_retriever.invoke(query)
print(f"BM25 retrieved {len(bm25_docs)} documents.")

# 2. Use the existing vector_retriever to retrieve documents
vector_docs = vector_retriever.invoke(query)
print(f"Vector retrieved {len(vector_docs)} documents.")

# 3. Combine bm25_docs and vector_docs into a single list
combined_docs = bm25_docs + vector_docs
print(f"Combined documents before deduplication: {len(combined_docs)}")

# 4. Create a new list to store deduplicated documents
deduplicated_docs = []
seen_page_content = set()

for doc in combined_docs:
    if doc.page_content not in seen_page_content:
        deduplicated_docs.append(doc)
        seen_page_content.add(doc.page_content)

# 5. Print the number of documents before and after deduplication
print(f"Combined documents after deduplication: {len(deduplicated_docs)}")

print("\n--- Deduplicated Results ---")
for i, doc in enumerate(deduplicated_docs):
    print(f"Result {i+1} (Source: {doc.metadata.get('source', 'N/A')} - Page: {doc.metadata.get('page_label', 'N/A')}):")
    print(f"{doc.page_content[:200]}...\n")

## Implement Reranking

### Subtask:
Install a suitable reranking library (e.g., using a cross-encoder from `sentence-transformers`), load a reranker model, and then apply it to the combined and deduplicated documents to reorder them based on their relevance to the query.


**Reasoning**:
First, I need to install the `sentence-transformers` library to use the `CrossEncoder` for reranking.



In [None]:
pip install -qU sentence-transformers

**Reasoning**:
Now that `sentence-transformers` is installed, I will import `CrossEncoder`, load the specified reranker model, prepare the input for reranking by pairing the query with each deduplicated document's content, apply the reranker to get relevance scores, and then sort the documents based on these scores to obtain the reranked results. I will then print the reranked results along with their sources and content snippets.



In [None]:
from sentence_transformers import CrossEncoder

# 1. Import and Load a pre-trained cross-encoder model
# Using 'cross-encoder/ms-marco-MiniLM-L-6-v2' as suggested
print("Loading CrossEncoder model...")
reranker_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print("CrossEncoder model loaded.")

# 2. Prepare input for the reranker
# The reranker expects a list of (query, document_text) pairs
reranker_inputs = [[query, doc.page_content] for doc in deduplicated_docs]

# 3. Get relevance scores from the reranker
print("Reranking documents...")
relevance_scores = reranker_model.predict(reranker_inputs)
print("Documents reranked.")

# 4. Pair documents with their scores and sort them
doc_score_pairs = list(zip(deduplicated_docs, relevance_scores))

# Sort in descending order based on relevance score
reranked_docs_with_scores = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Store the reranked documents (without scores for a clean list)
reranked_docs = [doc for doc, score in reranked_docs_with_scores]

print("\n--- Reranked Results ---")
for i, (doc, score) in enumerate(reranked_docs_with_scores):
    print(f"Result {i+1} (Score: {score:.4f}, Source: {doc.metadata.get('source', 'N/A')} - Page: {doc.metadata.get('page_label', 'N/A')}):")
    print(f"{doc.page_content[:200]}...\n")

print("\nBenefits of this approach:")
print("- **Improved Relevance**: Reranking models are trained to identify fine-grained semantic matches, often improving the quality of top results compared to initial retrieval.")
print("- **Hybrid Strength**: Combining keyword (BM25) and semantic (vector) retrieval ensures broad coverage, while reranking refines the results for precision.")
print("- **Better User Experience**: More relevant results at the top lead to a more efficient and satisfying information retrieval experience.")

## Final Task

### Subtask:
Display the reranked documents and explain the benefits of this hybrid retrieval approach.


## Summary:

### Q&A
The task asked to display the reranked documents and explain the benefits of the hybrid retrieval approach. This was successfully achieved by showing the reordered list of documents with their relevance scores and explicitly listing the advantages.

### Data Analysis Key Findings
*   Initially, both BM25 and vector retrievers were re-initialized due to a `NameError`.
*   For the query "What is the FOIL method?", the BM25 retriever retrieved 3 documents, and the vector retriever also retrieved 3 documents.
*   After combining the results from both retrievers, the total count was 6 documents.
*   Deduplication based on `page_content` reduced the combined list to 5 unique documents, indicating at least one overlap between the two retrieval methods.
*   A `CrossEncoder` model, specifically 'cross-encoder/ms-marco-MiniLM-L-6-v2', was successfully loaded and used for reranking.
*   The reranking process effectively reordered the 5 deduplicated documents based on their relevance to the query.
*   Documents highly relevant to the "FOIL method" received high relevance scores (e.g., 7.7543, 6.3138), placing them at the top.
*   Less relevant documents received significantly lower scores (e.g., 0.4131, -10.6818), indicating their lower priority.

### Insights or Next Steps
*   The hybrid retrieval system, combining BM25, vector search, and a cross-encoder reranker, provides a robust method for retrieving highly relevant documents by leveraging both keyword and semantic matching, followed by fine-grained relevance scoring.
*   Further evaluation could involve quantitative metrics (e.g., NDCG, MRR) against a gold standard to objectively measure the improvement in retrieval effectiveness provided by the reranking step.


The Full Pipeline So Far:
Step	Action	Result
1. Ingestion	Semantic Chunking	6,000 meaningful text snippets.
2. Storage	ChromaDB	Vector coordinates for all snippets.
3. Retrieval	BM25 + Vector	A "rough list" of the 6 most likely chunks.
4. Reranking	Cross-Encoder	The 1 absolute best chunk moved to the top.