## Project Overview — Winnie-the-Pooh RAG

This project studies how retrieval quality and chunking strategy affect the accuracy of a RAG system for factual question answering. The full text of Winnie-the-Pooh is used as the knowledge base, while the generation model, embeddings, and prompt are kept fixed across all experiments.

Only the retrieval pipeline is varied: different chunk sizes, separators, similarity search, re-ranking, and MMR are compared under identical conditions. Performance is evaluated on a curated QA set using semantic answer matching.

## Imports

In [None]:
# !pip install numpy==1.26.4 pandas scikit-learn datasets sentence-transformers faiss-cpu langchain langchain-community langchain-huggingface transformers spacy nltk

In [None]:
# !python -m spacy download en_core_web_sm


In [73]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
import string
import numpy as np
import spacy
import nltk
import re
from typing import List, Tuple
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_huggingface import HuggingFacePipeline

nltk.download("punkt")
nlp = spacy.load("en_core_web_sm")


[nltk_data] Downloading package punkt to /home/lisa_polo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Generation Model

FLAN-T5-Large is a strong instruction-tuned encoder–decoder model optimized for factual text-to-text tasks such as question answering

In [100]:
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text2text-generation",
    do_sample=False,
    temperature=0.0,
    max_new_tokens=64
)

t5_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)


Device set to use cuda:0


## Load and Clean Book

In [101]:
url = "https://www.gutenberg.org/files/67098/67098-0.txt"
response = requests.get(url)
text = response.text

start = text.find("Here is Edward Bear")
end = text.find("THE END")
book_text = text[start:end].strip()

print("Book length:", len(book_text))


Book length: 123196


## Prompt

In [102]:
prompt_template = """Answer the question using ONLY the information in the context below.

CRITICAL RULES:
1. Use the EXACT words and phrasing from the context whenever possible
2. Preserve the original grammar, tense, and perspective from the source text
3. If the context says "he did X", use "he did X" (not "He does X" or "X was done")
4. If the context says "because long words bothered him", use "because long words bothered him" (not "because long words bother him")
5. Keep pronouns, verb tenses, and narrative voice exactly as written in the context
6. Answer directly and concisely - quote or paraphrase minimally from context
7. For "why" questions, include "because" and use the exact reasoning from the text
8. If you cannot find the answer in the context, respond with exactly: NOT FOUND

Context:
{context}

Question: {question}

Answer (use exact wording from context):"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template
)
llm_chain = prompt | t5_llm


## Test Set

In [103]:
test_set = [
    {
      "question": "What is Winnie-the-Pooh’s original name?",
      "answer": "Edward Bear"
    },
    {
      "question": "What name does Pooh live under?",
      "answer": "Sanders"
    },
    {
      "question": "What color balloon does Pooh choose to look like a small black cloud?",
      "answer": "Blue"
    },
    {
      "question": "What is the first thing Pooh says when he wakes up in the morning?",
      "answer": "What’s for breakfast?"
    },
    {
      "question": "What happens when Pooh eats too much at Rabbit’s house?",
      "answer": "He gets stuck in Rabbit’s doorway"
    },
    {
      "question": "How long must Pooh wait to be unstuck from Rabbit's house?",
      "answer": "About a week"
    },
    {
      "question": "What comforting book does Christopher Robin read to Pooh while he's stuck?",
      "answer": "A Sustaining Book for a Wedged Bear in Great Tightness"
    },
    {
      "question": "What does Piglet’s sign 'Trespassers W' actually stand for?",
      "answer": "Trespassers William, his grandfather"
    },
    {
      "question": "What creature do Pooh and Piglet think they are tracking in the snow?",
      "answer": "A Woozle"
    },
    {
      "question": "What does Eeyore lose in Chapter IV?",
      "answer": "His tail"
    },
    {
      "question": "Who finds Eeyore’s missing tail?",
      "answer": "Winnie-the-Pooh"
    },
    {
      "question": "Who is the new animal that arrives with Baby Roo?",
      "answer": "Kanga"
    },
    {
      "question": "What does Owl write on the pot for Eeyore’s birthday?",
      "answer": "HIPY PAPY BTHUTHDTH THUTHDA BTHUTHDY"
    },
    {
      "question": "What gift does Piglet bring Eeyore?",
      "answer": "A balloon"
    },
    {
      "question": "What happens to Piglet’s balloon?",
      "answer": "It bursts"
    },
    {
      "question": "What does Eeyore do with the burst balloon and the pot?",
      "answer": "He puts the balloon in and out of the pot"
    },
    {
      "question": "What is the name of the rescue boat Christopher Robin and Pooh use?",
      "answer": "The Brain of Pooh"
    },
    {
      "question": "Why does Christopher Robin throw a party at the end?",
      "answer": "To celebrate Pooh’s bravery in rescuing Piglet during the flood"
    },
    {
      "question": "Why does Piglet agree to use honey in the Heffalump trap instead of haycorns?",
      "answer": "Because if they used honey, Pooh would provide it, not Piglet."
    },
    {
      "question": "Why does Pooh say he is a Bear of Very Little Brain?",
      "answer": "Because long words and riddles confuse him."
    },
    {
      "question": "What does Pooh suggest Christopher Robin do to trick the bees?",
      "answer": "Walk with an umbrella saying, 'Tut-tut, it looks like rain.'"
    },
    {
      "question": "Why does Rabbit want to use Pooh’s legs as a towel-horse?",
      "answer": "Because Pooh is stuck in his doorway and not going anywhere."
    },
    {
      "question": "What leads Pooh and Piglet to believe they are being followed by multiple Woozles?",
      "answer": "They mistake their own tracks for those of Woozles."
    },
    {
      "question": "Why is the boat used to rescue Piglet called 'The Brain of Pooh'?",
      "answer": "Because it was Pooh’s idea to use an umbrella as a boat."
    },
    {
      "question": "Was Piglet ever actually bathed by Kanga?",
      "answer": "Yes, he was bathed by Kanga"
    }
]

## Semantic Chunking

In [104]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=320,
    chunk_overlap=120
)

sections = text_splitter.split_text(book_text)
print("Total chunks:", len(sections))


Total chunks: 594


In [105]:
sections[0], sections[-1], sections[150]

('Here is Edward Bear, coming downstairs now, bump, bump, bump, on the\nback of his head, behind Christopher Robin. It is, as far as he knows,\nthe only way of coming downstairs, but sometimes he feels that there\nreally is another way, if only he could stop bumping for a moment and',
 '[Transcriber\'s Note: Near the end of Chapter VI, the reference to\nKanga was modified to read "...and every Tuesday Kanga spent the day\nwith her great friend Pooh ..."]\n\n\n\n*** END OF THE PROJECT GUTENBERG EBOOK 67098 ***',
 'Winnie-the-Pooh read the two notices very carefully, first from left to\nright, and afterwards, in case he had missed some of it, from right to\nleft. Then, to make quite sure, he knocked and pulled the knocker, and\nhe pulled and knocked the bell-rope, and he called out in a very loud')

## Embeddings

E5-Large-v2 is optimized for semantic retrieval and question–passage matching and produces high-quality dense vectors that work well for similarity search

In [106]:
embedding_model_name = "intfloat/e5-large-v2"
semantic_model = SentenceTransformer(embedding_model_name)

## Initial retreiver

In [107]:
db = FAISS.from_texts(
    sections,
    HuggingFaceEmbeddings(model_name=embedding_model_name)
)

retriever_similarity = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},
    )

## Base RAG
let's see plain rag results

In [108]:
def truncate_context(context, tokenizer, max_tokens=480):
    tokens = tokenizer.encode(context, truncation=True, max_length=max_tokens)
    return tokenizer.decode(tokens, skip_special_tokens=True)

def rag_answer(question, retriever):
    docs = retriever.invoke(question)
    context = "\n\n".join(doc.page_content for doc in docs)
    context = truncate_context(context, tokenizer, max_tokens=480)

    return llm_chain.invoke({"context": context, "question": question}).strip()


In [109]:
def semantic_match(pred, true, threshold=0.8):
    e1 = semantic_model.encode(pred, convert_to_tensor=True)
    e2 = semantic_model.encode(true, convert_to_tensor=True)
    score = util.pytorch_cos_sim(e1, e2).item()
    return score >= threshold, score

def evaluate(test_set, retriever, threshold=0.8, rag_answer=rag_answer):
    semantic_correct = 0

    for qa in test_set:
        pred = rag_answer(qa["question"], retriever = retriever)
        true = qa["answer"]

        semantic, score = semantic_match(pred, true, threshold)

        print("Q:", qa["question"])
        print("Expected:", true)
        print("Predicted:", pred)
        print(f"Semantic: {semantic} {round(score,2)}")
        print("-"*60)

        if semantic:
            semantic_correct += 1

    semantic_acc = semantic_correct / len(test_set)

    print("Semantic Accuracy:", round(semantic_acc * 100, 2), "%")

In [84]:
evaluate(test_set, retriever_similarity, 0.8)

Q: What is Winnie-the-Pooh’s original name?
Expected: Edward Bear
Predicted: Edward Bear
Semantic: True 1.0
------------------------------------------------------------
Q: What name does Pooh live under?
Expected: Sanders
Predicted: Sanders
Semantic: True 1.0
------------------------------------------------------------
Q: What color balloon does Pooh choose to look like a small black cloud?
Expected: Blue
Predicted: NOT FOUND
Semantic: False 0.79
------------------------------------------------------------
Q: What is the first thing Pooh says when he wakes up in the morning?
Expected: What’s for breakfast?
Predicted: What's for breakfast?
Semantic: True 0.98
------------------------------------------------------------
Q: What happens when Pooh eats too much at Rabbit’s house?
Expected: He gets stuck in Rabbit’s doorway
Predicted: Rabbit sternly tells Pooh to get thin again.
Semantic: False 0.79
------------------------------------------------------------
Q: How long must Pooh wait to b

initial results have many correct answers, but frequently returns NOT FOUND for answers that require the right chunk to be retrieved and probably geys incomplete context for longer explanatory questions. This confirms that generation is working, but retrieval coverage and chunk alignment are still weak. The next step here is to introduce re-ranking to improve how the most relevant chunks are selected before generation

### Rag with reranking

In [85]:
def rag_answer_reranking(question, retriever):
    docs = retriever.invoke(question)
    q_emb = semantic_model.encode(question, convert_to_tensor=True)
    scored = []
    for doc in docs:
        d_emb = semantic_model.encode(doc.page_content, convert_to_tensor=True)
        score = util.pytorch_cos_sim(q_emb, d_emb).item()
        scored.append((score, doc.page_content))
    
    scored.sort(reverse=True)
    top_context = "\n\n".join([x[1] for x in scored[:2]])
    context = top_context
    context = truncate_context(context, tokenizer, max_tokens=480)
    return llm_chain.invoke({"context": context, "question": question}).strip()


In [87]:
evaluate(test_set, retriever_similarity,rag_answer=rag_answer_reranking)


Q: What is Winnie-the-Pooh’s original name?
Expected: Edward Bear
Predicted: Edward Bear
Semantic: True 1.0
------------------------------------------------------------
Q: What name does Pooh live under?
Expected: Sanders
Predicted: Sanders
Semantic: True 1.0
------------------------------------------------------------
Q: What color balloon does Pooh choose to look like a small black cloud?
Expected: Blue
Predicted: not like a small black cloud in a blue sky
Semantic: False 0.78
------------------------------------------------------------
Q: What is the first thing Pooh says when he wakes up in the morning?
Expected: What’s for breakfast?
Predicted: What's for breakfast?
Semantic: True 0.98
------------------------------------------------------------
Q: What happens when Pooh eats too much at Rabbit’s house?
Expected: He gets stuck in Rabbit’s doorway
Predicted: Christopher Robin nodded. "Then there's only one thing to be done," he said. "We shall have to wait for you to get thin again

Re-ranking gave a very minor improvement in overall accuracy, while most retrieval failures and NOT FOUND cases remain, we don't attribute this to reranking, but believe the issue is how the text is chunked and indexed.
Let's change the chunking strategy itself by switching to a separator-based text splitter to better align chunks with sentence and paragraph boundaries.

### Another splitter

In [90]:
text_splitter_separators = RecursiveCharacterTextSplitter(
    separators=["\n\n", ". ", "! ", "? "],
    chunk_size=450,
    chunk_overlap=80
)
sections_separators = text_splitter_separators.split_text(book_text)
print("Total chunks:", len(sections_separators ))

db_separators = FAISS.from_texts(
    sections_separators,
    HuggingFaceEmbeddings(model_name=embedding_model_name)
)

retriever_similarity_separators = db_separators.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},
    )
evaluate(test_set, retriever_similarity_separators)

Total chunks: 368


In [92]:
evaluate(test_set, retriever_similarity_separators)

Q: What is Winnie-the-Pooh’s original name?
Expected: Edward Bear
Predicted: Edward Bear
Semantic: True 1.0
------------------------------------------------------------
Q: What name does Pooh live under?
Expected: Sanders
Predicted: Sanders
Semantic: True 1.0
------------------------------------------------------------
Q: What color balloon does Pooh choose to look like a small black cloud?
Expected: Blue
Predicted: blue
Semantic: True 1.0
------------------------------------------------------------
Q: What is the first thing Pooh says when he wakes up in the morning?
Expected: What’s for breakfast?
Predicted: What's for breakfast?
Semantic: True 0.98
------------------------------------------------------------
Q: What happens when Pooh eats too much at Rabbit’s house?
Expected: He gets stuck in Rabbit’s doorway
Predicted: Rabbit uses Pooh’s back legs as a towel-horse
Semantic: True 0.82
------------------------------------------------------------
Q: How long must Pooh wait to be unstu

Switching to separator-based chunking gave large  improvement in accuracy. Most previous NOT FOUND cases disappear, and multi-sentence factual answers become consistently retrievable. This confirms that chunk boundaries aligned with semantic structure are critical for RAG performance. The next question is whether retrieving diverse context instead of just the most similar chunks can further improve answers Let's use MMR as it can capture complementary information spread across multiple parts of the text.

In [93]:
retriever_mmr = db_separators.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 10, "lambda_mult": 0.7}
)


In [94]:
evaluate(test_set, retriever_mmr)

Q: What is Winnie-the-Pooh’s original name?
Expected: Edward Bear
Predicted: Sanders
Semantic: False 0.73
------------------------------------------------------------
Q: What name does Pooh live under?
Expected: Sanders
Predicted: he had the name over the door in gold letters, and lived under it.
Semantic: False 0.7
------------------------------------------------------------
Q: What color balloon does Pooh choose to look like a small black cloud?
Expected: Blue
Predicted: blue
Semantic: True 1.0
------------------------------------------------------------
Q: What is the first thing Pooh says when he wakes up in the morning?
Expected: What’s for breakfast?
Predicted: What's for breakfast?
Semantic: True 0.98
------------------------------------------------------------
Q: What happens when Pooh eats too much at Rabbit’s house?
Expected: He gets stuck in Rabbit’s doorway
Predicted: Rabbit uses Pooh’s back legs as a towel-horse.
Semantic: True 0.83
----------------------------------------

While some answers improve, others regress due to incorrect top-ranked chunks. Overall performance remains similar to similarity search but less stable. So lets get back to similarity but increase number of kwargs to get more context

In [98]:
retriever_similarity_separators_5 = db_separators.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5},
    )
evaluate(test_set, retriever_similarity_separators_5)

Q: What is Winnie-the-Pooh’s original name?
Expected: Edward Bear
Predicted: Sanders
Semantic: False 0.73
------------------------------------------------------------
Q: What name does Pooh live under?
Expected: Sanders
Predicted: Sanders
Semantic: True 1.0
------------------------------------------------------------
Q: What color balloon does Pooh choose to look like a small black cloud?
Expected: Blue
Predicted: blue
Semantic: True 1.0
------------------------------------------------------------
Q: What is the first thing Pooh says when he wakes up in the morning?
Expected: What’s for breakfast?
Predicted: What's for breakfast?
Semantic: True 0.98
------------------------------------------------------------
Q: What happens when Pooh eats too much at Rabbit’s house?
Expected: He gets stuck in Rabbit’s doorway
Predicted: Rabbit uses Pooh’s back legs as a towel-horse
Semantic: True 0.82
------------------------------------------------------------
Q: How long must Pooh wait to be unstuck

Increasing k further improves recall without introducing significant noise. More supporting context is consistently retrieved, leading to the best overall balance between recall and precision in this setup.

## Conclusion
This project shows that when the generation model stays the same, retrieval strategy becomes the key factor in RAG performance. Using simple similarity search with small fixed-size chunks gives only moderate accuracy (64–68%). When switching to separator-based semantic chunking, accuracy jumps to 84–88%, most previous NOT FOUND cases disappear, and the system becomes much more reliable at retrieving multi-sentence facts.