# Load epub book

This notebook implements a RAG (Retrieval-Augmented Generation) question-answering system for "A Christmas Carol" by
  Charles Dickens. Here's what it does:

  1. Document Loading & Chunking (Cells 1-4)

  - Loads the EPUB book using UnstructuredEPubLoader
  - Splits the book into 203 chunks using RecursiveCharacterTextSplitter
    - Chunk size: 1024 characters
    - Chunk overlap: 50 characters
  - Each chunk contains a portion of the book's text

  2. Embedding Creation (Cells 5-8)

  - Uses the BAAI/bge-small-en-v1.5 sentence transformer model
  - Converts text into 384-dimensional vector embeddings
  - These embeddings capture semantic meaning for similarity search

  3. Vector Database Setup (Cells 9-13)

  - Creates unique IDs for each chunk (203 total)
  - Stores all chunks in ChromaDB (in-memory vector database)
  - Collection name: carol
  - Enables semantic search over the book content

  4. Basic Retrieval Test (Cells 14-15)

  - Tests querying with: "What happened Marley?"
  - Returns top 5 most relevant chunks based on semantic similarity
  - Shows distances (lower = more similar) ranging from 0.31 to 0.33

  5. RAG Question-Answering Workflow (Cells 17-21)

  The innovative two-step RAG approach:

  Step 1: Question Reformulation (Cell 18)
  - Takes original question: "What is the name of Bob Cratchit's youngest son who is ill?"
  - Uses FLAN-T5 to convert to a declarative statement: "Bob Cratchit's youngest son is ill."
  - This helps improve retrieval accuracy

  Step 2: Retrieve & Answer (Cells 19-21)
  - Queries ChromaDB with the reformulated statement (not the question)
  - Retrieves top 3 relevant chunks as context
  - Combines context + original question into a prompt: "Answer based on context:\n\n{context}\n\n{question}"
  - Uses FLAN-T5 to generate final answer: "Tiny Tim"

  6. Discussion Points (Cell 22)

  The notebook ends with reflection questions about:
  - Performance of the solution
  - Potential issues
  - Possible improvements

  ---
  Key Innovation: The two-step approach (reformulate question → retrieve → answer) helps bridge the gap between how
  questions are asked and how information is stored in the text.

In [1]:
# Import libraries
import os
from langchain_community.document_loaders import UnstructuredEPubLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

import chromadb
from uuid import uuid4
from chromadb.utils import embedding_functions

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

This code creates a text splitter that breaks large documents into smaller, overlapping chunks. Here's what each part
  does:

  Parameters

  chunk_size = 1024
  - Maximum size of each text chunk in characters (not words)
  - Each chunk will be ≤1024 characters long
  - Think of it as cutting the book into pages of roughly equal size

  chunk_overlap = 50
  - Number of characters that overlap between consecutive chunks
  - The last 50 characters of chunk N are repeated as the first 50 characters of chunk N+1
  - Prevents important information from being split across chunk boundaries

  Why Overlap Matters

  Without overlap:
  Chunk 1: "...and Scrooge saw the ghost of"
  Chunk 2: "Marley appear before him..."
  ❌ Context is broken - the ghost identity is split!

  With 50-char overlap:
  Chunk 1: "...and Scrooge saw the ghost of Marley"
  Chunk 2: "ghost of Marley appear before him..."
  ✅ Both chunks maintain context about who the ghost is

  RecursiveCharacterTextSplitter

  This is a smart splitter from LangChain that:
  1. Tries to split on natural boundaries (paragraphs first)
  2. Falls back to sentences if paragraphs are too long
  3. Falls back to individual characters if needed
  4. Respects the chunk_size limit while keeping text coherent

  Result: The Christmas Carol book is split into 203 semantically meaningful chunks that can be searched independently
  while maintaining context at boundaries.

In [2]:
# TODO: Load document 
chunk_size = 1024
chunk_overlap = 50
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

epub_loader = UnstructuredEPubLoader('./docs/charles-dickens_a-christmas-carol.epub')

In [3]:
# TODO Split document
chunks = epub_loader.load_and_split(text_splitter)

  data file translations/en.yaml not found



In [4]:
# TODO Examine chunk
print(len(chunks))
print(chunks[100])

203
page_content='For the people who were shovelling away on the housetops were jovial and full of glee; calling out to one another from the parapets, and now and then exchanging a facetious snowball﻿—better-natured missile far than many a wordy jest﻿—laughing heartily if it went right, and not less heartily if it went wrong. The poulterers’ shops were still half open, and the fruiterers’ were radiant in their glory. There were great, round, potbellied baskets of chestnuts, shaped like the waistcoats of jolly old gentlemen, lolling at the doors, and tumbling out into the street in their apoplectic opulence: There were ruddy, brown-faced, broad-girthed Spanish onions, shining in the fatness of their growth like Spanish friars, and winking from their shelves in wanton slyness at the girls as they went by, and glanced demurely at the hung-up mistletoe. There were pears and apples clustered high in blooming pyramids; there were bunches of grapes, made, in the shopkeepers’ benevolence, to d

# Create embeddings

In [5]:
# TODO: Create embedding model
embed_model_name = "BAAI/bge-small-en-v1.5"
#embed_model_name = "all-MiniLM-L6-v2"

chroma_embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=embed_model_name)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
# TODO: Explore embedding model
text = 'hello world'
emb_text = chroma_embed_func([ 'hello, world', 'big black bug bleeds black blood' ])


In [7]:
print(len(emb_text))
print(len(emb_text[0]))
print(len(emb_text[1]))
print(emb_text[0])

2
384
384
[-3.15818861e-02 -4.86476459e-02  3.21323797e-02 -6.57482818e-02
 -1.12419424e-03  1.14271725e-02 -1.62230618e-03  5.49601130e-02
  4.48704585e-02 -2.09966279e-03  7.87420943e-03 -2.20074728e-02
  3.43550295e-02  6.57045990e-02  2.98711378e-02 -2.77358951e-04
  1.02012476e-03 -3.47684883e-02 -1.21079311e-01 -1.47990584e-02
  9.72585976e-02  3.53694521e-02 -1.68968178e-02 -4.28635776e-02
 -2.48042773e-02  5.63818216e-03  6.80470234e-03  1.35493865e-02
  6.07596058e-03 -9.83635336e-02 -6.45544007e-02 -1.15323970e-02
  3.96090522e-02  2.41095815e-02  4.54738513e-02 -2.10404806e-02
  2.52141189e-02 -1.03885625e-02 -7.94329867e-02  3.64228641e-03
  4.60232683e-02 -5.09504527e-02  1.40664354e-02 -3.41338338e-03
  1.36136273e-02 -4.93411645e-02  1.70672853e-02  5.47222272e-02
 -2.78037954e-02  4.88214078e-04 -5.45994267e-02 -8.51237681e-03
 -1.97877735e-02 -2.24599219e-03  2.84830965e-02  9.09864530e-02
  7.97385275e-02  2.93904054e-03  4.68928032e-02  8.69194046e-03
  1.88648272e-0

In [8]:
# TODO: Prepare the chunks for inserting into Chroma
# Extract the text
texts = [ c.page_content for c in chunks ]
print(texts[100])
print(len(texts))


For the people who were shovelling away on the housetops were jovial and full of glee; calling out to one another from the parapets, and now and then exchanging a facetious snowball﻿—better-natured missile far than many a wordy jest﻿—laughing heartily if it went right, and not less heartily if it went wrong. The poulterers’ shops were still half open, and the fruiterers’ were radiant in their glory. There were great, round, potbellied baskets of chestnuts, shaped like the waistcoats of jolly old gentlemen, lolling at the doors, and tumbling out into the street in their apoplectic opulence: There were ruddy, brown-faced, broad-girthed Spanish onions, shining in the fatness of their growth like Spanish friars, and winking from their shelves in wanton slyness at the girls as they went by, and glanced demurely at the hung-up mistletoe. There were pears and apples clustered high in blooming pyramids; there were bunches of grapes, made, in the shopkeepers’ benevolence, to dangle from conspic

In [9]:
text_ids = [  str(uuid4())[:8] for _ in range(len(texts))]
print(text_ids)
print(len(text_ids))

['277d0bb8', 'bbe6d999', 'a07de751', 'e60d2e0a', 'b4a98aa7', '01b6444e', '49d46dce', '092ced88', '95c3810f', '1d8829e4', '65af40ca', 'd17ec7a5', 'bb1bbdbb', 'aa5514f7', '5e80c94a', '0fc79f83', '1ad4e15f', 'e387ff1b', '5db028d6', 'ff34017d', 'bd561d9c', '35c5b8d2', 'fec09abd', '773e74b7', 'b20689bb', '763d87d0', 'e2789d3b', 'b2078cdf', '26da3249', 'fa98828d', '14407106', '5327d3c1', 'defba9c2', '0eaec760', '76ceebd8', '2c2f9df7', '0901abd1', 'af7a64dc', 'b24dec98', '9f92d87b', '94cc593f', '301851ad', 'b43a424d', 'f3fff753', '8d4c56cd', 'd609844d', '7fe71b78', 'ddeebeab', '0c7a6173', 'b99640e6', '233e05ab', '9c506a5f', 'd8f2cb55', '1d642b35', 'b8a0ca1b', 'ef5f377b', 'a4df0832', '2238b7d7', 'b5e4ce98', '702c156b', '2c693bb8', '2dbae0de', '6a43dfbd', 'a65c5a28', '58e595df', 'af72ddae', '9c4913ae', '3406831b', '0a7cd4bf', '09ea8047', '8c555549', '57f15a08', 'f755bbdf', '0399c18f', '74eba842', '01934995', 'a6381159', 'db380b09', 'b49e0f47', '8107e3e7', 'd3c70d23', '6c2dcd9e', 'df832793', '42

In [11]:
# TODO: Create ephemeral Chroma client and save chunks
col_name = 'carol'

# Create a the chromadb client
ch_client = chromadb.Client()

# drop the table
try:
   ch_client.delete_collection(col_name)
except:
   pass

# Insert the texts into the database
carol_col = ch_client.create_collection(
   name = col_name,
   embedding_function=chroma_embed_func
)


In [12]:
#Insert the docs into the collection
carol_col.add(
   documents = texts,
   ids = text_ids
)

In [13]:
# TODO: Print number of documents in collection 
print(carol_col.count())

203


In [14]:
# TODO: Query collection 
query = "What happened Marley?"


results = carol_col.query(
   query_texts=[ query ],
   n_results=5
)

print(results)

{'ids': [['ddeebeab', 'e60d2e0a', '773e74b7', 'fa98828d', '5327d3c1']], 'embeddings': None, 'documents': [['Marley’s Ghost bothered him exceedingly. Every time he resolved within himself, after mature inquiry that it was all a dream, his mind flew back again, like a strong spring released, to its first position, and presented the same problem to be worked all through, “Was it a dream or not?”\n\nScrooge lay in this state until the chime had gone three-quarters more, when he remembered, on a sudden, that the Ghost had warned him of a visitation when the bell tolled one. He resolved to lie awake until the hour was passed; and, considering that he could no more go to sleep than go to heaven, this was, perhaps, the wisest resolution in his power.\n\nThe quarter was so long, that he was more than once convinced he must have sunk into a doze unconsciously, and missed the clock. At length it broke upon his listening ear.\n\n“Ding, dong!”\n\n“A quarter past,” said Scrooge, counting.\n\n“Ding, 

In [15]:
for id in results['ids'][0]:
   result = carol_col.get(id)
   print(result['documents'])

['Marley’s Ghost bothered him exceedingly. Every time he resolved within himself, after mature inquiry that it was all a dream, his mind flew back again, like a strong spring released, to its first position, and presented the same problem to be worked all through, “Was it a dream or not?”\n\nScrooge lay in this state until the chime had gone three-quarters more, when he remembered, on a sudden, that the Ghost had warned him of a visitation when the bell tolled one. He resolved to lie awake until the hour was passed; and, considering that he could no more go to sleep than go to heaven, this was, perhaps, the wisest resolution in his power.\n\nThe quarter was so long, that he was more than once convinced he must have sunk into a doze unconsciously, and missed the clock. At length it broke upon his listening ear.\n\n“Ding, dong!”\n\n“A quarter past,” said Scrooge, counting.\n\n“Ding, dong!”\n\n“Half past,” said Scrooge.\n\n“Ding, dong!”\n\n“A quarter to it,” said Scrooge.\n\n“Ding, dong!”

# Question and Answer LLM
In this exercise you will implement a question and answer LLM for the 'A Christmas Carol' book that you have chunked and saved. 

The workflow is as follows:
1. Assume you ask the following question regarding the book eg. `"Who is Scrooge?"`?
2. Query the relevant context from Chroma with the question or facts from the question.
3. Combine the question and the top 5 context return by Chroma into a prompt 
4. Use `google/flan-t5-base` to answer the question.

Look through the FLAN templates in [Github](https://github.com/google-research/FLAN/blob/main/flan/templates.py) and select an appropriate template for this workshop.

Do not worry about the accuracy of the result. Focus on implementing the solution. We will discuss the nuances of the solution at the end of the workshop.

Use your RAG workflow to answer the provided questions in `questions_for_rag.txt` file. 

In [16]:
# TODO Your code 
model_name = "google/flan-t5-base"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

In [17]:
# Extract the core ideas of the question 
question = "What is the name of Scrooge's underpaid clerk?"
#question = "Who was Scrooge's deceased business partner?"
#question = "Who was Scrooge engaged to in his youth, and why did she leave him?"
question = "What is the name of Bob Cratchit's youngest son who is ill?"
#question = "What does Scrooge see written on the gravestone that frightens him into changing his ways?"
#question = " What is Scrooge's response when his nephew Fred invites him to Christmas dinner at the beginning of the story?"
#question = " What specific, generous act does Scrooge perform for the Cratchit family on Christmas morning?"

prompt = f"{question}\n\nWhat is sentence that verbalizes this data?"
#prompt = f"{question}\n\nWhat data can be extracted from this sentence?"
#prompt = f"Generate an approximately fifteen-word sentence that describes all this data: {question}"

# convert to a statement
enc_prompt = tokenizer(prompt, return_tensors='pt')
enc_answer = model.generate(enc_prompt.input_ids)
answer = tokenizer.decode(enc_answer[0], skip_special_tokens=True)

print(answer)

Bob Cratchit's youngest son is ill.


In [19]:
# TODO Your code
# FIX: Use the original question for search, not the reformulated answer
# The reformulation loses important details from the question
context = ""
results = carol_col.query(
   query_texts=[ question ],
   n_results=3
)
for id in results['ids'][0]:
   result = carol_col.get(id)
   context += result['documents'][0]

print(context)

She hurried out to meet him; and little Bob in his comforter﻿—he had need of it, poor fellow﻿—came in. His tea was ready for him on the hob, and they all tried who should help him to it most. Then the two young Cratchits got upon his knees, and laid, each child, a little cheek against his face, as if they said, “Don’t mind it, father. Don’t be grieved!”

Bob was very cheerful with them, and spoke pleasantly to all the family. He looked at the work upon the table, and praised the industry and speed of Mrs. Cratchit and the girls. They would be done long before Sunday, he said.

“Sunday! You went today, then, Robert?” said his wife.

“Yes, my dear,” returned Bob. “I wish you could have gone. It would have done you good to see how green a place it is. But you’ll see it often. I promised him that I would walk there on a Sunday. My little, little child!” cried Bob. “My little child!”So Martha hid herself, and in came little Bob, the father, with at least three feet of comforter, exclusive o

In [20]:
question_prompt = f"Answer based on context:\n\n{context}\n\n{question}"
print(question_prompt)

Answer based on context:

She hurried out to meet him; and little Bob in his comforter﻿—he had need of it, poor fellow﻿—came in. His tea was ready for him on the hob, and they all tried who should help him to it most. Then the two young Cratchits got upon his knees, and laid, each child, a little cheek against his face, as if they said, “Don’t mind it, father. Don’t be grieved!”

Bob was very cheerful with them, and spoke pleasantly to all the family. He looked at the work upon the table, and praised the industry and speed of Mrs. Cratchit and the girls. They would be done long before Sunday, he said.

“Sunday! You went today, then, Robert?” said his wife.

“Yes, my dear,” returned Bob. “I wish you could have gone. It would have done you good to see how green a place it is. But you’ll see it often. I promised him that I would walk there on a Sunday. My little, little child!” cried Bob. “My little child!”So Martha hid herself, and in came little Bob, the father, with at least three feet

In [21]:
# TODO Your code
enc_query_prompt = tokenizer(question_prompt, return_tensors='pt')

enc_query_answer = model.generate(enc_query_prompt.input_ids)

query_answer = tokenizer.decode(enc_query_answer[0], skip_special_tokens=True)

print(question)
print(query_answer)

Token indices sequence length is longer than the specified maximum sequence length for this model (771 > 512). Running this sequence through the model will result in indexing errors


What is the name of Bob Cratchit's youngest son who is ill?
Tiny Tim


# Discussion

1. How did your solution perform?
2. Where do you think are the issues?
3. How can you improve it?

Based on analyzing the notebook, here are the key issues with this RAG implementation:

  1. Question Reformulation Adds Complexity & Errors

  # Cell 18: Converts question to statement first
  question = "What is the name of Bob Cratchit's youngest son who is ill?"
  # Becomes: "Bob Cratchit's youngest son is ill."
  Problems:
  - Extra LLM call adds latency and cost
  - Can lose important question details (e.g., asking for a NAME gets lost)
  - The reformulation might be wrong or incomplete
  - Why not just search with the original question?

  2. Very Limited Context (Only 3 Chunks)

  # Cell 19: Only retrieves 3 results
  # FIX APPLIED: Now uses question instead of reformulated answer
  results = carol_col.query(query_texts=[question], n_results=3)
  Issues:
  - 3 chunks × 1024 chars = ~3,000 characters total context
  - Important information might be in chunks 4-10
  - No diversity in retrieval (all chunks might be from same scene)

  3. Small Model Limitations (FLAN-T5-Base)

  - Only 250M parameters - limited reasoning capability
  - Struggles with complex questions requiring inference
  - Short output length limits detailed answers
  - No instruction fine-tuning for RAG tasks specifically

  4. Chunking Strategy Issues

  chunk_size = 1024
  chunk_overlap = 50  # Only 5% overlap!
  Problems:
  - 50-character overlap is very small (just ~10 words)
  - Character-based chunking can split mid-sentence
  - No semantic awareness (might split a conversation)
  - 1024 chars might cut important multi-paragraph context

  5. No Answer Validation or Confidence Scoring

  - Doesn't check if the answer is actually in the context
  - No confidence scores shown to user
  - Could hallucinate if context doesn't contain answer
  - No fallback for "I don't know"

  6. Embedding Model Limitations

  embed_model_name = "BAAI/bge-small-en-v1.5"
  - Only 384 dimensions (smaller models = less nuanced)
  - Might not capture subtle semantic differences
  - Same embedding for questions and passages (not optimized)

  7. Ephemeral Database (Lost on Restart)

  ch_client = chromadb.Client()  # In-memory only!
  - All embeddings lost when notebook restarts
  - Must re-embed entire book every time (~6 minutes)
  - No persistence for production use

  8. No Metadata or Filtering

  - Can't filter by chapter, character name, or scene
  - No source attribution (which chapter is the answer from?)
  - Can't do temporal reasoning ("What happened BEFORE Marley appeared?")

  9. Single Query Strategy

  - Only one attempt at retrieval
  - No query expansion (synonyms, rephrasings)
  - No hybrid search (keyword + semantic)
  - Misses the HyDE (Hypothetical Document Embeddings) opportunity

  10. Prompt Engineering Issues

  question_prompt = f"Answer based on context:\n\n{context}\n\n{question}"
  - Very basic prompt - no instructions about:
    - Answer length
    - What to do if answer not found
    - How to cite sources
    - Format expectations

  Example of Failure Mode

  If you asked: "How did Scrooge change?"

  1. Reformulation might produce: "Scrooge changed" (loses the "HOW")
  2. Top 3 chunks might all be about one change, missing others
  3. FLAN-T5-base might give overly simplistic answer
  4. No way to know which parts of the book the answer came from

  How to Improve (Quick Wins)

  1. Skip the reformulation - search with original question
  2. Increase context - retrieve 5-10 chunks
  3. Better chunking - use 512 overlap, semantic splitting
  4. Add persistence - use chromadb.PersistentClient()
  5. Improve prompt - add instructions for "unknown" cases
  6. Use larger model - FLAN-T5-large or modern LLM
  7. Add reranking - re-score top-k results with cross-encoder
  8. Show sources - return chunk IDs/page numbers with answer

  The current approach works for simple factual questions but struggles with complex reasoning, multi-hop questions, or
  questions requiring synthesis across multiple parts of the book.