### Document pre-processing

In [43]:
import os

In [44]:
pdf_path = "Roger_Federer_story.pdf"

if not os.path.exists(pdf_path):
    raise FileNotFoundError("File not found!")

In [45]:
import fitz

def initial_preprocess(text: str) -> str:
    cleaned_text = text.replace("\n", " ").strip()
    cleaned_text = cleaned_text.replace("\t", "").strip()
    return cleaned_text

def get_doc_text(file_path: str) -> list[str]:
    doc = fitz.open(file_path)
    text_per_page = []

    for page in doc:
        text = page.get_text()
        cleaned_text = initial_preprocess(text)
        text_per_page.append(cleaned_text)

    return text_per_page

In [46]:
import random

contents = get_doc_text(pdf_path)
random_content = random.choice(contents)
random_content

'René Stauffer 72 Andre Agassi as the favorite and himself and Juan Carlos Ferrero of Spain as  the leading contenders for the No. 2 position. Federer’s assessment quickly  proved to be wrong. Federer beat Ferrero in his opening round-robin match  and then defeated Jiri Novak for a 2-0 round robin record. Federer then  clinched his round-robin flight when Agassi lost to both Novak and Ferrero.  No longer with a chance of making the semifinals, Agassi hastily left China in  disappointment, using a hip injury as the reason for his withdrawal. In the semifinals, Federer faced Hewitt, who already clinched the year-end  No. 1 ranking for a second year in a row. The Australian barely qualified for  the semifinals and benefited from Carlos Moya winning a three-hour mean- ingless match over fellow Spaniard Costa, where a Costa victory would have  him reach the semifinals rather than Hewitt. Although Federer lost five of the  last seven matches with Hewitt, he reasoned his chances of beating hi

In [47]:
from spacy.lang.en import English

lan = English()
lan.add_pipe("sentencizer")

def extract_info(text_per_page: list[str]) -> list[dict]:
    info_per_page = []

    for page_no, text in enumerate(text_per_page):
        sentences = list(lan(text).sents)
        sentences = [str(sen) for sen in sentences]
        info_per_page.append(
            {
                "page_number": page_no,
                "char_count": len(text),
                "word_count": len(text.split(" ")),
                "sentences": sentences,
                "sentence_count": len(sentences),
                "token_count": len(text) / 4,
                "text": text
            }
        )

    return info_per_page

In [6]:
processed_contents = extract_info(contents)
random.choice(processed_contents)

{'page_number': 202,
 'char_count': 2496,
 'word_count': 462,
 'sentences': ['René Stauffer 166 ning,” he said. “',
  'I should have stood farther up, played more aggressively and  I needed to try to put more pressure on him.”',
  'Although he reached the final of every tournament he played to date in  2006, there was now an unmistakable blemish on this record.',
  'His four losses  in 2006 were all to Nadal and all in finals.',
  'His first loss was on the hard  courts of Dubai, with the next three coming on clay in the finals of Monte  Carlo, Rome and Paris.',
  'In the meantime, his career record against Nadal was  1-6, with his lone victory being achieved after coming back from a two-sets- to-love deficit in the Key Biscayne final in 2005.',
  'Nadal, however, was cautious  to not lay claim to the No.',
  '1 ranking. “',
  'I can’t say that I’m better than Roger  because that wouldn’t be true,” he said in Paris.',
  'The comforts of the grass courts of Wimbledon—still considered Fe

In [49]:
len(processed_contents)

289

In [50]:
import pandas as pd

df = pd.DataFrame(processed_contents)
df.head(5)

Unnamed: 0,page_number,char_count,word_count,sentences,sentence_count,token_count,text,chunks,embeddings
0,0,59,10,[THE ROGER FEDERER STORY Quest For Perfection...,1,14.75,THE ROGER FEDERER STORY Quest For Perfection ...,[[THE ROGER FEDERER STORY Quest For Perfectio...,"[[-0.0035685853, 0.0807035, 0.0087812655, 0.06..."
1,1,77,13,[THE ROGER FEDERER STORY Quest For Perfection...,1,19.25,THE ROGER FEDERER STORY Quest For Perfection ...,[[THE ROGER FEDERER STORY Quest For Perfectio...,"[[-0.012125157, 0.077075735, -0.00013054344, 0..."
2,2,362,55,"[Cover and interior design: Emily Brackett, Vi...",2,90.5,"Cover and interior design: Emily Brackett, Vis...","[[Cover and interior design: Emily Brackett, V...","[[0.02902794, 0.039170813, -0.06723627, 0.0251..."
3,3,2675,1331,"[Contents From The Author ., ., ., ., ., ...",403,668.75,Contents From The Author . . . . . . . ...,"[[Contents From The Author .], [ . . . . . . ....","[[0.04131374, 0.021502212, -0.009723351, 0.014..."
4,4,2394,1108,"[New York, New York ., ., ., ., ., ., .,...",316,598.5,"New York, New York . . . . . . . . . ...","[[New York, New York .], [ . . . . . . . . . ....","[[-0.04760484, 0.10172539, 0.010011899, -0.046..."


In [51]:
def split_sentences_to_chunks(sentences: list[str], chunk_size: int = 5) -> list[list[str]]:
    chunks = []

    for i in range(0, len(sentences), chunk_size):
        chunks.append(sentences[i: i+chunk_size])

    return chunks

In [52]:
for item in processed_contents:
    item["chunks"] = split_sentences_to_chunks(item["sentences"])

In [56]:
df = pd.DataFrame(processed_contents)
df.head()

Unnamed: 0,page_number,char_count,word_count,sentences,sentence_count,token_count,text,chunks,embeddings
0,0,59,10,[THE ROGER FEDERER STORY Quest For Perfection...,1,14.75,THE ROGER FEDERER STORY Quest For Perfection ...,[[THE ROGER FEDERER STORY Quest For Perfectio...,"[[-0.0035685853, 0.0807035, 0.0087812655, 0.06..."
1,1,77,13,[THE ROGER FEDERER STORY Quest For Perfection...,1,19.25,THE ROGER FEDERER STORY Quest For Perfection ...,[[THE ROGER FEDERER STORY Quest For Perfectio...,"[[-0.012125157, 0.077075735, -0.00013054344, 0..."
2,2,362,55,"[Cover and interior design: Emily Brackett, Vi...",2,90.5,"Cover and interior design: Emily Brackett, Vis...","[[Cover and interior design: Emily Brackett, V...","[[0.02902794, 0.039170813, -0.06723627, 0.0251..."
3,3,2675,1331,"[Contents From The Author ., ., ., ., ., ...",403,668.75,Contents From The Author . . . . . . . ...,"[[Contents From The Author ., ., ., ., .],...","[[0.04131374, 0.021502212, -0.009723351, 0.014..."
4,4,2394,1108,"[New York, New York ., ., ., ., ., ., .,...",316,598.5,"New York, New York . . . . . . . . . ...","[[New York, New York ., ., ., ., .], [ ., ...","[[-0.04760484, 0.10172539, 0.010011899, -0.046..."


In [57]:
for item in processed_contents:
    chunks = item["chunks"]

    processed_chunks = []
    for chunk in chunks:
        temp = []
        for sentence in chunk:
            words = sentence.split()
            if len(words) > 3:
                temp.append(sentence)

        processed_chunks.append(temp)

    item["chunks"] = processed_chunks

In [58]:
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model_name_or_path = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)
model.eval()

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid download

NewModel(
  (embeddings): NewEmbeddings(
    (word_embeddings): Embedding(250048, 768, padding_idx=1)
    (rotary_emb): NTKScalingRotaryEmbedding()
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): NewEncoder(
    (layer): ModuleList(
      (0-11): 12 x NewLayer(
        (attention): NewSdpaAttention(
          (qkv_proj): Linear(in_features=768, out_features=2304, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
          (o_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (mlp): NewGatedMLP(
          (up_gate_proj): Linear(in_features=768, out_features=6144, bias=False)
          (down_proj): Linear(in_features=3072, out_features=768, bias=True)
          (act_fn): GELUActivation()
          (hidden_dropout): Dropout(p=0.1, inplace=False)
        )
        (attn_ln): LayerNorm((768,), eps=1e-12, elementwise_affine

In [59]:
import torch

for item in processed_contents:
    embeddings = []

    for chunk in item["chunks"]:
        joined_sentences = " ".join(chunk)

        inputs = tokenizer(joined_sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')

        with torch.no_grad():
            outputs = model(**inputs)

        emb = outputs.last_hidden_state[:, 0]
        emb = F.normalize(emb, p=2, dim=1)

        emb_np = emb[0].cpu().numpy()
        embeddings.append(emb_np)

    item["embeddings"] = embeddings


In [15]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cpu")

  from .autonotebook import tqdm as notebook_tqdm


In [16]:
for item in processed_contents:
    embeddings = []

    for chunk in item["chunks"]:
        joined_sentences = " ".join(chunk)
        emb = embedding_model.encode(joined_sentences)
        embeddings.append(emb)

    item["embeddings"] = embeddings

In [60]:
df = pd.DataFrame(processed_contents)
df.head()

Unnamed: 0,page_number,char_count,word_count,sentences,sentence_count,token_count,text,chunks,embeddings
0,0,59,10,[THE ROGER FEDERER STORY Quest For Perfection...,1,14.75,THE ROGER FEDERER STORY Quest For Perfection ...,[[THE ROGER FEDERER STORY Quest For Perfectio...,"[[-0.03971169, 0.0085977055, -0.03254035, -0.0..."
1,1,77,13,[THE ROGER FEDERER STORY Quest For Perfection...,1,19.25,THE ROGER FEDERER STORY Quest For Perfection ...,[[THE ROGER FEDERER STORY Quest For Perfectio...,"[[-0.037184387, 0.0052641653, -0.050972298, 0...."
2,2,362,55,"[Cover and interior design: Emily Brackett, Vi...",2,90.5,"Cover and interior design: Emily Brackett, Vis...","[[Cover and interior design: Emily Brackett, V...","[[-0.059621613, 0.047493614, -0.013715048, 0.0..."
3,3,2675,1331,"[Contents From The Author ., ., ., ., ., ...",403,668.75,Contents From The Author . . . . . . . ...,"[[Contents From The Author .], [], [], [ . . ....","[[-0.023638057, 0.010781779, -0.017973509, 0.0..."
4,4,2394,1108,"[New York, New York ., ., ., ., ., ., .,...",316,598.5,"New York, New York . . . . . . . . . ...","[[New York, New York .], [], [], [ . . . . . ....","[[-0.085618645, 0.07266691, -0.02959496, 0.022..."


In [61]:
import uuid

chunk_entries = []

for item in processed_contents:
    page_num = item["page_number"]

    for idx, chunk in enumerate(item["chunks"]):
        chunk_dict = {}
        chunk_text = " ".join(chunk)
        token_count = len(chunk_text)/4

        chunk_dict["chunk_id"] = str(uuid.uuid4())
        chunk_dict["page_num"] = int(page_num)
        chunk_dict["text"] = chunk_text
        chunk_dict["embedding"] = item["embeddings"][idx]

        metadata = {
            "char_count": len(chunk_text),
            "sentence_count": len(chunk),
            "token_count": token_count,
            "source": os.path.basename(pdf_path)
        }

        chunk_dict["metadata"] = metadata

        chunk_entries.append(chunk_dict)

In [62]:
df = pd.DataFrame(chunk_entries)
df.head()

Unnamed: 0,chunk_id,page_num,text,embedding,metadata
0,76457e22-e034-43f4-92e6-d9e3c0d6962f,0,THE ROGER FEDERER STORY Quest For Perfection ...,"[-0.03971169, 0.0085977055, -0.03254035, -0.00...","{'char_count': 59, 'sentence_count': 1, 'token..."
1,d57f0ac4-4ebf-4e85-9f91-07876655d0e6,1,THE ROGER FEDERER STORY Quest For Perfection ...,"[-0.037184387, 0.0052641653, -0.050972298, 0.0...","{'char_count': 77, 'sentence_count': 1, 'token..."
2,b278dad0-e16a-4cec-8220-5fd6d4f717a4,2,"Cover and interior design: Emily Brackett, Vis...","[-0.059621613, 0.047493614, -0.013715048, 0.00...","{'char_count': 362, 'sentence_count': 2, 'toke..."
3,53c40b0e-18c4-4b2a-abbf-b75ef7d1b457,3,Contents From The Author .,"[-0.023638057, 0.010781779, -0.017973509, 0.05...","{'char_count': 26, 'sentence_count': 1, 'token..."
4,1fa65fe9-02dc-4f18-8c23-edaf4c6075f7,3,,"[-0.06665535, 0.05475923, -0.05223963, 0.03440...","{'char_count': 0, 'sentence_count': 0, 'token_..."


In [63]:
df.describe

<bound method NDFrame.describe of                                   chunk_id  page_num  \
0     76457e22-e034-43f4-92e6-d9e3c0d6962f         0   
1     d57f0ac4-4ebf-4e85-9f91-07876655d0e6         1   
2     b278dad0-e16a-4cec-8220-5fd6d4f717a4         2   
3     53c40b0e-18c4-4b2a-abbf-b75ef7d1b457         3   
4     1fa65fe9-02dc-4f18-8c23-edaf4c6075f7         3   
...                                    ...       ...   
1314  347a3978-18bf-428d-81ca-5d42110dd599       284   
1315  0cf397fe-5689-415a-bad7-1a03a95fd8f2       285   
1316  a66e5132-fce8-4b1a-ace3-f8cbb095126b       286   
1317  d9f0eeba-b28c-4e60-b290-93b19dde8825       287   
1318  e39d56e1-770c-46e6-afec-f8665c93c276       288   

                                                   text  \
0     THE ROGER  FEDERER STORY Quest For Perfection ...   
1     THE ROGER  FEDERER STORY Quest For Perfection ...   
2     Cover and interior design: Emily Brackett, Vis...   
3                            Contents From The Author .  

In [64]:
import faiss
import numpy as np

embedding_matrix = np.array([entry["embedding"] for entry in chunk_entries]).astype("float32")

embedding_matrix = embedding_matrix / np.linalg.norm(embedding_matrix, axis=1, keepdims=True)

index = faiss.IndexFlatIP(embedding_matrix.shape[1])
index.add(embedding_matrix)

chunk_id_map = {i: entry for i, entry in enumerate(chunk_entries)}

In [65]:
def encode_gte(texts: list[str]) -> torch.Tensor:
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors='pt'
    )
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state[:, 0]  # CLS token
        embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings

In [66]:
query = "How many tournaments did Roger Federer win in 2003?"
query_embedding_tensor = encode_gte([query])
query_embedding = query_embedding_tensor[0].cpu().numpy().astype("float32")

D, I = index.search(np.array([query_embedding]), k=10)
retrieved_chunks = [chunk_id_map[i]["text"] for i in I[0]]

In [67]:
retrieved_chunks

['95 Chapter 19 Duels in Texas In 2003, Roger Federer won six tournaments on all different surfaces—indoors  in Marseille and Vienna, hard courts in Dubai, clay in Munich and grass in  Halle and at Wimbledon. However, a gap in his resume was the absence of  a tournament title in North America. Of the 21 tournaments he played on  the North American continent since he turned professional, 10 ended in the  first round. On only one occasion—in Key Biscayne in 2002—did he manage  to make his way into a singles final. He wasn’t even able to muscle his way  as far as the quarterfinals at the US Open or at the big events in Cincinnati  or Indian Wells.',
 '74 Chapter 15 The Grand Slam Block Roger Federer’s declared goal for 2003 was, as before, to win a Grand Slam  tournament. He finally wanted to rid himself of the moniker as the best  player in tennis without a Grand Slam title. In his 14 career Grand Slam  tournament appearances, his best results were two modest quarterfinal fin- ishes—both

In [None]:
query = "How many tournaments did roger federer in 2003?"
query_embedding = embedding_model.encode(query).astype("float32")

k = min(len(chunk_id_map), 10)
D, I = index.search(np.array([query_embedding]), k)

retrieved_chunks = [chunk_id_map[i]["text"] for i in I[0]]

In [68]:
context = "\n\n".join(retrieved_chunks)

prompt = f"""
Context:
{context}

Question:
{query}

Answer:
"""

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="<Openrouter_API_key>",
    )

content = f"""
Context:
{prompt.context}

Question:
{prompt.question}

Answer:
"""

completion = client.chat.completions.create(
extra_headers={},
extra_body={},
model="deepseek/deepseek-chat-v3-0324:free",
messages=[
    {
        "role": "system",
        "content": "You are a question-answering assistant"
    },
    {
        "role": "user",
        "content": content
    }
]
)

print(completion.choices[0].message.content)

Roger Federer won **six tournaments** in 2003 across different surfaces:
1. Marseille (indoor)  
2. Vienna (indoor)  
3. Dubai (hard court)  
4. Munich (clay)  
5. Halle (grass)  
6. **Wimbledon** (grass, his first Grand Slam title).  

Additionally, his victory at the year-end **Tennis Masters Cup** in November 2003 brought his total to **seven titles** for the year (some sources may count six excluding the Masters Cup, but the text confirms it as a major title win in 2003). 

**Final Answer:** Roger Federer won **seven tournaments** in 2003, including Wimbledon and the Tennis Masters Cup.
