# Step 1: Setting up the environment

#### To create and activate a virtual environment, run on your terminal:

**Windows:**


```
python -m venv venv
```
```
venv\Scripts\Activate
```

**macOS/Linux:**

```
python3 -m venv venv
```

```
source venv/bin/activate
```

#### After the environment is activated, install the requirements:

```
pip install -r requirements.txt
```


# Step 2: Pre-processing the document

#### What you will need:
1. PDF Document (skin cancer detection paper in my case)
2. Embedding model (I am using Sentence-Bert)

As described in the Readme file, the aim of this tutorial is to build a RAG-assisted LLM that can retrieve information from research papers, helping students and researchers get a quicker understanding of the paper. I will be using the paper *Skin Cancer Detection using ML Techniques* <sup>1</sup> for this example.  Feel free to use any paper you would like to retrieve information from, the same steps will apply to any paper / book. All you need to do is download it in pdf format and add your file path to the variable 'path'.  

To pre-process the PDF document, we will use an embedding model. 

#### ⚠️ Now, what does “embedding” mean in AI?

In this tutorial, we are trying to get our AI model to understand a paper (complex text data). The problem is, our model can only understand numbers. That is where embeddings come in.

> An embedding is a way of representing complex data (like words or images) as a list of numbers — called a vector — in such a way that the relationships between items are preserved.


#### Let’s dive into that:

Think of each item (a word, an image, a sentence) as a point in space - a location on a map. The closer two points are, the more related their meanings are.

For example:

- The word “cat” will be close to “dog”.

- The word “car” will be far away from “banana”.

That’s because in real life, cats and dogs are similar (both animals, pets), while a car and a banana are not.

So embeddings help us map meaning into a mathematical space.

#### 🧐 What is an embedding model?

An embedding model is an AI model that has learned how to take something complex — like a sentence — and turn it into a vector (a list of numbers) that captures its meaning.

Different embedding models specialize in different kinds of data. The table below shows some examples of open-source embedding models for different use cases:


| Data Type         | Embedding model examples    | What do they capture? |
|-------------------|-----------------------------|-------------------------------
| Words | Word2Vec, GloVe, FastText | Word meanings, analogies, syntactic similarity |
| Sentences / Text | Sentence-BERT (SBERT), Instructor, E5 | Semantic similarity between sentences/documents |
| Images              |  DINO, OpenCLIP   | Visual concepts, cross-modal (image-text) meaning   |
| Audio               |  Wav2Vec 2.0, Whisper  | Speech content, audio features   |
| Code | CodeBERT, GraphCodeBERT | Code syntax and semantics |  

In this tutorial we are looking to read PDF documents, therefore, we need a model that embeds data based on semantic similarity. I have chosen Sentence-BERT, but it is interchangable for any sentence / text embedding model. Once you have build your own RAG-assisted LLM, you can experiment with different models and decide what works best for you

Note that embedding models do not exactly embed words or sentences, they embed tokens.

#### ❓ What is a token?

A token is a smallest unit of input that a language model (like GPT or BERT) understands.

In most modern NLP systems, tokens are not exactly words — they can be:

- A whole word (hello)

- A subword (un, believ, able)

- A punctuation mark (!, .)

- Even just a few characters (Th, is)

Think of a token as a "chunk" of text — a building block the model processes one at a time.

> **Example**
> 
> Sentence: "This is amazing!" might be tokenized as:
> 
> ['This', ' is', ' amazing', '!']


.
> Note that, on average in English text, 1 token is equal to 4 characters.


#### Now that we know how the data pre_processing will work, let's get started!




<sup>1</sup> M. Vidya and M.V. Karki "Skin Cancer Detection using Machine Learning Techniques", 2020 IEEE International Conference on Electronics, Computing and Communication Technologies, Bangalore, India, 2020, pp. 1-5, doi 10.1109/CONECCT50063.2020.9198489.98489. 








#### 2.1. Importing the relevant modules, getting the PDF we want to read, and extracting text from it:

In [24]:
# Import relevant modules
import fitz
import os
import requests
import re
from tqdm.auto import tqdm

# Get PDF path (change this variable to your pdf path)
#____________________________________________________________________
path = r"G:\My Drive\feines 2025\MS Imaging paper\to submit.pdf"
#____________________________________________________________________

# Check that the path exists
if os.path.exists(path):
    print(f"PDF file '{path}' exists.")
else:
    print(f"PDF file '{path}' does not exist")

# Open the PDF file
paper = fitz.open(path)
    
# Define a helper function to extract text from the pdf
def extract_text(paper: fitz.Document):
  """Applies formatting to the PDF textand stores the content in a list of dictionaries
  Inputs: 
      paper (fitz.Document): PDF document
  Outputs: 
      output (list[dict]): List of dictionaries containing the formatted extracted text from each PDF page 
      and the corresponding page number
  """

  # Define an empty list that will be filled with the extracted text
  output = []

  # CHANGE THIS LOOP A BIT MORE !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  for page_number, page in enumerate(paper):
    paper_text = page.get_text()
    paper_text = re.sub(r'\s+', ' ', paper_text).strip() # removes any \n or white spaces
    output.append({"page_number": page_number,       
                   "text": paper_text
                   })
  return output

# Check that the helper function works as expected by printing the first page
output = extract_text(paper=paper)
print(output[1])

PDF file 'G:\My Drive\feines 2025\MS Imaging paper\to submit.pdf' exists.
{'page_number': 1, 'text': 'Various studies have proposed using smartphone cameras for MSI, mainly motivated by the biomedical optics community, with the aim of monitoring haemodynamics by detecting the different spectral characteristics of oxygenated and deoxygenated haemoglobin in blood. Some of these require modifications or additions to the smartphone [7–10] but a new approach by He and Wang [11,12] was able to derive simulated multispectral images from an unmodified smartphone camera. Here, we adapt the method of He and Wang and apply it to generating simulated multispectral images from digitised photographs of a palimpsest. The photographs were acquired using standard digitisation protocols so the method described here could be applied to any digitised images. The technique requires a colourchecker chart which is imaged using a multispectral imaging system and with standard photography. These images are pro

#### 2.2. Chunking the extracted text

First of all, we will the NLP library **SpaCy** to divide our extracted text in sentences.

This is due to the fact that embedding models cannot process an infinite number of tokens, therefore we need to limit the number of tokens by chunking the text into groups of sentences.

For this tutorial I have split the text in chunks of 15 sentences, although this number is arbitrary. Feel free to experiment and decide what works best with your model. What is the criteria to keep in mind:
1. Smaller groups of text will be easier to inspect, making it easier to filter content
2. The text chunks need to fit into our embedding model's context window
3. Chunks too large will make the context that will be passed to the LLM too vague
4. Chunks too short might leave out information that is also relevant / be misleading
5. We want to find a chunk size so that the context passed to the LLM will be specific and focused

# REVIEW CONTENT (siml) FROM HERE


In [25]:
from spacy.lang.en import English

spacy = English()

# Add a sentencizer pipeline (sentencizer turns text into sentences)
# You can check the documentation at https://spacy.io/api/sentencizer
spacy.add_pipe("sentencizer")

# Test that the sentencizer works
test_spacy = spacy("SpaCy is an NLP library. It splits text into sentences. Let's test it.")
assert len(list(test_spacy.sents)) == 3
print(list(test_spacy.sents))

# Define the number of sentences per chunk
len_chunks = 15

# Create a function to split the text into chunk size
def split_list(input_list: list[str],
               slice_size: int = len_chunks):
  """Splits text into chunk size"""
  return [input_list[i:(i + slice_size)] for i in range(0, len(input_list), (slice_size))] #CHANGE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  # Reminder: range(start, stop, step)

test_list = list(range(25))
split_list(test_list)


[SpaCy is an NLP library., It splits text into sentences., Let's test it.]


[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
 [15, 16, 17, 18, 19, 20, 21, 22, 23, 24]]

In [26]:
# Loop through pages and split the text into sentences
for item in tqdm(output):
  # Get a list of sentences in the current item's text:
  item["sentences"] = list(spacy(item["text"]).sents)
  # Make sure all sentences are strings:
  item["sentences"] = [str(sentence) for sentence in item["sentences"]]
  # Loop through pages and split sentences into chunks, then get number of sentences per chunk:
  item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                       slice_size=len_chunks)
  item["num_chunks"] = len(item["sentence_chunks"])
  # Count the sentences:
  item["page_sentence_count_spacy"] = len(item["sentences"])



100%|██████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 181.10it/s]


In [15]:
import random
random.sample(output, k=1)

[{'page_number': 10,
  'text': '5. Application to Archimedes Palimpsest The photographs of the palimpsest and ColorChecker described in section 3 were taken using the same camera, lens, lighting system and processing pipeline. It would be useful to know how far this can be generalised – can a W matrix calculated for a particular photography set-up be applied to photographs taken using a different set-up? This will not give optimal results, but given that PCA is non-quantitative, an approximate rendering of the multispectral images might still offer some useful information. Note that if more sophisticated, calibrated or quantitative image analysis was to be attempted, we would not consider using an unmatched W matrix. Indeed, we advise against using simulated multispectral images for any extended analysis. We examine the hypothesis that a generic W matrix can be used in some circumstances by reanalysing photographs taken of the Archimedes Palimpsest between 1999 and 2008 ([1–3]), downlo

#### 2.3. Embedding each text chunk
# CHANGE
We want to **embed** each chunk of sentences into its own **numerical representation**.

That will give us a good level of granularity, meaning, we can dive specifically into the text sample that was used in our model.

In [29]:

# Split each chunk into its own item
final_chunks = []
for item in tqdm(output):
  for sentence_chunk in item["sentence_chunks"]:
    chunk_dict = {}
    chunk_dict["page_number"] = item["page_number"] #get the page number

    # Join sentences in a chunk into a paragraph:
    joined_chunk = "".join(sentence_chunk).replace("  ", " ").replace("\n", " ").strip() # is there a way to replace all multiple spaces for 1

    # ADD FORMATTING SPECIFIC TO MY DOCUMENT_______________________________________________________________

    # For the 'joined_chunk', new sentences will be joined as 'end.Start'
    # To add a space, we use library regex (re). '\.([A-Z])' means for any chars
    # with this format (. followed by any capital letter (A-Z)), add
    # 1 space after '.'. So: ".A" -> ". A" (for any capital letter).
    joined_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_chunk)

    chunk_dict["sentence_chunk"] = joined_chunk #add the joined paragraph as "sentence_chunks"

    #Get stats:
    chunk_dict["chunk_token_count"] = len(joined_chunk) / 4

    final_chunks.append(chunk_dict)

len(final_chunks)

100%|████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 16686.63it/s]


24

In [30]:
import random
random.sample(final_chunks, k=1)

[{'page_number': 8,
  'sentence_chunk': 'In this case, the simulated images unexpectedly appeared to show the greatest contrast to the undertext.',
  'chunk_token_count': 26.0}]

# FILTER OUT IRRELEVANT CHUNKS_____________
e.g. he does under 30 tokens see what would be useful for me otherwise remove
might be unnecessary

In [8]:
# RESEARCH__________________________________________
from transformers import AutoTokenizer, AutoModel
import torch
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
model = AutoModel.from_pretrained('allenai/specter2_base')

sentences = ["Sentence Transformer library provides an easy way to embed data.",
             "Sentences can be embedded one by one or in a list.",
             "I like dancing!"] # CHANGE______________________________________________________________

for item in tqdm(pages_and_chunks_over_min_tokens):
# Tokenize the sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Forward pass to get hidden states
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :]  # Take the CLS token (first token) as the embedding

# Now `embeddings` is a tensor with shape (3, hidden_size)
print(embeddings.shape) 

embeddings_dict = dict(zip(sentences, embeddings))
print(embeddings_dict)

torch.Size([3, 768])


In [37]:
# RESEARCH__________________________________________
from transformers import AutoTokenizer, AutoModel
import torch
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
model = AutoModel.from_pretrained('allenai/specter2_base')

for item in tqdm(final_chunks):
    # Tokenize the sentences
    inputs = tokenizer(item["sentence_chunk"], padding=True, truncation=True, max_length=512, return_tensors="pt")

    # Forward pass to get hidden states
    with torch.no_grad():
        outputs = model(**inputs)
        item["embeddings"] = outputs.last_hidden_state[:, 0, :]  # Take the CLS token (first token) as the embedding

# Now `embeddings` is a tensor with shape (3, hidden_size)
#print(embeddings.shape) 

#embeddings_dict = dict(zip(sentences, embeddings))
print(final_chunks[1])

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:10<00:00,  2.39it/s]

{'page_number': 0, 'sentence_chunk': 'Multispectral imaging was developed as a method for recovering the lost text and proved to be extremely successful when applied to the Archimedes Palimpsest in the late 1990s and early 2000s, revealing works by Archimedes and other authors that were thought to have been lost [1–3]. Multispectral imaging (MSI) uses wavelength-selective illumination to record multiple photographs of an object, showing its response in the near ultraviolet, visible and near infrared spectral range. Filters may increase sensitivity to fluorescence by excluding the illuminating light. As well as being used to reveal underwriting, MSI has also been used to detect other features that are invisible to the human eye such as erased inscriptions, faded text and previous versions of text or paintings [4,5]. MSI typically offers excellent spatial resolution but modest spectral resolution and its spectral data is uncalibrated. Using Delaney’s [6] terminology, MSI is referred to a




In [36]:
for item in tqdm(final_chunks):
    # Tokenize the sentences
    print(item["sentence_chunk"])

100%|████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 11452.02it/s]

For submission to Heritage Science Real and simulated multispectral imaging of a palimpsest Adam Gibson1, Amy Howe2, Steve Wright2, Martina Sabate Monfort1, Terence Leung1, Angela Warren-Thomas3, Tabitha Tuckett3, Katy Makin3 1. UCL Medical Physics and Biomedical Engineering, Gower St, London WC1E 6BT 2. UCL Library Services, Gower St, London WC1E 6BT 3. UCL Special Collections, Gower St, London WC1E 6BT Abstract. We have recovered undertext from a palimpsest using multispectral imaging. Moreover, we have developed a method for generating simulated multispectral images from previously acquired digitised images of the manuscript, using knowledge of how a colourchecker chart appears in the multispectral images and in the standard digitised images. The ability to identify the undertext was generally better in the real multispectral images, though there were examples of improved identification in the simulated images. However, the method was unsuccessful when applied to freely available im




#### 2.4. Similarity Search

# EXPLAIN
Note:
We want to: search for a query (e.g. "macronutrient functions") and get relevant info from textbook.

Steps to do this:
1. Define query string
2. Turn query string into embedding
3. Perform dot product or cosine similarity function between the text embeddings and the query embedding
4. Sort the results form 3 in descending order


Note: to use dot product for comparison ensure both vector sizes are of the same shape and tensors/vectors are in the same data type.

In [None]:
# 1. Define query
query = "good foods for protein"
print(f"Query: {query}")

# 2. Embed the query
# Note: it's important to embed your query with the SAME MODEL as embeddings
query_embedding = embedding_model.encode(query, convert_to_tensor=True).to(device)

# 3. Get similarity scores
# with dot product (use cosine similarity if outputs of model aren't normalised)
from time import perf_counter as timer

start_time = timer
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer

# 4. Get the top-k results (we keep top 5)
top_results_dot_product = torch.topk(dot_scores, 5)
top_results_dot_product