# Step 1: Setting up the environment

#### To create and activate a virtual environment, run on your terminal:

**Windows:**


```
python -m venv venv
```
```
venv\Scripts\Activate
```

**macOS/Linux:**

```
python3 -m venv venv
```

```
source venv/bin/activate
```

#### After the environment is activated, install the requirements:

```
pip install -r requirements.txt
```


# Step 2: Pre-processing the document

#### What you will need:
1. PDF Document (skin cancer detection paper in my case)
2. Embedding model (I am using Sentence-Bert)

As described in the Readme file, the aim of this tutorial is to build a RAG-assisted LLM that can retrieve information from research papers, helping students and researchers get a quicker understanding of the paper. I will be using the paper *Skin Cancer Detection using ML Techniques* <sup>1</sup> for this example.  Feel free to use any paper you would like to retrieve information from, the same steps will apply to any paper / book. All you need to do is download it in pdf format and add your file path to the variable 'path'.  

To pre-process the PDF document, we will use an embedding model. 

#### ⚠️ Now, what does “embedding” mean in AI?

In this tutorial, we are trying to get our AI model to understand a paper (complex text data). The problem is, our model can only understand numbers. That is where embeddings come in.

> An embedding is a way of representing complex data (like words or images) as a list of numbers — called a vector — in such a way that the relationships between items are preserved.


#### Let’s dive into that:

Think of each item (a word, an image, a sentence) as a point in space - a location on a map. The closer two points are, the more related their meanings are.

For example:

- The word “cat” will be close to “dog”.

- The word “car” will be far away from “banana”.

That’s because in real life, cats and dogs are similar (both animals, pets), while a car and a banana are not.

So embeddings help us map meaning into a mathematical space.

#### 🧐 What is an embedding model?

An embedding model is an AI model that has learned how to take something complex — like a sentence — and turn it into a vector (a list of numbers) that captures its meaning.

Different embedding models specialize in different kinds of data. The table below shows some examples of open-source embedding models for different use cases:


| Data Type         | Embedding model examples    | What do they capture? |
|-------------------|-----------------------------|-------------------------------
| Words | Word2Vec, GloVe, FastText | Word meanings, analogies, syntactic similarity |
| Sentences / Text | Sentence-BERT (SBERT), Instructor, E5 | Semantic similarity between sentences/documents |
| Images              |  DINO, OpenCLIP   | Visual concepts, cross-modal (image-text) meaning   |
| Audio               |  Wav2Vec 2.0, Whisper  | Speech content, audio features   |
| Code | CodeBERT, GraphCodeBERT | Code syntax and semantics |  

In this tutorial we are looking to read PDF documents, therefore, we need a model that embeds data based on semantic similarity. I have chosen Sentence-BERT, but it is interchangable for any sentence / text embedding model. Once you have build your own RAG-assisted LLM, you can experiment with different models and decide what works best for you

Note that embedding models do not exactly embed words or sentences, they embed tokens.

#### ❓ What is a token?

A token is a smallest unit of input that a language model (like GPT or BERT) understands.

In most modern NLP systems, tokens are not exactly words — they can be:

- A whole word (hello)

- A subword (un, believ, able)

- A punctuation mark (!, .)

- Even just a few characters (Th, is)

Think of a token as a "chunk" of text — a building block the model processes one at a time.

> **Example**
> 
> Sentence: "This is amazing!" might be tokenized as:
> 
> ['This', ' is', ' amazing', '!']


.
> Note that, on average in English text, 1 token is equal to 4 characters.


#### Now that we know how the data pre_processing will work, let's get started!




<sup>1</sup> M. Vidya and M.V. Karki "Skin Cancer Detection using Machine Learning Techniques", 2020 IEEE International Conference on Electronics, Computing and Communication Technologies, Bangalore, India, 2020, pp. 1-5, doi 10.1109/CONECCT50063.2020.9198489.98489. 








#### 2.1. Importing the relevant modules, getting the PDF we want to read, and extracting text from it:

In [1]:
# Import relevant modules
import fitz
import os
import requests


# Get PDF path (change this variable to your pdf path)
#____________________________________________________________________
path = r"G:\My Drive\feines 2025\MS Imaging paper\to submit.pdf"
#____________________________________________________________________

# Check that the path exists
if os.path.exists(path):
    print(f"PDF file '{path}' exists.")
else:
    print(f"PDF file '{path}' does not exist")

# Open the PDF file
paper = fitz.open(path)

# Define a helper function to clean text
def format_pdf(paper_text: str): # REMOVE -> STR ???????????????????????
  """Use this function for any formatting you want to apply to the text prior to embedding.
  Inputs: 
      paper_text (str): PDF text
  Outputs: 
      formatted_pdf (str): PDF text with leading or trailing whitespaces removed
  """
  formatted_pdf = paper_text.strip()
  # add any other formatting features here
    
  return formatted_pdf
    
# Define a helper function to extract text from the pdf
def extract_text(formatted_pdf: str):
  """Reads the text content inside the provided PDF
  Inputs: 
      formatted_pdf (str): Formatted PDF text
  Outputs: 
      output (list[dict]): List of dictionaries containing the extracted text from each PDF page,
      alongside the page number
  """
  paper = fitz.open(path)

  # Define an empty list that will be filled with the extracted text
  output = []

  # CHANGE THIS LOOP A BIT MORE !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  for page_number, page in enumerate(paper):
    paper_text = page.get_text()
    paper_text = format_pdf(paper_text=paper_text)
    output.append({"page_number": page_number,       
                   "token_count": len(paper_text) / 4,
                   "text": paper_text
                   })
  return output

# Check that the helper function works as expected by printing the first three pages
output = extract_text(path=path)
print(output)

  from .autonotebook import tqdm as notebook_tqdm


PDF file 'G:\My Drive\feines 2025\MS Imaging paper\to submit.pdf' exists.


[{'page_number': 0,
  'token_count': 723.0,
  'text': 'For submission to Heritage Science \n \nReal and simulated multispectral imaging of a palimpsest \n \nAdam Gibson1, Amy Howe2, Steve Wright2, Martina Sabate Monfort1, Terence Leung1, Angela \nWarren-Thomas3, Tabitha Tuckett3, Katy Makin3 \n1. UCL Medical Physics and Biomedical Engineering, Gower St, London WC1E 6BT \n2. UCL Library Services, Gower St, London WC1E 6BT \n3. UCL Special Collections, Gower St, London WC1E 6BT \n \nAbstract. We have recovered undertext from a palimpsest using multispectral imaging. \nMoreover, we have developed a method for generating simulated multispectral images \nfrom previously acquired digitised images of the manuscript, using knowledge of how a \ncolourchecker chart appears in the multispectral images and in the standard digitised \nimages. The ability to identify the undertext was generally better in the real \nmultispectral images, though there were examples of improved identification in the \n

#### 2.2. Chunking the extracted text

First of all, we will the NLP library **SpaCy** to divide our extracted text in sentences.

This is due to the fact that embedding models cannot process an infinite number of tokens, therefore we need to limit the number of tokens by chunking the text into groups of sentences.

For this tutorial I have split the text in chunks of 15 sentences, although this number is arbitrary. Feel free to experiment and decide what works best with your model. What is the criteria to keep in mind:
1. Smaller groups of text will be easier to inspect, making it easier to filter content
2. The text chunks need to fit into our embedding model's context window
3. Chunks too large will make the context that will be passed to the LLM too vague
4. Chunks too short might leave out information that is also relevant / be misleading
5. We want to find a chunk size so that the context passed to the LLM will be specific and focused

# REVIEW CONTENT (siml) FROM HERE


In [2]:
from spacy.lang.en import English

spacy = English()

# Add a sentencizer pipeline (sentencizer turns text into sentences)
# You can check the documentation at https://spacy.io/api/sentencizer
spacy.add_pipe("sentencizer")

# Test that the sentencizer works
test_spacy = spacy("SpaCy is an NLP library. It splits text into sentences. Let's test it.")
assert len(list(test_spacy.sents)) == 3
print(list(test_spacy.sents))

# Define the number of sentences per chunk
len_chunks = 15

# Create a function to split the text into chunk size
def split_list(input_list: list[str],
               slice_size: int = len_chunks):
  """Splits text into chunk size"""
  return [input_list[i:(i + slice_size)] for i in range(0, len(input_list), (slice_size))] #CHANGE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  # Reminder: range(start, stop, step)

test_list = list(range(25))
split_list(test_list)


[SpaCy is an NLP library., It splits text into sentences., Let's test it.]


[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
 [15, 16, 17, 18, 19, 20, 21, 22, 23, 24]]

In [16]:
# Loop through pages and split the text into sentences
for item in tqdm(output):
  # Get a list of sentences in the current item's text:
  item["sentences"] = list(spacy(item["text"]).sents)
  # Make sure all sentences are strings:
  item["sentences"] = [str(sentence) for sentence in item["sentences"]]
  # Loop through pages and split sentences into chunks, then get number of sentences per chunk:
  item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                       slice_size=len_chunks)
  item["num_chunks"] = len(item["sentence_chunks"])
  # Count the sentences:
  item["page_sentence_count_spacy"] = len(item["sentences"])



100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 194.03it/s]


In [17]:
import random
random.sample(output, k=1)

[{'page_number': 5,
  'token_count': 52.0,
  'text': 'Figure 2: Real and simulated multispectral images of MS LAT/15 f20v. (a) is the original digitised \nimage. (b)-(e) are four selected reflectance MSI images and (f)-(l) the corresponding simulated MSI \nimages.',
  'sentences': ['Figure 2: Real and simulated multispectral images of MS LAT/15 f20v. (',
   'a) is the original digitised \nimage. (',
   'b)-(e) are four selected reflectance MSI images and (f)-(l) the corresponding simulated MSI \nimages.'],
  'page_sentence_count_spacy': 3,
  'sentence_chunks': [['Figure 2: Real and simulated multispectral images of MS LAT/15 f20v. (',
    'a) is the original digitised \nimage. (',
    'b)-(e) are four selected reflectance MSI images and (f)-(l) the corresponding simulated MSI \nimages.']],
  'num_chunks': 1}]

#### 2.3. Embedding each text chunk
# CHANGE
We want to **embed** each chunk of sentences into its own **numerical representation**.

That will give us a good level of granularity, meaning, we can dive specifically into the text sample that was used in our model.

In [23]:
import re

# Split each chunk into its own item
final_chunks = []
for item in tqdm(output):
  for sentence_chunk in item["sentence_chunks"]:
    chunk_dict = {}
    chunk_dict["page_number"] = item["page_number"] #get the page number

    # Join sentences in a chunk into a paragraph:
    joined_chunk = "".join(sentence_chunk).replace("  ", " ").replace("\n", " ").strip() # is there a way to replace all multiple spaces for 1

    # ADD FORMATTING SPECIFIC TO MY DOCUMENT_______________________________________________________________

    # For the 'joined_chunk', new sentences will be joined as 'end.Start'
    # To add a space, we use library regex (re). '\.([A-Z])' means for any chars
    # with this format (. followed by any capital letter (A-Z)), add
    # 1 space after '.'. So: ".A" -> ". A" (for any capital letter).
    joined_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_chunk)

    chunk_dict["sentence_chunk"] = joined_chunk #add the joined paragraph as "sentence_chunks"

    #Get stats:
    chunk_dict["chunk_token_count"] = len(joined_chunk) / 4

    final_chunks.append(chunk_dict)

len(final_chunks)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<?, ?it/s]


24

In [28]:
random.sample(final_chunks, k=1)

[{'page_number': 3,
  'sentence_chunk': 'The original digitised images were downloaded and inspected. Multispectral images were acquired  of 13 sheets using the UCL Multispectral Imaging System. Simulated MSI images were generated  using the process described in section 2.4, and using a W matrix that mapped between the Canon  EOS 6D camera used for digitisation and the PhaseOne system used for MSI.    2.5  Post processing',
  'chunk_token_count': 96.0}]

# FILTER OUT IRRELEVANT CHUNKS_____________
e.g. he does under 30 tokens see what would be useful for me otherwise remove
might be unnecessary

Chunk token count: 26.5 | Text: In  this case, the simulated images unexpectedly appeared to show the greatest contrast to the  undertext.


In [None]:
from sentence_transformers import SentenceTransformer

# Selects one of many open-source embedding models (high quality but slower)
# model doc: https://huggingface.co/sentence-transformers/all-mpnet-base-v2
embedding_model = SentenceTransformer("all-mpnet-base-v2",
                                      device = "cpu")

# Creates list of sentences as an embedding demo
sentences = ["Sentence Transformer library provides an easy way to embed data.",
             "Sentences can be embedded one by one or in a list.",
             "I like dancing!"]
# With embedding we try to capture meaning with a value. In this case, sentence
# number 3 should be further from 1 and 2.

embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# Shows embeddings
for sentence, embedding in embeddings_dict.items():
  print(f"Sentence: {sentence}")
  print(f"Embedding: {embedding}")
  print(" ")

In [40]:
from transformers import AutoTokenizer
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2')
model = AutoModel.from_pretrained('allenai/specter2')

sentences = ["Sentence Transformer library provides an easy way to embed data.",
             "Sentences can be embedded one by one or in a list.",
             "I like dancing!"] # CHANGE______________________________________________________________

# Tokenize the sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Forward pass to get hidden states
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :]  # Take the CLS token (first token) as the embedding

# Now `embeddings` is a tensor with shape (3, hidden_size)
print(embeddings.shape) 

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
