# Step 1: Setting up the environment

#### To create and activate a virtual environment, run on your terminal:

**Windows:**


```
python -m venv venv
```
```
venv\Scripts\Activate
```

**macOS/Linux:**

```
python3 -m venv venv
```

```
source venv/bin/activate
```

#### After the environment is activated, install the requirements:

```
pip install -r requirements.txt
```


# Step 2: Pre-processing the document

#### What you will need:
1. PDF Document (skin cancer detection paper in my case)
2. Embedding model (I am using Sentence-Bert)

As described in the Readme file, the aim of this tutorial is to build a RAG-assisted LLM that can retrieve information from research papers, helping students and researcher get a quicker understanding of the paper. I will be using the paper *Skin Cancer Detection using ML Techniques* <sup>1</sup> for this example.  Feel free to use any paper you would like to retrieve information from, the same steps will apply to any paper / book. All you need to do is download it in pdf format and add your file path to the variable 'path'.  

To pre-process the PDF document, we will use an embedding model. 

#### ⚠️ Now, what does “embedding” mean in AI?

In this tutorial, we are trying to get our AI model to understand a paper (complex text data). The problem is, our model can only understand numbers. That is where embeddings come in.

> An embedding is a way of representing complex data (like words or images) as a list of numbers — called a vector — in such a way that the relationships between items are preserved.


#### Let’s dive into that:

Think of each item (a word, an image, a sentence) as a point in space - a location on a map. The closer two points are, the more related their meanings are.

For example:

- The word “cat” will be close to “dog”.

- The word “car” will be far away from “banana”.

That’s because in real life, cats and dogs are similar (both animals, pets), while a car and a banana are not.

So embeddings help us map meaning into a mathematical space.

#### 🧐 What is an embedding model?

An embedding model is an AI model that has learned how to take something complex — like a sentence — and turn it into a vector (a list of numbers) that captures its meaning.

Different embedding models specialize in different kinds of data. The table below shows some examples of open-source embedding models for different use cases:


| Data Type         | Embedding model examples    | What do they capture? |
|-------------------|-----------------------------|-------------------------------
| Words | Word2Vec, GloVe, FastText | Word meanings, analogies, syntactic similarity |
| Sentences / Text | Sentence-BERT (SBERT), Instructor, E5 | Semantic similarity between sentences/documents |
| Images              |  DINO, OpenCLIP   | Visual concepts, cross-modal (image-text) meaning   |
| Audio               |  Wav2Vec 2.0, Whisper  | Speech content, audio features   |
| Code | CodeBERT, GraphCodeBERT | Code syntax and semantics |  

In this tutorial we are looking to read PDF documents, therefore, we need a model that embeds data based on semantic similarity. I have chosen Sentence-BERT, but it is interchangable for any sentence / text embedding model. Once you have build your own RAG-assisted LLM, you can experiment with different models and decide what works best for you.

#### Now that we know how the data pre-processing will work, let's get started!




< |



<sup>1</sup> M. Vidya and M.V. Karki "Skin Cancer Detection using Machine Learning Techniques", 2020 IEEE International Conference on Electronics, Computing and Communication Technologies, Bangalore, India, 2020, pp. 1-5, doi 10.1109/CONECCT50063.2020.9198489.98489. 



In [14]:
# Import relevant modules
import os
import requests
import fitz
from tqdm.auto import tqdm
#Note: tqdm is used to obtain a progress bar from any loop you run with it

# Get PDF path 
# Change this variable to your pdf path______________________________
pdf_path = r"G:\My Drive\feines 2025\MS Imaging paper\to submit.pdf"
#____________________________________________________________________

# Check that the path exists
if os.path.exists(pdf_path):
    print(f"PDF file '{pdf_path}' exists.")
else:
    print(f"PDF file '{pdf_path}' does not exist")

# Open the PDF file
document = fitz.open(pdf_path)

ModuleNotFoundError: No module named 'frontend'

In [None]:
# Define helper function to clean text
def format_pdf(pdf_text: str) -> str: # REMOVE -> STR ???????????????????????
  """Use this function for any formatting you want to apply to the text prior to embedding.
  Input: PDF text
  Output: PDF text with \n replaced for a space"""
  formatted_pdf = pdf_text.replace("\n", " ").strip()
  # add any other formatting features here
  return formatted_pdf

# Define helper function to open and read pdfs
def open_and_read_pdf(pdf_path: str) -> list[dict]:
  doc = fitz.open(pdf_path)
  output = [] #empty list that will be filled with info in the pdf's pages
  # Get each page (enumerate is used to number the pages) and apply pre-process
  # helper function
  for page_number, page in tqdm(enumerate(doc)):
    text = page.get_text()
    text = format_pdf(text=text)
    output.append({"page_number": page_number - 41,         #page number
                   "page_char_count": len(text),            #chars in the page
                   "page_word_count": len(text.split(" ")), #words in the page
                   "page_sentence_count_raw": len(text.split(". ")), #sentences
                   "page_token_count": len(text) / 4,       #tokens in the page
                   "text": text                             #text in the page
                   # Reminder: 'Hello, World!' has 4 tokens: 'Hello', ',',
                   # 'World', and '!'. 1 token ~= 4 char on average.
                   })
  return output

# Apply open and read helper function to pdf and display pages 1-3
output = open_and_read_pdf(pdf_path=pdf_path)
output[:2]