##### The aim of this project is to create a Retrieval Augmented Generation pipeline and execute it on a local GPU. Particulary in this notebook, the "retrieval" part of the pipeline is performed by following the steps below: ###

* Open a PDF document.
* Do the exploratory data analysis.* Prepare the text of the PDF textbook for an embedding model by splitting it into chunks.
* Convert all text chunks into numerical representations for storage. 
* Store the numerical representation.

In [1]:
import fitz
import os
from tqdm.auto import tqdm
import pandas as pd
import random
from spacy.lang.de import German
import re 
from sentence_transformers import SentenceTransformer
import torch

In [2]:
pdf_path1 = "C:\\Users\\asus\\WeitBlick.pdf"
pdf_path2 = "C:\\Users\\asus\\ParkAllee.pdf"

In [3]:
token_length = 7
## on the internet, I found out that token length is 6-7 in German.

In [4]:
doc1 = fitz.open(pdf_path1)

### Document/Text Processing

In [5]:
def read_pdf(doc) -> list[dict]:
    """ 
    This function stores information of each page in a dictionary, then add that dictionary to a list for convenient access later.
    
    Parameters:
    doc: the document to be processed.

    Returns:
    list[dict]: A list of dictionaries, each containing the page number, character count, word count, sentence count, token count, and the extracted text for each page.
    
    """
    pages = []
    
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text.replace("\n", " ").strip() 
        text = re.sub(r'\.{2,}', ' ', text)
        text = re.sub(r'BASIS_PACK_WB', '', text)
        text = re.sub(r'Seite \([^)]*\) von \([^)]*\)', '', text)
        text = re.sub(r'\bWB\w*', '', text)
        pages.append({"page_number": page_number + 1,
                               "char_count_per_page": len(text),
                               "word_count_per_page": len(text.split(" ")),
                               "sentence_count_raw_per_page": len(text.split(". ")),
                               "token_count_per_page": len(text) / token_length ,
                               "text": text})
        
    return pages
    

In [6]:
pages = read_pdf(doc1)

0it [00:00, ?it/s]

In [7]:
pages

[{'page_number': 1,
  'char_count_per_page': 0,
  'word_count_per_page': 1,
  'sentence_count_raw_per_page': 1,
  'token_count_per_page': 0.0,
  'text': ''},
 {'page_number': 2,
  'char_count_per_page': 173,
  'word_count_per_page': 16,
  'sentence_count_raw_per_page': 1,
  'token_count_per_page': 24.714285714285715,
  'text': ' /D/1006/XIII/03/22 Inhaltsübersicht Informationen für Ihren Versicherungsvertrag Steuerinformationen Das Kleingedruckte – mal ganz groß: Allgemeine Versicherungsbedingungen'},
 {'page_number': 3,
  'char_count_per_page': 2445,
  'word_count_per_page': 313,
  'sentence_count_raw_per_page': 14,
  'token_count_per_page': 349.2857142857143,
  'text': 'Informationen für Ihren Versicherungsvertrag Seite 1 von 3  /D/1006/XIII/03/22 1 Wer ist Ihr Vertragspartner Versicherer ist die Standard Life International DAC (90 St Stephens Green, Dublin 2, Irland,  Register-Nr. 408507). Die Anschrift der für Sie zuständigen Zweigniederlassung lautet:  Standard Life Versicherung Z

In [8]:
df_pages = pd.DataFrame(pages)
df_pages.describe()

Unnamed: 0,page_number,char_count_per_page,word_count_per_page,sentence_count_raw_per_page,token_count_per_page
count,44.0,44.0,44.0,44.0,44.0
mean,22.5,2518.0,348.068182,17.5,359.714286
std,12.845233,1340.716816,179.334869,9.431516,191.530974
min,1.0,0.0,1.0,1.0,0.0
25%,11.75,1105.75,182.0,12.25,157.964286
50%,22.5,3108.5,427.0,19.0,444.071429
75%,33.25,3485.5,483.5,24.0,497.928571
max,44.0,5304.0,634.0,33.0,757.714286


From the dataframe, it can be seen that some pages include a few words, i.e 2. Those pages cannot contain information, therefore can be removed.

In [9]:
df_pages = df_pages[df_pages["word_count_per_page"] > 25]
df_pages.head(5)

Unnamed: 0,page_number,char_count_per_page,word_count_per_page,sentence_count_raw_per_page,token_count_per_page,text
2,3,2445,313,14,349.285714,Informationen für Ihren Versicherungsvertrag S...
3,4,2390,329,15,341.428571,Informationen für Ihren Versicherungsvertrag S...
4,5,245,27,2,35.0,Informationen für Ihren Versicherungsvertrag S...
5,6,892,108,7,127.428571,/D/1006/XIII/03/22 Steuerinformationen zu Wei...
6,7,3444,401,26,492.0,Steuerinformationen Seite 1 von 3 /D/1006/XII...


In [10]:
pages = df_pages.to_dict(orient = "records")

In [11]:
nlp = German()
nlp.add_pipe("sentencizer")  ## add a sentencizer pipeline


<spacy.pipeline.sentencizer.Sentencizer at 0x23e5f27ee90>

In [12]:
for item in tqdm(pages):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences 
    item["sentence_count_stacy_per_page"] = len(item["sentences"])



  0%|          | 0/42 [00:00<?, ?it/s]

In [13]:
df_pages = pd.DataFrame(pages)
df_pages.head(5)

Unnamed: 0,page_number,char_count_per_page,word_count_per_page,sentence_count_raw_per_page,token_count_per_page,text,sentences,sentence_count_stacy_per_page
0,3,2445,313,14,349.285714,Informationen für Ihren Versicherungsvertrag S...,[Informationen für Ihren Versicherungsvertrag ...,17
1,4,2390,329,15,341.428571,Informationen für Ihren Versicherungsvertrag S...,[Informationen für Ihren Versicherungsvertrag ...,15
2,5,245,27,2,35.0,Informationen für Ihren Versicherungsvertrag S...,[Informationen für Ihren Versicherungsvertrag ...,1
3,6,892,108,7,127.428571,/D/1006/XIII/03/22 Steuerinformationen zu Wei...,[ /D/1006/XIII/03/22 Steuerinformationen zu We...,7
4,7,3444,401,26,492.0,Steuerinformationen Seite 1 von 3 /D/1006/XII...,[Steuerinformationen Seite 1 von 3 /D/1006/XI...,21


Let's create sentence sets consisting of 10 sentences and call it "chunks". Chunking involves dividing text into manageable segments. This approach is essential for several reasons: it allows for easier management of text in uniform sizes and prevents exceeding token capacities in embedding models (such as those limited to 384 tokens, where sequences longer than this can lead to information loss).

In [14]:
num_sentence_per_chunk = 8

In [15]:
def split_list(input_list: list[str], 
              slice_size: int = num_sentence_per_chunk) -> list[list[str]]:
    
    """
    Splits the input_list into sublists of size slice_size.
    For example, a list of 17 sentences would be split into two lists of [[10], [7]].
    """
    
    return [input_list[i:i+slice_size]for i in range(0, len(input_list), slice_size)]

In [16]:
for item in tqdm(pages):
    item["chunks"] = split_list(item["sentences"], num_sentence_per_chunk)
    item["num_chunks"] = len(item["chunks"])

  0%|          | 0/42 [00:00<?, ?it/s]

In [17]:
df_pages = pd.DataFrame(pages)
df_pages.head(5)

Unnamed: 0,page_number,char_count_per_page,word_count_per_page,sentence_count_raw_per_page,token_count_per_page,text,sentences,sentence_count_stacy_per_page,chunks,num_chunks
0,3,2445,313,14,349.285714,Informationen für Ihren Versicherungsvertrag S...,[Informationen für Ihren Versicherungsvertrag ...,17,[[Informationen für Ihren Versicherungsvertrag...,3
1,4,2390,329,15,341.428571,Informationen für Ihren Versicherungsvertrag S...,[Informationen für Ihren Versicherungsvertrag ...,15,[[Informationen für Ihren Versicherungsvertrag...,2
2,5,245,27,2,35.0,Informationen für Ihren Versicherungsvertrag S...,[Informationen für Ihren Versicherungsvertrag ...,1,[[Informationen für Ihren Versicherungsvertrag...,1
3,6,892,108,7,127.428571,/D/1006/XIII/03/22 Steuerinformationen zu Wei...,[ /D/1006/XIII/03/22 Steuerinformationen zu We...,7,[[ /D/1006/XIII/03/22 Steuerinformationen zu W...,1
4,7,3444,401,26,492.0,Steuerinformationen Seite 1 von 3 /D/1006/XII...,[Steuerinformationen Seite 1 von 3 /D/1006/XI...,21,[[Steuerinformationen Seite 1 von 3 /D/1006/X...,3


We will embed each chunk of sentences into its own numerical representation. Then, to keep things clean, l
Let's create our database based on chunks, not the pages.

In [18]:
chunks = []

for item in tqdm(pages):
    for chunk in item["chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        joined_sentence_chunk = "".join(chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'.\1', joined_sentence_chunk)
        chunk_dict["chunk"] = joined_sentence_chunk
        
        chunk_dict["char_count_per_chunk"] = len(joined_sentence_chunk)
        chunk_dict["word_count_per_chunk"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["token_count_per_chunk"] = len (joined_sentence_chunk) / token_length
        chunks.append(chunk_dict)

  0%|          | 0/42 [00:00<?, ?it/s]

In [19]:
chunks

[{'page_number': 3,
  'chunk': 'Informationen für Ihren Versicherungsvertrag Seite 1 von 3 /D/1006/XIII/03/22 1 Wer ist Ihr Vertragspartner Versicherer ist die Standard Life International DAC (90 St Stephens Green, Dublin 2, Irland, Register-Nr.408507).Die Anschrift der für Sie zuständigen Zweigniederlassung lautet: Standard Life Versicherung Zweigniederlassung Deutschland der Standard Life International DAC Lyoner Straße 9 60528 Frankfurt/Main Ladungsfähige Anschrift und Sitz der Zweigniederlassung Standard Life Versicherung Zweigniederlassung Deutschland der Standard Life International DAC Lyoner Straße 9 60528 Frankfurt Die Zweigniederlassung ist eingetragen beim Amtsgericht Frankfurt am Main unter der Registernummer HRB 111481. Vertreter der Zweigniederlassung und zugleich Hauptbevollmächtigter: Richard Reinhard.Standard Life International DAC ist eine irische Versicherungsgesellschaft mit Sitz in Dublin und gehört zur Phoenix Gruppe in Großbritannien.Standard Life International DA

In [20]:
df_chunks = pd.DataFrame(chunks)
df_chunks.head(5).round(2)

Unnamed: 0,page_number,chunk,char_count_per_chunk,word_count_per_chunk,token_count_per_chunk
0,3,Informationen für Ihren Versicherungsvertrag S...,1274,154,182.0
1,3,Jegliche schriftliche und mündliche Kommunikat...,897,96,128.14
2,3,Im unwahrscheinlichen Fall einer Insolvenz und...,238,29,34.0
3,4,Informationen für Ihren Versicherungsvertrag S...,1409,176,201.29
4,4,"Eine Beschwerde, bei der zugleich ein Verfahre...",950,123,135.71


In [21]:
df_chunks.describe().round(2)

Unnamed: 0,page_number,char_count_per_chunk,word_count_per_chunk,token_count_per_chunk
count,121.0,121.0,121.0,121.0
mean,23.32,904.0,116.88,129.14
std,11.47,478.65,61.25,68.38
min,3.0,1.0,1.0,0.14
25%,14.0,609.0,84.0,87.0
50%,23.0,879.0,113.0,125.57
75%,32.0,1196.0,149.0,170.86
max,44.0,3533.0,468.0,504.71


From the dataframe, it can be seen that some chunks include a few tokens, i.e 10. Those chunks cannot contain information, therefore can be removed.

In [22]:
min_token_length = 15

df_chunks = df_chunks[df_chunks["token_count_per_chunk"] > min_token_length ]
df_chunks.describe()

Unnamed: 0,page_number,char_count_per_chunk,word_count_per_chunk,token_count_per_chunk
count,120.0,120.0,120.0,120.0
mean,23.416667,911.525,117.841667,130.217857
std,11.469787,473.417381,60.570397,67.631054
min,3.0,143.0,22.0,20.428571
25%,14.0,615.0,84.0,87.857143
50%,23.0,879.0,113.5,125.571429
75%,32.25,1197.0,149.25,171.0
max,44.0,3533.0,468.0,504.714286


In [23]:
df_chunks[df_chunks["token_count_per_chunk"] > 384]

Unnamed: 0,page_number,chunk,char_count_per_chunk,word_count_per_chunk,token_count_per_chunk
43,18,Allgemeine Versicherungsbedingungen Seite 6 vo...,3533,468,504.714286


In this project, sentence-transformers model all-mpnet-base-v2 is used as an embedding model and it has an input size of 384 tokens. This means that the model has been trained in ingest and turn into embeddings texts with 384 tokens. Texts over 384 tokens which are encoded by this model will be auotmatically reduced to 384 tokens in length, potentially losing some information. The fact that there is only 3 chunks that have token more than 384 is quite acceptable. On average, we can embed whole page with the all-mpnet-base-v2 mode.

In [24]:
chunks = df_chunks.to_dict(orient = "records")

### Embedding Creation

While humans understand text, machines understand numbers best! Ideally, embeddings of text will mean that similar meaning texts have similar numerical representation.



In [25]:
device= "cuda" if torch.cuda.is_available() else "cpu"


Sentence-transformers library contains many pre-trained embedding models. Specifically, we'll get the all-mpnet-base-v2 model. 

In [26]:
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device=device)



In [27]:
## save our embeddings to our "chunk" dictionary:

for item in tqdm(chunks):
    item["embedding"] = embedding_model.encode(item["chunk"])

  0%|          | 0/120 [00:00<?, ?it/s]

In [28]:
df_chunks = pd.DataFrame(chunks)
df_chunks.head(5).round(2)

Unnamed: 0,page_number,chunk,char_count_per_chunk,word_count_per_chunk,token_count_per_chunk,embedding
0,3,Informationen für Ihren Versicherungsvertrag S...,1274,154,182.0,"[0.003149034, -0.04874641, 0.009309288, -0.002..."
1,3,Jegliche schriftliche und mündliche Kommunikat...,897,96,128.14,"[0.008266833, -0.10775237, -0.0140566705, -0.0..."
2,3,Im unwahrscheinlichen Fall einer Insolvenz und...,238,29,34.0,"[-0.034830462, -0.09795748, 0.008475932, -0.02..."
3,4,Informationen für Ihren Versicherungsvertrag S...,1409,176,201.29,"[0.02947697, -0.07836568, 0.01964098, 0.015444..."
4,4,"Eine Beschwerde, bei der zugleich ein Verfahre...",950,123,135.71,"[0.0377404, -0.018597338, -0.0036616018, 0.016..."


In [29]:
df_chunks["embedding"][0].shape

(768,)

Our embedding has a shape of (768,) meaning it's a vector of 768 numbers which represent our corresponding chunk in high-dimensional space.

No matter the size of the text input to our all-mpnet-base-v2 model, it will be turned into an embedding size of (768,). This value is fixed. So whether a sentence is 1 token long or 1000 tokens long, it will be truncated/padded with zeros to size 384 and then turned into an embedding vector of size (768,). Other embedding models may have different input/output shapes.

In [30]:
## save embeddings to file

df_chunks_save_path = "chunks_and_embeddings_for_WeitBlick.csv"
df_chunks.to_csv(df_chunks_save_path, index=False)