# Übungen RAG mit PDF-Dateien als Basis (Hausaufgabe #8)

## Aufgabe

- Eigenen Use Case definieren, dementsprechend sind auche eigene PDFS zu verwenden.
- TPM (token per minute) Limit Problem angehen
- Text chunking
- Text cleaning
- Embeddings - hugging face vs OpenAI
- System-prompt und User-Prompt (für den 2 Fall wo man am Ende Fragen an GPT schickt)
- Parameter variieren (temperature=0, top_p=0.1)
- Mit RAG ohne RAG vergleich
- Fazit (und was kann man besser machen - wie z.B. Datenbank etc)

## Übungssetup

### Teammitglieder Gruppe 3

- Hans Wermelinger
- Helmut Gehrer
- Markus Näpflin
- Nils Hryciuk
- Stefano Mavilio

### Laufzeitumgebung

Die benötigten Module werden mit `apt-get`, `npm` und `pip` bei Bedarf installiert.

Folgendes wird jedoch zusätzlich benötigt:

- Lesezugriff auf GitHub Respository NAMARKUS (derzeit öffentlich zugänglich)
- **API-Key** für die Nutzung der **OpenAI Rest-API**.
  - Unter **Google-Colab** muss dieser als **Secret `OPENAI_API_KEY`** hinterlegt werden.
  - Lokal sollte eine Umgebungsvariable mit dem gleichen Namen vorhanden sein.

## Setup der Umgebung

In den folgenden Blocks erfolgt das Setup der benötigten Tools.

Wir benötigen einige Libraries um die PDFs via Image in Texte zu konvertieren.

💾 Für die lokale Ausführung muss `poppler`auf der eigenen Maschine installiert sein, damit `pdf2image` funktioniert. Weitere Details siehe [hier](https://pypi.org/project/pdf2image/).

#### Poppler

Für PDF zu Bild-Konvertierung

In [None]:
!apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 186 kB of archives.
After this operation, 696 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.5 [186 kB]
Fetched 186 kB in 2s (104 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 123633 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.5_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.5) ...
Setting up poppler-utils (22.02.0-2ubuntu0.5) ...
Processing triggers for man-db (2.10.2-1) ...


#### Degit

Für das Clonen von einzelnen Verzeichnissen (PDF-Dateien) aus GitHub.

In [None]:
!npm install degit

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K
added 1 package in 2s
[1G[0K⠙[1G[0K

#### Python-Module

Diverse Module, die im ganzen Workflow benötigt werden.

In [None]:
%pip install pdf2image
%pip install pdfminer
%pip install pdfminer.six
%pip install openai==1.57.0
%pip install scikit-learn
%pip install rich
%pip install tqdm
%pip install pandas

Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0
Collecting pdfminer
  Downloading pdfminer-20191125.tar.gz (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycryptodome (from pdfminer)
  Downloading pycryptodome-3.21.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Downloading pycryptodome-3.21.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m46.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pdfminer
  Building wheel for pdfminer (setup.py) ... [?25l[?25hdone
  Created wheel for pdfminer: filename=pdfminer-201

In [None]:
# Imports
from pdf2image import convert_from_path
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)
import pdfminer.high_level as pdf2text
# from pdfminer.high_level import extract_text
import base64
import io
import os
import concurrent.futures
from tqdm import tqdm
from openai import OpenAI
import re
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import json
import numpy as np
from rich import print
from ast import literal_eval

#### OpenAI API-Key

Aus Colab-Secret-Storage oder Umgebungsvariable einlesen.

In [None]:
# Ermittelt den API-Key für OpenAPI abhängig von der Laufzeitumgebung.
try:
  from google.colab import userdata
  openai_api_key = userdata.get("OPENAI_API_KEY")
except:
  openai_api_key = os.getenv("OPENAI_API_KEY")

  if openai_api_key is None:
      raise Exception("API-Key not found")

print("Key für Zugriff auf OpenAI Rest-API wurde gesetzt.")


### Zu verarbeitende PDFs bereitstellen

Die PDFs werden aus dem oben erwähnten GitHub-Repo geclont.

In [None]:
!npx degit github:namarkus/BFH_CAS_AI_2024/Day08/Grp3/zvb_pdfs#main zvb_pdfs

! ls -al ./zvb_pdfs

[1G[0K⠙[1G[0K

## Datenaufbereitung

In diesem Abschnitt bereiten wir die DAten aus dem PDF für das Retrieval vor.

Dafür existieren 2 Varianten:

1. Text dirent mit `pdfminer`ermitteln
2. PDF in Bilder konvertieren und diese dann mit GPT-4o analysieren lassen.

> You can skip the 1st method if you want to only use the content inferred from the image analysis.

### Variante 1: pdfminer

Mit pdfminer werden die Texte direkt aus dem PDF ausgelesen.


In [None]:
def extract_text_from_doc(path):
    mined_text = pdf2text.extract_text(path)
    return mined_text

In [None]:
# Testen pdfminer
test_file = "zvb_pdfs/Helsana_sana_zvb.pdf"
text = extract_text_from_doc(test_file)
print(text)

### Variante 2: Bildanalyse mit GPT-4o

Nachdem ein Bild in mehrere Images konvertiert worden ist, diese durch ChatGPT in Text konvertieren lassen.

In [None]:
def convert_doc_to_images(path):
    images = convert_from_path(path)
    return images



In [None]:
# Testen der obenstehenden Methode
file_path =  "zvb_pdfs/Helsana_sana_zvb.pdf"
images = convert_doc_to_images(file_path)
print(f'Das PDF {file_path} besteht aus folgenden {len(images)} Seiten (Bildern)')
for img in images:
    display(img)

In [None]:
# Converting images to base64 encoded images in a data URI format to use with the ChatCompletions API
def get_img_uri(img):
    png_buffer = io.BytesIO()
    img.save(png_buffer, format="PNG")
    png_buffer.seek(0)

    base64_png = base64.b64encode(png_buffer.read()).decode('utf-8')

    data_uri = f"data:image/png;base64,{base64_png}"
    return data_uri

In [None]:
# Analysieren der Setite mithilfe von OpenAPI.
def analyze_image(openai_client, system_prompt, data_uri):
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {
                    "type": "image_url",
                    "image_url": {
                        "url": f"{data_uri}"
                    }
                    }
                ]
                },
        ],
        max_tokens=1000,
        temperature=0,
        top_p=0.1
    )
    return response.choices[0].message.content

In [None]:
# set client
client = OpenAI(api_key=openai_api_key)

##### Offizieller Prompt

In [None]:
# System-Prompt (Vorschlag aus Beispiel)
# Output should be according to swiss german writng rules.
image_analysis_system_prompt = '''
You will be provided with an image of a PDF page concerning a part of insurance terms. Your task is to deliver a detailed and accessible explanation of the content you see, tailored for an audience with no prior knowledge of the subject (101-level). Your audience is from the German-speaking part of Switzerland, so use Swiss German writing conventions.
If there is an identifiable title, start by stating the title to provide context for your audience.
Describe visual elements in detail:
- **Diagrams**: Explain each component and how they interact. For example, "The process begins with X, which then leads to Y and results in Z."
- **Tables**: Break down the information logically in a clear sentence. For instance, "Product A costs X dollars, while Product B is priced at Y dollars."
Focus on the content itself rather than the format:
- **DO NOT** include terms referring to the content format.
- **DO NOT** mention the content type. Instead, directly discuss the information presented.
Keep your explanation comprehensive yet concise:
- Be exhaustive in describing the content, as your audience cannot see the image.
- Exclude irrelevant details such as page numbers or the position of elements on the image.
Use clear and accessible language:
- Explain technical terms or concepts in simple language appropriate for a 101-level audience.
Engage with the content:
- Interpret and analyze the information where appropriate, offering insights to help the audience understand its significance.
------
If there is an identifiable title, present the output in the following format:
# {TITLE}

{Content description}

If there is no clear title, simply provide the content description.
'''

openapi_text = ''
for img in images:
    data_uri = get_img_uri(img)
    openapi_text += analyze_image(client, image_analysis_system_prompt, data_uri)
print(openapi_text)


##### Intergalaktische Krankenversicherung

In [None]:
intergalactical_prompt = """
You will be provided with an image of a PDF page concerning a part of insurance terms. Your task is to deliver a detailed and accessible explanation of the content you see, tailored for yedi fighters, so rewrite the text in the form as Yoda would, but in german.
If there is an identifiable title, start by stating the title to provide context for your audience.
Describe visual elements in detail:
- **Diagrams**: Explain each component and how they interact.
- **Tables**: Break down the information logically in a clear sentence.
Focus on the content itself rather than the format:
- **DO NOT** include terms referring to the content format.
- **DO NOT** mention the content type. Instead, directly discuss the information presented.
Keep your explanation comprehensive yet concise:
- Be exhaustive in describing the content, as your audience cannot see the image.
- Exclude irrelevant details such as page numbers or the position of elements on the image.
Use clear and accessible language:
- Explain technical terms or concepts in simple language appropriate for your intergalactical readers
Engage with the content:
- Interpret and analyze the information where appropriate, offering insights to help the audience understand its significance.
------
If there is an identifiable title, present the output in the following format:
# {TITLE}

{Content description}

If there is no clear title, simply provide the content description.
"""
intergalactical_text = ''
for img in images:
    data_uri = get_img_uri(img)
    intergalactical_text += analyze_image(client, intergalactical_prompt, data_uri)
print(intergalactical_text)

In [None]:
hypergalactical_prompt = """
You are an advanced AI language model designed to extract, interpret, and paraphrase complex legal documents in German, specifically healthcare insurance contracts. Your task is to accurately process and paraphrase the content of PDF documents while adhering to the following requirements:
Requirements:

    Language and Accuracy:
        Work exclusively in German.
        Maintain high precision and avoid adding, omitting, or altering the meaning of any content.

    Text Extraction:
        Extract all text, regardless of format, including multi-column layouts, tables, and graphical elements containing text.
        If text extraction is ambiguous or incomplete due to graphical complexity, flag it for clarification.

    Tables and Graphical Content:
        Pay special attention to tables and their contents, as they may be crucial for interpretation. Represent all table data clearly and accurately.
        Extract and paraphrase text embedded in graphical elements with the same care as standard text.

    Structure and Completeness:
        Ensure the paraphrased output contains all information from the original document, preserving the document's logical structure and important relationships.
        Avoid introducing any information not present in the original document.

    Paraphrasing Rules:
        Simplify and condense sentences for readability while maintaining their original meaning and tone.
        Use consistent terminology for technical and legal terms across documents.

    Comparability:
        Structure the output in a way that facilitates direct comparison between different documents.
        Include markers or headings that align with common sections in health insurance contracts, such as "Coverage Details," "Exclusions," "Premiums," and "Claims Processes."

    Formatting:
        Present paraphrased text in a clean and structured format that reflects the logical flow of the original content.
        Use bullet points, numbered lists, or headings where applicable for clarity.

    Metadata and Footnotes:
        Retain any metadata, footnotes, or annotations if they contribute to the interpretation of the document.

    Limitations and Scope:
        If content extraction is incomplete due to illegible or inaccessible parts of the PDF, clearly indicate the gap without assuming or generating content.
        Exclude any interpretations or additional commentary not derived directly from the document.

Final Output:

The paraphrased content should be a comprehensive and faithful reproduction of the original document in a simplified form, ready for comparative analysis with other similar documents. Your primary objective is to preserve meaning and structure, enabling accurate comparison without loss of detail.
"""

hypergalactical_text = ''
for img in images:
    data_uri = get_img_uri(img)
    hypergalactical_text += analyze_image(client, hypergalactical_prompt, data_uri)
print(hypergalactical_text)


##### Leichte Sprache

In [None]:
easy_reading_prompt = '''
You will be provided with an image of a PDF page concerning a part of insurance terms. Your task is to deliver a detailed and accessible explanation of the content you see, tailored for an audience with no prior knowledge of the subject and limited literacy or cognitive abilities. Your audience is from the German-speaking part of Switzerland, so use Swiss German writing conventions.
If there is an identifiable title, start by stating the title to provide context for your audience.
Describe visual elements in detail:
- **Diagrams**: Explain each component and how they interact. For example, "The process begins with X. Then it leads to Y. The result is Z."
- **Tables**: Break down the information logically in a clear sentence. For instance, "Product A costs X dollars. Product B costs Y dollars."
Focus on the content itself rather than the format:
- **DO NOT** include terms referring to the content format.
- **DO NOT** mention the content type. Instead, directly discuss the information presented.
Keep your explanation comprehensive yet concise:
- Be exhaustive in describing the content, as your audience cannot see the image.
- Exclude irrelevant details such as page numbers or the position of elements on the image.
Use clear and accessible language:
- Explain technical terms or concepts in simple language appropriate for an audience with limited literacy or cognitive abilities, so use short sentences and explain terms.
Engage with the content:
- Interpret and analyze the information where appropriate, offering insights to help the audience understand its significance.
------
If there is an identifiable title, present the output in the following format:
# {TITLE}

{Content description}

If there is no clear title, simply provide the content description.
'''
easy_reading_text = ''
for img in images:
    data_uri = get_img_uri(img)
    easy_reading_text += analyze_image(client, easy_reading_prompt, data_uri)
print(easy_reading_text)

### Gegenüberstellung Output Varianten

In [None]:
from IPython.core.display import display, HTML
import markdown

pdf2text_formatted = markdown.markdown(text)
openapi_fornmatted = markdown.markdown(hypergalactical_text)

html_comparision = f"""
<div style="display: flex; justify-content: space-between;">
    <div style="width: 45%; padding: 10px;">
        <h3>&lt;pdfminer&gt;</h3>
        <p>{pdf2text_formatted}</p>
    </div>
    <div style="width: 45%; padding: 10px;">
        <h3>&lt;chatgpt&gt;</h3>
        <p>{openapi_fornmatted}</p>
    </div>
</div>
"""

display(HTML(html_comparision))

### Embeddings

In [None]:
# Storage

import json
import os


class EmbeddingStorage:
    def __init__(self, file_path):
        self.file_path = file_path
        self.data = []

        if os.path.exists(self.file_path):
            self.load_embeddings()

    def save_embeddings(self, embeddings, texts):
        self.data = [{"embedding": embedding, "text": text} for embedding, text in zip(embeddings, texts)]
        with open(self.file_path, 'w') as f:
            json.dump(self.data, f)

    def load_embeddings(self):
        with open(self.file_path, 'r') as f:
            self.data = json.load(f)

    def get_embedding(self, index):
        return self.data[index]['embedding'] if 0 <= index < len(self.data) else None

    def get_text(self, index):
        return self.data[index]['text'] if 0 <= index < len(self.data) else None

    def get_all_embeddings(self):
        return [item['embedding'] for item in self.data]

In [None]:
# Embeddings

from sklearn.metrics.pairwise import cosine_similarity
import openai


class OpenAIEmbedding:
    def __init__(self, api_key):
        openai.api_key = api_key

    def get_embedding(self, text):
        response = openai.embeddings.create(input=text, model="text-embedding-3-small")
        return response.data[0].embedding

    def __cosine_similarity(self, embedding1, embedding2):
        return cosine_similarity([embedding1], [embedding2])[0][0]

    def find_most_similar(self, embedding, embeddings_list):
        # TODO: as question is smaller, maybe fill
        similarities = [self.__cosine_similarity(embedding, emb) for emb in embeddings_list]
        most_similar_idx = np.argmax(similarities)
        return most_similar_idx, similarities[most_similar_idx]

In [None]:
# Get Most Similar Text Based On Question

embedding = OpenAIEmbedding(openai_api_key)
storage = EmbeddingStorage("Andy.json")

andy_hug = "Andy Hug wurde in Zürich geboren und wuchs zusammen mit seinem Bruder und seiner Schwester bei seinen Grosseltern in Wohlen auf."
barbara_mueller = "Barbara Müller wurde in Bern geboren."

texts = [
    andy_hug,
    barbara_mueller
]

embeddings = [
    embedding.get_embedding(andy_hug),
    embedding.get_embedding(barbara_mueller)
]

storage.save_embeddings(embeddings, texts)

storage.load_embeddings()
loaded_embeddings = storage.get_all_embeddings()

query_embedding = embedding.get_embedding("Wo ist Andi geboren?")
idx, similarity = embedding.find_most_similar(query_embedding, loaded_embeddings)

# Retrieve the text corresponding to the most similar embedding
most_similar_text = storage.get_text(idx)
print(f"Most similar text: {most_similar_text} with similarity {similarity}")

## Verarbeitung aller Dokumente


In [None]:
files_path = "zvb_pdfs"
openai_client = OpenAI(api_key=openai_api_key)

all_items = os.listdir(files_path)
files = [item for item in all_items if os.path.isfile(os.path.join(files_path, item))]
print (f'Processing {files} ...')


In [None]:
def analyze_doc_image(img):
    img_uri = get_img_uri(img)
#    data = analyze_image(openai_client, image_analysis_system_prompt, img_uri)
    data = analyze_image(openai_client, hypergalactical_prompt, img_uri)
    return data

We will list all files in the example folder and process them by
1. Extracting the text
2. Converting the docs to images
3. Analyzing pages with GPT-4o

Note: This takes about ~2 mins to run. Feel free to skip and load directly the result file (see below).

In [None]:
# TPM Limit Problem angehen

docs = []

for f in files[0:2]:

    path = f"{files_path}/{f}"
    doc = {
        "filename": f
    }
    text = extract_text_from_doc(path)
    doc['text'] = text
    imgs = convert_doc_to_images(path)
    pages_description = []

    print(f"Analyzing pages for doc {f}")

    # Concurrent execution
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:

        # Removing 1st slide as it's usually just an intro
        futures = [
            executor.submit(analyze_doc_image, img)
            for img in imgs
        ]

        with tqdm(total=len(imgs)) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(1)

        for f in futures:
            res = f.result()
            pages_description.append(res)

    doc['pages_description'] = pages_description
    docs.append(doc)

In [None]:
# Saving result to file for later
json_path = "parsed_pdf_docs.json"

with open(json_path, 'w') as f:
    json.dump(docs, f)

In [None]:
# Optional: load content from the saved file
with open(json_path, 'r') as f:
    docs = json.load(f)

In [None]:
docs

### Embedding content
Before embedding the content, we will chunk it logically by page.
For real-world scenarios, you could explore more advanced ways to chunk the content:
- Cutting it into smaller pieces
- Adding data - such as the slide title, deck title and/or the doc description - at the beginning of each piece of content. That way, each independent chunk can be in context

For the sake of brevity, we will use a very simple chunking strategy and rely on separators to split the text by page.

In [None]:
# Chunking content by page and merging together slides text & description if applicable
# Überlegt Euch wie man den Text zerlegt
content = []

for doc in docs:
    # Split the text by form feed ('\f') and skip the first slide
    slides = doc['text'].split('\f')
    descriptions = doc['pages_description']

    # Create a mapping of description titles for faster lookup
    description_map = {
        desc.split('\n')[0].strip().lower(): desc.split('\n', 1)[1] if '\n' in desc else ""
        for desc in descriptions
    }
    used_descriptions = set()

    for slide in slides:
        slide_lines = slide.split('\n')
        slide_title = slide_lines[0].strip().lower() if slide_lines else ""
        slide_content = slide + '\n'

        # Find matching description by slide title
        if slide_title in description_map:
            slide_content += description_map[slide_title]
            used_descriptions.add(slide_title)

        content.append(slide_content)

    # Add descriptions that weren't used
    unused_descriptions = [
        desc for title, desc in description_map.items() if title not in used_descriptions
    ]
    content.extend(unused_descriptions)


In [None]:
for c in content:
    print(c)
    print("\n\n-------------------------------\n\n")

### Texte bereinigen
#### Download benötigter Module und Daten.

In [None]:
# !pip install spaCy nltk
!python -m spacy download de_core_news_sm

In [None]:
# AVB / ZB Texte bereinigen (ghe)
# Optimniert für deutschsprachige Texte, (Vertragsbedingungen von Versicherungen)
#(1) Bereitstellen der ensprechenden Funktionen

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import spacy

nltk.download('stopwords')
nltk.download('punkt_tab')
stop_words = set(stopwords.words('german')) # Deutsche Stopwörter
nlp = spacy.load("de_core_news_sm") # Deutsches Sprachmodell  für Lematisierung
phrases_to_remove = ["Impressum:",
                     "Rechte vorbehalten",
                     "VVG",
                     "ZB",
                     "ZVB"
                     "Ausgabe",
                     "Gültig ab"
                     "Seite"]


def basic_text_cleaning(text):
    #text = text.lower()     # Kleinbuchstaben
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Entfernen von URLs
    text = re.sub(r'\S+@\S+', '', text)     # Entfernen von E-Mail-Adressen
    #text = re.sub(r'[^a-zäöüß\s]', '', text)     # Entfernen von Sonderzeichen und Zahlen
    #text = re.sub(r'[^a-z0-9äöüéèà\s]', '', text)     # Entfernen von Sonderzeichen
    text = re.sub(r'\s+', ' ', text).strip() # Entfernen von mehrfachen Leerzeichen
    return text

def remove_stopwords(text):
    words = word_tokenize(text, language='german')
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

def lemmatize_text(text):
    doc = nlp(text)
    lemmatized_text = ' '.join([token.lemma_ for token in doc])
    return lemmatized_text

def remove_phrases(text, phrases_to_remove):
    for phrase in phrases_to_remove:
        text = text.replace(phrase, '')
    return text


In [None]:
# AVB / ZB Texte bereinigen (ghe)
# (2) Test-Textbausteine --> nur für Test der folgenden Funktionen
# content = [
#     "Hallo! Besuche uns auf https://example.com oder schreibe eine Mail an info@example.com. 😊",
#     "Dies ist ein Beispieltext. Impressum: Alle Rechte vorbehalten.",
#     "Die Katzen spielen mit den Bällen. Sie haben 5 Stück.",
#     "Das ist ein kurzer Satz, der einige Stopwörter enthält.",
#     "In der Heilungskosten-Zusatzversicherung Komplementär sind versicherbar:",
#     "Für Leistungen aus Komplementär I ist in jedem Fall eine ärztliche Verordnung notwendig."
# ]
print ('Obenstehenden Code auskommentieren und den folgenden Code ausführen, um nur die Textbereinigung zu testen')

In [None]:
# AVB / ZB Texte bereinigen (ghe)
# Optimniert für deutschsprachige Texte, (Vertragsbedingungen von Versicherungen)
# (3) Aufruf der Funktionen nach Bedarf.
clean_content = []
for text_block in content:
    cleaner_text_block = text_block
    cleaner_text_block = basic_text_cleaning(cleaner_text_block)
    #cleaner_text_block = remove_phrases(cleaner_text_block, phrases_to_remove)
    #cleaner_text_block = remove_stopwords(cleaner_text_block)
    #cleaner_text_block = lemmatize_text(cleaner_text_block)
    print(f'Text {text_block} \n --> {cleaner_text_block}')
    clean_content.append(cleaner_text_block)

In [None]:
for c in clean_content:
    print(c)
    print("\n\n-------------------------------\n\n")

In [None]:
# Creating the embeddings
# We'll save to a csv file here for testing purposes but this is where you should load content in your vectorDB.
df = pd.DataFrame(clean_content, columns=['content'])
print(df.shape)
df.head()

Unnamed: 0,content
0,2 2. Leistungskatalog Komplementär I II III Ma...
1,
2,**Hinweis:** - Die Nennung der Geschlechter er...
3,### Komplementärversicherung | Kategorie | I |...
4,Ausgabe 1. Januar 2021 Zusätzliche Versicherun...


In [None]:
# HA vergleichen mit Hugging Face embeddings

embeddings_model = "text-embedding-3-large"

def get_embeddings(text):
    embeddings = client.embeddings.create(
      model="text-embedding-3-small",
      input=text,
      encoding_format="float"
    )
    return embeddings.data[0].embedding

In [None]:
df['embeddings'] = df['content'].apply(lambda x: get_embeddings(x))
df.head()

Unnamed: 0,content,embeddings
0,2 2. Leistungskatalog Komplementär I II III Ma...,"[-0.032685652, 0.019604007, 0.03211956, 0.0215..."
1,,"[0.015368387, -0.034810703, -0.009328825, 0.01..."
2,**Hinweis:** - Die Nennung der Geschlechter er...,"[-0.0103751635, 0.01803124, 0.0070108683, 0.02..."
3,### Komplementärversicherung | Kategorie | I |...,"[-0.023061814, 0.022291865, 0.023061814, 0.027..."
4,Ausgabe 1. Januar 2021 Zusätzliche Versicherun...,"[0.0029073232, 0.016002247, 0.0573272, 0.04209..."


In [None]:
# Saving locally for later
data_path = "parsed_pdf_docs_with_embeddings.csv"
df.to_csv(data_path, index=False)

In [None]:
# Optional: load data from saved file
df = pd.read_csv(data_path)
df["embeddings"] = df.embeddings.apply(literal_eval).apply(np.array)

In [None]:
df

## Retrieval-augmented generation

The last step of the process is to generate outputs in response to input queries, after retrieving content as context to reply.

In [None]:
rag_system_prompt = '''
    You will be provided with an input prompt and content as context that can be used to reply to the prompt.

    You will do 2 things:

    1. First, you will internally assess whether the content provided is relevant to reply to the input prompt.

    2a. If that is the case, answer directly using this content. If the content is relevant, use elements found in the content to craft a reply to the input prompt.

    2b. If the content is not relevant, use your own knowledge to reply or say that you don't know how to respond if your knowledge is not sufficient to answer.

    Stay concise with your answer, replying specifically to the input prompt without mentioning additional information provided in the context content.
'''

model="gpt-4o"

def search_content(df, input_text, top_k):
    embedded_value = get_embeddings(input_text)
    df["similarity"] = df.embeddings.apply(lambda x: cosine_similarity(np.array(x).reshape(1,-1), np.array(embedded_value).reshape(1, -1)))
    res = df.sort_values('similarity', ascending=False).head(top_k)
    return res

def get_similarity(row):
    similarity_score = row['similarity']
    if isinstance(similarity_score, np.ndarray):
        similarity_score = similarity_score[0][0]
    return similarity_score

def generate_output(input_prompt, similar_content, threshold = 0.5):

    content = similar_content.iloc[0]['content']

    # Adding more matching content if the similarity is above threshold
    if len(similar_content) > 1:
        for i, row in similar_content.iterrows():
            similarity_score = get_similarity(row)
            if similarity_score > threshold:
                content += f"\n\n{row['content']}"

    prompt = f"INPUT PROMPT:\n{input_prompt}\n-------\nCONTENT:\n{content}"

    completion = client.chat.completions.create(
        model=model,
        temperature=0.5,
        messages=[
            {
                "role": "system",
                "content": rag_system_prompt
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
    )

    return completion.choices[0].message.content

In [None]:
# Example user queries related to the content
example_inputs = [
    'Kann ich mein Rechnung für Akkupunktur der Helsana einsenden?',
    'Wieviel zahlt Helsana an Akkupunktur?',
    'Übernimmt Visana die Kosten für meine Hellseherin?',
    'Welche Krankenkasse übernimmt die höheren Beiträge bei Ergotherapie?',
    'Hat mein Kind Vorbehalte, wenn ich es bei Visana alternativ versichern lasse?',
    'Hat mein Kind Vorbehalte, wenn ich es bei Helsana alternativ versichern lasse?',
    'Ab wann kann ich mein Kind bei der Helsana alternativ versichern?',
    'Ich war auf Bali in einer Ayurveda Behandlung. Werden mir diese Kosten zurückerstattet?',
    'Von welcher Versicherung erhalten ich den grössten Betrag vergütet und für welche Leistung?',
    'Wie war nochmal der Name von Taylor Swifts Lieblingshaustier?'
]

In [None]:
# Running the RAG pipeline on each example
for ex in example_inputs:
    print(f"[deep_pink4][bold]QUERY:[/bold] {ex}[/deep_pink4]\n\n")
    matching_content = search_content(df, ex, 3)
    print(f"[grey37][b]Matching content:[/b][/grey37]\n")
    for i, match in matching_content.iterrows():
        print(f"[grey37][i]Similarity: {get_similarity(match):.2f}[/i][/grey37]")
        #print(f"[grey37]{match['content'][:100]}{'...' if len(match['content']) > 100 else ''}[/[grey37]]\n\n")
    reply = generate_output(ex, matching_content)
    print(f"[turquoise4][b]REPLY:[/b][/turquoise4]\n\n[spring_green4]{reply}[/spring_green4]\n\n--------------\n\n")

#### Testen von RAG

Die Antworten können unterschiedlich sein, da sie von einer LLM aufbereitet werden. Unit- oder Integrationstests sind so per se eher schwierig.

Eine Variante, dies zu umgehen ist der Rückgriff auf eine LLM, um die erwartet und erhaltene Aussage inhaltlicdh zu vergleichen.

In [None]:
testing_system_prompt = '''
    You will receive 2 statements marked as "<<EXPECTED>>:" and "<<RECEIVED>>:"

    Check this two statements if the base proposition is the same.

    Answer just with the literal value "True" or "False"
'''

model="gpt-4o"

def test_statement(question, expected_answer, expected=True):

    matching_content = search_content(df, question, 3)
    actual_answer = generate_output(question, matching_content)

    #print (actual_answer)

    prompt = f"""
    <<EXPECTED>>:
    {expected_answer}

    <<RECEIVED>>:
    {actual_answer}
    """

    # print(f'Teste: {prompt}')

    completion = client.chat.completions.create(
        model=model,
        temperature=0.5,
        messages=[
            {
                "role": "system",
                "content": testing_system_prompt
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
    )
    if completion.choices[0].message.content == str(expected):
        print(f'Test zu "{question}" erfolgreich!')
    else:
        print(f'''Test zu "{question}" expected={expected} fehlgeschlagen!
        - erwartete Antwort: "{expected_answer}"
        - erhaltene Antwort: "{actual_answer}"
        ''')



test_statement('Übernimmt Visana die Kosten für meine Hellseherin?', 'Visana übernimmt keine Kosten für Hellseherinnen.')
test_statement('Übernimmt Visana die Kosten für meine Hellseherin?', 'Ja Visana übernimmt die Kosten für Hellseherinnen.', expected=False)
test_statement('Kann ich mein Rechnung für Akkupunktur der Helsana einsenden?', 'Ja, aber mit Vorbehalt auf anerkannte Therapeuten. Helsana erstattet 75% der Kosten')
test_statement('Kann ich mein Rechnung für Akkupunktur der Helsana einsenden?', 'Ja', expected=False) # Hier fehlt der Ausschluss, daher nicht korrekt.
test_statement('Stimmt es, dass Helsana und Visana nur ambulante Behandlungen und Therapien versichern?', 'Nein es werden auch stationäre Behandlungen und Medikamente vergütet')
test_statement('Stimmt es, dass Helsana und Visana nur ambulante Behandlungen und Therapien versichern?', 'Bei Helsana sind nur ambulante Leistungen gedeckt.', expected=False)
test_statement('Welche Versicherung vergütet den grösseren Anteil an meinen Kosten für Medikamente?', 'Visana, bei dieser Versicherung sind 90% der Kosten für Medikamente gedeckt.', expected=False) #Wir haben kein Dokument mit beiden Marken. Daher werden wir in unseren Embeddings mit der Frage keine Antwort finden.



## Fazit


In this notebook, we have learned how to develop a basic RAG pipeline based on PDF documents. This includes:

- How to parse pdf documents, taking slide decks and an export from an HTML page as examples, using a python library as well as GPT-4o to interpret the visuals
- How to process the extracted content, clean it and chunk it into several pieces
- How to embed the processed content using OpenAI embeddings
- How to retrieve content that is relevant to an input query
- How to use GPT-4o to generate an answer using the retrieved content as context

If you want to explore further, consider these optimisations:

- Playing around with the prompts provided as examples
- Chunking the content further and adding metadata as context to each chunk
- Adding rule-based filtering on the retrieval results or re-ranking results to surface to most relevant content

You can apply the techniques covered in this notebook to multiple use cases, such as assistants that can access your proprietary data, customer service or FAQ bots that can read from your internal policies, or anything that requires leveraging rich documents that would be better understood as images.