# Knowledge Base

## Auswahl der Sprachmodelle 

### llama3:8b
- **Leistung**: Starke Performance für RAG-Anwendungen mit guter Kontextverarbeitung
- **Größe**: Mit 8B Parametern effizient auf Consumer-Hardware lauffähig
- **Lizenz**: Permissive Lizenz ermöglicht kommerzielle Nutzung
- **Aktualität**: Neues Modell mit modernem Trainingsdatenset und verbesserter Instruction-Following-Fähigkeit

### mistral:7b-instruct
- **Effizienz**: Ausgezeichnetes Leistungs-Größen-Verhältnis
- **Spezialisierung**: Optimiert für Instruction-Following und Kontextverständnis
- **Architektur**: Gruppenweise Rotation der Aufmerksamkeit für verbesserte Verarbeitung langer Dokumente
- **Community-Support**: Breite Nutzerbasis und dokumentierte Anwendungsfälle für RAG

### phi4-mini
- **Ressourcenschonung**: Kleines Modell (3.8B) für Systeme mit begrenzten Ressourcen
- **Effizienz**: Hervorragende Leistung trotz geringer Größe
- **Antwortqualität**: Gute Formulierungsfähigkeit bei unternehmensbezogenen Inhalten
- **Kompatibilität**: Geringer VRAM-Bedarf macht es auf verschiedenen Systemen einsetzbar

Dieser Mix bietet eine gute Balance zwischen Performance, Ressourcenbedarf und verschiedenen Architekturen für einen aussagekräftigen Vergleich.



## Einrichtung der Knowledge Base

### Config

In [5]:
import os
import shutil
from pathlib import Path
import fitz
import re
from typing import List
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from langchain.schema import Document
import time
import psutil
import requests

# Folder structure
BASE_DIR = Path("knowledge-base")
IMPORT_DIR = BASE_DIR / "import"
PROCESSED_DIR = BASE_DIR / "processed"
INPUT_DIR = BASE_DIR / "embeddings-ready"
VECTOR_DB_DIR = BASE_DIR / "vector-stores"


# Query Options
CONTEXT_WINDOW = 8192
TOKEN_LIMIT = 4096
OLLAMA_MODEL = "phi4-mini"  # Ollama-Modell
TEST_QUERY = "Which callback function is called during training?"
EXPECTED_ANSWER = "ModelCheckpoint"

# Define Embedding Configs
EMBEDDING_CONFIGS = [
    {
        "name": "word_level",
        "model": "sentence-transformers/all-MiniLM-L6-v2",
        "db_path": VECTOR_DB_DIR / "word_level_db"
    },
    {
        "name": "sentence_level",
        "model": "sentence-transformers/all-mpnet-base-v2",
        "db_path": VECTOR_DB_DIR / "sentence_level_db"
    },
    {
        "name": "document_level",
        "model": "intfloat/multilingual-e5-large",
        "db_path": VECTOR_DB_DIR / "document_level_db"
    }
]

# Embedding Model to use for chunking test
EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"

# Chunking-Methoden
CHUNKING_METHODS = {
    "fixed_size": RecursiveCharacterTextSplitter(
        chunk_size=1000, 
        chunk_overlap=100
    ),
    "sentence": RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n", ". ", "! ", "? ", ";", ":"],
        chunk_size=1000,
        chunk_overlap=0
    ),
    "paragraph": RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n"],
        chunk_size=2000,
        chunk_overlap=50
    )
}

# Ensure Folders are created
VECTOR_DB_DIR.mkdir(exist_ok=True, parents=True)

PDF-Dateien: 10
 - 1-Challenges_and_Reliability_of_Predictive_Maintenance.pdf
 - 2-1478_unsupervised_model_selection_for unsupervised anomaly detection.pdf
 - 3-Anomaly_Detection_in_Renewable_Energy_Big_Data_Usi.pdf
 - 4-Anomaly Detection and Diagnosis In Manufacturing Systems A Comparative Study Of Statistical, Machine Learning And Deep Learning Techniques.pdf
 - 5-comparative study of deep learning autoencoders for vibration anomaly detection in manufacturing equipement.pdf
Verarbeite: 1-Challenges_and_Reliability_of_Predictive_Maintenance.pdf
Verarbeite: 2-1478_unsupervised_model_selection_for unsupervised anomaly detection.pdf
Verarbeite: 3-Anomaly_Detection_in_Renewable_Energy_Big_Data_Usi.pdf
Verarbeite: 4-Anomaly Detection and Diagnosis In Manufacturing Systems A Comparative Study Of Statistical, Machine Learning And Deep Learning Techniques.pdf
Verarbeite: 5-comparative study of deep learning autoencoders for vibration anomaly detection in manufacturing equipement.pdf
Verarbeit

### Vector-DB und Query Methoden definieren

In [None]:
def load_vector_db(db_path, embedding_model):
    """Lädt eine existierende Vektordatenbank"""
    print(f"Lade Vector DB aus {db_path}")
    embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
    db = Chroma(
        persist_directory=str(db_path),
        embedding_function=embeddings
    )
    return db

def query_ollama(prompt, model=OLLAMA_MODEL):
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "stream": False
        }
    )
    return response.json()
  
def search_and_query_llm(db_path, query, embedding_model):
    start_time = time.time()
    print(f"\nSuche in {db_path} nach: '{query}'")
    
    db = load_vector_db(db_path, embedding_model)
    docs = db.similarity_search(query, k=3)
    suchzeit = time.time() - start_time
    
    context = "\n\n".join([doc.page_content for doc in docs])
    prompt = f"""Basierend auf dem folgenden Kontext, beantworte die Frage.

Kontext:
{context}

Frage: {query}"""
    
    start_time = time.time()
    response = query_ollama(prompt)
    antwortzeit = time.time() - start_time
    content = response['message']['content']

    print("\n" + "="*60)
    print(f"FRAGE:")
    print("-"*60)
    print(query)
    print("="*60)
    
    print(f"ANTWORT:")
    print("-"*60)
    print(content)
    print("="*60)
    
    print(f"METADATEN:")
    print("-"*60)
    print(f"Suchzeit:       {suchzeit:.4f} Sekunden")
    print(f"LLM-Antwortzeit: {antwortzeit:.2f} Sekunden")
    print(f"Gesamtzeit:     {suchzeit + antwortzeit:.2f} Sekunden")
    print("="*60 + "\n")
    
    return {
        "suchzeit": suchzeit,
        "antwortzeit": antwortzeit,
        "gesamtzeit": suchzeit + antwortzeit,
        "antwort": content
    }


def create_vector_db(chunks, model_path, db_path):
    start_time = time.time()
    memory_before = psutil.Process().memory_info().rss / 1024 / 1024  # MB
    
    print(f"\nErstelle Vector DB mit {model_path}")
    embeddings = HuggingFaceEmbeddings(model_name=model_path)
    
    db = Chroma.from_documents(
        chunks,
        embeddings,
        persist_directory=str(db_path)
    )
    
    memory_after = psutil.Process().memory_info().rss / 1024 / 1024  # MB
    erstellungszeit = time.time() - start_time
    
    print(f"Vector DB in {db_path} gespeichert")
    print(f"Erstellungsdauer: {erstellungszeit:.2f} Sekunden")
    print(f"Speicherverbrauch: {memory_after - memory_before:.2f} MB")
    
    return db, db_path


### Verzeichnis Struktur einlesen und laden

In [None]:
def load_documents(directory: Path):
    documents = []
    for file_path in directory.glob("*.txt"):
        with open(file_path, "r", encoding="utf-8") as f:
            text = f.read()
        doc = Document(page_content=text, metadata={"source": file_path.name})
        documents.append(doc)
    print(f"Geladen: {len(documents)} Dokumente")
    return documents

for dir_path in [BASE_DIR, IMPORT_DIR, PROCESSED_DIR, INPUT_DIR]:
    dir_path.mkdir(exist_ok=True, parents=True)
    
def list_pdf_files(directory):
    return [f for f in directory.glob("*.pdf")]
  
pdf_files = list_pdf_files(IMPORT_DIR)

print(f"PDF-Dateien: {len(pdf_files)}")
for pdf in pdf_files:
    print(f" - {pdf.name}")

def process_pdf(pdf_path):
    document = fitz.open(pdf_path)
    text_content = []
    
    for page_num in range(len(document)):
        page = document[page_num]
        text = page.get_text()
        
        text = re.sub(r'\s+', ' ', text) 
        text = text.strip()
        
        if text:
            text_content.append(f"--- Seite {page_num + 1} ---\n{text}")
    
    document.close()
    
    processed_text = "\n\n".join(text_content)
    
    processed_text = re.sub(r'([.!?])\s*(\w)', r'\1\n\2', processed_text)  # Satzenden mit Zeilenumbrüchen
    
    return processed_text

processed_files = []

for pdf_path in pdf_files:
    print(f"Verarbeite: {pdf_path.name}")
    
    processed_text = process_pdf(pdf_path)
    
    output_file = INPUT_DIR / f"{pdf_path.stem}.txt"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(processed_text)
    
    target_path = PROCESSED_DIR / pdf_path.name
    shutil.move(pdf_path, target_path)
    
    processed_files.append({
        "original_file": pdf_path.name,
        "processed_file": output_file.name,
        "size_kb": round(output_file.stat().st_size / 1024, 2)
    })


documents = load_documents(INPUT_DIR)

## Implementierung der Content Embeddings 

### Leitfragen zur Bewertung

Hier wurden die PDFs manuell analysiert um einen geeignete Query zu finden um die Qualität des Outputs zu prüfen.  
Da das Sprachmodell von z.B. phi4 auch schon viel Basis-Wissen ohne Knowledgebase hat, muss die Frage so formuliert werden, dass es nur mithilfe der Dokumente beantwortet werden kann.

> Which callback function is called during training?
Erwartete Antwort: ModelCheckpoint

### Ausgewählte Embeddings

- word_level
- sentence_level
- document_level

### Vector DB für Embedding erstellen

In [6]:
# Dokumente in Chunks aufteilen
def prepare_chunks(documents):
    max_chunk_size = min(CONTEXT_WINDOW // 2, TOKEN_LIMIT)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_chunk_size,
        chunk_overlap=50,
        length_function=len
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Dokumente in {len(chunks)} Chunks aufgeteilt")
    return chunks

chunks = prepare_chunks(documents)

for config in EMBEDDING_CONFIGS:
    create_vector_db(chunks, config["model"], config["db_path"])


Dokumente in 222 Chunks aufgeteilt

Erstelle Vector DB mit sentence-transformers/all-MiniLM-L6-v2
Vector DB in knowledge-base\vector-stores\word_level_db gespeichert
Erstellungsdauer: 13.91 Sekunden
Speicherverbrauch: 258.41 MB

Erstelle Vector DB mit sentence-transformers/all-mpnet-base-v2


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Vector DB in knowledge-base\vector-stores\sentence_level_db gespeichert
Erstellungsdauer: 144.09 Sekunden
Speicherverbrauch: 782.60 MB

Erstelle Vector DB mit intfloat/multilingual-e5-large


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Vector DB in knowledge-base\vector-stores\document_level_db gespeichert
Erstellungsdauer: 545.12 Sekunden
Speicherverbrauch: 1415.49 MB


### Evaluierung der Embedding Vector DBs

In [7]:
# Alle drei DBs mit der gleichen Frage testen
print("\n=== Vergleichstest aller Vector DBs ===")

results = {}
for config in EMBEDDING_CONFIGS:
    results[config["name"]] = search_and_query_llm(config["db_path"], TEST_QUERY, config["model"])

# Zusammenfassung der Ergebnisse
print("\n=== Zusammenfassung ===")
print("Embedding-Modell | Suchzeit (s) | Antwortzeit (s) | Gesamtzeit (s)")
print("-" * 65)
for model, result in results.items():
    print(f"{model:15} | {result['suchzeit']:.4f} | {result['antwortzeit']:.2f} | {result['gesamtzeit']:.2f}")


=== Vergleichstest aller Vector DBs ===

Suche in knowledge-base\vector-stores\word_level_db nach: 'Which callback function is called during training?'
Lade Vector DB aus knowledge-base\vector-stores\word_level_db


  db = Chroma(



FRAGE:
------------------------------------------------------------
Which callback function is called during training?
ANTWORT:
------------------------------------------------------------
The provided context does not explicitly mention any Python code that includes a callback function being used in an autoencoder or deep-learning models for anomaly detection. Callback functions are typically found within machine learning frameworks such as TensorFlow/Keras, and they can be set up to execute certain actions at various stages of the model training process (e.g., after each epoch).

Common types of callbacks include:

1. EarlyStopping: Stops training when a monitored metric has stopped improving.
2. ModelCheckpoint: Saves the best-performing model during the entire history or periodically as specified by 'save_best_only' and 'period'.
3. ReduceLROnPlateau: Decreases learning rate once an epoch with no improvement in performance is detected for epochs.
4. TensorBoard (as hinted at in Fi

## Implementierung des Chunking 

In [8]:
# Chunking-Methoden testen
for method_name, splitter in CHUNKING_METHODS.items():
    print(f"\n=== Chunking-Methode: {method_name} ===")
    
    # Dokumente in Chunks aufteilen
    chunks = splitter.split_documents(documents)
    print(f"Chunks erstellt: {len(chunks)}")
    
    # Vector-DB erstellen mit bestehender Methode
    db, _ = create_vector_db(chunks, EMBEDDING_MODEL,  VECTOR_DB_DIR / f"chunking_{method_name}")




=== Chunking-Methode: fixed_size ===
Chunks erstellt: 759

Erstelle Vector DB mit sentence-transformers/all-mpnet-base-v2
Vector DB in chunking_fixed_size gespeichert
Erstellungsdauer: 345.41 Sekunden
Speicherverbrauch: 640.40 MB

=== Chunking-Methode: sentence ===
Chunks erstellt: 745

Erstelle Vector DB mit sentence-transformers/all-mpnet-base-v2
Vector DB in chunking_sentence gespeichert
Erstellungsdauer: 301.90 Sekunden
Speicherverbrauch: 368.23 MB

=== Chunking-Methode: paragraph ===
Chunks erstellt: 407

Erstelle Vector DB mit sentence-transformers/all-mpnet-base-v2
Vector DB in chunking_paragraph gespeichert
Erstellungsdauer: 209.86 Sekunden
Speicherverbrauch: 416.54 MB



### Evaluierung Embedding

In [9]:

results = {}
for method_name, splitter in CHUNKING_METHODS.items():
    # Mit LLM testen
    results[method_name] = search_and_query_llm(VECTOR_DB_DIR / f"chunking_{method_name}", TEST_QUERY, EMBEDDING_MODEL)
        
# Zusammenfassung
print("\n=== Zusammenfassung der Chunking-Methoden ===")
print("Methode        | Suchzeit (s) | Antwortzeit (s) | Gesamtzeit (s)")
print("-" * 65)
for method, result in results.items():
    print(f"{method:15} | {result['suchzeit']:.4f} | {result['antwortzeit']:.2f} | {result['gesamtzeit']:.2f}")


Suche in knowledge-base\vector-stores\chunking_fixed_size nach: 'Which callback function is called during training?'
Lade Vector DB aus knowledge-base\vector-stores\chunking_fixed_size

FRAGE:
------------------------------------------------------------
Which callback function is called during training?
ANTWORT:
------------------------------------------------------------
Im PyTorch Framework wird während des Trainingsprozesses der Callback-Funktion `backward()` an das Modell übergeben. Dies ist Teil von PyTorchs Autograd-Modul (Automatic Differentiation), welches den Gradienten im Backpropagation-Prozess berechnet.

Um mehr Klarheit zu schaffen, hier ein Beispiel:

```python
import torch

# Ein einfaches Modell definieren.
model = MyModel()

# Ein Callback-Funktion für Training vorbereiten.
def train_callback(optimizer):
    def callback_fn(engine, batch):
        # Der Trainingsprozess wird an der aktuellen Batch-Einheit ausgeführt.
        
        optimizer.zero_grad()  # Gradient