# RAG Workshop – Dutch Pension Law  
Author: *Generated on 2025-04-30*  

This notebook accompanies an internal workshop on **Retrieval‑Augmented Generation (RAG)** with Dutch pension‑related law texts from [wetten.overheid.nl](https://wetten.overheid.nl).  
You will need:

* Python 3.10+  
* `openai`, `azure‑identity`, `langchain`, `langchain‑community`, `langchain‑huggingface`,`langchain_openai`, `tiktoken`, `rdflib`, `faiss-cpu`,`numpy`

Using `config_template.yaml` fill it's contexts (provided seperately) and save under `config.yaml`, necessary to run gpt models:

1. **Configure environment**
   ```bash
   # This is a template configuration file for the Azure OpenAI API.
   # Fill in the placeholders with your actual configuration values.
   # Then save the file as 'config.yaml' in the same directory.
   AZURE_ENDPOINT: [Enter your Azure endpoint here]

   ## Decoder
   API_VERSION_DECODER: [Enter your API version here]
   API_KEY_DECODER: [Enter your API key here]

   ## Encoder
   API_VERSION_ENCODER: [Enter your API version here]
   API_KEY_ENCODER: [Enter your API key here]
   ```
2. Next run the cell below to install dependencies.

In [None]:
%pip install -r requirements.txt

### Simple LLM call.

In [None]:
#Prompting API example.
from utils import gpt_4o_mini

query = "Say 'Hello world' in five different languages."
print(gpt_4o_mini(query).choices[0].message.content)

## 1 – Layout of Dutch law texts  
We will inspect local `.txt` dumps in `data/docs/`. Each file contains **one complete law** (some > 60 k tokens).

Below we load the all three files and print a *single article* per file to understand the structure.

### Random article from input law texts.

Here we show, using the three law texts, one random article from each.

In [None]:
from pathlib import Path
import random
from utils import split_articles, wrap_at_spaces

# Load a law text file.
DOC_PATH = Path("data/docs")
for law_file in sorted(DOC_PATH.glob("*.txt")):
    law_text = law_file.read_text(encoding="utf-8")
    articles = split_articles(law_text)  # drop preamble

    example_idx = random.randrange(len(articles))
    article_key = list(articles.keys())[example_idx]
    print(f"\n--- {law_file.name}, total articles {len(articles)} ---")
    print(article_key)
    print(wrap_at_spaces(articles[article_key],width=100))

### Count tokens per file.

Given our working dataset, we show the number of tokens per law text below.

In [None]:
# Token count
from utils import count_tokens_in_docs

print(count_tokens_in_docs())

## 2 – Direct LLM retrieval vs. Article‑wise retrieval  
We compare:

1. **Whole‑law prompt** – push the complete text (~60 k tokens)                                                 → costly ❌ fast ✔️ accurate ❌  
2. **Chunked/article prompts** – iterate per article                                                            → costly ❌ slow ❌ accurate ✔️  
3. **RAG** - use article text embeddings to identify five most relevant articles before searching per article   → cheap ✔️ fast ✔️ accurate ✔️  
4. **Graph RAG** - Using a law text's linked-data structure                                                     → cheap ✔️ fast ✔️ accurate ✔️

We demonstrates examples 3 and 4 at the end of the notebook.

### 2.1. First, using 'naive' full text chunking in one large system prompt.

In [None]:
import time
import json
from utils import gpt_4o_mini, llm_metrics,wrap_at_spaces

QUESTION = "Wat is de franchise, welk artikel gebruik je ervoor en wat zijn uitzonderingen op de wettelijke waarde? Hou het antwoord onder de 50 woorden."

# --- Whole law ---
with open("data/docs/Wet op loonbelasting 1964.txt", "r", encoding="utf-8") as f:
    law_text = f.read()
    start = time.time()
    response = gpt_4o_mini(user_message=QUESTION,law_text = law_text)

    elapsed = time.time()-start
    out_tokens = response.usage.completion_tokens
    in_tokens = response.usage.prompt_tokens
    print(f"Q {QUESTION}")
    print(f"A: {wrap_at_spaces(response.choices[0].message.content,100)}")
    print(f"\n--- Performance ---")
    print(json.dumps(llm_metrics("Hele wetstekst ingeladen", in_tokens, out_tokens, elapsed),indent=2))

#### Answer incorrect!

### 2.2. Second, by looping per article section.

In [None]:
import time
import json
from utils import gpt_4o_mini,llm_metrics,wrap_at_spaces,split_articles

QUESTION = "Wat is de franchise, en wat zijn uitzonderingen op de wettelijke waarde? "+\
"Indien deze tekst hier geen expliciete informatie over geeft, antwoord met enkel 'None'. Hou het antwoord onder de 50 woorden."


start = time.time()
in_tokens,out_tokens = 0,0
answers = []
with open("data/docs/Wet op loonbelasting 1964.txt", "r", encoding="utf-8") as f:
    law_text = f.read()
    article_dict = split_articles(law_text)
    for key, law_article_text in article_dict.items():
        response = gpt_4o_mini(user_message=QUESTION,law_text = law_article_text)
        in_tokens += response.usage.prompt_tokens
        out_tokens += response.usage.completion_tokens
        if "None" in str(response.choices[0].message.content):
            continue
        answers.append((key,response.choices[0].message.content))
    elapsed = time.time()-start
    llm_metrics("Zoeken per artikel", in_tokens, out_tokens, elapsed)
    for key,ans in answers:
        print(f"Q {QUESTION}")
        print(f"A: {wrap_at_spaces(ans,100)}")
        print(f"\n--- Performance ---")
        print(json.dumps(llm_metrics("Hele wetstekst ingeladen", in_tokens, out_tokens, elapsed),indent=2))


#### Answer is correct, but takes a long time to find (50+ seconds).


**Main observations**

- The franchise value can be lower than € 18.475 given the premium percentage, this is only observed when looping over articles.
- Looping over articles is accurate but takes a long time (50+ seconds in contrast to approximately 3 seconds). Both methods are costly, token-wise.

## 3. Introduction to RAG


A word/sentence embedding is a numerical representation (using a vector) of it's semantic meaning. 

By creating vector-stores of texts first, one no longer has to rely on word-matching or brute force loop-searching to find the right article containing specific text
by simply searching over a small subset of articles with an embedding closest to the question at hand, which is RAG in a nutshell. 

Cosine similarity measures the (L2) distance between two vectors, in this case word/sentence embeddings. 
Values are between 0 and 1, with values closer to 1 indicating words/sentences that are similar.

### 3.1. Comparing embeddings ofwords/sentences.

Change the model parameters to observe differences across similarity metrics.

In [None]:
import numpy as np
from utils import text_embedding_3_large
from sklearn.metrics.pairwise import cosine_similarity
from langchain_huggingface import HuggingFaceEmbeddings

# Adjustable parameters
model = "mxbai-embed-large-v1" #"mxbai-embed-large-v1" or "text_embedding_3_large"

assert model in ["mxbai-embed-large-v1","text_embedding_3_large"], "model must be either 'hf' or 'azure'"
mxbai_emb = HuggingFaceEmbeddings(model_name="mixedbread-ai/mxbai-embed-large-v1")
#Test one, same sentence, one different word.
print( "--- model:",model,"---")
text_tuple_list = list([("car","cat")]+[(f"cat {sentence}",f"kitten {sentence}") for sentence in ["","is a large animal", "is a large animal with much fur"]])
for t1,t2 in text_tuple_list:
    if model == "mxbai-embed-large-v1":
        encoder_response = mxbai_emb.embed_documents([t1,t2])
        v1,v2 = (np.array(encoder_response[i]).reshape(1,-1) for i in [0,1])
    else:
        encoder_response = text_embedding_3_large([t1,t2])
        v1,v2 = (np.array(encoder_response.data[i].embedding).reshape(1,-1) for i in [0,1])
    print(f"t1: {t1}")
    print(f"t2: {t2}")
    print("distance",np.round(cosine_similarity(v1,v2)[0,0],2))

#Test two, same meaning, different words.
print( "---")
text_tuple_list = list([("rock","object that beats scissors in rock paper scissors"),("Tallest building in New York in 1931","The Empire State Building")])
for t1,t2 in text_tuple_list:
    if model == "mxbai-embed-large-v1":
        encoder_response = mxbai_emb.embed_documents([t1,t2])
        v1,v2 = (np.array(encoder_response[i]).reshape(1,-1) for i in [0,1])
    else:
        encoder_response = text_embedding_3_large([t1,t2])
        v1,v2 = (np.array(encoder_response.data[i].embedding).reshape(1,-1) for i in [0,1])
    print(f"t1: {t1}")
    print(f"t2: {t2}")
    print("distance",np.round(cosine_similarity(v1,v2)[0,0],2))

**This cell demonstrates two things:**

When keeping one word different but increasing the sentence size, the differing word leads to a smaller difference between the resulting embeddings.

Distance between 'similar' sentences using the 'text_embedding_3_large' model seem to be further apart compared to the HuggingFaceEmbeddings 'mxbai-embed-large-v1' model.

This can be tested by adjusting the 'model' parameter above.

### 3.2. Creating law article vector stores using both models (takes 3+ minutes on my developer-laptop), do not rerun.

In [None]:
# from pathlib import Path
# from langchain_community.vectorstores import FAISS
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from utils import AZURE_EMBEDDINGS, MXBAI_EMBEDDINGS, split_articles
# import os

# dir_name = os.getcwd()

# # Save law texts seperately to a list and create the underlying vector_stores
# DOC_PATH = Path("data/docs")
# article_list = []
# for law_file in sorted(DOC_PATH.glob("*.txt")):
#     law_text = law_file.read_text(encoding="utf-8")
#     articles = split_articles(law_text)  # drop preamble
#     article_list += [f"""{law_file.name} {key} {value}""" for key,value in articles.items()]

# # 20000 chosen because the largest law text is 18000 characters.
# splitter = RecursiveCharacterTextSplitter(chunk_size=20000, chunk_overlap=0)
# documents = splitter.create_documents(article_list)

# # 'Vector stores', for now not stored efficiently (using FAISS langchain).
# mxbai_store = FAISS.from_documents(documents, MXBAI_EMBEDDINGS,normalize_L2=True)
# mxbai_store.save_local(os.path.join(dir_name,"data/vector_stores/mxbai-embed-large-v1_nongraph.index"))
# azure_store = FAISS.from_documents(documents, AZURE_EMBEDDINGS,normalize_L2=True)
# azure_store.save_local(os.path.join(dir_name,"data/vector_stores/text-embedding-3-large_nongraph.index"))


## 4. Demonstrating RAG using article vector store.

### 4.1. For the generated vector store shows 5 of the most closely matched articles to the original question.

In [None]:
import textwrap
from utils import MXBAI_STORE_NONGRAPH, AZURE_STORE_NONGRAPH, topk

# number of articles to check
n = 5

QUESTION = "Wat is de franchise, welk artikel gebruik je ervoor en wat zijn uitzonderingen op de wettelijke waarde? Hou het antwoord onder de 50 woorden."
v1 = topk(MXBAI_STORE_NONGRAPH,QUESTION,k=n)
v2 = topk(AZURE_STORE_NONGRAPH,QUESTION,k=n)

article_list_mxbai = [tup[0].page_content for tup in v1]
article_list_azure = [tup[0].page_content for tup in v2]

# Pair them distance-first so heapq.nlargest uses the distance as the key
dist_list = list(zip(["mixedbread-ai/mxbai-embed-large-v1","text_embedding_3_large"],[article_list_mxbai, article_list_azure]))
for name,article_list in dist_list:
    print(f"\n--- Top {n} --- {name}")
    for i, article in enumerate(article_list):
        print(f"{i+1}.",f"{textwrap.shorten(article,150)}")

#### Article 18a of 'Wet op loonbelasting 1964' is the correct article.
The 'text_embedding_3_large' encoder ranks it first, 'mixedbread-ai/mxbai-embed-large-v1' model ranks it second.

### 4.2. Putting everything together, using mxbai encoder

In [None]:
import time
import textwrap
import json
from utils import rag_executor, MXBAI_STORE_NONGRAPH

start = time.time()
QUESTION = "Wat is de franchise, welk artikel gebruik je ervoor en wat zijn uitzonderingen op de wettelijke waarde? Hou het antwoord onder de 50 woorden."
answer,top_docs,qa_chain,performance_cache = rag_executor(QUESTION,store = MXBAI_STORE_NONGRAPH)
print("Q:", QUESTION)
print("A:", answer)
print(f"\n--- Top {len(top_docs)} documents ---")
for i,doc in enumerate(top_docs):
    print(f"{i+1}.",f"{textwrap.shorten(str(doc.page_content),150)}")

print(f"\n--- Performance ---")
in_tokens = performance_cache["in_tokens"]
out_tokens = performance_cache["out_tokens"]
elapsed = performance_cache["elapsed"]
print(json.dumps(llm_metrics("RAG op wetsartikelen", in_tokens, out_tokens, elapsed),indent=2))

#### Much faster and accurate compared to the methods in Section 2.1 or 2.2.

## 5. Graph RAG, taking advantage of linked data structure for RAG.

### 5.1. Law graph (created beforehand).

In [None]:
#TODO: Placeholder, visualize graph.

### 5.2. Graph-based vector store.

In [None]:
#TODO: Placeholder, create vector stores.

In [None]:
#TODO: Placeholder, display top 5 articles based on a query.

### 5.3. Questions across articles.

In [None]:
#TODO: Graph RAG demonstration.

## 6. Wrap‑up / Key takeaways ✅  

* **Direct prompting** on entire laws is cost‑heavy and hits context limits.  
* **Chunking** improves alignment but sacrifices latency.  
* **RAG** with a high‑quality embedding model (MXBAI) gives the *best accuracy‑per‑dollar*.  
* Integrating domain‑specific knowledge graphs can further boost recall for cross‑article references.  

Feel free to extend the notebook by  
* replacing placeholder accuracies with manual grading,  
* adding caching for embeddings,  
* deploying the FAISS index as an API.  
