<a href="https://colab.research.google.com/github/nickradunovic/ragdemo_politie/blob/main/rechtspraak_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval-Augmented Generation (RAG) with Public Justice Data

# How RAG Works
Traditional LLMs like GPT-4 come pre-trained on massive public datasets, allowing for incredible natural language processing capabilities out of the box. However, their utility is limited without access to your own private data. RAG tries to combat this limitation by introducing a retrieval mechanism that allows the model to access externally provided data sources like private data. Here's a simplified overview of how it works:

- **Retrieval**: The model performs an initial retrieval step to gather contextually relevant information from a knowledge base or external documents.

- **Augmentation**: The retrieved information is then seamlessly integrated into the generation process, augmenting the model's understanding and improving the quality of generated content.

- **Generation**: The model generates text, now equipped with the additional knowledge obtained through retrieval, resulting in more informed and contextually rich output.

In this demo, we will be working with [public case law data](https://uitspraken.rechtspraak.nl/resultaat?zoekterm=&inhoudsindicatie=&publicatiestatus=ps1&sort=Relevance&rechtsgebied=r3) containing records of lawsuit rulings of criminal law. We will demonstrate how RAG can be used to retrieve relevant records and generate accurate, context-aware responses to queries about this data. This will showcase the behind-the-scenes workings of RAG and illustrate how it can add significant value over standard LLM implementations that do not utilize a retrieval mechanism.

## Objectives
By the end of this tutorial, you will:

- Understand the core concepts of RAG and its advantages over standard LLMs.

- Learn how to implement a RAG system using public justice data.

- See practical examples of querying the data and retrieving contextually relevant responses.

# Create the LLM prompt with question and answer

First, let's create an `.env` file containing the necessary OAI keys.

In [30]:
# create .env file with OAI keys

Install required dependencies

In [None]:
%%writefile requirements.txt
python-dotenv
openai
azure-search-documents
azure-core

In [None]:
!pip install -r requirements.txt

Import Required Dependencies

In [None]:
import uuid
import os
from dotenv import load_dotenv
from openai import AzureOpenAI
from typing import List
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery

### Initiating the Azure SearchClient and AzureOpenAI instance

In this first step, we will set up our environment and connect to the necessary Azure services. This involves loading environment variables and initializing clients for Azure Cognitive Search and Azure OpenAI. These connections are crucial as they allow us to interact with the search index and the OpenAI models.

In [None]:
load_dotenv()

credential = AzureKeyCredential(os.environ.get("SEARCH_KEY"))
search_client = SearchClient(
    endpoint=os.environ.get("SEARCH_ENDPOINT"),
    index_name=os.environ.get("SEARCH_INDEX_NAME"),
    credential=credential,
)
client = AzureOpenAI(
    api_key=os.getenv("OPENAI_KEY"),
    api_version=os.getenv("OPENAI_API_VERSION"),
    azure_endpoint=os.getenv("OPENAI_ENDPOINT"),
)

### Implementing the Assistant Class

In this step, we will implement the Assistant class, which serves as the core of our RAG system. This class will manage interactions with the Azure Cognitive Search and Azure OpenAI services. It includes methods for searching documents and generating responses to user queries.

In [None]:
class Assistant:
    def __init__(self, search_client: SearchClient, openai_client, model: str = "gpt-4o"):
        self.search_client = search_client
        self.llm = openai_client.chat.completions.create
        self.embedding = openai_client.embeddings.create
        self.model = model

    def search(self, query: str, top_n_documents: int) -> List:
        """
        Searches for documents based on the given query.

        Args:
            query (str): The search query.
            top_n_documents (int): Number of top documents to retrieve.

        Returns:
            list: A list of search results.
        """
        query_embedding = (
            self.embedding(input=[query], model=os.getenv("EMBEDDING_NAME"))
            .data[0]
            .embedding
        )

        # Azure AI search requires a vector query
        vector_query = VectorizedQuery(
            vector=query_embedding,
            k_nearest_neighbors=top_n_documents,
            fields="content_vector",
            exhaustive=True,
        )

        # passing in query in search_text makes it 'hybrid' search
        search_results = self.search_client.search(
            search_text=query, vector_queries=[vector_query], top=top_n_documents
        )

        return search_results

    def rag_chat(self, question: str) -> str:
        """
        Ask a question to the RAG chatbot

        Args:
            question (str): The question to ask the chatbot.

        Returns:
            str: The response from the chatbot.
        """
        search_results = self.search(question, top_n_documents=7)

        documents_string = ""
        for result in search_results:
            documents_string += (
                f"ECLI: {result['title']} | Text: {result['content']}\n\n"
            )

        response = self.llm(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": f"Je bent een assistent die vragen beantwoordt over Rechtspraken. \
                        Bij elke vraag krijg je relevante info van verschillende rechtszaken meegestuurd in CONTEXT. \
                        De vraag staat bij VRAAG. \
                        Je moet je antwoord enkel en alleen baseren op de meegestuurde info \
                        Refereer in je antwoord per stuk. \
                        VRAAG: {question}. CONTEXT:\n{documents_string}. \nJOUW ANTWOORD:",
                },
            ],
        )

        return response.choices[0].message.content, documents_string

    def normal_chat(self, question: str) -> str:
        """
        Ask a question to the RAG chatbot

        Args:
            question (str): The question to ask the chatbot.

        Returns:
            str: The response from the chatbot.
        """
        response = self.llm(
            model=self.model,
            messages=[
                {
                    "role": "system",
                    "content": "Je bent een assistent die vragen beantwoordt over Rechtspraken.",
                },
                {"role": "user", "content": f"VRAAG: {question}"},
            ],
        )

        return response.choices[0].message.content

### Instantiating the Assistant Class
Now that we have defined the Assistant class, the next step is to create an instance of this class. This will allow us to use the methods we defined to perform searches and generate responses. By passing in the search_client and client (initialized in the previous step) to the Assistant constructor, we can set up our assistant for use.

In [None]:
assistant = Assistant(search_client, client)

### Generating a Response Using the RAG System
Now, we will utilize the Assistant class instance to generate a response to a specific questions related and not related to legal cases. See whether or not there is a difference in response quality between quantitative and substantive/content question. Try out your own questions as well.

In [None]:
vraag = "welke rechtszaken hebben een man als dader?"
rag_antwoord, rag_documents = assistant.rag_chat(vraag)
print(rag_antwoord)

In [None]:
vraag = "in welke rechtszaken komt het gebruik van een vuurwapen voor? Het vuurwapen hoeft niet te zijn afgegaan."
rag_antwoord, rag_documents = assistant.rag_chat(vraag)
print(rag_antwoord)

In [None]:
vraag = "In welke rechtzaken zijn kinderen betrokken?"
rag_antwoord, rag_documents = assistant.rag_chat(vraag)
print(rag_antwoord)

### Printing the CONTEXT of the prompt to see what documents were retrieved and used for its response

It is possible to print the context of the prompt used by the RAG assistent. Note, that the method `rag_chat` also returns the retrieved documents. Print these documents to see the result of the vector search. 

In [None]:
print(rag_documents)

### How to handle unrelated questions?

At the moment, the RAG Assistant allows users to ask questions unrelated to the data in the vector database. Imagine a use case where this behaviour is not acceptable. How can we ensure that the RAG assistent gives appropiate responses and answers questions only related to the use case?

In [None]:
vraag = "Hoe kan het zijn dat appels verschillende kleuren hebben (bijvoorbeeld rood, groen, geel of een combinatie van deze kleuren)?"
rag_antwoord, rag_documents = assistant.rag_chat(vraag)
print(rag_antwoord)

**TASK**: Reprogram the RAG Assistant so that it does not answer questions unrelated to legal cases.

Does it work? Discuss why it does / why it doesn't.

In [None]:
# Write a query unrelated to legal case law data to test whether or not the RAG Assistent answers this kind of questions.

### Utilize the 'normal chat' that doesn't incorporate RAG functionality

If you would like to ask questions to the OAI model without RAG functionality, make an OAI call using the `normal_chat` method of the Assistant. Using this non-RAG chat one can play around and try to spot differences between a LLM with RAG functionality and one without. 

**TASK**: What are the use cases and down- and upsides of using RAG compared to using a regular LLM?

In [None]:
# Instead of using `rag_chat`, write queries using the `normal_chat` method of the Assistant instead. Try out various queries and compare with the `rag_chat` method.

### Chunk size

Chunk size is a parameter that determines the length of each segment (or "chunk") of text that the source documents are divided into. In our implementation, the chunk size is specified in terms of the number of characters. The embeddings in the vector store we just used where created using a chunk size of `20480`, which is large enough for each legal case. However, what would happen if we use a smaller chunk size like `1024`? Now we use the search_index_name `SEARCH_INDEX_NAME=index-politiedemo` that was made with a large (`20480`) chunk size.

We prepared some indexes in our vector store for you to play with. Change the search_index_name to `SEARCH_INDEX_NAME=index-politiedemo-s` so we can see the effect of having a smaller chunk size of `1024`.

In [None]:
search_client_small = SearchClient(
    endpoint=os.environ.get("SEARCH_ENDPOINT"),
    index_name="index-politiedemo-s",
    credential=credential,
)

In [None]:
assistant_small_chunk_size = Assistant(search_client_small, client)

In [None]:
vraag = "welke rechtszaken hebben een man als dader?"
rag_antwoord, rag_documents = assistant_small_chunk_size.rag_chat(vraag)
print(rag_antwoord)

### Changing the `top_n_documents`

By using a smaller chunk size, it might be possible to retrieve a larger number of documents from the vector database. At the moment, `top_n_documents = 7`. Try retrieving a larger number of documents per query as see how this impacts the response quality. 

**NOTE**: make sure to use the index created using the smaller chunk size of `1024`.

In [None]:
# write a query for the RAG Assistant while retrieving more than 7 documents.

### Changing the OAI model

There are different LLMs one can use for their RAG-implementation as well. We just used OpenAI GPT-4o in the RAG-assistant we just played with. However, you can imagine use cases and projects that opt to using a model not from OpenAI. For this demo, we keep on using OpenAI but we can play with different models they provide.

Next to `gpt-4o`, we also deployed `gpt-35-turbo` and `gpt-4` for this demo. Each model has a different context window size. Below we try to use `gpt-35-turbo` with the big chunk size search_client. Note that this does not work.

**TASK**: Change the model or the search client in order to make it work.
Even though the context window size is smaller for `gpt-35-turbo` compared to the other models, it might still be preferred over the others. What would be reasons to opt for one LLM over the others?

In [31]:
assistant_gpt35_turbo = Assistant(search_client_small, client, model="gpt-35-turbo")

In [None]:
# Try out the new RAG-assistant to see if you can find any differences

### Vector search is language independent

In [None]:
vraag = "Which person was involved in multiple cases"
rag_antwoord, rag_documents = assistant.rag_chat(vraag)
print(rag_antwoord)

## Things to consider and try out for yourself:
- How can we utilize system prompt to get the response that we want in our application (can we include any guard rails?)
- What could be ways to enforce that the RAG Assistant is consistent in its responses?
- Difference in quality of response between quantitative questions (how many ...?, what is the greatest ...?) and qualitative (inhoudelijke) questions. 
- What could be shortcomings of this RAG implementation? --> (1) no chat history is saved at the moment, (2) ..., ...

In [None]:
# Try out your own queries after playing with one of the points mentioned above.