<a href="https://colab.research.google.com/github/nickradunovic/ragdemo_politie/blob/main/load_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Indexing and storing data in a vector database

Before we can use the RAG-assistant to retrieve and ask questions about our juridical documents, we need to store these as embeddings in a vector database.

First, we import the necessary dependencies

In [1]:
import os
from os.path import join, dirname, abspath
import pandas as pd
import re
import requests
import fitz
import time
from typing import List
from bs4 import BeautifulSoup
from openai import AzureOpenAI
from markdownify import markdownify as md
from langchain.docstore.document import Document
from langchain.text_splitter import TokenTextSplitter
from langchain_openai import AzureOpenAIEmbeddings

from langchain_community.vectorstores.azuresearch import (
    AzureSearch,
    FIELDS_ID,
    FIELDS_CONTENT,
    FIELDS_CONTENT_VECTOR,
    FIELDS_METADATA,
)
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchableField,
    SearchField,
    SearchFieldDataType,
)
from dotenv import load_dotenv

load_dotenv()

True

We initiate the embedding model using LangChain's `AzureOpenAIEmbeddings`.

Here, we specify the OpenAI keys and the chunk size.

In [2]:
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=os.getenv("OPENAI_ENDPOINT"),
    azure_deployment=os.getenv("EMBEDDING_NAME"),
    chunk_size=20480,
    openai_api_key=os.getenv("OPENAI_KEY"),
    openai_api_version=os.getenv("OPENAI_API_VERSION"),
    openai_api_type="azure",
)

We made functions to easily remove indexes created in the vector database in Azure

In [4]:
def delete_index():
    headers = {
        "api-key": os.getenv("SEARCH_KEY"),
        "Content-Type": "application/json",
    }
    response = requests.delete(
        os.getenv("SEARCH_ENDPOINT")
        + "/indexes/"
        + os.getenv("SEARCH_INDEX_NAME")
        + "?api-version=2020-06-30",
        headers=headers,
    )
    return response.status_code == 204


def check_index_deleted():
    headers = {
        "api-key": os.getenv("SEARCH_KEY"),
    }
    response = requests.get(
        os.getenv("SEARCH_ENDPOINT")
        + "/indexes/"
        + os.getenv("SEARCH_INDEX_NAME")
        + "?api-version=2020-06-30",
        headers=headers,
    )
    return response.status_code == 404

Azure AI Search, `AzureSearch`, (formerly known as "Azure Cognitive Search") is an Azure service that provides secure informatiopn retrieval at scale for GenAI search applications including RAG.
This we will use to index our juridical documents and apply vector search.

Here, we define the `AzureSearch` instance as appropriate. In `fields`, we specify the fields we want to include per document indexed.

In [5]:
def init_acs():
    acs = AzureSearch(
        azure_search_endpoint=os.getenv("SEARCH_ENDPOINT"),
        azure_search_key=os.getenv("SEARCH_KEY"),
        index_name=os.getenv("SEARCH_INDEX_NAME"),
        embedding_function=embeddings.embed_query,
        fields=[
            SimpleField(
                name=FIELDS_ID,
                type=SearchFieldDataType.String,
                key=True,
                filterable=True,
            ),
            SearchableField(
                name=FIELDS_CONTENT,
                type=SearchFieldDataType.String,
                analyzer_name="nl.lucene",
            ),
            SearchField(
                name=FIELDS_CONTENT_VECTOR,
                type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True,
                vector_search_dimensions=1536,
                vector_search_profile_name="myHnswProfile",
            ),
            SearchableField(
                name=FIELDS_METADATA,
                type=SearchFieldDataType.String,
                analyzer_name="nl.lucene",
            ),
            SimpleField(
                name="title",
                type=SearchFieldDataType.String,
                key=False,
                filterable=True,
                sortable=True,
                facetable=True,
            ),
            SimpleField(
                name="page",
                type=SearchFieldDataType.String,
                key=False,
                filterable=True,
                sortable=True,
                facetable=True,
            ),
        ],
    )
    return acs

As preparation, we've scraped the contents in the field of criminal law from `https://uitspraken.rechtspraak.nl/` and saved them as CSV split on year.

In the code  cell below, we do the following:
- We initiate Azure AI Search and specify the relevant parameters. 
- Read in each of the CSV with juridical cases
- Split the text of each document in chunks using LangChain's `TokenTextSplitter`
- Index the documents and add them to the vector database.



In [7]:
# AI SEARCH DATABASE ON FULL TEXT ECLI RECHTSPRAAK
acs = init_acs()

params = {
    "splitter": "TokenTextSplitter",
    "encoding_name": "cl100k_base",
    "chunk_size": 20480,
    "chunk_overlap": 2048,
}
token_splitter = TokenTextSplitter(
    encoding_name=params["encoding_name"],
    chunk_size=params["chunk_size"],
    chunk_overlap=params["chunk_overlap"],
)

for df_name in [
    "rechtspraak_metadata_2020.xlsx",
    "rechtspraak_metadata_2021.xlsx",
    "rechtspraak_metadata_2022.xlsx",
    "rechtspraak_metadata_2023.xlsx",
    "rechtspraak_metadata_2024.xlsx",
]:
    print(df_name)
    df = pd.read_excel(df_name)
    documents = []
    for index, row in df.iterrows():
        ecli = row["ecli"]
        fulltext = row["full_text"]

        chunks = token_splitter.split_text(fulltext)
        print(f"{ecli}: {len(chunks)}")
        for idx, chunk in enumerate(chunks):
            doc = Document(
                page_content=chunk,
                metadata={"title": ecli, "chunk_index": idx},
            )
            documents.append(doc.copy())

    # ADD ALL CHUNKS TO ACS
    acs.add_documents(documents)
    print("Creation of new index succesfull.")

rechtspraak_metadata_2020.xlsx
ECLI:NL:RVS:2020:5: 7
ECLI:NL:GHARL:2020:214: 5
ECLI:NL:GHARL:2020:216: 5
ECLI:NL:GHARL:2020:220: 4
ECLI:NL:GHARL:2020:219: 4
ECLI:NL:HR:2020:54: 1
ECLI:NL:RBLIM:2020:401: 5
ECLI:NL:RVS:2020:252: 2
ECLI:NL:CRVB:2020:241: 3
ECLI:NL:RBNNE:2020:430: 3
ECLI:NL:RBOVE:2020:376: 3
ECLI:NL:RBROT:2020:904: 8
ECLI:NL:RVS:2020:383: 2
ECLI:NL:GHARL:2020:1106: 4
ECLI:NL:HR:2020:250: 1
ECLI:NL:RBNNE:2020:724: 2
ECLI:NL:RBOVE:2020:837: 10
ECLI:NL:RVS:2020:593: 3
ECLI:NL:CBB:2020:126: 6
ECLI:NL:CRVB:2020:537: 3
ECLI:NL:GHSHE:2020:888: 11
ECLI:NL:HR:2020:424: 1
ECLI:NL:CBB:2020:176: 3
ECLI:NL:GHSHE:2020:974: 2
ECLI:NL:OGEAC:2020:57: 3
ECLI:NL:CBB:2020:163: 11
ECLI:NL:CRVB:2020:678: 2
ECLI:NL:CRVB:2020:680: 4
ECLI:NL:CRVB:2020:744: 2
ECLI:NL:GHSHE:2020:1079: 3
ECLI:NL:CRVB:2020:763: 5
ECLI:NL:CBB:2020:210: 2
ECLI:NL:RBNNE:2020:1441: 2
ECLI:NL:RBNHO:2020:2533: 6
ECLI:NL:RBROT:2020:2985: 4
ECLI:NL:OGEAA:2020:193: 6
ECLI:NL:HR:2020:588: 1
ECLI:NL:RBNHO:2020:2651: 2
ECLI:NL:HR