<a href="https://colab.research.google.com/github/medulka/commercial-llms/blob/main/RAG_over_gaia_x_documents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG search over Gaia-X Documents

A tool how to ask questions over the Gaia-X documents.
Inspired by the showcase of Tobias Oberrauch (KI-Bundesverband): https://github.com/tobiasoberrauch/rag-showcase.

The pdfs of Gaia-x documents are stored in my personal Google Drive and have to be occassionally updated.
In this version, LLM Mistral available from the Huggingface repository was used to create the embeddings. The vector database Qdrant was chosen for the data storage.

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Jul  2 09:51:38 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

Your runtime has 54.8 gigabytes of available RAM



In [4]:
!pip install langchain
!pip install langchain_qdrant
!pip install langchain_huggingface

!pip install pypdf
!pip install google-colab

!pip install qdrant-client



Import the pdf files from the personal Google drive.

In [14]:
import os
from google.colab import drive
from langchain.document_loaders import PyPDFLoader
from rich import print

# mount a personal Google Drive content
drive.mount("/content/drive")
path = "/content/drive/MyDrive/gaia-x-documents/"

# getting a list of all the fiels in the directory
pdf_files = [f for f in os.listdir(path) if f.endswith('.pdf')]


loaders = []
for pdf_file in pdf_files:
    loader = PyPDFLoader(os.path.join(path, pdf_file))
    loaders.append(loader)

print(f"Loaded {len(loaders)} pdfs")

documents = []
for loader in loaders:
    document = loader.load_and_split()
    documents.append(document)

print(f"Loaded {len(document)} documents")
print("len_documents: ", len(documents))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Creating a collection in the Qdrant database.

In [6]:
from langchain_huggingface import HuggingFaceEmbeddings
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, PointStruct, VectorParams
from langchain_qdrant import Qdrant


collection_name = "gaia-x-documents"
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

from google.colab import userdata

os.environ["QDRANT_URL"] = userdata.get('QDRANT_URL')
os.environ["QDRANT_API_KEY"] = userdata.get('QDRANT_API_KEY')

client = QdrantClient(url=userdata.get('QDRANT_URL'), api_key=userdata.get('QDRANT_API_KEY'))
print(client)

if not client.collection_exists(collection_name):
    client.create_collection(collection_name=collection_name, vectors_config=VectorParams(size = 128, distance=Distance.COSINE))
    print("Collection created")
else:
    print("Collection already exists")


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


<qdrant_client.qdrant_client.QdrantClient object at 0x7cdd80c8a170>
Collection already exists


Populating the database.

In [15]:
for i in range(len(documents)):

    vector_store = Qdrant.from_documents(
        url = os.environ["QDRANT_URL"],
        api_key = os.environ["QDRANT_API_KEY"],
        documents=documents[i],
        embedding=embeddings,
        collection_name=collection_name,
        force_recreate = True,
        )

In [16]:
collection_info = client.get_collection(collection_name)
print(collection_info)

Query the database.

In [18]:
answer = vector_store.similarity_search("What is Trust Anchor?")
print(answer[0].page_content)
print("\npage:", answer[0].metadata["page"])
print("\nsource:", answer[0].metadata["source"])

In [19]:
search_result= vector_store.similarity_search_with_score("What is Gaia-X Trust Anchor?")
result, score = search_result[0]
print(result.page_content)
print(f"\nScore: {score}")

In [13]:
from typing import List
from rich.console import Console
from rich.table import Table
from langchain_core.documents import Document

def create_table(title: str, documents: List[Document]):
    table = Table(title=title, show_lines=True)
    table.add_column("page")
    table.add_column("page_content")

    for document in documents:
        table.add_row(str(document.metadata["page"]), document.page_content)

    return table

similar_documents = vector_store.search("Was ist Gaia-X Trust Anchor?", "similarity")

console = Console()
console.print(create_table("similar_documents", similar_documents))