## Creating an index and populating it with documents using Milvus

Simple example on how to ingest PDF documents, then web pages content into a Milvus Vector Store.

### Requirements

- A **Milvus** vector database, set up according to [these instructions](../../../vector-databases/milvus/README.md).

- Connection credentials to Milvus must be available as environment variables:

  - `MILVUS_USERNAME`

  - `MILVUS_PASSWORD`

- Update the **MILVUS_HOST**, **MILVUS_PORT**, and **MILVUS_COLLECTION** in this notebook to match your deployment settings.

### Needed packages and imports

In [1]:
!pip install -q einops==0.7.0 langchain==0.1.9 pypdf==4.0.2 pymilvus==2.3.6 sentence-transformers==2.4.0

In [2]:
import requests
import os
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus

### Base parameters, the Milvus connection info

In [3]:
# Replace values according to your Milvus deployment
MILVUS_HOST = "milvus-service"
# MILVUS_HOST = "milvus-service.<your-namespace>.svc.cluster.local"
MILVUS_PORT = 19530
MILVUS_USERNAME = os.getenv('MILVUS_USERNAME')
MILVUS_PASSWORD = os.getenv('MILVUS_PASSWORD')
MILVUS_COLLECTION = "demo_collection"

## Initial index creation and document ingestion

#### Download and load pdfs

In [4]:
product_version = "2.13"
documents = [
    "release_notes",
    "introduction_to_red_hat_openshift_ai",
    "getting_started_with_red_hat_openshift_ai_self-managed",
    "openshift_ai_tutorial_-_fraud_detection_example",
    "developing_a_model",
    "integrating_data_from_amazon_s3",
    "working_on_data_science_projects",
    "serving_models",
    "monitoring_data_science_models",
    "managing_users",
    "managing_resources",
    "installing_and_uninstalling_openshift_ai_self-managed",
    "installing_and_uninstalling_openshift_ai_self-managed_in_a_disconnected_environment",
    "upgrading_openshift_ai_self-managed",
    "upgrading_openshift_ai_self-managed_in_a_disconnected_environment",   
]

pdfs = [f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/pdf/{doc}/red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls = {f"red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us": f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/html-single/{doc}/index" for doc in documents}

In [5]:
docs_dir = f"rhoai-doc-{product_version}"

if not os.path.exists(docs_dir):
    os.mkdir(docs_dir)

for pdf in pdfs:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"{docs_dir}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

Skipped https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.13/pdf/developing_a_model/red_hat_openshift_ai_self-managed-2.13-developing_a_model-en-us.pdf
Skipped https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.13/pdf/integrating_data_from_amazon_s3/red_hat_openshift_ai_self-managed-2.13-integrating_data_from_amazon_s3-en-us.pdf
Skipped https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.13/pdf/monitoring_data_science_models/red_hat_openshift_ai_self-managed-2.13-monitoring_data_science_models-en-us.pdf


In [6]:
pdf_folder_path = f"./rhoai-doc-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

#### Inject metadata

In [7]:
from pathlib import Path

for doc in pdf_docs:
    doc.metadata["source"] = pdfs_to_urls[Path(doc.metadata["source"]).stem]

#### We also use the website information in addition to the PDF files

In [8]:
websites = [
    "https://ai-on-openshift.io/getting-started/openshift/",
    "https://ai-on-openshift.io/getting-started/opendatahub/",
    "https://ai-on-openshift.io/getting-started/openshift-ai/",
    "https://ai-on-openshift.io/odh-rhoai/configuration/",
    "https://ai-on-openshift.io/odh-rhoai/custom-notebooks/",
    "https://ai-on-openshift.io/odh-rhoai/nvidia-gpus/",
    "https://ai-on-openshift.io/odh-rhoai/custom-runtime-triton/",
    "https://ai-on-openshift.io/odh-rhoai/openshift-group-management/",
    "https://ai-on-openshift.io/tools-and-applications/minio/minio/",
    "https://access.redhat.com/articles/7047935",
    "https://access.redhat.com/articles/rhoai-supported-configs",
]

In [9]:
website_loader = WebBaseLoader(websites)
website_docs = website_loader.load()

#### We merge the docs and website info then we split the results into document chunks with some overlap

In [10]:
docs = pdf_docs + website_docs

#### Split documents into chunks with some overlap

In [11]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=128)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content='Red Hat OpenShift AI Self-Managed\n2.13\nWorking on data science projects\nOrganize your work in projects and workbenches, create and collaborate on\nnotebooks, train and deploy models, configure model servers, and implement\npipelines\nLast Updated: 2024-09-16', metadata={'source': 'https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.13/html-single/working_on_data_science_projects/index', 'page': 0})

#### Create the index and ingest the documents in Milvus.
Note that the results will appear in Milvus only after the add_documents operation has completed

In [12]:
# If you want to use a GPU, you can add the 'device': 'cuda' argument provided you have used GPU Accelerator on your Workbench.
# model_kwargs = {'device': 'cuda'}
model_kwargs = {}
embeddings = HuggingFaceEmbeddings(
    model_kwargs=model_kwargs,
    show_progress=True
)

db = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    auto_id=True,
    drop_old=False
    )

db.add_documents(all_splits)

2025-03-18 01:45:13.368476: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Batches:   0%|          | 0/36 [00:00<?, ?it/s]

[456719310776465987,
 456719310776465988,
 456719310776465989,
 456719310776465990,
 456719310776465991,
 456719310776465992,
 456719310776465993,
 456719310776465994,
 456719310776465995,
 456719310776465996,
 456719310776465997,
 456719310776465998,
 456719310776465999,
 456719310776466000,
 456719310776466001,
 456719310776466002,
 456719310776466003,
 456719310776466004,
 456719310776466005,
 456719310776466006,
 456719310776466007,
 456719310776466008,
 456719310776466009,
 456719310776466010,
 456719310776466011,
 456719310776466012,
 456719310776466013,
 456719310776466014,
 456719310776466015,
 456719310776466016,
 456719310776466017,
 456719310776466018,
 456719310776466019,
 456719310776466020,
 456719310776466021,
 456719310776466022,
 456719310776466023,
 456719310776466024,
 456719310776466025,
 456719310776466026,
 456719310776466027,
 456719310776466028,
 456719310776466029,
 456719310776466030,
 456719310776466031,
 456719310776466032,
 456719310776466033,
 456719310776

#### Test query

In [13]:
query = "How can I work with GPU and taints in OpenShift AI?"
docs_with_score = db.similarity_search_with_score(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.5976202487945557
Apply the taints you need to your Nodes or MachineSets, for example:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  ...
spec:
  replicas: 1
  selector:
    ...
  template:
    ...
    spec:
      ...
      taints:
        - key: restrictedaccess
          value: "yes"
          effect: NoSchedule



Apply the relevant toleration to the NVIDIA Operator.


In the nvidia-gpu-operator namespace, get to the Installed Operator menu, open the NVIDIA GPU Operator settings, get to the ClusterPolicy tab, and edit the ClusterPolicy.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.5976202487945557
Apply the taints you need to your Nodes or MachineSets, for example:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  ...
spec:
  r