## Creating an index and populating it with documents using Milvus and Nomic AI Embeddings

Simple example on how to ingest PDF documents, then web pages content into a Milvus VectorStore. . In this example, the embeddings are the fully open source ones released by NomicAI, [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).

As described in [this blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1), those embeddings feature a "8192 context-length that outperforms OpenAI Ada-002 and text-embedding-3-small on both short and long context tasks". In additions, they are:

- Open source
- Open data
- Open training code
- Fully reproducible and auditable

Requirements:
- A Milvus instance, either standalone or cluster.

### Needed packages and imports

In [51]:
!pip install -q einops==0.7.0 langchain==0.1.9 pypdf==4.0.2 pymilvus==2.3.6 sentence-transformers==2.4.0


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [52]:
import requests
import os
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus

### Base parameters, the Milvus connection info

In [53]:
MILVUS_HOST = "vectordb-milvus"
MILVUS_PORT = 19530
MILVUS_USERNAME = "root"
MILVUS_PASSWORD = "Milvus"
MILVUS_COLLECTION = "ocp_and_rhoai"

## Initial index creation and document ingestion

#### Download and load pdfs

## RHOAI

In [54]:
product_version = "2-latest"
documents = [
    "release_notes",
    "introduction_to_red_hat_openshift_ai",
    "getting_started_with_red_hat_openshift_ai_self-managed",
    "openshift_ai_tutorial_-_fraud_detection_example",
    "developing_a_model",
    "integrating_data_from_amazon_s3",
    "working_on_data_science_projects",
    "Working_with_distributed_workloads",
    "serving_models",
    "managing_users",
    "managing_resources",
    "installing_and_uninstalling_openshift_ai_self-managed",
    "installing_and_uninstalling_openshift_ai_self-managed_in_a_disconnected_environment",
    "upgrading_openshift_ai_self-managed",
    "upgrading_openshift_ai_self-managed_in_a_disconnected_environment",   
]

pdfs = [f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/pdf/{doc}/red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls_rhoai = {f"red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us": f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/html-single/{doc}/index" for doc in documents}

In [55]:
try:
    os.mkdir(f"rhoai-doc-{product_version}")
except:
    pass


for pdf in pdfs:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"rhoai-doc-{product_version}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

Skipped https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2-latest/pdf/developing_a_model/red_hat_openshift_ai_self-managed-2-latest-developing_a_model-en-us.pdf


In [56]:
pdf_folder_path = f"./rhoai-doc-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs_rhoai = pdf_loader.load()

In [57]:
len(pdf_docs_rhoai)

498

## OpenShift

In [58]:
product_version = "4.15"
documents = [
    "about",
    "getting_started",
    "release_notes",
    "security_and_compliance",
    "architecture",
    "support",
    "installing",
    "Installing_OpenShift_Container_Platform_with_the_Assisted_Installer",
    "updating_clusters",
    "authentication_and_authorization",
    "networking",
    "registry",
    "postinstallation_configuration",
    "storage",
    "scalability_and_performance",
    "edge_computing",
    "migrating_from_version_3_to_4",
    "Migration_Toolkit_for_Containers",
    "backup_and_restore",
    "machine_management",
    "web_console",
    "hosted_control_planes",
    "cli_tools",
    "building_applications",
    "serverless",
    "images",
    "nodes",
    "operators",
    "specialized_hardware_and_driver_enablement",
    "Builds_using_BuildConfig",
    "jenkins",
    "monitoring",
    "logging",
    "distributed_tracing",
    "red_hat_build_of_opentelemetry",
    "network_observability",
    "power_monitoring",
    "cluster_observability_operator",
    "virtualization",
    "service_mesh",
    "Windows_Container_Support_for_OpenShift"  
]

In [59]:
pdfs_ocp = [f"https://access.redhat.com/documentation/de-de/openshift_container_platform/{product_version}/pdf/{doc}/OpenShift_Container_Platform-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls_ocp = {f"openshift_container_platform-{product_version}-{doc}-en-us": f"https://access.redhat.com/documentation/de-de/openshift_container_platform/{product_version}/html-single/{doc}/index" for doc in documents}

In [60]:
try:
    os.mkdir(f"ocp-doc-{product_version}")
except:
    pass

for pdf in pdfs_ocp:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"ocp-doc-{product_version}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

Skipped https://access.redhat.com/documentation/de-de/openshift_container_platform/4.15/pdf/Installing_OpenShift_Container_Platform_with_the_Assisted_Installer/OpenShift_Container_Platform-4.15-Installing_OpenShift_Container_Platform_with_the_Assisted_Installer-en-us.pdf


In [61]:
pdf_folder_path = f"./ocp-doc-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs_ocp = pdf_loader.load()

In [62]:
len(pdf_docs_ocp)

11638

## GitOps

In [63]:
product_version = "1.12"
documents = [
    "understanding_openshift_gitops",
    "release_notes",
    "installing_gitops",
    "removing_gitops",
    "argo_cd_instance",
    "access_control_and_user_management",
    "managing_resource_use",
    "argo_cd_applications",
    "argo_cd_application_sets",
    "declarative_cluster_configuration",
    "argo_rollouts",
    "security",
    "GitOps_workloads_on_infrastructure_nodes",
    "observability",
    "troubleshooting_issues"
]

In [64]:
pdfs_gitops = [f"https://access.redhat.com/documentation/en-us/red_hat_openshift_gitops/{product_version}/pdf/{doc}/Red_Hat_OpenShift_GitOps-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls_gitops = {f"red_hat_openshift_gitops-{product_version}-{doc}-en-us": f"https://access.redhat.com/documentation/en-us/red_hat_openshift_gitops/{product_version}/html-single/{doc}/index" for doc in documents}

In [65]:
try:
    os.mkdir(f"ocp-gitops-{product_version}")
except:
    pass

for pdf in pdfs_gitops:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"ocp-gitops-{product_version}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

In [66]:
pdf_folder_path = f"./ocp-gitops-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs_gitops = pdf_loader.load()

In [67]:
len(pdf_docs_gitops)

234

## Pipelines

In [68]:
product_version = "1.14"
documents = [
    "About_OpenShift_Pipelines",
    "installing_and_configuring",
    "Managing_performance_and_resource_use",
    "Creating_CICD_pipelines",
    "Pipelines_as_Code",
    "securing_openshift_pipelines",
    "observability_in_openshift_pipelines",
    "Custom_Tekton_Hub_instance"
]

In [69]:
pdfs_pipelines = [f"https://access.redhat.com/documentation/en-us/red_hat_openshift_pipelines/{product_version}/pdf/{doc}/Red_Hat_OpenShift_GitOps-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls_pipelines = {f"red_hat_openshift_pipelines-{product_version}-{doc}-en-us": f"https://access.redhat.com/documentation/en-us/red_hat_openshift_pipelines/{product_version}/html-single/{doc}/index" for doc in documents}

In [70]:
try:
    os.mkdir(f"ocp-pipelines-{product_version}")
except:
    pass

for pdf in pdfs_pipelines:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"ocp-pipelines-{product_version}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

In [71]:
pdf_folder_path = f"./ocp-gitops-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs_pipelines = pdf_loader.load()

#### Inject metadata

In [127]:
pdfs_to_urls = pdfs_to_urls_rhoai | pdfs_to_urls_ocp | pdfs_to_urls_gitops | pdfs_to_urls_pipelines
pdf_docs = pdf_docs_rhoai + pdf_docs_ocp + pdf_docs_gitops + pdf_docs_pipelines

In [128]:
len(pdf_docs)

12370

In [122]:
#from pathlib import Path

#for doc in pdf_docs:
#    doc.metadata["source"] = pdfs_to_urls[Path(doc.metadata["source"]).stem]

KeyError: 'https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2-latest/html-single/introduction_to_red_hat_openshift_ai/index'

#### Load websites

In [125]:
websites = [
    "https://ai-on-openshift.io/getting-started/openshift/",
    "https://ai-on-openshift.io/getting-started/opendatahub/",
    "https://ai-on-openshift.io/getting-started/openshift-ai/",
    "https://ai-on-openshift.io/odh-rhoai/configuration/",
    "https://ai-on-openshift.io/odh-rhoai/custom-notebooks/",
    "https://ai-on-openshift.io/odh-rhoai/nvidia-gpus/",
    "https://ai-on-openshift.io/odh-rhoai/custom-runtime-triton/",
    "https://ai-on-openshift.io/odh-rhoai/openshift-group-management/",
    "https://ai-on-openshift.io/tools-and-applications/minio/minio/",
    "https://access.redhat.com/articles/7047935",
    "https://access.redhat.com/articles/rhoai-supported-configs",
]

In [126]:
website_loader = WebBaseLoader(websites)
website_docs = website_loader.load()

#### Merge both types of docs

In [129]:
docs = pdf_docs + website_docs

In [130]:
docs[15]

Document(page_content='. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .\n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 

#### Split documents into chunks with some overlap

In [131]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=80)
all_splits = text_splitter.split_documents(docs)
all_splits[15]

Document(page_content='be used, so determining which versions of what libraries to use can be very challenging. OpenShift AI\nincludes many packaged notebook images that have been built with insight from data scientists and\nrecommendation engines. You can start new projects quickly on the right foot without worrying\nabout downloading unproven and possibly insecure images from random upstream repositories.\nCustom notebooks\nIn addition to notebook images provided and supported by Red Hat and independent software\nvendors (ISVs), you can configure custom notebook images that cater to your project’s specific\nrequirements.\nData science pipelines\nOpenShift AI supports Data Science Pipelines 2.0, for an efficient way of running your data science\nworkloads. You can standardize and automate machine learning workflows that enable you to\ndevelop and deploy your data science models.\nModel serving\nAs a data scientist, you can deploy your trained machine-learning models to serve intellige

#### Create the index and ingest the documents

In [132]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
model_kwargs = {'trust_remote_code': True, 'device': 'cpu'}
embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=True
)

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.4.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>


In [133]:
db = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    auto_id=True,
    drop_old=True
    )

In [134]:
db.add_documents(all_splits)

Batches:   0%|          | 0/915 [00:00<?, ?it/s]

[450192143387062363,
 450192143387062364,
 450192143387062365,
 450192143387062366,
 450192143387062367,
 450192143387062368,
 450192143387062369,
 450192143387062370,
 450192143387062371,
 450192143387062372,
 450192143387062373,
 450192143387062374,
 450192143387062375,
 450192143387062376,
 450192143387062377,
 450192143387062378,
 450192143387062379,
 450192143387062380,
 450192143387062381,
 450192143387062382,
 450192143387062383,
 450192143387062384,
 450192143387062385,
 450192143387062386,
 450192143387062387,
 450192143387062388,
 450192143387062389,
 450192143387062390,
 450192143387062391,
 450192143387062392,
 450192143387062393,
 450192143387062394,
 450192143387062395,
 450192143387062396,
 450192143387062397,
 450192143387062398,
 450192143387062399,
 450192143387062400,
 450192143387062401,
 450192143387062402,
 450192143387062403,
 450192143387062404,
 450192143387062405,
 450192143387062406,
 450192143387062407,
 450192143387062408,
 450192143387062409,
 450192143387

#### Alternatively, add new documents

In [None]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
# embeddings = HuggingFaceEmbeddings(
#     model_name="nomic-ai/nomic-embed-text-v1",
#     model_kwargs=model_kwargs,
#     show_progress=True
# )

# db = Milvus(
#     embedding_function=embeddings,
#     connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
#     collection_name=MILVUS_COLLECTION,
#     metadata_field="metadata",
#     text_field="page_content",
#     auto_id=True,
#     drop_old=False
#     )

# db.add_documents(all_splits)

#### Test query

In [135]:
query = "How can I work with GPU and taints in OpenShift AI?"
docs_with_score = db.similarity_search_with_score(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [136]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.4466228783130646
But don't worry, OpenShift AI and Open Data Hub take care of this part for you when you launch notebooks, workbenches, model servers, or pipeline runtimes!
Installation
Here is the documentation you can follow:

OpenShift AI documentation
NVIDIA documentation (more detailed)

Advanced configuration
Working with taints
In many cases, you will want to restrict access to GPUs, or be able to provide choice between different types of GPUs: simply stating "I want a GPU" is not enough. Also, if you want to make sure that only the Pods requiring GPUs end up on GPU-enabled nodes (and not other Pods that just end up being there at random because that's how Kubernetes works...), you're at the right place!
The only supported method at the moment to achieve this is to taint nodes, then apply tolerations on the Pods depending on where you want them scheduled. If you don't pay close attention t