## Creating an index and populating it with documents using Milvus and Nomic AI Embeddings

Simple example on how to ingest PDF documents, then web pages content into a Milvus VectorStore. . In this example, the embeddings are the fully open source ones released by NomicAI, [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).

As described in [this blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1), those embeddings feature a "8192 context-length that outperforms OpenAI Ada-002 and text-embedding-3-small on both short and long context tasks". In additions, they are:

- Open source
- Open data
- Open training code
- Fully reproducible and auditable

Requirements:
- A Milvus instance, either standalone or cluster.

### Needed packages and imports

In [1]:
!pip install -q einops==0.7.0 langchain==0.1.9 pypdf==4.0.2 pymilvus==2.3.6 sentence-transformers==2.4.0


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import requests
import os
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus

### Base parameters, the Milvus connection info

In [21]:
MILVUS_HOST = "vectordb-milvus.ic-shared-milvus.svc.cluster.local"
MILVUS_PORT = 19530
MILVUS_USERNAME = "root"
MILVUS_PASSWORD = "Milvus"
MILVUS_COLLECTION = "collection_nomicai_embeddings"
HF_TOKEN = ""

## Initial index creation and document ingestion

#### Download and load pdfs

In [5]:
product_version = 2.12
documents = [
    "release_notes/Red_Hat_OpenShift_AI_Self-Managed-2.12-Release_notes-en-US.pdf"
    #"introduction_to_red_hat_openshift_ai",
    #"getting_started_with_red_hat_openshift_ai_self-managed",
    #"openshift_ai_tutorial_-_fraud_detection_example",
    #"developing_a_model",
    #"integrating_data_from_amazon_s3",
    #"working_on_data_science_projects",
    #"serving_models",
    #"monitoring_data_science_models",
    #"managing_users",
    #"managing_resources",
    #"installing_and_uninstalling_openshift_ai_self-managed",
    #"installing_and_uninstalling_openshift_ai_self-managed_in_a_disconnected_environment",
    #"upgrading_openshift_ai_self-managed",
    #"upgrading_openshift_ai_self-managed_in_a_disconnected_environment",
]

pdfs = [f"https://docs.redhat.com/en-us/documentation/red_hat_openshift_ai_self-managed/{product_version}/pdf/{doc}" for doc in documents]
pdfs_to_urls = {f"red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us": f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/html-single/{doc}/index" for doc in documents}

In [6]:
docs_dir = f"rhoai-doc-{product_version}"

if not os.path.exists(docs_dir):
    os.mkdir(docs_dir)

for pdf in pdfs:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"{docs_dir}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

In [7]:
pdf_folder_path = f"./rhoai-doc-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

#### Inject metadata

In [8]:
from pathlib import Path

for doc in pdf_docs:
    doc.metadata["source"] = pdfs_to_urls[Path(doc.metadata["source"]).stem]

KeyError: 'Red_Hat_OpenShift_AI_Self-Managed-2.12-Release_notes-en-US'

#### Load websites

In [9]:
websites = [
    "https://ai-on-openshift.io/getting-started/openshift/",
    "https://ai-on-openshift.io/getting-started/opendatahub/",
    "https://ai-on-openshift.io/getting-started/openshift-ai/",
    "https://ai-on-openshift.io/odh-rhoai/configuration/",
    "https://ai-on-openshift.io/odh-rhoai/custom-notebooks/",
    "https://ai-on-openshift.io/odh-rhoai/nvidia-gpus/",
    "https://ai-on-openshift.io/odh-rhoai/custom-runtime-triton/",
    "https://ai-on-openshift.io/odh-rhoai/openshift-group-management/",
    "https://ai-on-openshift.io/tools-and-applications/minio/minio/",
    "https://access.redhat.com/articles/7047935",
    "https://access.redhat.com/articles/rhoai-supported-configs",
]

In [None]:
website_loader = WebBaseLoader(websites)
website_docs = website_loader.load()

#### Merge both types of docs

In [10]:
#docs = pdf_docs + website_docs
docs = pdf_docs

#### Split documents into chunks with some overlap

In [11]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content='Red Hat OpenShift AI Self-Managed\n \n2.12\nRelease notes\nFeatures, enhancements, resolved issues, and known issues associated with this\nrelease\nLast Updated: 2024-08-27', metadata={'source': 'rhoai-doc-2.12/Red_Hat_OpenShift_AI_Self-Managed-2.12-Release_notes-en-US.pdf', 'page': 0})

#### Create the index and ingest the documents

In [28]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
model_kwargs = {'trust_remote_code': True, 'device': 'cuda', 'token': HF_TOKEN}
embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=True
)


db = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    auto_id=True,
    drop_old=True
    )

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.4.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>


In [29]:
db.add_documents(all_splits)

Batches: 100%|██████████| 3/3 [00:00<00:00,  4.99it/s]


[452171186136437710,
 452171186136437711,
 452171186136437712,
 452171186136437713,
 452171186136437714,
 452171186136437715,
 452171186136437716,
 452171186136437717,
 452171186136437718,
 452171186136437719,
 452171186136437720,
 452171186136437721,
 452171186136437722,
 452171186136437723,
 452171186136437724,
 452171186136437725,
 452171186136437726,
 452171186136437727,
 452171186136437728,
 452171186136437729,
 452171186136437730,
 452171186136437731,
 452171186136437732,
 452171186136437733,
 452171186136437734,
 452171186136437735,
 452171186136437736,
 452171186136437737,
 452171186136437738,
 452171186136437739,
 452171186136437740,
 452171186136437741,
 452171186136437742,
 452171186136437743,
 452171186136437744,
 452171186136437745,
 452171186136437746,
 452171186136437747,
 452171186136437748,
 452171186136437749,
 452171186136437750,
 452171186136437751,
 452171186136437752,
 452171186136437753,
 452171186136437754,
 452171186136437755,
 452171186136437756,
 452171186136

#### Alternatively, add new documents

In [24]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
# embeddings = HuggingFaceEmbeddings(
#     model_name="nomic-ai/nomic-embed-text-v1",
#     model_kwargs=model_kwargs,
#     show_progress=True
# )

# db = Milvus(
#     embedding_function=embeddings,
#     connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
#     collection_name=MILVUS_COLLECTION,
#     metadata_field="metadata",
#     text_field="page_content",
#     auto_id=True,
#     drop_old=False
#     )

# db.add_documents(all_splits)

#### Test query

In [31]:
query = "How can I work with GPU and taints in OpenShift AI?"
docs_with_score = db.similarity_search_with_score(query)

Batches: 100%|██████████| 1/1 [00:00<00:00, 79.11it/s]


In [32]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.6150780916213989
For administrators, OpenShift AI enables data science workloads in an existing Red Hat OpenShift or
ROSA environment. Manage users with your existing OpenShift identity provider, and manage the
resources available to notebook servers to ensure data scientists have what they require to create, train,
and host models. Use accelerators to reduce costs and allow your data scientists to enhance the
performance of their end-to-end data science workflows using graphics processing units (GPUs) and
Intel Gaudi AI accelerators.
OpenShift AI has two deployment options:
Self-managed software
 that you can install on-premise or in the cloud. You can install
OpenShift AI Self-Managed in a self-managed environment such as OpenShift Container
Platform, or in Red Hat-managed cloud environments such as Red Hat OpenShift Dedicated
(with a Customer Cloud Subscription for AWS or GCP), Red Hat OpenShi