## Creating an index and populating it with documents using Milvus and Nomic AI Embeddings

Simple example on how to ingest PDF documents, then web pages content into a Milvus VectorStore. . In this example, the embeddings are the fully open source ones released by NomicAI, [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).

As described in [this blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1), those embeddings feature a "8192 context-length that outperforms OpenAI Ada-002 and text-embedding-3-small on both short and long context tasks". In additions, they are:

- Open source
- Open data
- Open training code
- Fully reproducible and auditable

Requirements:
- A Milvus instance, either standalone or cluster.

### Needed packages and imports

In [3]:
!pip install -q einops==0.7.0 langchain==0.1.9 pypdf==4.0.2 pymilvus==2.3.6 sentence-transformers==2.4.0 python-docx unstructured[docx,pptx] python-pptx docx2txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
import requests
import os
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader, Docx2txtLoader
from langchain_community.document_loaders import UnstructuredPowerPointLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus
import docx2txt

### Base parameters, the Milvus connection info

In [5]:
MILVUS_HOST = "vectordb-milvus"
MILVUS_PORT = 19530
MILVUS_USERNAME = "root"
MILVUS_PASSWORD = "Milvus"
MILVUS_COLLECTION = "redhat_notes"

## Initial index creation and document ingestion

#### Load pfs

In [6]:
pdf_folder_path = "../../knowledge_base_data"
# Create a list to store the loaded data from all files
all_data_pdfs = []
success_counter = 0

# Iterate over all files in the directory
for filename in os.listdir(pdf_folder_path):
    if filename.endswith(".pdf"):
        file_path = os.path.join(pdf_folder_path, filename)
        try:
            loader = PyPDFLoader(file_path)
            data = loader.load()
            all_data_pdfs.append(data)
            success_counter += 1
        except Exception as e:
            print(f"Error loading file '{filename}': {e}")
            continue  # Skip to the next iteration
print(f"Successfully loaded '{success_counter}' pdfs")

could not convert string to float: '0.00-10498687' : FloatObject (b'0.00-10498687') invalid; use 0.0 instead
could not convert string to float: '0.00-10498687' : FloatObject (b'0.00-10498687') invalid; use 0.0 instead
could not convert string to float: '0.00-10498687' : FloatObject (b'0.00-10498687') invalid; use 0.0 instead
could not convert string to float: '0.00-10498687' : FloatObject (b'0.00-10498687') invalid; use 0.0 instead
could not convert string to float: '0.00-10498687' : FloatObject (b'0.00-10498687') invalid; use 0.0 instead
could not convert string to float: '0.00-10498687' : FloatObject (b'0.00-10498687') invalid; use 0.0 instead
could not convert string to float: '0.00-10498687' : FloatObject (b'0.00-10498687') invalid; use 0.0 instead
could not convert string to float: '0.00-10498687' : FloatObject (b'0.00-10498687') invalid; use 0.0 instead
could not convert string to float: '0.00-10498687' : FloatObject (b'0.00-10498687') invalid; use 0.0 instead
could not convert s

Error loading file 'Red Hat ACS-technicalDeepDive-v3.72.pdf': Stream has ended unexpectedly


could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to float: '0.00-10' : FloatObject (b'0.00-10') invalid; use 0.0 instead
could not convert string to fl

Successfully loaded '406' pdfs


In [10]:
all_data_pdfs[0]

[Document(page_content='', metadata={'source': '../../knowledge_base_data/Copy of Red Hat Appendix 2 ISV Program Appendix.pdf', 'page': 0}),
 Document(page_content='', metadata={'source': '../../knowledge_base_data/Copy of Red Hat Appendix 2 ISV Program Appendix.pdf', 'page': 1}),
 Document(page_content='', metadata={'source': '../../knowledge_base_data/Copy of Red Hat Appendix 2 ISV Program Appendix.pdf', 'page': 2}),
 Document(page_content='', metadata={'source': '../../knowledge_base_data/Copy of Red Hat Appendix 2 ISV Program Appendix.pdf', 'page': 3}),
 Document(page_content='', metadata={'source': '../../knowledge_base_data/Copy of Red Hat Appendix 2 ISV Program Appendix.pdf', 'page': 4}),
 Document(page_content='', metadata={'source': '../../knowledge_base_data/Copy of Red Hat Appendix 2 ISV Program Appendix.pdf', 'page': 5}),
 Document(page_content='', metadata={'source': '../../knowledge_base_data/Copy of Red Hat Appendix 2 ISV Program Appendix.pdf', 'page': 6}),
 Document(pag

In [8]:
docs = []
for sublist_pdfs in all_data_pdfs:
    for subitem in sublist_pdfs:
        docs.append(subitem)

In [13]:
len(docs)

8596

#### Load docx Files

In [14]:
docx_folder_path = "../../knowledge_base_data"

# Create a list to store the loaded data from all files
all_data_docx = []
success_counter = 0

# Iterate over all files in the directory
for filename in os.listdir(docx_folder_path):
    if filename.endswith(".docx"):
        file_path = os.path.join(docx_folder_path, filename)
        try:
            loader = Docx2txtLoader(file_path)
            data = loader.load()
            all_data_docx.append(data)
            success_counter += 1
        except Exception as e:
            print(f"Error loading file '{filename}': {e}")
            continue  # Skip to the next iteration
print(f"Successfully loaded '{success_counter}' documents")

Successfully loaded '410' documents


In [15]:
for sublist_docx in all_data_docx:
    for subitem in sublist_docx:
        docs.append(subitem)

In [16]:
len(docs)

9006

#### Load pptx files

In [17]:
pptx_folder_path = "../../knowledge_base_data"

# Create a list to store the loaded data from all files
all_data_pptx = []
success_counter = 0

# Iterate over all files in the directory
for filename in os.listdir(pptx_folder_path):
    if filename.endswith(".pptx"):
        file_path = os.path.join(pptx_folder_path, filename)
        try:
            loader = UnstructuredPowerPointLoader(file_path)
            data = loader.load()
            all_data_pptx.append(data)
            success_counter += 1
        except Exception as e:
            print(f"Error loading file '{filename}': {e}")
            continue  # Skip to the next iteration
print(f"Successfully loaded '{success_counter}' documents")

Error loading file 'An open source approach to shipping VM based Siemens solutions_.pptx': invalid character in attribute value, line 1, column 1014 (<string>, line 1)
Error loading file 'RH Edge Computing for Siemens Energy January 2023.pptx': invalid character in attribute value, line 1, column 1014 (<string>, line 1)
Error loading file 'Siemens DEMA and Red Hat.pptx': invalid character in attribute value, line 1, column 1014 (<string>, line 1)
Error loading file 'Siemens OSS Days 2023.pptx': invalid character in attribute value, line 1, column 1014 (<string>, line 1)
Successfully loaded '241' documents


In [18]:
for sublist_pptx in all_data_pptx:
    for subitem in sublist_pptx:
        docs.append(subitem)

In [19]:
print(f"Loaded '{len(docs)}' files in total")

Loaded '9247' files in total


#### Inject metadata

#### Merge both types of docs

#### Split documents into chunks with some overlap

In [20]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=120)
all_splits = text_splitter.split_documents(docs)
all_splits[15]

Document(page_content='(https://access.redhat.com/solutions/5181471). For clarity, NFR subscriptions are made \navailable for the purposes of (a) integration testing, (b) certification,  (c) recreating End User \nissues, (d) troubleshooting and (e) training Siemens and End User’s personnel with respect to \nthe Red Hat Products contained in the Integrated Products.  \n \n3.4. Embedded Subscriptions  \n \nDuring the Term of the Program and in consideration for the Embedded Subscription fees paid \nto Red Hat or an authorized Red Hat channel partner for each Unit of Software (as defined \nbelow and/or in an Embedded Order Form) and for active Embedded Subscriptions, (a) Red Hat \nwill provide Siemens with access to  the Software, access to the Updates and the Partner \nSupport (“ Embedded Subscription ”) and (b) Siemens may (i) distribute the Software as an \nIntegrated Product to its end customers (“ End Users ”) and (ii) distribute the Updates, if and', metadata={'source': '../../knowl

len(docs)len(docs)#### Create the index and ingest the documents

In [21]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=True
)

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.4.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>


In [22]:
db = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    auto_id=True,
    drop_old=True
    )

In [24]:
db.add_documents(all_splits)

Batches:   0%|          | 0/755 [00:00<?, ?it/s]

RPC error: [batch_insert], <DataNotMatchException: (code=1, message=The Input data type is inconsistent with defined schema, please check it.)>, <Time:{'RPC start': '2024-05-07 10:42:30.482060', 'RPC error': '2024-05-07 10:42:30.483596'}>
Failed to insert batch starting at entity: 15000/24138


DataNotMatchException: <DataNotMatchException: (code=1, message=The Input data type is inconsistent with defined schema, please check it.)>

#### Alternatively, add new documents

In [None]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
# embeddings = HuggingFaceEmbeddings(
#     model_name="nomic-ai/nomic-embed-text-v1",
#     model_kwargs=model_kwargs,
#     show_progress=True
# )

# db = Milvus(
#     embedding_function=embeddings,
#     connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
#     collection_name=MILVUS_COLLECTION,
#     metadata_field="metadata",
#     text_field="page_content",
#     auto_id=True,
#     drop_old=False
#     )

# db.add_documents(all_splits)

#### Test query

In [25]:
query = "Who is Anke Fritzenkoetter?"
docs_with_score = db.similarity_search_with_score(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [26]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.5264995098114014
ISV Business (Anke 
Fritzenkoetter)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.9638674259185791
rer der Welt, steckt im Ausgebremst
ZF FRIEDRICHSHAFEN  Der Zulieferer will Ballast abwerfen. 
Auch der „People Mover“ steht zum Verkauf.
Aussteigen, bitte: ZF Friedrichshafen will sein Shuttle-Geschäft loswerden
Vor dem Move: 
der neue ZF-Chef 
Holger Klein
Fotos: Robert Hoernig, Felix Kästle
2022-09__mm__All__Namen und Nachrichten__114771__2209_NNAufmacher-010__013   132022-09__mm__All__Namen und Nachrichten__114771__2209_NNAufmacher-010__013   13 22.08.2022   09:16:4322.08.2022   09:16:43
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  