## Creating an index and populating it with documents using Milvus and Nomic AI Embeddings

Simple example on how to ingest PDF documents, then web pages content into a Milvus VectorStore. . In this example, the embeddings are the fully open source ones released by NomicAI, [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).

As described in [this blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1), those embeddings feature a "8192 context-length that outperforms OpenAI Ada-002 and text-embedding-3-small on both short and long context tasks". In additions, they are:

- Open source
- Open data
- Open training code
- Fully reproducible and auditable

Requirements:
- A Milvus instance, either standalone or cluster.

### Needed packages and imports

In [16]:
!pip install -q einops==0.7.0 langchain==0.1.9 pypdf==4.0.2 pymilvus==2.3.6 sentence-transformers==2.4.0 python-docx unstructured[docx,pptx] python-pptx docx2txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [17]:
import requests
import os
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader, Docx2txtLoader, PyPDFDirectoryLoader
from langchain_community.document_loaders import UnstructuredPowerPointLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus
import docx2txt

### Base parameters, the Milvus connection info

In [18]:
MILVUS_HOST = "vectordb-milvus"
MILVUS_PORT = 19530
MILVUS_USERNAME = "root"
MILVUS_PASSWORD = "Milvus"
MILVUS_COLLECTION = "redhat_notes"

## Initial index creation and document ingestion

#### Load pfs

In [19]:
pdf_folder_path = "./pdfs"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
docs = pdf_loader.load()

In [20]:
len(docs)

1070

#### Load docx Files

In [21]:
docx_folder_path = "./docx"

# Create a list to store the loaded data from all files
all_data_docx = []

# Iterate over all files in the directory
for filename in os.listdir(docx_folder_path):
    if filename.endswith(".docx"):
        file_path = os.path.join(docx_folder_path, filename)
        loader = Docx2txtLoader(file_path)
        data = loader.load()
        all_data_docx.append(data)

In [22]:
for sublist_docx in all_data_docx:
    for subitem in sublist_docx:
        docs.append(subitem)

In [23]:
len(docs)

1324

#### Load pptx files

In [24]:
pptx_folder_path = "./pptx"

# Create a list to store the loaded data from all files
all_data_pptx = []

# Iterate over all files in the directory
for filename in os.listdir(pptx_folder_path):
    if filename.endswith(".pptx"):
        file_path = os.path.join(pptx_folder_path, filename)
        try:
            loader = UnstructuredPowerPointLoader(file_path)
            data = loader.load()
            all_data_pptx.append(data)
        except Exception as e:
            print(f"Error loading file '{filename}': {e}")
            continue  # Skip to the next iteration

Error loading file 'RH Edge Computing for Siemens Energy January 2023.pptx': invalid character in attribute value, line 1, column 1014 (<string>, line 1)


In [25]:
for sublist_pptx in all_data_pptx:
    for subitem in sublist_pptx:
        docs.append(subitem)

In [26]:
len(docs)

1360

#### Inject metadata

#### Merge both types of docs

#### Split documents into chunks with some overlap

In [27]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048,
                                               chunk_overlap=120)
all_splits = text_splitter.split_documents(docs)
all_splits[15]

Document(page_content='Backup\nVertraulich | © Siemens 2023 | Jan -Boike Fischer | DI FA CTR SDS | 2023 -07-31', metadata={'source': 'snemeis-pdfs/POC_Summary.pdf', 'page': 14})

len(docs)len(docs)#### Create the index and ingest the documents

In [28]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
model_kwargs = {'trust_remote_code': True}
embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=True
)

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.4.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>


In [29]:
db = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    auto_id=True,
    drop_old=True
    )

[93m[has_collection] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:172.30.190.66:19530: Failed to connect to remote host: No route to host>[0m
[93m[has_collection] retry:5, cost: 0.81s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:172.30.190.66:19530: Failed to connect to remote host: No route to host>[0m
[93m[has_collection] retry:6, cost: 2.43s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:172.30.190.66:19530: Failed to connect to remote host: No route to host>[0m
[93m[has_collection] retry:7, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:172.30.190.66:19530: Failed to connect to remote host: No route to host>[0m
[93m[has_collection] retry:8, c

In [30]:
db.add_documents(all_splits)

Batches:   0%|          | 0/125 [00:00<?, ?it/s]

[449260199505830359,
 449260199505830360,
 449260199505830361,
 449260199505830362,
 449260199505830363,
 449260199505830364,
 449260199505830365,
 449260199505830366,
 449260199505830367,
 449260199505830368,
 449260199505830369,
 449260199505830370,
 449260199505830371,
 449260199505830372,
 449260199505830373,
 449260199505830374,
 449260199505830375,
 449260199505830376,
 449260199505830377,
 449260199505830378,
 449260199505830379,
 449260199505830380,
 449260199505830381,
 449260199505830382,
 449260199505830383,
 449260199505830384,
 449260199505830385,
 449260199505830386,
 449260199505830387,
 449260199505830388,
 449260199505830389,
 449260199505830390,
 449260199505830391,
 449260199505830392,
 449260199505830393,
 449260199505830394,
 449260199505830395,
 449260199505830396,
 449260199505830397,
 449260199505830398,
 449260199505830399,
 449260199505830400,
 449260199505830401,
 449260199505830402,
 449260199505830403,
 449260199505830404,
 449260199505830405,
 449260199505

#### Alternatively, add new documents

In [31]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
# embeddings = HuggingFaceEmbeddings(
#     model_name="nomic-ai/nomic-embed-text-v1",
#     model_kwargs=model_kwargs,
#     show_progress=True
# )

# db = Milvus(
#     embedding_function=embeddings,
#     connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
#     collection_name=MILVUS_COLLECTION,
#     metadata_field="metadata",
#     text_field="page_content",
#     auto_id=True,
#     drop_old=False
#     )

# db.add_documents(all_splits)

#### Test query

In [32]:
query = "Who is Anke Fritzenkoetter?"
docs_with_score = db.similarity_search_with_score(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [33]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.7926554679870605
Name: 		Anke Fritzenkötter

Telephone:	0160-90995336

Fax:		

Email:		afritzen@redhat.com

Partner bill-to address:



Humboldtstr. 59

90459 Nürnberg, Germany



Embedded Partner (if different than Partner above):







Embedded Partner Product(s) (additions or replacements, if any):



Spectrum Power

Embedded Partner Administrative Contact for administrative issues such as reporting, purchase orders and payments:

Name: tbd

Email: 

Phone: 



Embedded Partner Support Contact for technical support issues:

Name: tbd

Email: 

Phone: 







Purchase Summary

Quantity

Unit*

Term

SKU

Subscription or Services Purchased

Per Unit Fee EUR

Per Unit Prorated Fee EUR

Total Fee

Contract Year 1**









18

Fee

1/Dec/2020 – 30/Nov/2021

MCT1242***

Red Hat Advanced ISV Partnership Fee RHEL***

2,400.00

600.00****

(EUR) 600.00

2

Fee

1/Dec/2020 –30/Nov/2021

MCT1242***

R