## Creating an index and populating it with documents using Milvus and Nomic AI Embeddings

Simple example on how to ingest PDF documents, then web pages content into a Milvus VectorStore. . In this example, the embeddings are the fully open source ones released by NomicAI, [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).

As described in [this blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1), those embeddings feature a "8192 context-length that outperforms OpenAI Ada-002 and text-embedding-3-small on both short and long context tasks". In additions, they are:

- Open source
- Open data
- Open training code
- Fully reproducible and auditable

Requirements:
- A Milvus instance, either standalone or cluster.

### Needed packages and imports

In [1]:
!pip install -q einops==0.7.0 langchain==0.1.9 pypdf==4.0.2 pymilvus==2.3.6 sentence-transformers==2.4.0 grpcio grpcio-reflection


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import requests
import os
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus
from caikit_nlp_client import LangchainEmbeddings

### Base parameters, the Milvus connection info

In [3]:
MILVUS_HOST = "vectordb-milvus.milvus.svc.cluster.local"
MILVUS_PORT = 19530
MILVUS_USERNAME = os.getenv('MILVUS_USERNAME')
MILVUS_PASSWORD = os.getenv('MILVUS_PASSWORD')
MILVUS_COLLECTION = "collection_nomicai_embeddings"

## Initial index creation and document ingestion

#### Download and load pdfs

In [4]:
product_version = 2.12
documents = [
    "release_notes",
    "introduction_to_red_hat_openshift_ai",
    "getting_started_with_red_hat_openshift_ai_self-managed",
    "openshift_ai_tutorial_-_fraud_detection_example",
    "developing_a_model",
    "integrating_data_from_amazon_s3",
    "working_on_data_science_projects",
    "serving_models",
    "monitoring_data_science_models",
    "managing_users",
    "managing_resources",
    "installing_and_uninstalling_openshift_ai_self-managed",
    "installing_and_uninstalling_openshift_ai_self-managed_in_a_disconnected_environment",
    "upgrading_openshift_ai_self-managed",
    "upgrading_openshift_ai_self-managed_in_a_disconnected_environment",   
]

pdfs = [f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/pdf/{doc}/red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls = {f"red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us": f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/html-single/{doc}/index" for doc in documents}

In [5]:
docs_dir = f"rhoai-doc-{product_version}"

if not os.path.exists(docs_dir):
    os.mkdir(docs_dir)
    for pdf in pdfs:
        try:
            response = requests.get(pdf)
        except:
            print(f"Skipped {pdf}")
            continue
        if response.status_code!=200:
            print(f"Skipped {pdf}")
            continue  
        with open(f"{docs_dir}/{pdf.split('/')[-1]}", 'wb') as f:
            f.write(response.content)
else:
    print('PDF dir found, skipping doc load')

PDF dir found, skipping doc load


In [6]:
pdf_folder_path = f"./rhoai-doc-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

#### Inject metadata

In [7]:
from pathlib import Path

for doc in pdf_docs:
    doc.metadata["source"] = pdfs_to_urls[Path(doc.metadata["source"]).stem]

#### Load websites

In [8]:
websites = [
    "https://ai-on-openshift.io/getting-started/openshift/",
    "https://ai-on-openshift.io/getting-started/opendatahub/",
    "https://ai-on-openshift.io/getting-started/openshift-ai/",
    "https://ai-on-openshift.io/odh-rhoai/configuration/",
    "https://ai-on-openshift.io/odh-rhoai/custom-notebooks/",
    "https://ai-on-openshift.io/odh-rhoai/nvidia-gpus/",
    "https://ai-on-openshift.io/odh-rhoai/custom-runtime-triton/",
    "https://ai-on-openshift.io/odh-rhoai/openshift-group-management/",
    "https://ai-on-openshift.io/tools-and-applications/minio/minio/",
    "https://access.redhat.com/articles/7047935",
    "https://access.redhat.com/articles/rhoai-supported-configs",
]

In [9]:
website_loader = WebBaseLoader(websites)
website_docs = website_loader.load()

#### Merge both types of docs

In [10]:
docs = pdf_docs + website_docs

#### Split documents into chunks with some overlap

In [11]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=126,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content='Red Hat OpenShift AI Self-Managed\n \n2.12\nGetting started with Red Hat OpenShift AI\nSelf-Managed', metadata={'source': 'https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.12/html-single/getting_started_with_red_hat_openshift_ai_self-managed/index', 'page': 0})

#### Create the index and ingest the documents

In [12]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
#model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
#embeddings = HuggingFaceEmbeddings(
#    model_name="nomic-ai/nomic-embed-text-v1",
#    model_kwargs=model_kwargs,
#    show_progress=True
#)

#model = "flan-t5-small-caikit"
#model_endpoint_url = "https://caikit-flan-jary-wb.apps.jary-intel-opea.51ty.p1.openshiftapps.com"
model = "all-MiniLM-L12-v2-caikit"
model_endpoint_url = "https://caikit-nomic5-jary-wb.apps.jary-intel-opea.51ty.p1.openshiftapps.com"
token = '''\
Bearer <token>
'''.replace('\n', '')

embeddings = LangchainEmbeddings(
    token=token,
    endpoint=model_endpoint_url,
    model=model
)

db = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    auto_id=True,
    drop_old=True
    )

In [13]:
print(len(all_splits))
db.add_documents(all_splits[0:2])

1648
Response: <Response [200]>
typeof response: <class 'dict'>
and content: {'results': {'vectors': [{'data': {'values': [0.03457500785589218, -0.11341545730829239, -0.02059202827513218, -0.041143160313367844, 0.11863195151090622, 0.0068161506205797195, -0.051357999444007874, -0.010111869312822819, 0.009462030604481697, 0.04921481013298035, -0.0542125441133976, -0.054595284163951874, -0.06587641686201096, -0.01370768342167139, 0.009627859108150005, 0.01673414558172226, -0.0042167892679572105, 0.03128102049231529, 0.009154733270406723, -0.04102141410112381, -0.049923770129680634, 0.0383109524846077, 0.01981678046286106, -0.03550291061401367, 0.030222218483686447, -0.02350367233157158, -0.023187052458524704, 0.018142912536859512, -0.002936252858489752, 0.004191323183476925, 0.07297460734844208, -0.04197116941213608, -0.0333026684820652, 0.011124794371426105, -0.018672358244657516, 0.058045536279678345, 0.042738065123558044, -0.06177794188261032, -0.08879492431879044, -0.0554640516638755



[452551956821749122, 452551956821749123]

#### Alternatively, add new documents

In [14]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
# embeddings = HuggingFaceEmbeddings(
#     model_name="nomic-ai/nomic-embed-text-v1",
#     model_kwargs=model_kwargs,
#     show_progress=True
# )

# db = Milvus(
#     embedding_function=embeddings,
#     connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
#     collection_name=MILVUS_COLLECTION,
#     metadata_field="metadata",
#     text_field="page_content",
#     auto_id=True,
#     drop_old=False
#     )

# db.add_documents(all_splits)

#### Test query

In [15]:
query = "How can I work with GPU and taints in OpenShift AI?"
docs_with_score = db.similarity_search_with_score(query)

RPC error: [search], <ParamError: (code=1, message=`search_data` value [{'result': {'data': {'values': [0.014584872871637344, -0.06318628787994385, 0.04656320810317993, 0.0041893115267157555, 0.055654026567935944, -0.023691782727837563, -0.035480696707963943, -0.024649253115057945, 0.03860316798090935, -0.04051440954208374, -0.09852523356676102, -0.0668564885854721, -0.05555010586977005, -0.022397201508283615, 0.05034366250038147, -0.0031785545870661736, -0.002295487094670534, 0.06766505539417267, 0.00147917284630239, -0.008932314813137054, -0.01815999299287796, 0.008372127078473568, -0.005877046380192041, -0.06242193281650543, -0.11136991530656815, 0.06901022791862488, 0.020539313554763794, -0.0208711177110672, 0.06108584254980087, 0.031310658901929855, 0.006186673417687416, 0.02889353595674038, -0.0731697529554367, 0.016533030197024345, 0.041661959141492844, 0.02147749252617359, 0.0063850064761936665, -0.02691182866692543, 0.0011352173751220107, -0.08928168565034866, 0.03256421908736

single-end::<class 'dict'>


ParamError: <ParamError: (code=1, message=`search_data` value [{'result': {'data': {'values': [0.014584872871637344, -0.06318628787994385, 0.04656320810317993, 0.0041893115267157555, 0.055654026567935944, -0.023691782727837563, -0.035480696707963943, -0.024649253115057945, 0.03860316798090935, -0.04051440954208374, -0.09852523356676102, -0.0668564885854721, -0.05555010586977005, -0.022397201508283615, 0.05034366250038147, -0.0031785545870661736, -0.002295487094670534, 0.06766505539417267, 0.00147917284630239, -0.008932314813137054, -0.01815999299287796, 0.008372127078473568, -0.005877046380192041, -0.06242193281650543, -0.11136991530656815, 0.06901022791862488, 0.020539313554763794, -0.0208711177110672, 0.06108584254980087, 0.031310658901929855, 0.006186673417687416, 0.02889353595674038, -0.0731697529554367, 0.016533030197024345, 0.041661959141492844, 0.02147749252617359, 0.0063850064761936665, -0.02691182866692543, 0.0011352173751220107, -0.08928168565034866, 0.03256421908736229, 0.06183486804366112, 0.009485910646617413, 0.06715354323387146, 0.0010236738016828895, -0.08174418658018112, -0.03011239506304264, 0.03800399601459503, 0.05713973566889763, -0.08928481489419937, -0.05991650000214577, 0.012595375999808311, 0.016711987555027008, 0.03370453044772148, -0.011783143505454063, -0.10805878043174744, 0.05128142237663269, -0.06735710054636002, -0.05274287983775139, -0.11574730277061462, 0.02844822220504284, -0.02746553160250187, 0.017946258187294006, -0.003775320015847683, 0.0393647737801075, -0.06837623566389084, 0.03376587852835655, 0.04379250854253769, 0.1569635272026062, -0.03761008754372597, -0.03829736262559891, -0.00125018204562366, 0.00467340461909771, -0.009071770124137402, 0.026751572266221046, 0.019619392231106758, 0.014807931147515774, -0.031202001497149467, -0.012176277115941048, -0.03639766946434975, -0.04712605103850365, -0.04259219765663147, 0.01858421228826046, 0.009825349785387516, 0.07071541994810104, 0.07400719821453094, -0.05820853263139725, 0.020972024649381638, 0.07550183683633804, 0.04037749394774437, -0.01543225347995758, 0.014572113752365112, -0.00942428782582283, -0.06688651442527771, 0.10454128682613373, 0.013365802355110645, 0.06941507011651993, -0.0736028179526329, 0.015428081154823303, 0.016179386526346207, 0.001234943512827158, -0.043790362775325775, -0.02884799800813198, 0.04575164243578911, -0.005305311176925898, 0.02328507974743843, -0.02936409041285515, 0.03720063343644142, -0.05117899179458618, -0.03842052444815636, 0.01352866180241108, 0.03587675839662552, 0.04136205464601517, 0.055047035217285156, 0.0142186488956213, 0.07815283536911011, -0.053977712988853455, -0.056122906506061554, -0.13236966729164124, 0.054615121334791183, -0.03248252347111702, 0.002509210491552949, -0.006499790120869875, 0.0988859012722969, 0.061471909284591675, -0.09130871295928955, -0.01921354979276657, -0.017373396083712578, 0.06439381837844849, -0.005483126733452082, 0.003399729495868087, -0.08230875432491302, 0.0437166690826416, 0.012396786361932755, 0.020581427961587906, -0.012061962857842445, -0.01756710186600685, 0.046470437198877335, 0.013456419110298157, 0.1026584655046463, -0.0667569488286972, 0.06711674481630325, 0.014108897186815739, -0.09340624511241913, 0.061370521783828735, -0.0048665013164281845, 0.00933636724948883, -0.005139284301549196, 0.04851497337222099, -0.05957512557506561, -0.009987762197852135, -0.03106587752699852, 0.00833495706319809, 0.06600100547075272, 0.01869024708867073, 0.008592687547206879, 0.054032836109399796, 0.03342051804065704, -0.025271669030189514, 0.07631026208400726, -0.034227803349494934, 0.06339805573225021, 0.02995457500219345, 0.002674761228263378, -0.036314092576503754, -0.06542003154754639, 0.05833638086915016, 0.07301187515258789, -0.00044458996853791177, 0.024072011932730675, 0.054900798946619034, -0.021689780056476593, -0.03487660363316536, -0.05602489784359932, 0.015433670952916145, -0.054005060344934464, -0.025644799694418907, -0.07043884694576263, -0.09363314509391785, 0.0004348485090304166, -0.012056720443069935, -0.09290433675050735, -0.06256331503391266, -0.07590273767709732, 0.04167597368359566, 0.007900455966591835, 0.0064818221144378185, 0.053002797067165375, -0.06750011444091797, -0.04483967274427414, -0.003854323411360383, 0.0312756709754467, -0.03812343627214432, 0.013369910418987274, -0.018640024587512016, -0.03937136381864548, -0.016028963029384613, 0.014098362997174263, -0.05233175680041313, 0.11216506361961365, -0.051747702062129974, 0.010560144670307636, 0.010651644319295883, 0.014653826132416725, -0.12515850365161896, -0.021001916378736496, -0.06555647403001785, 0.01375874038785696, 0.0016723958542570472, 0.05002404376864433, 0.008798393420875072, 0.026508741080760956, -0.035833388566970825, -0.002590681193396449, 0.08434350788593292, 0.027569904923439026, -0.06263944506645203, 0.028942950069904327, -0.02389315515756607, 0.016674384474754333, 0.043256741017103195, -0.00580583093687892, -0.061235301196575165, 1.6057734050232315e-33, -0.1492091566324234, -0.06065739691257477, -0.0023869825527071953, 0.09895540028810501, -0.0045413370244205, -0.06544184684753418, 0.028065068647265434, 0.058206818997859955, 0.05781853199005127, -0.03447922319173813, 0.016592135652899742, 0.06482477486133575, -0.04064421355724335, -0.07765158265829086, -0.07646887004375458, 0.007588179782032967, 0.0658290833234787, 0.049890872091054916, -0.02769753336906433, 0.02145177498459816, -0.004575575701892376, 0.023757895454764366, -0.03729231283068657, -0.05160095915198326, -0.03350529447197914, 0.10382034629583359, -0.0140005424618721, -0.05979413166642189, -0.0174234751611948, -0.048489656299352646, 0.019399244338274002, 0.10987498611211777, -0.05924583971500397, -0.03348465636372566, 0.029603581875562668, -0.0353844091296196, -0.003685267874971032, -0.024675825610756874, -0.03299764543771744, -0.05468814820051193, 0.10971824824810028, 0.010882362723350525, -0.04706621170043945, 0.04657239839434624, -0.034296467900276184, 0.03442617878317833, -0.0496404692530632, 0.014585250988602638, -0.012121649459004402, -0.13285408914089203, 0.047050029039382935, -0.02071719616651535, 0.06001737341284752, 0.014574199914932251, 0.008911359123885632, 0.002286734525114298, 0.04716602340340614, 0.03438242897391319, -0.06079953536391258, 0.014079793356359005, 0.03465615212917328, 0.04144585132598877, -0.0891789123415947, -0.0566045381128788, -0.0835319384932518, -0.08375360816717148, 0.02238568291068077, 0.05881341174244881, 0.039812132716178894, -0.05469474941492081, 0.05280223861336708, -0.009554049000144005, 0.11812502890825272, 0.053988274186849594, -0.05013333633542061, -0.017728354781866074, -0.015345875173807144, -0.006963836960494518, 0.047857291996479034, 0.04317903891205788, -0.0010581979295238853, -0.0587613545358181, 0.018984945490956306, -0.011532309465110302, 0.11344951391220093, 0.03621669486165047, -0.00383434584364295, -0.06820447742938995, 0.0132405124604702, 0.016602393239736557, -0.09857430309057236, 0.07078862190246582, 0.021691441535949707, -0.001023370074108243, 0.04721171781420708, 3.852177859996369e-32, -0.006423229817301035, 0.033488254994153976, 0.014528268948197365, 0.04720699414610863, -0.0929436907172203, -0.019785258919000626, -0.09689337760210037, 0.034317970275878906, 0.03616931289434433, -0.09267346560955048, 0.038904737681150436, -0.05591494217514992, 0.0562567338347435, -0.08508827537298203, 0.02379368059337139, 0.028874216601252556, 0.010712353512644768, 0.10465926676988602, -0.023036032915115356, -0.055224839597940445, -0.0855826810002327, 0.0920143872499466, 0.012590007856488228, 0.0017466493882238865, -0.05087079107761383, -0.037898723036050797, 0.04480414092540741, -0.03464238718152046, 0.0685521811246872, 0.035517800599336624, 0.05225042253732681, -0.02770010009407997, 0.017395706847310066, 0.08678565919399261, 0.014880580827593803, 0.07472119480371475, -0.0646403357386589, 0.10784274339675903, 0.10158834606409073, -0.04599706456065178, -0.006602074485272169, 0.02894851565361023, -0.03664414584636688, -0.022297361865639687, 0.008166681975126266, -0.04378227889537811, -0.015524359419941902, -0.09372594207525253, -0.0029912840109318495, 0.08970669656991959, 0.05921768769621849, 0.013423459604382515, 0.046341121196746826, -0.03201599046587944, 0.09078036993741989, 0.002246722113341093, 0.016627119854092598, -0.0438307486474514, 0.014466634951531887, -0.04049038514494896, 0.027568358927965164, -0.02540023997426033, -0.0029951739124953747, -0.01858036406338215]}}, 'producer_id': {'name': 'EmbeddingModule', 'version': '0.0.1'}, 'input_token_count': 18}] is illegal)>

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)