In [None]:
%pip install -r requirements.txt

In [3]:
import os
from dotenv import load_dotenv
load_dotenv(override=True)

True

# Using Milvus to manage vector datasets with Neopilot
In this demo, we use Milvus as our vector store to enable quick document queries. We compute document embeddings using a small BERT model for semantic search.

## 1. Setup the knowledgebase
We start by setting up our knowledgebase using Milvus following the following steps:
- Get the data. In this demo, we will use the WikiHow dataset, which is quite large and may take some time to insert into Milvus.
- Check data quality.
- Connect to our Milvus instance and initialize a new Collection with a defined schema.
- Upload the data to Milvus.

### 1.1 Download the dataset
Acquire the dataset from the following URL: `https://ucsb.box.com/s/7yq601ijl1lzvlfu4rjdbbxforzd2oag`

This can take some time depending on connection speed. The file path and name shoudl be provided in the environment variable `WH_PATH`.

In [4]:
WH_PATH = os.environ['WH_PATH']

### 1.2 Load and check the data
In this case, we observe that some of the data could be cleaner:
- One of the titles seems to be mistakenly registered as a sectionLabel
- Some odd codepoint choices, for example for apostrophes
- Some titles end in spurious numbers

In this case we'll manually lines with non-string data during processing (see below). Other options include normalizing the data at an application-dependent level of interest (can be just codepoint normalization, can be full normalization/canonicalization).

In [6]:
import pandas
doc = pandas.read_csv(WH_PATH)

In [7]:
doc_indexed = doc.set_index(['title', 'headline']).sort_index()

In [8]:
doc_indexed.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,overview,text,sectionLabel
title,headline,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
How to Zydeco,\nTry a side step.,Zydeco dancing is type of Cajun dancing perfo...,If you already have the rock step down (or ev...,Adding Movement
How to Zydeco,\nTry the open position.,Zydeco dancing is type of Cajun dancing perfo...,"The open position is, as it sounds, much more...",Learning the Closed and Open Position
How to Zydeco,\nUse a rock step.,Zydeco dancing is type of Cajun dancing perfo...,"Often, you'll just be shifting your weight ba...",Adding Movement
How to Zydeco,\nUse dance techniques for the extra beat.,Zydeco dancing is type of Cajun dancing perfo...,It can be hard to remember to hold for the ex...,Learning the Beat
,\nInsert the following into your <head> section:\n\n\n\n\n\n,Do you want to change the user's cursor when ...,"Steps,Tips,Related wikiHows",How to Set Cursors for Webpage Links


### 1.3a Create Milvus connection
We will interact with our Milvus instance using the official pymilvus library. Alternatively, it is possible to use LangChain's Milvus vectorstores class to add documents instance. In that case, a simple `from_documents` or `from_texts` (or similar) will generate the collection using the correct settings expected by LangChain.

Milvus requires a connection for all operations.

The alias on the connection is used from then on (with `using=` parameters in other functions) to refer to the connection that was established.
The connection is not managed and we should remember to disconnect at the end. The `using=` field has a value of `default` when not specified, so starting a connection with an alias of `default` allows us to write a little less code.

In [9]:
from pymilvus import connections
connections.connect(
  alias="default",
  host=os.environ['MILVUS_HOST'],
  port=os.environ['MILVUS_PORT']
)

### 1.3b Create schema for the milvus store
Note that if a collection with the same name but a different schema exists, Milvus may throw a SchemaNotReady exception.
Also, text fields' max length is actually in bytes, not characters. Even though it's possible to get the byte size of the string and trim it to fit the byte limits in the schema, there are finicky bits and it's better to simply set limits to the max allowable (65535).
We will not be using the LangChain Milvus vectorstores, but we still show how to create a minimal LangChain-compatible store through pymilvus. In this case, fields in the collection must follow some special rules:
- The primary key must be called pk
- The vector must be called vector
- The text entry must be called text

Milvus also supports schemaless operations if `enable_dynamic_fields=True`.

In [10]:
from pymilvus import CollectionSchema, FieldSchema, DataType, Collection, utility

In [None]:
MAX_TITLE = 512
MAX_TEXT = 1024
MAX_VEC = 384

NAME = "WikiHow"

if NAME in utility.list_collections():
    whcollection = Collection(NAME)
    whcollection.drop()

whschema = CollectionSchema(
    fields=[
        FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=65535, default_value=""),
        FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535, default_value=""),
        FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=384)
    ],
    enable_dynamic_fields=False,
    description="WikiHow collection"
)
whcollection = Collection(
    name=NAME,
    schema=whschema,
    consistency_level="Session" # Make sure we read our own writes, otherwise allowed to be a bit out of date.
)

### 1.4 Batch-wise insertion into milvus
We use a small BERT model to compute embeddings for our documents to place in the milvus store. We will be using the same model later to compute query embeddings for similarity search.

The choice of batch size in this example is arbitrary, and a double-batch system may be preferable to accomodate both the embedding model and milvus.

When the embedding model runs on GPU, the batch size should be selected so as to optimize the transfer-to-memory vs runtime overheads (too small and a major amount of time will be wasted on memory transfers instead of embedding proper, too large and it won't fit on the device).
If the model is accessed over the network, the batch size should be selected with the same concerns in mind, although further overhead may be incurred depending on how the model is scheduled or how the API is designed.

With regard to milvus, the idea is the same: a batch size that's too small means incurring milvus' operational overhead along with communication overhead. The other tradeoff of note regards any temporary processing or data streaming that may occur: a higher batch size also implies loading more data into memory and possibly generating longer-lasting temporary artifacts before submitting the data to milvus, after which it can all be discarded.



### 1.4a Load embeddings
We use HuggingFaceEmbeddings with the MiniLM BERT model.

In [4]:
import langchain
from langchain.embeddings import HuggingFaceEmbeddings

In [None]:
embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

In [None]:
BATCH_SIZE = 2048

batch = []
def insert_data(data):
    import math

    batch = []

    titles = list(data.keys())

    vecs = embeddings.embed_documents(titles)
    
    entries = [[], [], []]

    for b, title in enumerate(titles):
        text = title + ":\n"
        for cat in data[title]:
            text += cat + ":\n"
            text += "\n".join(data[title][cat])
            
        title_len_diff = len(title.encode('utf-16-le')) - len(title)
        text_len_diff = len(text.encode('utf-16-le')) - len(text)
        entries[0].append(title[:MAX_TITLE - title_len_diff])
        entries[1].append(text[:MAX_TEXT - text_len_diff])
        entries[2].append(vecs[b])

    whcollection.insert(entries)

import collections, tqdm
doc_data = collections.defaultdict(lambda: collections.defaultdict(list))
for i in tqdm.tqdm(range(len(doc_indexed)), total=len(doc_indexed)):
    if (type(doc_indexed.index[i][0]) is not str) or (type(doc_indexed.index[i][1]) is not str):
        continue
    die = False
    for col in ['text', 'overview', 'sectionLabel']:
        if type(doc_indexed.iloc[i][col]) is not str:
            die = True
            break
    if die:
        continue
    section_head = doc_indexed.index[i][0] + " (" + doc_indexed.iloc[i]['overview'].strip() + ")"
    category = doc_indexed.index[i][1]
    step = " ".join(map(lambda x: x.strip(), doc_indexed.iloc[i][['sectionLabel', 'text']]))

    if len(doc_data) % BATCH_SIZE == 1 and len(doc_data) != 1:
        insert_data(doc_data)
        doc_data = collections.defaultdict(lambda: collections.defaultdict(list))
    doc_data[section_head][category].append(step)
    if i == len(doc_indexed) - 1:
        insert_data(doc_data)

### 1.4b Flush!
Milvus will not seal segments that are too small, a flush is necessary to force it.

In [50]:
whcollection.flush()

### 1.4c Create index
Search can be accelerated significantly by creating an index on the vector. Here we use L2 similarity with a flat index using inverted files (`IVF_FLAT`).

If using the langchain milvus store interface, now is a good time to disconnect as well. Otherwise, now is the time to load the collection.

In [14]:
whcollection.create_index(field_name="vector", index_params={"metric_type": "L2", "index_type": "IVF_FLAT", "nlist": "1024"})
whcollection.load()
# To actually use the data, we would have to do a `whcollection.load()` before any queries.
# Once done with queries, we should then use `whcollection.release()` to stop using resources

alloc_timestamp unimplemented, ignore it


## 2. Setup relevance search
Now that the data store is ready, we can do searches against it. Below we build a demo document retrieval system.

In [15]:
RELEVANCE_CUTOFF = 0.75 # Arbitrary threshold for document relevance.
# This is metric-dependant and will have to be tuned depending on dataset.
# It will also depend on the data in the vector: mind aspects like normalization.

In [16]:
def find(what):
    found = whcollection.search(
            [embeddings.embed_query(what)], # Vector for the query
            anns_field="vector", # Name of the field to search against
            param={'metric_type': 'L2', # Search params...
                        'offset': 0,
                        'params': {'nprobe': 1}
                        },
            limit=1,
            output_fields=['text', 'title']) # Also get the document title.
    match_title = found[0][0].entity.get('title')
    match_text = found[0][0].entity.get('text')
    match_dist = found[0][0].distance

    return { "found": match_dist < RELEVANCE_CUTOFF, "title": match_title, "text": match_text }

## 3. Demo driver
We setup a simple driver to test our work. Enter data in the input to receive the most relevant document, or no document if there was no suitable match in the database.

To stop providing queries, simply enter an empty line.

In [17]:
while True:
    ipt = input(">").strip()
    if len(ipt) == 0:
        break

    resp = ''
    
    result = find(ipt)
    if result['found']:
        print(f"Title: {result['title']}\nContents:\n{result['text']}\n")
    else:
        print(f"No matching document for query '{ipt}'.")

Title: How to Be a Fast Runner (Always come up last in the big race? Want some tips on how to speed yourself up whether you're in the Olympics or just out on the playground? Here are some ideas to help you out.)
Contents:
How to Be a Fast Runner (Always come up last in the big race? Want some tips on how to speed yourself up whether you're in the Olympics or just out on the playground? Here are some ideas to help you out.):

Adjust your stride according to the distance you're running, if it's a sprint then quickly turn over your legs and keep your knees high.:
Conserving energy If it's a mid distance(half mile) focus more on running hard, kicking out in front of you. Or, over longer distances, where you must keep efficient, you can do this by keeping your elbows at 90˚ angles, placing your hands near your waist, and puffing out your chest. Keep your pelvis underneath you, with your back straight, and don't kick behind you. Raise your knees, and pretend to kick your butt.
Always do a qu

## 4. Cleanup
Unload the collection to stop using up resources, then close the connection. We're done!

In [55]:
whcollection.release()
connections.disconnect("default")