# Getting Started with Milvus and OpenAI
### Finding your next book
[Milvus](https://milvus.io/) is a popular open-source vector database that powers AI applications with highly performant and scalable vector similarity search.


In this notebook we will be going over generating embeddings of book descriptions with OpenAI and using those embeddings within Milvus to find relevant books. The dataset in this example is sourced from HuggingFace datasets, and contains a little over 1 million title-description pairs.


For demonstration purposes, we are using a reduced dataset of 10,000 samples from the original HuggingFace dataset containing over a million records. This subset will allow us to effectively illustrate the embedding and retrieval process without the overhead of handling the full dataset size.

Lets begin by first downloading the required libraries for this notebook:
- `openai` is used for communicating with the OpenAI embedding service
- `pymilvus` is used for communicating with the Milvus server
- `datasets` is used for downloading the dataset
- `tqdm` is used for the progress bars


In [None]:
! pip install openai pymilvus datasets tqdm

> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the "Runtime" menu at the top of the screen, and select "Restart session" from the dropdown menu).

We will use OpenAI as the LLM in this example. You should prepare the [api key](https://platform.openai.com/docs/quickstart) `OPENAI_API_KEY` as an environment variable.

In [3]:
import os

os.environ["OPENAI_API_KEY"] = "sk-***********"

## Initialize OpenAI client and Milvus
Initialize the OpenAI client.

In [4]:
from openai import OpenAI

openai_client = OpenAI()

Set the collection name and dimension for the embeddings.

In [5]:
COLLECTION_NAME = "book_search"
DIMENSION = 1536

BATCH_SIZE = 1000

Connect to Milvus.

In [6]:
from pymilvus import MilvusClient

# Connect to Milvus Database
client = MilvusClient("./milvus_demo.db")

> As for the argument of `url` and `token`:
> - Setting the `uri` as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.
> - If you have large scale of data, say more than a million vectors, you can set up a more performant Milvus server on [Docker or Kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server address and port as your uri, e.g.`http://localhost:19530`. If you enable the authentication feature on Milvus, use "<your_username>:<your_password>" as the token, otherwise don't set the token.
> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud.

In [5]:
# Remove collection if it already exists
if client.has_collection(COLLECTION_NAME):
    client.drop_collection(COLLECTION_NAME)

Define the fields for the collection, which include the id, title, description, and embedding.

In [None]:
from pymilvus import DataType

# Create collection which includes the id, title, description, and embedding.

# 1. Create schema
schema = MilvusClient.create_schema(
    auto_id=True,
    enable_dynamic_field=False,
)

# 2. Add fields to schema
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=64000)
schema.add_field(field_name="description", datatype=DataType.VARCHAR, max_length=64000)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=DIMENSION)

# 3. Create collection with the schema
client.create_collection(collection_name=COLLECTION_NAME, schema=schema)

Create the index on the collection and load it.

In [None]:
# Create the index on the collection and load it.

# 1. Prepare index parameters
index_params = client.prepare_index_params()


# 2. Add an index on the embedding field
index_params.add_index(
    field_name="embedding",
    metric_type="L2",
    index_type="HNSW",
    params={"M": 8, "efConstruction": 64},
)


# 3. Create index
client.create_index(collection_name=COLLECTION_NAME, index_params=index_params)


# 4. Load collection
client.load_collection(collection_name=COLLECTION_NAME, replica_number=1)

## Dataset
With Milvus up and running we can begin grabbing our data. Hugging Face Datasets is a hub that holds many different user datasets, and for this example we are using Skelebor's book dataset. This dataset contains title-description pairs for over 1 million books. To keep the demonstration efficient, we will be working with a smaller subset of this dataset. We are going to embed each description and store it within Milvus along with its title.

In [None]:
import datasets

# Download the dataset and only use the `train` portion
dataset = datasets.load_dataset(
    "Skelebor/book_titles_and_descriptions_en_clean", split="train"
)

# Shuffle and select a subset of 10,000 entries
dataset = dataset.shuffle(seed=42).select(range(10000))

## Insert the Data
Now that we have our data on our machine we can begin embedding it and inserting it into Milvus. The embedding function takes in text and returns the embeddings in a list format.

In [9]:
# Simple function that converts the texts to embeddings
def emb_texts(texts):
    res = openai_client.embeddings.create(input=texts, model="text-embedding-3-small")
    return [res_data.embedding for res_data in res.data]

This next step does the actual inserting. We iterate through all the entries and create batches that we insert once we hit our set batch size. After the loop is over we insert the last remaning batch if it exists.

In [10]:
from tqdm import tqdm

# batch (data to be inserted) is a list of dictionaries
batch = []

# Embed and insert in batches
for i in tqdm(range(0, len(dataset))):
    batch.append(
        {
            "title": dataset[i]["title"] or "",
            "description": dataset[i]["description"] or "",
        }
    )

    if len(batch) % BATCH_SIZE == 0 or i == len(dataset) - 1:
        embeddings = emb_texts([item["description"] for item in batch])

        for item, emb in zip(batch, embeddings):
            item["embedding"] = emb

        client.insert(collection_name=COLLECTION_NAME, data=batch)
        batch = []

100%|██████████| 10000/10000 [01:09<00:00, 144.77it/s]


## Query the Database
With our data safely inserted in Milvus, we can now perform a query. The query takes in a string or a list of strings and searches them. The resuts print out your provided description and the results that include the result score, the result title, and the result book description.



In [15]:
import textwrap


def query(queries, top_k=5):
    res = client.search(
        collection_name=COLLECTION_NAME,
        data=emb_texts(queries),
        limit=top_k,
        output_fields=["title", "description"],
        search_params={
            "metric_type": "L2",
            "params": {"ef": 64},
        },
    )
    print("Description:", queries)

    for hit_group in res:
        print("Results:")
        for rank, hit in enumerate(hit_group, start=1):
            entity = hit["entity"]

            print(
                f"\tRank: {rank} Score: {hit['distance']:} Title: {entity.get('title', '')}"
            )
            description = entity.get("description", "")
            print(textwrap.fill(description, width=88))
            print()


query("Book about a k-9 from europe")

Description: Book about a k-9 from europe
Results:
	Rank: 1 Score: 1.086188554763794 Title: The Purloined Poodle (The Iron Druid Chronicles, #8.5)
Thanks to his relationship with the ancient Druid Atticus O'Sullivan, Oberon the Irish
wolfhound knows trouble when he smells it--and furthermore, he knows he can handle it.
When he discovers that a prizewinning poodle has been abducted in Eugene, Oregon, he
learns that it's part of a rash of hound abductions all over the Pacific Northwest.
Since the police aren't too worried about dogs they assume have run away, Oberon knows
it's up to him to track down those hounds and reunite them with their humans. For
justice! And gravy! Engaging the services of his faithful Druid, Oberon must travel
throughout Oregon and Washington to question a man with a huge salami, thwart the plans
of diabolical squirrels, and avoid, at all costs, a fight with a great big bear. But if
he's going to solve the case of the Purloined Poodle, Oberon will have to recruit