# Milvus Quick Start

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/milvus/milvus_1_quick_start.ipynb)


[Milvus](https://milvus.io/) is a popular vector database.  This guide will show you how to get up and running with it quickly.

References
- [Milvus quick start](https://milvus.io/docs/quickstart.md)

**This notebook is deisnged to run on local python environment and Google Colab environment 😄** 


## Configuration

In [1]:
class MyConfig:
    pass

MY_CONFIG = MyConfig()

# Set any setting here
MY_CONFIG.SAMPLE_SETTING = 1

## Determine Runtime

In [2]:
# are we running in Colab?
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   MY_CONFIG.RUNNING_IN_COLAB = True
else:
   print("NOT running in Colab")
   MY_CONFIG.RUNNING_IN_COLAB = False

NOT running in Colab


## Install Dependencies (If required)

**A note for Google Colab Users**

After installing the dependenceis, if you get errors loading libraries, **restart runtime** and **run the notebook** again

In [3]:
if MY_CONFIG.RUNNING_IN_COLAB:
  !pip install pymilvus  'pymilvus[model]'

## Setup Embedded Database

Milvus can be embedded and easy to use.

After we execute this code, you will see `milvus_demo.db` and `.milvus_demo.db.lock` file in the folder

In [4]:
from pymilvus import MilvusClient

client = MilvusClient("milvus_demo.db")

# Create A Collection

Collection is how data is organized in Milvus. 

Think a collection like a database table.

Here we are using default schema as follows:
- The primary key is `id`
- Vector field is `vector`

In [5]:
# if we already have a collection, clear it first
if client.has_collection(collection_name="demo_collection"):
    client.drop_collection(collection_name="demo_collection")

client.create_collection(
    collection_name="demo_collection",
    dimension=768,  # The vectors we will use in this demo has 768 dimensions
)


## Data and Embeddings

We are going to prepare some simple text data and calculate their embeddings

In [6]:
from pymilvus import model

# If connection to https://huggingface.co/ failed, uncomment the following path
# import os
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'


# This will download "all-MiniLM-L6-v2", a light weight model.
# This will download a small embedding model "paraphrase-albert-small-v2" (~50MB).
embedding_fn = model.DefaultEmbeddingFunction()

docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

vectors = embedding_fn.encode_documents(docs)
print("Dim:", embedding_fn.dim, vectors[0].shape)  # Dim: 768 (768,)

# Each entity has id, vector representation, raw text, and a subject label that we use
# to demo metadata filtering later.

# data is an array of dictionaries [ {...}, {...}]
data = [
    {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
    for i in range(len(vectors))
]

print("Data has", len(data), "entities, each with fields: ", data[0].keys())
print("Vector dim:", len(data[0]["vector"]))


  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Dim: 768 (768,)
Data has 3 entities, each with fields:  dict_keys(['id', 'vector', 'text', 'subject'])
Vector dim: 768


## Insert data

In [7]:
res = client.insert(collection_name="demo_collection", data=data)

print(res)

{'insert_count': 3, 'ids': [0, 1, 2], 'cost': 0}


## Perform Vector Search

Let's do a semantic search

In [18]:
from pprint import pprint

query = "Who is Alan Turing?"

query_vectors = embedding_fn.encode_queries([query])
# If you don't have the embedding function you can use a fake vector to finish the demo:
# query_vectors = [ [ random.uniform(-1, 1) for _ in range(768) ] ]

res = client.search(
    collection_name="demo_collection",  # target collection
    data=query_vectors,  # query vectors
    limit=2,  # number of returned entities
    output_fields=["text", "subject"],  # specifies fields to be returned
)

# pprint(res, depth=1, indent=4)

for r in res:
    pprint(r, indent=4)


[   {   'distance': 0.5859944820404053,
        'entity': {   'subject': 'history',
                      'text': 'Born in Maida Vale, London, Turing was raised '
                              'in southern England.'},
        'id': 2},
    {   'distance': 0.5118255019187927,
        'entity': {   'subject': 'history',
                      'text': 'Alan Turing was the first person to conduct '
                              'substantial research in AI.'},
        'id': 1}]
