# Similarity Search with Milvus and OpenAI

This tutorial shows how to use Milvus and OpenAI Embedding for semantic similarity search.

We will demonstrate how to use the latest `text-embedding-3-small` OpenAI embedding model to convert a blog page contents into embeddings. We store these embeddings into the milvus vector database. Then ask questions about the content of this blog page, which are used for milvus queries. In this way, blog document chunks related to the question can be retrieved in the vector space through semantic similarity.





## Preparations

The only prerequisite you'll need here is an API key from the [OpenAI website](https://openai.com/product). Be sure you have already [started up a Milvus instance](https://milvus.io/docs/install_standalone-docker.md).


Import packages

In [1]:
import os
from openai import OpenAI
from pymilvus import (
    connections,
    utility,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
)

Here we can find the main variables that need to be modified for running with your own accounts. Beside each is a description of what it is.

In [2]:
MILVUS_HOST = "localhost"
MILVUS_PORT = "19530"
COLLECTION_NAME = "openai_doc_collection"  # Milvus collection name
EMBEDDING_MODEL = "text-embedding-3-small"  # OpenAI embedding model name, you can change it into `text-embedding-3-large` or `text-embedding-ada-002`

client = OpenAI()  # Initialize an Open AI client
client.api_key = os.getenv('OPENAI_API_KEY')  # Use your own Open AI API Key or set it in the environment variables.

Let’s try the OpenAI Embedding service and get the dimensions of the model.

In [3]:
response = client.embeddings.create(
    input="Your text string goes here",
    model=EMBEDDING_MODEL
)
res_embedding = response.data[0].embedding
print(f'{res_embedding[:20]} ...')
dimension = len(res_embedding)
print(f'\nDimensions of `{EMBEDDING_MODEL}` embedding model is: {dimension}')

[0.00514861848205328, 0.017234396189451218, -0.018690429627895355, -0.01859242655336857, -0.04732108861207962, -0.030296696349978447, 0.027692636474967003, 0.003640083596110344, 0.011249258182942867, 0.006401647347956896, -0.0016966640250757337, 0.0157923623919487, -0.0013186553260311484, -0.007833180017769337, 0.059921376407146454, 0.050261154770851135, -0.027538632974028587, 0.009940228424966335, -0.04040492698550224, 0.05000915005803108] ...

Dimensions of `text-embedding-3-small` embedding model is: 1536


## Load documents to Milvus

This segment deals with Milvus and setting up the database for this use case. Within Milvus, we need to set up a collection and index the collection. For more information on how to use Milvus, look [here](https://milvus.io/docs/example_code.md).


In [4]:
# Connect to Milvus
connections.connect(host=MILVUS_HOST, port=MILVUS_PORT)

# Remove collection if it already exists
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)

# Create collection which includes the id, text, and embedding.
fields = [
    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65_535),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dimension)
]
schema = CollectionSchema(fields, "Here is description of this collection.")
doc_collection = Collection(COLLECTION_NAME, schema)

# Create an index for the collection.
index = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}
doc_collection.create_index("embeddings", index)

Status(code=0, message=)

Here we have prepared a data source, which is crawled from the latest [blog](https://openai.com/blog/new-embedding-models-and-api-updates#fn-A) of Open AI, and its name is `openai embedding blog.txt`. It has been divided according to sentences, and each sentence is a line.

Convert each line in the document into embeddings, and then insert these embeddings into the milvus collection.

In [5]:
with open('./docs/openai_embedding_blog.txt', 'r') as f:
    lines = f.readlines()

embeddings = []
for line in lines:
    response = client.embeddings.create(
        input=line,
        model=EMBEDDING_MODEL
    )
    embeddings.append(response.data[0].embedding)

entities = [
    list(range(len(lines))),  # field pk
    lines,  # field text
    embeddings,  #field embeddings
]
insert_result = doc_collection.insert(entities)

# After final entity is inserted, it is best to call flush to have no growing segments left in memory
doc_collection.flush()

## Query

Here we will build a `semantic_search` function, which is used to retrieve the semantically similar document just imported in milvus.


In [6]:
# Load the collection into memory for searching
doc_collection.load()


def semantic_search(query, top_k=3):
    response = client.embeddings.create(
        input=query,
        model=EMBEDDING_MODEL
    )
    vectors_to_search = [response.data[0].embedding]
    search_params = {
        "metric_type": "L2",
        "params": {"nprobe": 10},
    }
    result = doc_collection.search(vectors_to_search, "embeddings", search_params, limit=top_k, output_fields=["text"])
    return result[0]

Here we ask questions about the price of the latest embedding models.

In [7]:
question = 'What is the price of the `text-embedding-3-small` model?'

match_results = semantic_search(question, top_k=3)
for match in match_results:
    print(f"distance = {match.distance:.2f}\n{match.text}")

distance = 0.50
Pricing for `text-embedding-3-small` has therefore been reduced by 5X compared to `text-embedding-ada-002`, from a price per 1k tokens of $0.0001 to $0.00002.

distance = 0.56
`text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding- ada-002` model released in December 2022.

distance = 0.56
**`text-embedding-3-large` is our new best performing model.



The smaller the distance, the closer the vector is, that is, the more similar the semantics are. We can see that the top 1 results returned can answer this question.

Let's try another question, it's a question about the new GPT-4.

In [8]:
question = 'What is the context window size of GPT-4??'

match_results = semantic_search(question, top_k=3)
for match in match_results:
    print(f"distance = {match.distance:.2f}\n{match.text}")

distance = 0.97
Over 70% of requests from GPT-4 API customers have transitioned to GPT-4 Turbo since its release, as developers take advantage of its updated knowledge cutoff, larger 128k context windows, and lower prices.

distance = 1.02
Today, we are releasing an updated GPT-4 Turbo preview model, `gpt-4-0125-preview`.

distance = 1.02
* Overview * Index * GPT-4 * DALL·E 3



Our semantic retrieval is able to identify the meaning of our queries and return the most semantically similar documents from Milvus collection.

We can delete this collection to save resources.

In [9]:
# Drops the collection
utility.drop_collection(COLLECTION_NAME)