# Similarity Search with Milvus and OpenAI

We'll showcase how [OpenAI's Embedding API](https://platform.openai.com/docs/guides/embeddings) can be used with our vector database to search across book titles. Many existing book search solutions (such as those used by public libraries, for example) rely on keyword matching rather than a semantic understanding of what the title is actually about. Using a trained model to represent the input data is known as semantic search, and can be expanded to a variety of different text-based use cases, including anomaly detection and document search.

## Getting started
We need an API key from the OpenAI website. Then we need to install some dependencies(Your PyMilvus should be higher than 2.4.2.).

In [None]:
!pip install pymilvus
!pip install openai

We'll also prepare the data that we're going to use for this example. You can grab the book titles [here](https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks). Let's create a function to load book titles from our CSV.

In [None]:
import csv
import json
import random
import time
import os
from openai import OpenAI
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

In [None]:
# Extract the book titles
def csv_load(file):
    with open(file, newline='') as f:
        reader=csv.reader(f, delimiter=',')
        for row in reader:
            yield row[1]

With this, we're ready to move on to generating embeddings.

## Searching book titles with OpenAI & Milvus
Here, we list the main parameters needed for this example. Beside each is a description of what it represents.

In [None]:
FILE = './books.csv'  # Download it from https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks and save it in the folder that holds your script.
COLLECTION_NAME = 'title_db'  # Collection name
DIMENSION = 1536  # Embeddings size
COUNT = 100  # How many titles to embed and insert.
MILVUS_URI = 'example.db'  # Define the URI for Milvus Lite
MILVUS_TOKEN = ''
OPENAI_ENGINE = 'text-embedding-3-small'  # Which engine to use, you can change it into `text-embedding-3-large` or `text-embedding-ada-002`
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"  # Use your own Open AI API Key here
openai_client = OpenAI()

### Note
Because the embedding process for a free OpenAI account is relatively time-consuming, we use a set of data small enough to reach a balance between the script executing time and the precision of the search results. You can change the `COUNT` constant to fit your needs.

In [None]:
from pymilvus import MilvusClient, DataType

# Set up a Milvus client
milvus_client = MilvusClient(
    uri=MILVUS_URI,
    token=MILVUS_TOKEN
)
milvus_client.create_collection(
    collection_name=COLLECTION_NAME,
    vector_field_name="vector",
    dimension=DIMENSION,
    auto_id=False,
    enable_dynamic_field=True,
    metric_type="COSINE"
)

Once we have the collection set up, we need to start inserting our data. This process involves three steps: reading the data, embedding the titles, and inserting into Milvus.

In [None]:
# Extract embedding from text using OpenAI
def embed(text):
    response = openai_client.embeddings.create(
        input=text,
        model=OPENAI_ENGINE
    )
    return response.data[0].embedding

# Insert each title and its embedding
for idx, text in enumerate(random.sample(sorted(csv_load(FILE)), k=COUNT)):  # Load COUNT amount of random values from dataset
    ins=[{"id": idx, "title": (text[:198] + '..') if len(text) > 200 else text, 'vector': embed(text)}]  # Insert the title id, the title text, and the title embedding vector
    milvus_client.insert(COLLECTION_NAME, ins)

In [None]:
# Search the database based on input text
def search(text):
    results = milvus_client.search(
        collection_name = COLLECTION_NAME,
        data=[embed(text)],    # Embeded search value
        output_fields=["title"], # Include title field in result
        search_params={"metric_type": "COSINE"},
        limit=5 # Limit to five results per search
    )
    ret=[]
    for hit in results[0]:
        row=[]
        row.extend([hit["id"], hit["distance"], hit["entity"]['title']])  # Get the id, distance, and title for the results
        ret.append(row)
    return ret

search_terms=['self-improvement', 'landscape']

for x in search_terms:
    print('Search term:', x)
    for result in search(x):
        print(result)

You should see the following as the output:

In [None]:
# Search term: self-improvement
# [92, 0.3417697846889496, 'Me vs. Me']
# [68, 0.34026533365249634, 'Finding and Exploring Your Spiritual Path: An Exploration of the Pleasures and Perils of Seeking Personal Enlightenment']
# [75, 0.3175859749317169, 'The Power of Infinite Love & Gratitude: An Evolutionary Journey to Awakening Your Spirit']
# [18, 0.3085497319698334, 'On Beauty']
# [41, 0.2611697316169739, 'A Sentimental Education']

# Search term: landscape
# [18, 0.2569754123687744, 'On Beauty']
# [95, 0.2353191375732422, 'Close Range']
# [24, 0.22978681325912476, 'This Lullaby']
# [72, 0.20614467561244965, "Cliffs of Despair: A Journey to Suicide's Edge"]
# [23, 0.20127272605895996, 'City of God']