# Book Title Search Using Zilliz Cloud and OpenAI
In this notebook we go over how to search for the best matching book titles using Zilliz Cloud as the Vector Database and OpenAI as the embedding system.

## Packages
We first begin with importing the required packages. In this example, the only non-builtin packages are openai and pymilvus, with each being the client pacakges for their respective services. If not present on your system, these packages can be installed using `pip install openai pymilvus`.

In [62]:
import csv
import json
import random
import openai
import time
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

## Parameters
Here we can find the main parameters that need to be modified for running with your own accounts. Beside each is a description of what it is.

In [63]:
FILE = '/Users/filiphaltmayer/Documents/code/openai/books.csv'  # https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks
COLLECTION_NAME = 'title_db'  # Collection name
DIMENSION = 1536  # Embeddings size
COUNT = 100  # How many titles to embed and insert
URI='https://in01-4770aba22d5783f.aws-us-west-2.vectordb.zillizcloud.com:19531'  # Endpoint URI obtained from Zilliz Cloud
USER='db_admin'  # Username specified when you created this database
PASSWORD='TESTTEST!1'  # Password set for that account 
OPENAI_ENGINE = 'text-embedding-ada-002'  # Which engine to use
openai.api_key = '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'  # OpenAI api key


## Zilliz Cloud
This segment deals with Zilliz Cloud and setting up the database for this use case. Within Zilliz Cloud we need to setup a collection and index the collection. 

In [64]:
# Connect to Milvus Database
connections.connect(uri=URI, user=USER, password=PASSWORD, secure=True)

In [65]:
# Remove collection if it already exists
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)

In [66]:
# Create collection which includes the id, title, and embedding.
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, descrition='Ids', is_primary=True, auto_id=False),
    FieldSchema(name='title', dtype=DataType.VARCHAR, description='Title texts', max_length=200),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='Embedding vectors', dim=DIMENSION)
]
schema = CollectionSchema(fields=fields, description='Title collection')
collection = Collection(name=COLLECTION_NAME, schema=schema)

In [67]:
# Create an IVF_FLAT index for collection.
index_params = {
    'metric_type':'L2',
    'index_type':"AUTOINDEX",
    'params':{}
}
collection.create_index(field_name="embedding", index_params=index_params)

Status(code=0, message=)

## Insert Data
Once we have the collection setup we need to start inserting our data. This is in three steps: reading the data, embedding the titles, and inserting into Zilliz Cloud.

In [68]:
# Extract the book titles
def csv_load(file):
    with open(file, newline='') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            yield row[1]

# Extract embeding from text using OpenAI
def embed(text):
    return openai.Embedding.create(
        input=text, 
        engine=OPENAI_ENGINE)["data"][0]["embedding"]

In [69]:
# Insert each title and its embedding
for idx, text in enumerate(random.sample(sorted(csv_load(FILE)), k = COUNT)):  # Load COUNT amount of random values from dataset
    ins = [[idx], [text], [embed(text)]]  # Insert the title id, the title text, and the title embedding vector
    collection.insert(ins)
    time.sleep(3)

In [70]:
# Load the collection into memory for searching
collection.load()

## Search for Titles
Once all the data is inserted and indexed within Zilliz Cloud, we can search for titles by taking our search phrase, embedding it with OpenAI, and searching with Zilliz Cloud.

In [71]:
# Search the database based on input text
def search(text):
    # Search parameters for the index
    search_params = {
        "metric_type": "L2", 
        "params": {"level": 1}
    }

    results = collection.search(
        data = [embed(text)],  # Embeded search value
        anns_field="embedding",  # Search across embeddings
        param=search_params,
        limit = 5,  # Limit to five results per search
        output_fields=['title']  # Include title field in result
    )

    ret = []
    for hit in results[0]:
        row = []
        row.extend([hit.id, hit.score, hit.entity.get('title')])  # Get the id, distance, and title for the results
        ret.append(row)
    return ret


In [72]:
# Search for titles that closest match these phrases.
search_terms = ['A large bird', 'An old human']

# Print out the results in order of [id, similarity score, title]
for x in search_terms:
    print('Search term:', x)
    for result in search(x):
        print(result)
    print()

Search term: A large bird
[76, 0.37748369574546814, 'Hope is the Thing with Feathers: A Personal Chronicle of Vanished Birds']
[54, 0.38402289152145386, 'The Big U']
[60, 0.3866572082042694, 'White Fang']
[37, 0.41218090057373047, 'A Confederate General from Big Sur / Dreaming of Babylon / The Hawkline Monster']
[29, 0.4169483184814453, 'The Hand of Dinotopia']

Search term: An old human
[36, 0.3807057738304138, 'The Story of a Shipwrecked Sailor']
[70, 0.38546591997146606, 'Christine']
[35, 0.3888280391693115, "Little Pilgrim's Progress"]
[29, 0.39572960138320923, 'The Hand of Dinotopia']
[77, 0.40066787600517273, "Self-Made Man: One Woman's Journey Into Manhood and Back Again"]

