# Book Title Search Using Towhee, Milvus and OpenAI
In this notebook we go over how to search for the best matching book titles using [Towhee](https://github.com/towhee-io/towhee) as the data processing pipeline, [Milvus](https://github.com/milvus-io/milvus) as the Vector Database and [OpenAI](https://beta.openai.com/docs/guides/embeddings) as the embedding system.

## Packages
We first begin with importing the required packages. In this example, the only non-builtin packages are towhee and pymilvus, with each being the client pacakges for their respective services. These packages can be installed using `pip install pymilvus towhee`.

In [None]:
import csv
import json
import random
import time
from towhee.dc2 import pipe, ops, DataCollection
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

## Parameters
Here we can find the main parameters that need to be modified for running with your own accounts. Beside each is a description of what it is.

In [None]:
FILE = './data/books.csv'  # https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks
COLLECTION_NAME = 'title_db'  # Collection name
DIMENSION = 1536  # Embeddings size
COUNT = 100  # How many titles to embed and insert
HOST = 'localhost'  # Milvus ip address
PORT = 19530  # Milvus port
OPENAI_KEY = 'your key here' # OpenAI api key
OPENAI_ENGINE = 'text-embedding-ada-002' # Which engine to use

## Milvus
This segment deals with Milvus and setting up the database for this use case. Within Milvus we need to setup a collection and index the collection. For more information on how to install and run Milvus, look [here](https://milvus.io/docs).

In [None]:
# Connect to Milvus Database
connections.connect(host=HOST, port=PORT)

# Remove collection if it already exists
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)

In [None]:
# Create collection which includes the id, title, and embedding.
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, descrition='Ids', is_primary=True, auto_id=False),
    FieldSchema(name='title', dtype=DataType.VARCHAR, description='Title texts', max_length=200),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='Embedding vectors', dim=DIMENSION)
]
schema = CollectionSchema(fields=fields, description='Title collection')
collection = Collection(name=COLLECTION_NAME, schema=schema)

In [None]:
# Create an IVF_FLAT index for collection.
index_params = {
    'metric_type':'L2',
    'index_type':"IVF_FLAT",
    'params':{"nlist":1536}
}
collection.create_index(field_name="embedding", index_params=index_params)

## Insert Data
Once we have the collection setup we need to start inserting our data. This is done by creating a pipeline using Towhee. Within this pipeline there are two steps, embedding the text that is inputted, and inserting that data into Milvus. 

In [None]:
# Extract the book titles
def csv_load(file):
    with open(file, newline='') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            yield row[1]

In [None]:
# Pipeline which embeds data and inserts into Milvus
insert_p = (
    pipe.input('id', 'text')
    .map(
        'text',  # Input columns
        'vec',   # Output columns
        ops.text_embedding.openai(
            engine=OPENAI_ENGINE,
            api_key=OPENAI_KEY
        )
    )
    .map(
        ('id', 'text', 'vec'),
        (), 
        ops.ann_insert.milvus_client(
            host=HOST, 
            port=PORT, 
            collection_name=COLLECTION_NAME
        )
    )
                                    
    .output()
)

In [None]:
# Input the book titles
for idx, text in enumerate(random.sample(sorted(csv_load(FILE)), k = COUNT)): # Load COUNT amount of random values from dataset
    insert_p(idx, text)
    time.sleep(3)

## Search the Data
With the collection setup and all our embedded data inserted, we can begin searching the data. This is done by creating a pipeline similar to the inserting pipeline, but in this one instead of inserting the data, we search the data. The search phrase embedded and its vector is searched across all the stored vectors to find the closest matches. These matches are the most semantically similar titles.

In [None]:
# Pipeline to search through titles.
search_p = (
    pipe.input('text')
    .map(
        'text',
        'vec',
        ops.text_embedding.openai(
            engine='text-embedding-ada-002',
            api_key=OPENAI_KEY
        )
    )
    .flat_map(
        'vec',
        ('id', 'score', 'text'),
        ops.ann_search.milvus_client(
            host=HOST,
            port=PORT,
            collection_name=COLLECTION_NAME, output_fields=['title']
        )
    )
    .output('id', 'score', 'text')
)

In [None]:
collection.load()  # Current operator needs a load
dc = DataCollection(search_p('self-improvement'))
dc.show()