# Question Answering Using Milvus, Towhee and Hugging Face
In this notebook we go over how to search for the best answer to questions using Milvus as the Vector Database, Towhee as the pipeline, and Hugging Face as the model.

## Packages
We first begin with importing the required packages. In this example, the only non-builtin packages are datasets, towhee and pymilvus. Datasets is the Hugging Face packages to load in the data, towhee is the pipelining application, and pymilvus is the client for Zilliz Cloud. If not present on your system, these packages can be installed using `pip install towhee datasets pymilvus`.

In [1]:
# Used to stop annoying error cause by forking
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
from towhee.dc2 import pipe, ops, DataCollection
from datasets import load_dataset_builder, load_dataset, Dataset

  from .autonotebook import tqdm as notebook_tqdm


## Parameters
Here we can find the main parameters that need to be modified for running with your own accounts. Beside each is a description of what it is.

In [2]:
DATASET = 'squad'  # Huggingface Dataset to use
MODEL = 'bert-base-uncased'  # Transformer model to use
INSERT_RATIO = .001  # What percentage of dataset to embed and insert
COLLECTION_NAME = 'huggingface_db'  # Collection name
DIMENSION = 768  # Embeddings size
LIMIT = 10  # How many results to search for
HOST = 'localhost'  # Milvus IP
PORT = '19530'  # Milvus Port

## Milvus
This segment deals with Milvus and setting up the database for this use case. Within Milvus we need to setup a collection and index the collection. 

In [3]:
# Connect to Milvus Database
connections.connect(host=HOST, port=PORT)

In [4]:
# Remove collection if it already exists
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)

In [5]:
# Create collection which includes the id, title, and embedding.
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name='original_question', dtype=DataType.VARCHAR, max_length=1000),
    FieldSchema(name='answer', dtype=DataType.VARCHAR, max_length=1000),
    FieldSchema(name='original_question_embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

In [6]:
# Create an AutoIndex index for collection.
index_params = {
    'metric_type':'L2',
    'index_type':"IVF_FLAT",
    'params':{"nlist":1536}
}
collection.create_index(field_name="original_question_embedding", index_params=index_params)
collection.load()

## Insert Data
Once we have the collection setup we need to start inserting our data. This is done in three steps: tokenizing the original question, embedding the tokenized question, and inserting the embedding, original question, and answer. In our example we use Hugging Face Datasets to load in the dataset and then feed that into a Towhee pipeline for embedding and inserting.

In [7]:
data_dataset = load_dataset(DATASET, split='all')
data_dataset = data_dataset.train_test_split(test_size=INSERT_RATIO)['test']
# Clean up the data structure in the dataset.
data_dataset = data_dataset.map(lambda val: {'answer': val['answers']['text'][0]}, remove_columns=['answers'])

100%|██████████| 99/99 [00:00<00:00, 2055.91ex/s]


In [8]:
# The inserting pipeline
p = (
    pipe.input('question', 'answer')
        .map('question', 'embedding', ops.text_embedding.transformers(model_name='bert-base-uncased'))
        .map('embedding', 'cls_embedding', lambda x: x[0])
        .map(
            ('question', 'answer', 'cls_embedding'),
            (), 
            ops.ann_insert.milvus_client(
                host=HOST, 
                port=PORT, 
                collection_name=COLLECTION_NAME,
            )
        )
        .output()
)

for x in data_dataset:
    p(x['question'], x['answer'])

2023-02-10 15:02:46.444223: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [10]:
# The searching pipeline
search_p = (
    pipe.input('question')
    .map('question', 'embedding', ops.text_embedding.transformers(model_name='bert-base-uncased'))
    .map('embedding', 'cls_embedding', lambda x: x[0])
    .flat_map('cls_embedding', ('id', 'score', 'original_question','answer'), ops.ann_search.milvus_client(host=HOST, port=PORT, collection_name=COLLECTION_NAME, output_fields=['original_question', 'answer'], limit = 10))
    .output('answer', 'score', 'original_question')
)

search_questions = ['What was Orsini known for?', 'What does the finding of gold cause?']
for x in search_questions:
    res = search_p(x)
    print()
    print('Question:')
    print(x)
    print('Answer, Distance, Original Question:')
    while(True):
        answer = res.get()
        if answer:
            print(answer)
        else:
            break


Question:
What was Orsini known for?
Answer, Distance, Original Question:
['Napoleon III', 13.175783157348633, 'Who did Orsini try to assassinate?']
['retracted his decision', 24.993133544921875, 'What did Nasser do after mass demonstrations?']
['1865', 25.8225154876709, 'When did Palmerston die?']
['US$320,000,000', 26.799659729003906, 'How much money did Nasser spend on weapons?']
['bid for statehood', 27.64345932006836, 'Why was this constitutional convention held?']
['Oba Kosoko', 27.844146728515625, 'Which Lagos king had supported the slave trade?']
['2001', 28.558818817138672, 'When was the Doha Declaration adopted?']
['James River and Kanawha Canal', 29.44527816772461, 'What man-made body of water was designed in part by George Washington?']
['in a state of self-sacrifice', 29.550865173339844, 'How did the Cathars live?']
['1748 with the signing of the Treaty of Aix-la-Chapelle', 29.62041473388672, 'What was the end of the War of the Austrian Succession?']

Question:
What does 