# Question Answering System
In this example we will be going over the code used to build a question answering system. This example uses a modified BERT model to extract features from questions and Milvus to search for similar questions and answers. 

## Data
This example uses the [InsuranceQA Corpus](https://github.com/shuzi/insuranceQA) dataset, which contains 27,413 answers with the 3,065,492 running words of answers.

Download location: https://github.com/chatopera/insuranceqa-corpus-zh/tree/release/corpus/pairs

In this example, we use a small subset of the dataset that contains 100 pairs of quesiton-answers, it can be found under the **data** directory.

## Requirements


|  Packages   |  Servers    |
|-                  | -                 |   
| pymilvus          | milvus-1.1.0      |
| sentence_transformers      | postgres          |
| psycopg2          |
| pandas           |
| numpy   |

We have included a `requirements.txt` file in order to easily satisfy the required packages. 


## Up and Running

### Installing Packages
Install the required python packages with `requirements.txt`.

In [1]:
pip install -r requirements.txt

Collecting sentence_transformers
  Using cached sentence_transformers-1.1.1-py3-none-any.whl
Collecting pymilvus
  Using cached pymilvus-1.1.0-py3-none-any.whl (56 kB)
Collecting psycopg2
  Using cached psycopg2-2.8.6-cp38-cp38-macosx_10_9_x86_64.whl
Collecting pandas
  Using cached pandas-1.2.4-cp38-cp38-macosx_10_9_x86_64.whl (10.5 MB)
Collecting numpy
  Downloading numpy-1.20.3-cp38-cp38-macosx_10_9_x86_64.whl (16.0 MB)
[K     |████████████████████████████████| 16.0 MB 1.6 MB/s eta 0:00:01
Collecting grpcio>=1.22.0
  Downloading grpcio-1.37.1-cp38-cp38-macosx_10_10_x86_64.whl (3.9 MB)
[K     |████████████████████████████████| 3.9 MB 52.0 MB/s eta 0:00:01
[?25hCollecting grpcio-tools>=1.22.0
  Downloading grpcio_tools-1.37.1-cp38-cp38-macosx_10_10_x86_64.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 1.4 MB/s eta 0:00:01
[?25hCollecting ujson>=2.0.0
  Using cached ujson-4.0.2-cp38-cp38-macosx_10_14_x86_64.whl (45 kB)
Collecting protobuf<4.0dev,>=3.5.0.post1
  Down

### Starting Milvus Server

This demo uses Milvus 1.1.0, please refer to the [Install Milvus](https://milvus.io/docs/v1.1.0/install_milvus.md) guide to learn how to use this docker container. For this example we wont be mapping any local volumes. 

In [2]:
! docker run --name milvus_cpu_1.1.0 -d \
-p 19530:19530 \
-p 19121:19121 \
milvusdb/milvus:1.1.0-cpu-d050721-5e559c

docker: Error response from daemon: Conflict. The container name "/milvus_cpu_1.1.0" is already in use by container "1687960e49c352988ffbe1e5c91b6976545c51b5f09e0e99c8f62a51d4a6f26e". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.


### Starting Postgres Server
For now, Milvus doesn't support storing string data. Thus, we need a relational database to store questions and answers. In this example, we use [PostgreSQL](https://www.postgresql.org/).

In [3]:
! docker run --name postgres -d  -p 5432:5432 -e POSTGRES_HOST_AUTH_METHOD=trust postgres

docker: Error response from daemon: Conflict. The container name "/postgres" is already in use by container "244c288ae2317ab906211bda102ab7c849a7840bbd781480926fa571b9fdee79". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.


### Confirm Running Servers

In [4]:
! docker logs milvus_cpu_1.1.0


    __  _________ _   ____  ______    
   /  |/  /  _/ /| | / / / / / __/    
  / /|_/ // // /_| |/ / /_/ /\ \    
 /_/  /_/___/____/___/\____/___/     

Welcome to use Milvus!
Milvus Release version: v1.1.0, built at 2021-05-06 14:50.43, with OpenBLAS library.
You are using Milvus CPU edition
Last commit id: 5e559cd7918297bcdb55985b80567cb6278074dd

Loading configuration from: /var/lib/milvus/conf/server_config.yaml
WARNNING: You are using SQLite as the meta data management, which can't be used in production. Please change it to MySQL!
Supported CPU instruction sets: avx2, sse4_2
FAISS hook AVX2
Milvus server started successfully!


In [5]:
! docker logs postgres --tail 6

2021-05-20 19:03:54.900 UTC [1] LOG:  starting PostgreSQL 13.2 (Debian 13.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2021-05-20 19:03:54.900 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-05-20 19:03:54.900 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2021-05-20 19:03:54.903 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-05-20 19:03:54.907 UTC [68] LOG:  database system was shut down at 2021-05-20 19:03:54 UTC
2021-05-20 19:03:54.911 UTC [1] LOG:  database system is ready to accept connections


## Code Overview

### Connecting to Servers
We first start off by connecting to the servers. In this case the docker containers are running on localhost and the ports are the default ports. 

In [6]:
#Connectings to Milvus, BERT and Postgresql
import milvus
import psycopg2

milv = milvus.Milvus(host='localhost', port='19530')
conn = psycopg2.connect(host='localhost', port='5432', user='postgres', password='postgres')
cursor = conn.cursor()

### Creating Collection and Setting Index
#### 1. Creating the Collection  
A collection in Milvus is similar to a table in a relational database, and is used for storing all the vectors.  
The required parameters for creating a collection are as follows:  
- `collection_name`: the name of a collection.  
- `dimension`: BERT generates 728-dimensional vectors.  
- `index_file_size`: how large each data segment will be within the collection.      
- `metric_type`: the distance formula being used to calculate similarity. In this example we are using Inner product (IP).

In [30]:
TABLE_NAME = 'question_answering'

#Deleting previouslny stored table for clean run
milv.drop_collection(TABLE_NAME)


collection_param = {
            'collection_name': TABLE_NAME,
            'dimension': 768,
            'index_file_size': 1024,  
            'metric_type': milvus.MetricType.IP 
            }

status = milv.create_collection(collection_param)
print(status)

Status(code=0, message='Create collection successfully!')


#### 2. Setting an Index
After creating the collection we want to assign it an index type. This can be done before or after inserting the data. When done before, indexes will be made as data comes in and fills the data segments. In this example we are using IVF_FLAT which requires the 'nlist' parameter. Each index types carries its own parameters. More info about this param can be found [here](https://milvus.io/docs/v1.1.0/index.md#CPU).

In [31]:
param = {'nlist': 40}
status = milv.create_index(TABLE_NAME, milvus.IndexType.IVF_FLAT, param)
print(status)

Status(code=0, message='Build index successfully!')


### Creating Table in Postgres  
PostgresSQL will be used to store the Milvus ID and its corresponding question-answer combo.

In [32]:
#Deleting previouslny stored table for clean run
drop_table = "DROP TABLE IF EXISTS " + TABLE_NAME
cursor.execute(drop_table)
conn.commit()

try:
    sql = "CREATE TABLE if not exists " + TABLE_NAME + " (ids bigint, question text, answer text);"
    cursor.execute(sql)
    conn.commit()
    print("create postgres table successfully!")
except Exception as e:
    print("can't create a postgres table: ", e)

create postgres table successfully!


### Processing and Storing QA Dataset
#### 1. Generating Embeddings
In this example we are using the sentence_transformer library  to encode the sentence into vectors. This library uses a modified BERT model to generate the embeddings, and in this example we are using a model pretrained using Microsoft's `mpnet`. More info can be found [here](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models).

In [33]:
from sentence_transformers import SentenceTransformer
import pandas as pd
from sklearn.preprocessing import normalize

model = SentenceTransformer('paraphrase-mpnet-base-v2')

# Get questions and answers.
data = pd.read_csv('data/example.csv')
question_data = data['question'].tolist()
answer_data = data['answer'].tolist()

sentence_embeddings = model.encode(question_data)
sentence_embeddings = normalize(sentence_embeddings)

You try to use a model that was created with version 1.2.0, however, your version is 1.1.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





#### 2. Inserting Vectors into Milvus
Since this example dataset contains only 100 vectors, we are inserting all of them as one batch insert.

In [34]:
status, ids = milv.insert(collection_name=TABLE_NAME, records=sentence_embeddings)
print(status)

Status(code=0, message='Add vectors successfully!')


#### 3. Inserting IDs and Questions-answer Combos into PostgreSQL
In order to transfer the data into Postgres, we are creating a new file that combines all the data into a readable format. Once created, we pass this file into the Postgress server through STDIN due to the Postgres container not having access to the file locally. 

In [35]:
import os 

def record_temp_csv(fname, ids, answer, question):
    with open(fname,'w') as f:
        for i in range(len(ids)):
            line = str(ids[i]) + "|" + question[i] + "|" + answer[i] + "\n"
            f.write(line)

def copy_data_to_pg(table_name, fname, conn, cur):
    fname = os.path.join(os.getcwd(),fname)
    try:
        sql = "COPY " + table_name + " FROM STDIN DELIMITER '|' CSV HEADER"
        cursor.copy_expert(sql, open(fname, "r"))
        conn.commit()
        print("Inserted into Postgress Sucessfully!")
    except Exception as e:
        print("Copy Data into Postgress failed: ", e)
        
DATA_WITH_IDS = 'data/test.csv'   

record_temp_csv(DATA_WITH_IDS, ids, answer_data, question_data)
copy_data_to_pg(TABLE_NAME, DATA_WITH_IDS, conn, cursor)

Inserted into Postgress Sucessfully!


### Search
#### 1. Processing Query
When searching for a question, we first put the question through the same model to generate an embedding. Then with that embedding vector we  can search for similar embeddings in Milvus.  


In [55]:
SEARCH_PARAM = {'nprobe': 40}

query_vec = []

question = "What is AAA?"

query_embeddings = []
embed = model.encode(question)
embed = embed.reshape(1,-1)
embed = normalize(embed)
query_embeddings = embed.tolist()


status, results = milv.search(collection_name=TABLE_NAME, query_records=query_embeddings, top_k=5, params=SEARCH_PARAM)


#### 2. Getting the Similar Questions
There may not have questions that are similar to the given one. So we can set a threshold value, here we use 0.5, and when the most similar distance retrieved is less than this value, a hint that the system doesn't include the relevant question is returned. We then use the result ID's to pull out the similar questions from the Postgres server and print them with their corresponding similarity score.

In [57]:
similar_questions = []

if results[0][0].distance < 0.5:
    print("There are no similar questions in the database, here are the closest matches:")
else:
    print("There are similar questions in the database, here are the closest matches: ")
    
for result in results[0]:
    sql = "select question from " + TABLE_NAME + " where ids = " + str(result.id) + ";"
    cursor.execute(sql)
    rows=cursor.fetchall()
    if len(rows):
        similar_questions.append((rows[0][0], result.distance))
        print((rows[0][0], result.distance))

There are similar questions in the database, here are the closest matches: 
('What  Does  AAA  Home  Insurance  Cover?', 0.5728842616081238)
('What  Does  Credit  Have  To  Do  With  Auto  Insurance?', 0.4042107164859772)
('Is  Car  Insurance  Prepaid?', 0.31805452704429626)
('Does  AARP  Have  Long  Term  Care  Insurance?', 0.30669423937797546)
('Is  Car  Insurance  Credit  Checked?', 0.3045981228351593)


#### 3. Get the answer
After getting a list of similar questions, choose the one that you feel is closest to yours. Then you can use that question to find the corresponding answer in Postgres.

In [61]:
sql = "select answer from " + TABLE_NAME + " where question = '" + similar_questions[0][0] + "';"
cursor.execute(sql)
rows=cursor.fetchall()
print("Question:")
print(question)
print("Answer:")
print(rows[0][0])

Question:
What is AAA?
Answer:
 AAA Home insurance, like all other major carriers, covers a wide variety of claims, including fire, theft, vandalism, and many other items. However, there are numerous types of policies offered, so it is best to determine the type of policy you have to accurately understand all of the benefits. An experienced broker can help.
