# Question Answering System
In this example we will be going over the code used to build a question answering system. This example uses a BERT to extract features from questions and Milvus to search for similar questions and answers. 

## Data
This example uses the [InsuranceQA Corpus](https://github.com/shuzi/insuranceQA) dataset, which contains 27,413 answers with the 3,065,492 running words of answers.

Download location: https://github.com/chatopera/insuranceqa-corpus-zh/tree/release/corpus/pairs

In this example, we use a small dataset that containing 100 pairs of quesiton-answer and you can find it under **data** directory.

## Requirements


|  Packages   |  Servers    |
|-                  | -                 |   
| pymilvus          | milvus-1.1.0      |
| bert_serving      | postgres          |
| psycopg2          | bert-as-service   |
| pandas           |
| numpy   |

We have included a requirements.txt file in order to easily satisfy the required packages. 


## Up and Running

### Install Requirements
Install the required python packages with `requirements.txt`.

In [1]:
pip install -r requirements.txt

[31mERROR: Could not find a version that satisfies the requirement tensorflow==1.13.0rc1[0m
[31mERROR: No matching distribution found for tensorflow==1.13.0rc1[0m
Note: you may need to restart the kernel to use updated packages.


### Start Milvus Server

This demo uses Milvus 1.1.0, please refer to the [Install Milvus](https://milvus.io/docs/v1.1.0/install_milvus.md) guide to learn how to use this docker container. For this example we wont be mapping any local volumes. 

In [2]:
! docker run -d \
-p 19530:19530 \
-p 19121:19121 \
milvusdb/milvus:1.1.0-cpu-d050721-5e559c

baac6a3535071632e275085d8b12c41f581a2a52e6990d4072b0727c945bbf13


### Start Postgres Server
For now, Milvus doesn't support to store string type data. Thus, we need a relational database to store questions and answers. In this example, we use [PostgreSQL](https://www.postgresql.org/).

In [3]:
! docker run  -d  -p 5432:5432 postgres

8b47266fdb7d2ac3af752215b20de3fb67e5df5ca5b9f89295d586a24efdadbb


### Start BERT Server
#### 1. Download model

In [None]:
! wget -P model https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
! unzip model/cased_L-12_H-768_A-12.zip -d model/

#### 2. Start BERT serer  

In [5]:
import subprocess
subp = subprocess.Popen('nohup bert-serving-start -model_dir model/cased_L-12_H-768_A-12/ -num_worker=2 -max_seq_len=40 &', shell=True)
subp.wait(2)
if subp.poll() == 0:
    print("Successfully!")
else:
    print("Failed!")

Successfully!


## Code Overview

### Connecting to Servers
We first start off by connecting to the servers. In this case the docker containers are running on localhost and the ports are the default ports. 

In [None]:
#Connectings to Milvus, BERT and Postgresql

from milvus import Milvus, IndexType, MetricType, Status
from bert_serving.client import BertClient
import psycopg2

# TODO: Change USER_NAME to your computer user name.
USER_NAME = 'mialee'

milvus = Milvus(host = '127.0.0.1', port = 19530)
bc = BertClient(ip='127.0.0.1', port=5555, check_length=False)
conn = psycopg2.connect(host='localhost', port='5432', user=USER_NAME, password='', database=USER_NAME)
cursor = conn.cursor()

### Create collection and set index
#### 1. Create a collection  
A collection in Milvus is similar to a table in a relational database, and is used for storing all the vectors.  
Required parameters for creating a collection:  
- `collection_name`: the name of a collection.  
- `dimension`: BERT generates 728-dimensional vectors.  
- `index_file_size`: how large each data segment will be within the collection.      
- `metric_type`: the distance formula being used to calculate similarity. In this example we are using Inner product (IP).

In [None]:
TABLE_NAME = 'chatbot'

collection_param = {
            'collection_name': TABLE_NAME,
            'dimension': 768,
            'index_file_size': 1024,  
            'metric_type': MetricType.IP 
            }

status = milvus.create_collection(collection_param)
print(status)

#### 2. Set an index
After creating the collection we want to assign it an index type. This can be done before or after inserting the data. When done before, indexes will be made as data comes in and fills the data segments. In this example we are using IVF_FLAT which requires the 'nlist' parameter. Each index types carries its own parameters. More info about this param can be found [here](https://milvus.io/docs/v1.1.0/index.md#CPU).

In [None]:
param = {'nlist': 16384}
status = milvus.create_index(TABLE_NAME, IndexType.IVF_FLAT, param)
print(status)

### Create table in Postgre  
PostgresSQL will be used to store the Milvus ID and its corresponding answer and question.

In [None]:
try:
    sql = "CREATE TABLE if not exists " + TABLE_NAME + " (ids bigint, question text, answer text);"
    cursor.execute(sql)
    conn.commit()
    print("create postgres table successfully!")
except Exception as e:
    print("can't create a postgres table: ", e)

### Process and Store QA dataset
#### 1.Generate embeddings
Use BERT to convert questions into vectors. Then normalize and import vectors into Milvus. 

In [None]:
import pandas as pd
from functools import reduce
import numpy as np

# Get questions and answers.
data = pd.read_csv('data/example.csv')
question_data = data['question'].tolist()
answer_data = data['answer'].tolist()

# Convert quesitons to embeddings and normalize them.
def normaliz_vec(vec_list):
    question_vec = []
    for vec in vec_list:
        square_sum = reduce(lambda x,y:x+y, map(lambda x:x*x ,vec))
        sqrt_square_sum = np.sqrt(square_sum)
        coef = 1/sqrt_square_sum
        vec = list(map(lambda x:x*coef, vec))
        question_vec.append(vec)
    return question_vec

question_vec = bc.encode(question_data)
question_norm_vec = normaliz_vec(question_vec)

#### 2. Insert vectors in Milvus

In [None]:
status, ids = milvus.insert(collection_name=TABLE_NAME, records=question_norm_vec)

status

#### 3. Import IDs, questions and answers in PostgreSQL
And then import the generated (or specific) IDs and their corresponding questions and answers in PostgreSQL.

In [None]:
import os 

def record_temp_csv(fname, ids, answer, question):
    with open(fname,'w') as f:
        for i in range(len(ids)):
            line = str(ids[i]) + "|" + question[i] + "|" + answer[i] + "\n"
            f.write(line)

def copy_data_to_pg(table_name, fname, conn, cur):
    fname = os.path.join(os.getcwd(),fname)
    sql = "copy " + table_name + " from '" + fname + "' with CSV delimiter '|';"
    try:
        cursor.execute(sql)
        conn.commit()
        print("insert to pg sucessfully!")
    except Exception as e:
        print("copy data to postgre failed: ", e)
        
DATA_WITH_IDS = 'data/example_with_ids.csv'   

record_temp_csv(DATA_WITH_IDS, ids, answer_data, question_data)
copy_data_to_pg(TABLE_NAME, DATA_WITH_IDS, conn, cursor)

### Search
#### 1. Process query question and search in Milvus
When searching for a question, we first put the question through the same BERT model to generate embedding. Then we can get a vector and use it to search similar embeddings in Milvus.  
There may not have questions that are similar to the given one. So we can set a threshold value, here we use 0.8, and when the most similar distance retrieved is less than this value, a hint that the system doesn't include the relevant question is returned.

In [None]:
SEARCH_PARAM = {'nprobe': 32}

question = "Which life insurance is more recommended?"
vector = bc.encode([question])
vector_list = normaliz_vec(vector.tolist())
status, results = milvus.search(collection_name=TABLE_NAME, query_records=vector_list, top_k=5, params=SEARCH_PARAM)

if results[0][0].distance < 0.8:
    print("No similar questions in the database!")
else:
    print(results)

#### 2. Get similar questions
Quering the corresponding questions in PostgreSQL with the returned IDs.

In [None]:
similar_questions = []
for result in results[0]:
    sql = "select question from " + TABLE_NAME + " where ids = " + str(result.id) + ";"
    cursor.execute(sql)
    rows=cursor.fetchall()
    if len(rows):
        similar_questions.append(rows[0][0])

print(similar_questions)

#### 3. Get the answer
After getting a list of similar questions, choose the one that you feel is closest to yours. Then you can use the question to search for the corresponding answer in PostgreSQL.

In [None]:
sql = "select answer from " + TABLE_NAME + " where question = '" + similar_questions[0] + "';"
cursor.execute(sql)
rows=cursor.fetchall()

print(rows[0][0])