# Text Search Engine
In this example we will be going over the code used to build a Text Search Engine. This example uses a modified BERT model to convert text to vectors stored in Milvus, which can then be combined with Milvus to search for similar text to the user input text.

## Data

This example uses the English News dataset. In this example, we use a small subset of the dataset containing 180 mutually corresponding title-texts, which can be found in the **Data** directory.

## Requirements


|  Packages   |  Servers    |
|-                  | -                 |   
| pymilvus          | milvus-1.1.0      |
| sentence_transformers      | postgres          |
| psycopg2          |
| pandas           |
| numpy   |

We have included a `requirements.txt` file in order to easily satisfy the required packages. 


## Up and Running

### Installing Packages
Install the required python packages with `requirements.txt`.

In [21]:
pip install -r requirements.txt

E0525 15:31:44.512681617   27776 backup_poller.cc:132]       Run client channel backup poller: {"created":"@1621927904.512601059","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":321,"referenced_errors":[{"created":"@1621927904.512599092","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":948,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}


Note: you may need to restart the kernel to use updated packages.


### Starting Milvus Server

This demo uses Milvus 1.1.0, please refer to the [Install Milvus](https://milvus.io/docs/v1.1.0/install_milvus.md) guide to learn how to use this docker container. For this example we wont be mapping any local volumes. 

In [2]:
! docker run --name milvus_cpu_1.1 -d \
-p 19533:19530 \
-p 19123:19121 \
milvusdb/milvus:1.1.0-cpu-d050721-5e559c

0ecbc23b01742c348a7af90edb06925bf96e3d3b269a8120b07b3de5ab4cbfff


### Starting Postgres Server
For now, Milvus doesn't support storing multiple attributes for the data. Because of this we have to use another service to store these attributes and search through them, in this case PostgreSQL. 

In [4]:
! docker run --name postgres -d  -p 5432:5432 -e POSTGRES_HOST_AUTH_METHOD=trust postgres

3623cf3503bae1a2e7a32dddd59862014bfdd09015fb5a74472a14a1b89e07cd


### Confirm Running Servers

In [5]:
! docker logs milvus_cpu_1.1


    __  _________ _   ____  ______    
   /  |/  /  _/ /| | / / / / / __/    
  / /|_/ // // /_| |/ / /_/ /\ \    
 /_/  /_/___/____/___/\____/___/     

Welcome to use Milvus!
Milvus Release version: v1.1.0, built at 2021-05-06 14:50.43, with OpenBLAS library.
You are using Milvus CPU edition
Last commit id: 5e559cd7918297bcdb55985b80567cb6278074dd

Loading configuration from: /var/lib/milvus/conf/server_config.yaml
WARNNING: You are using SQLite as the meta data management, which can't be used in production. Please change it to MySQL!
Supported CPU instruction sets: avx2, sse4_2
FAISS hook AVX2
Milvus server started successfully!


In [63]:
! docker logs postgres --tail 6

2021-05-24 03:47:24.245 UTC [1] LOG:  starting PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2021-05-24 03:47:24.245 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-05-24 03:47:24.245 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2021-05-24 03:47:24.252 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-05-24 03:47:24.259 UTC [67] LOG:  database system was shut down at 2021-05-24 03:47:24 UTC
2021-05-24 03:47:24.265 UTC [1] LOG:  database system is ready to accept connections


## Code Overview

### Connecting to Servers
We first start off by connecting to the servers. In this case the docker containers are running on localhost and the ports are the default ports. 

In [11]:
#Connectings to Milvus, BERT and Postgresql
import milvus
import psycopg2

milv = milvus.Milvus(host='localhost', port='19533')
conn = psycopg2.connect(host='localhost', port='5432', user='postgres', password='postgres')
cursor = conn.cursor()

### Creating Collection and Setting Index
#### 1. Creating the Collection    
A collection in Milvus is similar to a table in a relational database, and is used for storing all the vectors.  
We need to specify the parameters `collection_name`, `dimension`, `index_file_size`, and `metric_type` when creating it. In this case we are storing 768-dimensional vectors and using the Inner product distance.
Our data segments are also set to the default 1024MB

In [5]:
TABLE_NAME = 'title_text_2'

#Deleting previouslny stored table for clean run
milv.drop_collection(TABLE_NAME)


collection_param = {
            'collection_name': TABLE_NAME,
            'dimension': 768,
            'index_file_size': 1024,  
            'metric_type': milvus.MetricType.IP 
            }

status = milv.create_collection(collection_param)
print(status)

Status(code=0, message='Create collection successfully!')


#### 2. Setting an Index
After creating the collection we want to assign it an index type. This can be done before or after inserting the data. When done before, indexes will be made as data comes in and fills the data segments. In this example we are using IVF_SQ8 which requires the 'nlist' parameter.

In [6]:
param = {'nlist': 16384}
status = milv.create_index(TABLE_NAME, milvus.IndexType.IVF_SQ8, param)
print(status)

Status(code=0, message='Build index successfully!')


### Creating Table in Postgres  
PostgresSQL will be used to store Milvus ID and its corresponding title and text.

In [12]:
#Deleting previouslny stored table for clean run
drop_table = "DROP TABLE IF EXISTS " + TABLE_NAME
cursor.execute(drop_table)
conn.commit()

try:
    sql = "CREATE TABLE if not exists " + TABLE_NAME + " (ids bigint, title text, text text);"
    cursor.execute(sql)
    conn.commit()
    print("create postgres table successfully!")
except Exception as e:
    print("can't create a postgres table: ", e)

create postgres table successfully!


### Processing and Storing the News Data
#### 1. Generating Embeddings
In this example we are using the sentence_transformer library  to encode the sentence into vectors. This library uses a modified BERT model to generate the embeddings, and in this example we are using a model pretrained using Microsoft's `mpnet`. More info can be found [here](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models).

In [2]:
from sentence_transformers import SentenceTransformer
import pandas as pd
from sklearn.preprocessing import normalize

model = SentenceTransformer('paraphrase-mpnet-base-v2')
# Get questions and answers.
data = pd.read_csv('data/example.csv')
title_data = data['title'].tolist()
text_data = data['text'].tolist()

sentence_embeddings = model.encode(title_data)
sentence_embeddings = normalize(sentence_embeddings)

You try to use a model that was created with version 1.2.0, however, your version is 1.1.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





#### 2. Inserting Vectors into Milvus
Since this example dataset contains only 100 vectors, we are inserting all of them as one batch insert.

In [7]:
status, ids = milv.insert(collection_name=TABLE_NAME, records=sentence_embeddings)
print(status)

Status(code=0, message='Add vectors successfully!')


#### 3. Inserting IDs and Title-text into PostgreSQL
In order to transfer the data into Postgres, we are creating a new file that combines all the data into a readable format. Once created, we pass this file into the Postgress server through STDIN due to the Postgres container not having access to the file locally. 

In [13]:
import os 

def record_temp_csv(fname, ids, title, text):
    with open(fname,'w') as f:
        for i in range(len(ids)):
            line = str(ids[i]) + "|" + title[i] + "|" + text[i] + "\n"
            f.write(line)

def copy_data_to_pg(table_name, fname, conn, cur):
    fname = os.path.join(os.getcwd(),fname)
    try:
        sql = "COPY " + table_name + " FROM STDIN DELIMITER '|' CSV HEADER"
        cursor.copy_expert(sql, open(fname, "r"))
        conn.commit()
        print("Inserted into Postgress Sucessfully!")
    except Exception as e:
        print("Copy Data into Postgress failed: ", e)
        
DATA_WITH_IDS = 'data/test.csv'   

record_temp_csv(DATA_WITH_IDS, ids, title_data, text_data)
copy_data_to_pg(TABLE_NAME, DATA_WITH_IDS, conn, cursor)

Inserted into Postgress Sucessfully!


### Search
#### 1. Processing Query
When searching for a question, we first put the question through the same model to generate an embedding. Then with that embedding vector we  can search for similar embeddings in Milvus.  

In [20]:
SEARCH_PARAM = {'nprobe': 64}

query_vec = []

title = "Loosing the War on Terrorism"

query_embeddings = []
embed = model.encode(title)
embed = embed.reshape(1,-1)
embed = normalize(embed)
query_embeddings = embed.tolist()


status, results = milv.search(collection_name=TABLE_NAME, query_records=query_embeddings, top_k=9, params=SEARCH_PARAM)
print(status)

Status(code=0, message='Search vectors successfully!')


#### 2. Getting the Similar Titles
There may not have titles that are similar to the given one. So we can set a threshold value, here we use 0.5, and when the most similar distance retrieved is less than this value, a hint that the system doesn't include the relevant question is returned. We then use the result ID's to pull out the similar titles from the Postgres server and print them with their corresponding similarity score.

In [16]:
similar_titles = []

if results[0][0].distance < 0.5:
    print("There are no similar questions in the database, here are the closest matches:")
else:
    print("There are similar questions in the database, here are the closest matches: ")
    
for result in results[0]:
    sql = "select title from " + TABLE_NAME + " where ids = " + str(result.id) + ";"
    cursor.execute(sql)
    rows=cursor.fetchall()
    if len(rows):
        similar_titles.append((rows[0][0], result.distance))
        print((rows[0][0], result.distance))
       

There are similar questions in the database, here are the closest matches: 
('Loosing the War on Terrorism', 0.9999999403953552)
('Politics an Afterthought Amid Hurricane  ', 0.294342577457428)
('Kerry-Kerrey Confusion Trips Up Campaign  ', 0.2923681437969208)
('News: Sluggish movement on power grid cyber security', 0.2759988009929657)
('Promoting a Shared Vision', 0.2651735246181488)
('U.S. Brokers Cease-fire in Western Afghanistan', 0.2630968689918518)
('On front line of AIDS in Russia', 0.2530452609062195)
('Fresh Fighting Shatters Short-Lived Ceasefire Deal', 0.23172177374362946)
('Flop in the ninth inning sinks Jays', 0.21579097211360931)


#### 3. Get the text
After getting a list of similar titles, choose the one that you feel is closest to yours. Then you can use that title to find the corresponding text in Postgres.

In [19]:
sql = "select text from " + TABLE_NAME + " where title = '" + similar_titles[0][0] + "';"
cursor.execute(sql)
rows=cursor.fetchall()
print("Title:")
print(title)
print("Text:")
print(rows[0][0])

Title:
Loosing the War on Terrorism
Text:
 Sven Jaschan, self-confessed author of the Netsky and Sasser viruses, is responsible for 70 percent of virus infections in 2004, according to a six-month virus roundup published Wednesday by antivirus company Sophos.  The 18-year-old Jaschan was taken into custody in Germany in May by police who said he had admitted programming both the Netsky and Sasser worms, something experts at Microsoft confirmed. (A Microsoft antivirus reward program led to the teenager's arrest.) During the five months preceding Jaschan's capture, there were at least 25 variants of Netsky and one of the port-scanning network worm Sasser.  Graham Cluley, senior technology consultant at Sophos, said it was staggeri   
