# Load Vector Embeddings to Milvus in watsonx.data (Web)

## Overview
This Jupyter Notebook provides a step-by-step guide on how to prepare document data for RAG using Milvus (in watsonx.data as a vector database).

In this notebook, we will convert text to vector embeddings and store the embeddings to Milvus as a vector database (in watsonx.data). Here are the steps:
1. Install and import libraries.
2. Connect to watsonx.data.
3. Connect to Milvus.
4. Create a Collection in Milvus.
5. Insert Vectors into Milvus.
6. Query Docs from Milvus through Semantic Search.

- Author: ahmad.muzaffar@ibm.com (APAC Ecosystem Technical Enablement).
- This material has been adopted from material originally produced by Katherine Ciaravalli, Ken Bailey and George Baklarz.

## 1. Install and import libraries

In [None]:
!pip install python-dotenv
!pip install wikipedia
!pip install pymilvus
!pip install sentence_transformers
!pip install grpcio==1.60.0 

In [None]:
# Import libraries
import pandas as pd
import warnings
import os
import re
#os.environ["TOKENIZERS_PARALLELISM"] = "false"

from sentence_transformers import SentenceTransformer
from pymilvus import Collection, connections
from pymilvus import utility

warnings.filterwarnings('ignore')

## 1. Connect to watsonx.data

In [None]:
!rm -f /tmp/presto.cert
!echo QUIT | openssl s_client -showcerts -connect localhost:8443 | awk '/-----BEGIN CERTIFICATE-----/ {p=1}; p; /-----END CERTIFICATE-----/ {p=0}' > /tmp/presto.crt

In [None]:
%run presto.ipynb

In [None]:
%%sql
   connect
   userid=ibmlhadmin
   password=password
   hostname=watsonxdata
   port=8443
   catalog=tpch
   schema=tiny
   certfile=/certs/lh-ssl-ts.crt

## 2. Connect to Milvus

#### Milvus Connection Settings

In [None]:
host            = 'watsonxdata'
port            = 19530
user            = 'ibmlhadmin'
password        = 'password'
server_pem_path = '/tmp/presto.crt'

In [None]:
# Generate a Connection to Milvus
from pymilvus import(
    Milvus,
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)

connections.connect(alias='default',
                   host=host,
                   port=port,
                   user=user,
                   password=password,
                   server_pem_path=server_pem_path,
                   server_name='watsonxdata',
                   secure=True)

## 3. Create a Collection in Milvus
This code will drop the rag_docs collection if it exists, and then recreate it. This script should return the following text.
```
Status(code=0, message=)
```

In [None]:
utility.drop_collection("rag_docs")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True), # Primary key
    FieldSchema(name="article_text", dtype=DataType.VARCHAR, max_length=2500,),
    FieldSchema(name="article_title", dtype=DataType.VARCHAR, max_length=200,),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=384),
]

schema = CollectionSchema(fields, "rag collection schema")

wiki_collection = Collection("rag_docs", schema)

# Create index
index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
}

wiki_collection.create_index(field_name="vector", index_params=index_params)

In [None]:
# Double Check that the Schema Exists
utility.list_collections()

## 4. Insert Vectors into Milvus

Here we read data from the watsonx.data table. We pull text chunks and titles from the database, being sure to separate them out into separate lists. We then vectorize using the `sentence-transformers/all-MiniLM-L6-v2` sentence transformer model. Learn more about Hugging Face sentence transformers here: [Sentence Transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

It is important we assemble the article text, article titles and vector embeddings into a `data` object. This object will be used to load the data into Milvus.

In [None]:
# Download Wikipedia articles from watsonx.data using the engine we created earlier 
articles_df = %sql --pandas SELECT * from hive_data.rag_web.web_wikipedia

# extract text + titles
passages = articles_df['text'].tolist()
passage_titles = articles_df['title'].tolist()

# Create vector embeddings + data
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 384 dim
passage_embeddings = model.encode(passages)

basic_collection = Collection("rag_docs") 
data = [
    passages,
    passage_titles,
    passage_embeddings
]
out = basic_collection.insert(data)
basic_collection.flush()  # Ensures data persistence
print("Done")

#### Check that the Collection has been Loaded

In [None]:
basic_collection = Collection("rag_docs") 
basic_collection.load()
basic_collection.num_entities 

## 5. Query Docs from Milvus through Semantic Search
After gathering the data from Wikipedia and then vectorizing it and inserting into Milvus, we are now ready to perform queries against the vector database. We will use the `sentence-transformers/all-MiniLM-L6-v2` model to generate the query vector and then use Milvus to find the most similar vectors in the database.

#### Create a Query Function
The following function will be used to query the Milvus database.

In [None]:
from pymilvus import(
    Milvus,
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)

def query_milvus(query, num_results=5):
    # Vectorize query
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 384 dim
    query_embeddings = model.encode([query])

    # Search
    search_params = {
        "metric_type": "L2", 
        "params": {"nprobe": 5}
    }
    results = basic_collection.search(
        data=query_embeddings, 
        anns_field="vector", 
        param=search_params,
        limit=num_results,
        expr=None, 
        output_fields=['article_text'],
    )
    return results

#### Query Question
Consider how climate change may relate to other industries and processes related to your business. Select one of the questions below to feed into Milvus query.

In [None]:
question_text = "What can my company do to help fight climate change?"
#question_text = "How do businesses negatively effect climate change?"
#question_text = "What can a businesses do to have a positive effect on climate change?"
#question_text = "How can a business reduce their carbon footprint?"

#### Search a Question in Milvus
An embedding is made for the question being asked. It is then used to search for the most relevant chunks in Milvus. The top 3 related chunks are retrieved below and can be used for a large language prompt.

The documents that best match the question are found in the list below.

In [None]:
num_results = 3
results = query_milvus(question_text, num_results)

display_articles = []
relevant_chunks  = []
for i in range(num_results):
    display_articles.append({
        "ID"      : results[0].ids[i],
        "Distance": results[0].distances[i],
        # "Article" : re.sub(r"^.*?\. (.*$)",r"\1",results[0][i].entity.get('article_text'))
        "Article" : re.sub(r"^.*?\. (.*\.).*$",r"\1",results[0][i].entity.get('article_text'))        
    })
    relevant_chunks.append(re.sub(r"^.*?\. (.*\.).*$",r"\1",results[0][i].entity.get('article_text')))

df = pd.DataFrame.from_dict(display_articles).sort_values("Distance",ascending=False)
df.style.set_properties(**{'text-align': 'left'}).set_caption(question_text).set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', 'blue'),
        ('font-size', '20px')
    ]
}])