# Load Vector Embeddings to Milvus in watsonx.data

## Overview
This Jupyter Notebook provides a step-by-step guide on how to prepare document data for RAG using Milvus (in watsonx.data as a vector database).

In this notebook, we will convert text to vector embeddings and store the embeddings to Milvus (in watsonx.data as a vector database). Here are the steps:
1. Connect to watsonx.data.
2. Connect to Milvus.
3. Create a Collection in Milvus.
4. Insert Vectors into Milvus.
5. Query Docs from Milvus through Semantic Search.

#### Credits
- ahmad.muzaffar@ibm.com (APAC Ecosystem Technical Enablement team)
- This material has been adopted from material originally produced by Katherine Ciaravalli, Ken Bailey and George Baklarz.

### Install required libraries

In [34]:
!pip install python-dotenv
!pip install wikipedia
!pip install pymilvus
!pip install sentence_transformers
!pip install grpcio==1.60.0 



### 1. Connect to watsonx.data

In [35]:
!rm -f /tmp/presto.cert
!echo QUIT | openssl s_client -showcerts -connect localhost:8443 | awk '/-----BEGIN CERTIFICATE-----/ {p=1}; p; /-----END CERTIFICATE-----/ {p=0}' > /tmp/presto.crt

Can't use SSL_get_servername
depth=0 C = YY, ST = XX, L = Home-Town, O = Data and AI, OU = For-CPD, emailAddress = dummy@example.dum, CN = Dummy-Self-signed-Cert
verify error:num=18:self-signed certificate
verify return:1
depth=0 C = YY, ST = XX, L = Home-Town, O = Data and AI, OU = For-CPD, emailAddress = dummy@example.dum, CN = Dummy-Self-signed-Cert
verify return:1
DONE


In [36]:
%run presto.ipynb

Presto Extensions Loaded.


In [37]:
%%sql
   connect
   userid=ibmlhadmin
   password=password
   hostname=watsonxdata
   port=8443
   catalog=tpch
   schema=tiny
   certfile=/certs/lh-ssl-ts.crt

Connection successful.


## 2. Connect to Milvus

#### Milvus Connection Settings

In [38]:
host            = 'watsonxdata'
port            = 19530
user            = 'ibmlhadmin'
password        = 'password'
server_pem_path = '/tmp/presto.crt'

#### Generate a Connection to Milvus

In [39]:
from pymilvus import(
    Milvus,
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)

connections.connect(alias='default',
                   host=host,
                   port=port,
                   user=user,
                   password=password,
                   server_pem_path=server_pem_path,
                   server_name='watsonxdata',
                   secure=True)

## 3. Create a Collection in Milvus
This code will drop the wiki_articles collection if it exists, and then recreate it. This script should return the following text.
```
Status(code=0, message=)
```

In [40]:
from pymilvus import utility

utility.drop_collection("wiki_articles")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True), # Primary key
    FieldSchema(name="article_text", dtype=DataType.VARCHAR, max_length=2500,),
    FieldSchema(name="article_title", dtype=DataType.VARCHAR, max_length=200,),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=384),
]

schema = CollectionSchema(fields, "wikipedia article collection schema")

wiki_collection = Collection("wiki_articles", schema)

# Create index
index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
}

wiki_collection.create_index(field_name="vector", index_params=index_params)

Status(code=0, message=)

#### Double Check that the Schema Exists

In [41]:
from pymilvus import utility
utility.list_collections()

['wiki_articles']

### 4. Insert Vectors into Milvus

Here we read data from the watsonx.data table. We pull text chunks and titles from the database, being sure to separate them out into separate lists. We then vectorize using the `sentence-transformers/all-MiniLM-L6-v2` sentence transformer model. Learn more about Hugging Face sentence transformers here: [Sentence Transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

It is important we assemble the article text, article titles and vector embeddings into a `data` object. This object will be used to load the data into Milvus.

In [42]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from pymilvus import Collection, connections
import warnings
import os
#os.environ["TOKENIZERS_PARALLELISM"] = "false"

warnings.filterwarnings('ignore')

# Download Wikipedia articles from watsonx.data using the engine we created earlier 

articles_df = %sql --pandas SELECT * from hive_data.rag_docs.web_wikipedia

# extract text + titles

passages = articles_df['text'].tolist()
passage_titles = articles_df['title'].tolist()

# Create vector embeddings + data

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 384 dim
passage_embeddings = model.encode(passages)

basic_collection = Collection("wiki_articles") 
data = [
    passages,
    passage_titles,
    passage_embeddings
]
out = basic_collection.insert(data)
basic_collection.flush()  # Ensures data persistence
print("Done")

Done


#### Check that the Collection has been Loaded

In [43]:
basic_collection = Collection("wiki_articles") 
basic_collection.load()
basic_collection.num_entities 

53

## 5. Query Docs from Milvus through Semantic Search
After gathering the data from Wikipedia and then vectorizing it and inserting into Milvus, we are now ready to perform queries against the vector database. We will use the `sentence-transformers/all-MiniLM-L6-v2` model to generate the query vector and then use Milvus to find the most similar vectors in the database.

#### Create a Query Function
The following function will be used to query the Milvus database.

In [44]:
from sentence_transformers import SentenceTransformer
from pymilvus import(
    Milvus,
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)

def query_milvus(query, num_results=5):
    
    # Vectorize query
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 384 dim
    query_embeddings = model.encode([query])

    # Search
    search_params = {
        "metric_type": "L2", 
        "params": {"nprobe": 5}
    }
    results = basic_collection.search(
        data=query_embeddings, 
        anns_field="vector", 
        param=search_params,
        limit=num_results,
        expr=None, 
        output_fields=['article_text'],
    )
    return results

#### Query Question
Consider how climate change may relate to other industries and processes related to your business. Select one of the questions below to feed into Milvus query.

In [45]:
question_text = "What can my company do to help fight climate change?"
#question_text = "How do businesses negatively effect climate change?"
#question_text = "What can a businesses do to have a positive effect on climate change?"
#question_text = "How can a business reduce their carbon footprint?"

#### Search a Question in Milvus
An embedding is made for the question being asked. It is then used to search for the most relevant chunks in Milvus. The top 3 related chunks are retrieved below and can be used for a large language prompt.

The documents that best match the question are found in the list below.

In [47]:
import re
num_results = 5
results = query_milvus(question_text, num_results)

display_articles = []
relevant_chunks  = []
for i in range(num_results):
    display_articles.append({
        "ID"      : results[0].ids[i],
        "Distance": results[0].distances[i],
        # "Article" : re.sub(r"^.*?\. (.*$)",r"\1",results[0][i].entity.get('article_text'))
        "Article" : re.sub(r"^.*?\. (.*\.).*$",r"\1",results[0][i].entity.get('article_text'))        
    })
    relevant_chunks.append(re.sub(r"^.*?\. (.*\.).*$",r"\1",results[0][i].entity.get('article_text')))

df = pd.DataFrame.from_dict(display_articles).sort_values("Distance",ascending=False)
df.style.set_properties(**{'text-align': 'left'}).set_caption(question_text).set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', 'blue'),
        ('font-size', '20px')
    ]
}])

Unnamed: 0,ID,Distance,Article
4,452740999904618768,0.980102,"Although there is no single pathway to limit global warming to 1.5 or 2 °C, most scenarios and strategies see a major increase in the use of renewable energy in combination with increased energy efficiency measures to generate the needed greenhouse gas reductions. To reduce pressures on ecosystems and enhance their carbon sequestration capabilities, changes would also be necessary in agriculture and forestry, such as preventing deforestation and restoring natural ecosystems by reforestation. Other approaches to mitigating climate change have a higher level of risk. Scenarios that limit global warming to 1.5 °C typically project the large-scale use of carbon dioxide removal methods over the 21st century. There are concerns, though, about over-reliance on these technologies, and environmental impacts. Solar radiation modification (SRM) is also a possible supplement to deep reductions in emissions. However, SRM raises significant ethical and legal concerns, and the risks are imperfectly understood. === Clean energy === Renewable energy is key to limiting climate change. For decades, fossil fuels have accounted for roughly 80%% of the world's energy use. The remaining share has been split between nuclear power and renewables (including hydropower, bioenergy, wind and solar power and geothermal energy). Fossil fuel use is expected to peak in absolute terms prior to 2030 and then to decline, with coal use experiencing the sharpest reductions."
3,452740999904618769,0.95047,"Oxfam found that in 2023 the wealthiest 10%% of people were responsible for 50%% of global emissions, while the bottom 50%% were responsible for just 8%%. Production of emissions is another way to look at responsibility: under that approach, the top 21 fossil fuel companies would owe cumulative climate reparations of $5.4 trillion over the period 2025–2050. To achieve a just transition, people working in the fossil fuel sector would also need other jobs, and their communities would need investments. === International climate agreements === Nearly all countries in the world are parties to the 1994 United Nations Framework Convention on Climate Change (UNFCCC). The goal of the UNFCCC is to prevent dangerous human interference with the climate system. As stated in the convention, this requires that greenhouse gas concentrations are stabilized in the atmosphere at a level where ecosystems can adapt naturally to climate change, food production is not threatened, and economic development can be sustained. The UNFCCC does not itself restrict emissions but rather provides a framework for protocols that do. Global emissions have risen since the UNFCCC was signed. Its yearly conferences are the stage of global negotiations."
2,452740999904618806,0.930857,"There are synergies but also trade-offs between adaptation and mitigation. An example for synergy is increased food productivity, which has large benefits for both adaptation and mitigation. An example of a trade-off is that increased use of air conditioning allows people to better cope with heat, but increases energy demand. Another trade-off example is that more compact urban development may reduce emissions from transport and construction, but may also increase the urban heat island effect, exposing people to heat-related health risks. == Policies and politics == Countries that are most vulnerable to climate change have typically been responsible for a small share of global emissions. This raises questions about justice and fairness. Limiting global warming makes it much easier to achieve the UN's Sustainable Development Goals, such as eradicating poverty and reducing inequalities. The connection is recognized in Sustainable Development Goal 13 which is to ""take urgent action to combat climate change and its impacts"". The goals on food, clean water and ecosystem protection have synergies with climate mitigation. The geopolitics of climate change is complex. It has often been framed as a free-rider problem, in which all countries benefit from mitigation done by other countries, but individual countries would lose from switching to a low-carbon economy themselves. Sometimes mitigation also has localized benefits though."
1,452740999904618762,0.926673,"If that fails, managed retreat may be needed. There are economic barriers for tackling dangerous heat impact. Avoiding strenuous work or having air conditioning is not possible for everybody. In agriculture, adaptation options include a switch to more sustainable diets, diversification, erosion control, and genetic improvements for increased tolerance to a changing climate. Insurance allows for risk-sharing, but is often difficult to get for people on lower incomes. Education, migration and early warning systems can reduce climate vulnerability. Planting mangroves or encouraging other coastal vegetation can buffer storms. Ecosystems adapt to climate change, a process that can be supported by human intervention. By increasing connectivity between ecosystems, species can migrate to more favourable climate conditions. Species can also be introduced to areas acquiring a favourable climate. Protection and restoration of natural and semi-natural areas helps build resilience, making it easier for ecosystems to adapt. Many of the actions that promote adaptation in ecosystems, also help humans adapt via ecosystem-based adaptation. For instance, restoration of natural fire regimes makes catastrophic fires less likely, and reduces human exposure. Giving rivers more space allows for more water storage in the natural system, reducing flood risk."
0,452740999904618771,0.859215,"==== Climate movement ==== Climate protests demand that political leaders take action to prevent climate change. They can take the form of public demonstrations, fossil fuel divestment, lawsuits and other activities. Prominent demonstrations include the School Strike for Climate. In this initiative, young people across the globe have been protesting since 2018 by skipping school on Fridays, inspired by Swedish teenager Greta Thunberg. Mass civil disobedience actions by groups like Extinction Rebellion have protested by disrupting roads and public transport. Litigation is increasingly used as a tool to strengthen climate action from public institutions and companies. Activists also initiate lawsuits which target governments and demand that they take ambitious action or enforce existing laws on climate change. Lawsuits against fossil-fuel companies generally seek compensation for loss and damage. == History == === Early discoveries === Scientists in the 19th century such as Alexander von Humboldt began to foresee the effects of climate change. In the 1820s, Joseph Fourier proposed the greenhouse effect to explain why Earth's temperature was higher than the Sun's energy alone could explain. Earth's atmosphere is transparent to sunlight, so sunlight reaches the surface where it is converted to heat."
