# Query Data using LLM

Here is the overall RAG pipeline.   In this notebook, we will do steps (6), (7), (8), (9) and (10)
- Importing data is already done in this notebook [rag_2_load_data_into_milvus.ipynb](rag_2_load_data_into_milvus.ipynb)
- 👉 Step 6: Calculate embedding for user query
- 👉 Step 7 & 8: Send the query to vector db to retrieve relevant documents
- 👉 Step 9 & 10: Send the query and relevant documents (returned above step) to LLM and get answers to our query

![image missing](media/rag-overview-2.png)

## Step-1: Configuration

In [1]:
from my_config import MY_CONFIG

## Step-2: Load .env file


In [2]:
import os,sys
## Load Settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# debug
# print (config)

MY_CONFIG.REPLICATE_API_TOKEN = config.get('REPLICATE_API_TOKEN')

if  MY_CONFIG.REPLICATE_API_TOKEN:
    print ("✅ config REPLICATE_API_TOKEN found")
else:
    raise Exception ("'❌ REPLICATE_API_TOKEN' is not set.  Please set it above to continue...")


✅ config REPLICATE_API_TOKEN found


## Step-3: Connect to Vector Database

Milvus can be embedded and easy to use.

<span style="color:blue;">Note: If you encounter an error about unable to load database, try this: </span>

- <span style="color:blue;">In **vscode** : **restart the kernel** of previous notebook. This will release the db.lock </span>
- <span style="color:blue;">In **Jupyter**: Do `File --> Close and Shutdown Notebook` of previous notebook. This will release the db.lock</span>
- <span style="color:blue;">Re-run this cell again</span>


In [3]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(MY_CONFIG.DB_URI)

print ("✅ Connected to Milvus instance:", MY_CONFIG.DB_URI)

✅ Connected to Milvus instance: ./rag_1_dpk.db


## Step-4: Setup Embeddings

Use the same embeddings we used to index our documents!

In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(MY_CONFIG.EMBEDDING_MODEL)

def get_embeddings (str):
    embeddings = model.encode(str, normalize_embeddings=True)
    return embeddings

In [5]:
# Test embeddings
embeddings = get_embeddings('Paris 2024 Olympics')
print ('embeddings len =', len(embeddings))
print ('embeddings[:5] = ', embeddings[:5])

embeddings len = 384
embeddings[:5] =  [ 0.02468892  0.10352131  0.0275264  -0.08551715 -0.01412829]


## Step-5: Vector Search and RAG

In [6]:
# Get relevant documents using vector / sementic search

def fetch_relevant_documents (query : str) :
    search_res = milvus_client.search(
        collection_name=MY_CONFIG.COLLECTION_NAME,
        data = [get_embeddings(query)], # Use the `emb_text` function to convert the question to an embedding vector
        limit=3,  # Return top 3 results
        search_params={"metric_type": "IP", "params": {}},  # Inner product distance
        output_fields=["text"],  # Return the text field
    )
    # print (search_res)

    retrieved_docs_with_distances = [
        {'text': res["entity"]["text"], 'distance' : res["distance"]} for res in search_res[0]
    ]
    return retrieved_docs_with_distances
## --- end ---


In [7]:
# test relevant vector search
import json
import pprint

question = "What was the training data used to train Granite models?"
relevant_docs = fetch_relevant_documents(question)
pprint.pprint(relevant_docs, indent=4)

[   {   'distance': 0.5530709028244019,
        'text': '## 5 Instruction Tuning\n'
                '\n'
                'Finetuning code LLMs on a variety of tasks explained via '
                'instructions has been shown to improve model usability and '
                'general performance. While there has been much progress in '
                'code instruction tuning, most of them adopt synthetically '
                'generated data from OpenAI models, which limits the model use '
                'in many enterprise applications. Thus, following OctoCoder '
                '(Muennighoff et al., 2023), we use only a combination of '
                'permissively licensed data, with an aim to enhance '
                'instruction following capabilities of our models, including '
                'logical reasoning and problem-solving skills. Speciﬁcally, '
                'Granite Code Instruct models are trained on the following '
                'types of data.\n'
            

## Step-6: Initialize LLM

### LLM Choices at Replicate


| Model                               | Publisher | Params | Description                                          |
|-------------------------------------|-----------|--------|------------------------------------------------------|
| ibm-granite/granite-3.0-8b-instruct | IBM       | 8 B    | IBM's newest Granite Model v3.0  (default)           |
| ibm-granite/granite-3.0-2b-instruct | IBM       | 2 B    | IBM's newest Granite Model v3.0                      |
| meta/meta-llama-3.1-405b-instruct   | Meta      | 405 B  | Meta's flagship 405 billion parameter language model |
| meta/meta-llama-3-8b-instruct       | Meta      | 8 B    | Meta's 8 billion parameter language model            |
| meta/meta-llama-3-70b-instruct      | Meta      | 70 B   | Meta's 70 billion parameter language model           |

References 

- https://www.ibm.com/granite
- https://www.llama.com/
- https://replicate.com/  

In [8]:
import os
os.environ["REPLICATE_API_TOKEN"] = MY_CONFIG.REPLICATE_API_TOKEN

print ('Using model:', MY_CONFIG.LLM_MODEL)

Using model: ibm-granite/granite-3.0-8b-instruct


In [9]:
import replicate

def ask_LLM (question, relevant_docs):
    context = "\n".join(
        [doc['text'] for doc in relevant_docs]
    )
    
    max_new_tokens = 1024
    
    ## Truncate context, so we don't over shoot context window
    context = context[:(MY_CONFIG.MAX_CONTEXT_WINDOW - max_new_tokens - 100)]
    # print ("context length:", len(context))
    # print ('============ context (this is the context supplied to LLM) ============')
    # print (context)
    # print ('============ end  context ============', flush=True)

    system_prompt = """
    Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
    """
    user_prompt = f"""
    Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
    <context>
    {context}
    </context>
    <question>
    {question}
    </question>
    """
    # print ("user_prompt length:", len(user_prompt))

    print ('============ here is the answer from LLM =====')
    # The meta/meta-llama-3-8b-instruct model can stream output as it's running.
    for event in replicate.stream(
        MY_CONFIG.LLM_MODEL,
        input={
            "top_k": 1,
            "top_p": 0.95,
            "prompt": user_prompt,
            #"max_tokens": MY_CONFIG.MAX_CONTEXT_WINDOW,
            "temperature": 0.1,
            "system_prompt": system_prompt,
            "length_penalty": 1,
            "max_new_tokens": max_new_tokens,
            "stop_sequences": "<|end_of_text|>,<|eot_id|>",
            "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
            "presence_penalty": 0,
            "log_performance_metrics": False
        },
    ):
        print(str(event), end="")
    ## ---
    print ('\n======  end LLM answer ======\n', flush=True)


## Step-7: Query

In [10]:
%%time

question = "What was the training data used to train Granite models?"
relevant_docs = fetch_relevant_documents(question)
ask_LLM(question=question, relevant_docs=relevant_docs)

The Granite Code Instruct models were trained on a combination of permissively licensed data, including the Code Commits Dataset (CommitPackFT) and Math Datasets (MathInstruct and MetaMathQA). Additionally, they were trained on Code Instruction Datasets such as Glaive-Code-Assistant-v3, Self-OSS-Instruct-SC2, Glaive-Function-Calling-v2, and NL2SQL.

CPU times: user 78.4 ms, sys: 12.3 ms, total: 90.6 ms
Wall time: 3.04 s


In [11]:
%%time

question = "What is attention mechanism?"
relevant_docs = fetch_relevant_documents(question)
ask_LLM(question=question, relevant_docs=relevant_docs)

The attention mechanism is a method that allows a model to focus on specific parts of the input when producing an output. It maps a query and a set of key-value pairs to an output, where the output is computed as a weighted sum of the values, and the weight assigned to each value is determined by a compatibility function of the query with the corresponding key. In the context of the Transformer model, attention is used in three ways: encoder-decoder attention layers, self-attention layers in the encoder, and self-attention layers in the decoder.

CPU times: user 43 ms, sys: 13.7 ms, total: 56.7 ms
Wall time: 1.22 s


In [12]:
%%time

question = "When was the moon landing?"
relevant_docs = fetch_relevant_documents(question)
ask_LLM(question=question, relevant_docs=relevant_docs)

I'm sorry, the provided context does not contain information about the moon landing.

CPU times: user 29 ms, sys: 7.71 ms, total: 36.7 ms
Wall time: 1.07 s
