## RAG example with Langchain, Milvus, and vLLM

Requirements:
- A Milvus instance, either standalone or cluster.
- Connection credentials to Milvus must be available as environment variables: MILVUS_USERNAME and MILVUS_PASSWORD.
- A vLLM inference endpoint. In this example we use the OpenAI Compatible API.

### Needed packages and imports

In [1]:
!pip install -q einops==0.7.0 langchain==0.1.9 pymilvus==2.3.6
#!pip install -q einops==0.7.0 langchain==0.1.9 pymilvus==2.3.6 openai==1.13.3
#!pip install einops==0.7.0 langchain==0.1.9 pymilvus==2.3.6 sentence-transformers==2.4.0 openai==1.13.3



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import os
from langchain.callbacks.base import BaseCallbackHandler
from langchain.chains import RetrievalQA
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms import VLLMOpenAI
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import Milvus

#### Bases parameters, Inference server and Milvus info

In [3]:
# Replace values according to your Milvus deployment
INFERENCE_SERVER_URL = os.getenv('API_URL')
MODEL_NAME = "mistral-7b-instruct"
API_KEY= os.getenv('API_KEY')
#API_KEY= "Empty"
MAX_TOKENS=1024
TOP_P=0.95
TEMPERATURE=0.01
PRESENCE_PENALTY=1.03
MILVUS_HOST = "vectordb-milvus.milvus.svc.cluster.local"
MILVUS_PORT = 19530
MILVUS_USERNAME = os.getenv('MILVUS_USERNAME')
MILVUS_PASSWORD = os.getenv('MILVUS_PASSWORD')
MILVUS_COLLECTION = "collection_nomicai_embeddings"

#### Initialize the connection

In [4]:
model_kwargs = {'trust_remote_code': True}
embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=False
)

store = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    drop_old=False
    )

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.4.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>


#### Initialize query chain

In [5]:
template="""<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant named HatBot answering questions.
You will be given a question you need to answer, and a context to provide you with information. You must answer the question based as much as possible on this context.
Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Context: 
{context}

Question: {question} [/INST]
"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

llm =  ChatOpenAI(
    openai_api_key=API_KEY,
    openai_api_base=INFERENCE_SERVER_URL,
    model_name=MODEL_NAME,
    max_tokens=MAX_TOKENS,
    top_p=TOP_P,
    temperature=TEMPERATURE,
    presence_penalty=PRESENCE_PENALTY,
    streaming=True,
    verbose=False,
    callbacks=[StreamingStdOutCallbackHandler()]
)

qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 4}
            ),
        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
        return_source_documents=True
        )

os.environ["TOKENIZERS_PARALLELISM"] = "false"

### Query the LLM without RAG

In [6]:
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
from langchain_community.llms import VLLMOpenAI
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate

template="""<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always be as helpful as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Current conversation:
{history}
Human: {input}
AI:
[/INST]
"""
PROMPT = PromptTemplate(input_variables=["history", "input"], template=template)

memory=ConversationBufferMemory()

conversation = ConversationChain(llm=llm,
                                 prompt=PROMPT,
                                 verbose=False,
                                 memory=memory
                                )

question = "How can I use an accelerator profile?"
conversation.predict(input=question);

An accelerator profile is a setting in some software applications or devices that can enhance the performance of the system. The specific steps to use an accelerator profile may vary depending on the software or device you're using. Here's a general idea of how you might use one:

1. Locate the settings or preferences menu within your software or device. This is usually found in the main menu or under the gear icon.

2. Look for a section labeled "Performance," "Performance Settings," or something similar.

3. In this section, you should find options for different performance profiles. These might be labeled as "Balanced," "High Performance," "Ultra," etc.

4. Select the "High Performance" or "Ultra" option to enable the accelerator profile. This will typically increase the speed and responsiveness of your software or device, but may also consume more resources and potentially drain battery life faster.

5. Save your changes and test the performance of your software or device to see if

### Query example with RAG

In [7]:
question = "How can I use an accelerator profile?"
result = qa_chain.invoke({"query": question})


To use an accelerator profile in OpenShift AI, you need to follow these steps:

1. First, ensure that your OpenShift instance contains an associated accelerator. If it's a new accelerator, you'll need to configure an accelerator profile for the accelerator in context. You can create an accelerator profile from the "Settings" page on the OpenShift AI dashboard, under the "Accelerator profiles" section.

2. If you have upgraded your OpenShift AI to version 2.13 or later and your instance already has an accelerator, its accelerator profile will be preserved after the upgrade. No additional action is required for existing accelerators.

3. For Intel Gaudi AI accelerators, you'll need to install the necessary dependencies and the version of the HabanaAI Operator that matches the Habana version of the HabanaAI workbench image in your deployment. You can find more information about this process in the resources provided: "HabanaAI Operator v1.10 for OpenShift" and "HabanaAI Operator v1.13 for

#### Retrieve source

In [8]:
def remove_duplicates(input_list):
    unique_list = []
    for item in input_list:
        if item.metadata['source'] not in unique_list:
            unique_list.append(item.metadata['source'])
    return unique_list

results = remove_duplicates(result['source_documents'])

for s in results:
    print(s)

https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.13/html-single/managing_resources/index
https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.13/html-single/upgrading_openshift_ai_self-managed_in_a_disconnected_environment/index
https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.13/html-single/installing_and_uninstalling_openshift_ai_self-managed_in_a_disconnected_environment/index
