# Create a RAG Application Using MyScale and BentoML

Retrieval-augmented generation (RAG) enhances AI applications like chatbots and recommendation systems by combining vector databases and Large Language Models (LLMs). Choosing the right LLM is crucial, considering factors like cost, privacy, and scalability. While commercial LLMs such as OpenAI's GPT-4 and Google's Gemini are effective, they can be expensive and pose data privacy concerns. Open-source LLMs offer flexibility and cost savings but require significant resources for fine-tuning and deployment.

A more efficient solution is deploying open-source LLMs on the cloud, providing necessary computational power and scalability without the high costs and complexities of local hosting. This approach reduces initial infrastructural expenses and maintenance concerns. Let's explore developing a RAG application using cloud-hosted open-source LLMs and a scalable vector database.

## Setting up the Environment

To install the required dependencies, enter the following command on your terminal:
```bash
pip install bentoml langchain clickhouse-connect
```

## Load the Data
We begin by importing the `WikipediaLoader` from the `langchain_community.document_loaders.wikipedia` module. We use this loader to fetch documents related to "Albert Einstein" from Wikipedia.

In [None]:
from langchain_community.document_loaders.wikipedia import WikipediaLoader
loader = WikipediaLoader(query="Albert Einstein")

# Load the documents
docs = loader.load()

# Display the content of the first document
print(docs[0].page_content)

## Split the Text into Chunks
We import the `CharacterTextSplitter` from `langchain_text_splitters` to split the document text into manageable chunks. We join the content of all pages into a single string to split the text into manageable chunks.

In [None]:
from langchain_text_splitters import CharacterTextSplitter
# Split the text into chunks
text = ' '.join([page.page_content.replace('\\t', ' ') for page in docs])
text_splitter = CharacterTextSplitter(
    separator="\\n",
    chunk_size=400,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([text])
splits = [item.page_content for item in texts]

## Deploy the Models on BentoML


Our data is ready, and the next step is to deploy the models on BentoML and use them in our RAG application. We will deploy the LLM first. For that, you need to create an account on [BentoML](https://cloud.bentoml.com/) if you don’t have any. After that, navigate to the deployments section and click on the "Create Deployment" button at the top right corner. A new page will open that looks like this:

<img src="../assets/opening-page.png" height="500" width="1100"/>

 Select the "bentoml/bentovllm-llama3-8b-instruct-service" model from the drop-down and click "Submit" at the bottom right corner. This should start deploying the model and a new page like this will open:


<img src="../assets/deploy-the-llm.PNG" height="500" width="1100"/>

The deployment can take some time. Once it is deployed, copy the endpoint.

**Note**: BentoML's free tier only allows the deployment of a single model. If your plan is upgraded and you can deploy more than one model, follow the steps below. If not, don't worry—we will use an open-source model locally for embeddings.

The process of deploying the embedding model is very similar to deploying the LLM. You need to follow these steps again:

1. Go to the deployments page.
2. Click on the "Create Deployment" button.
3. Select the `sentence-transformers` model from the list and click "Submit."
4. Once the deployment is complete, copy the endpoint.

Next, go to the API Tokens page and generate a new API key. Now, you are ready to use the deployed models in your RAG application.


## Defining the Embeddings Method
We define `get_embeddings` to generate text embeddings. It uses BentoML's service with an endpoint and API token if provided; otherwise, it uses local transformers and `torch` with the `sentence-transformers/all-MiniLM-L6-v2` model.

In [None]:
import subprocess
import sys
import numpy as np

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

def get_embeddings(texts: list, BENTO_EMBEDDING_MODEL_END_POINT=None, BENTO_API_TOKEN=None) -> list:
    
    if BENTO_EMBEDDING_MODEL_END_POINT and BENTO_API_TOKEN:
        import bentoml
        embedding_client = bentoml.SyncHTTPClient(BENTO_EMBEDDING_MODEL_END_POINT, token=BENTO_API_TOKEN)
        return embedding_client.encode(sentences=texts).tolist()
    else:
        # Install transformers and torch if not already installed
        try:
            import transformers
        except ImportError:
            install("transformers")
        try:
            import torch
        except ImportError:
            install("torch")
        
        from transformers import AutoTokenizer, AutoModel
        
        # Initialize the tokenizer and model for embeddings
        tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        
        inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return embeddings.numpy().tolist()

## Get the embeddings
We iterate over the text chunks (splits) in batches of 25 to generate embeddings using the `get_embeddings` function defined earlier.

In [None]:
all_embeddings = []
for i in range(0, len(splits), 25):
    batch = splits[i:i+25]
    embeddings_batch = get_embeddings(batch)
    all_embeddings.extend(embeddings_batch)

## Creating a DataFrame
Now, create a pandas DataFrame to store the text chunks and their corresponding embeddings.

In [None]:
import pandas as pd
df = pd.DataFrame({
    'page_content': splits,
    'embeddings': all_embeddings
})

## Connecting to MyScaleDB

The knowledge base is completed and now it’s the time to save the data to the vector database. Here, we are using MyScaleDB for vector storage. To start MyScaleDB cluster on cloud, you can follow the [quickstart guide](https://myscale.com/docs/en/quickstart/#how-to-launch-your-first-cluster). After that, we establish a connection to the MyScaleDB database using the clickhouse_connect library.

In [None]:
import clickhouse_connect
client = clickhouse_connect.get_client(
    host='your-host-name',
    port=443,
    username='your-user-name',
    password='your-password'
)

## Creating a Table and Inserting Data
Create a table in MyScaleDB to store the text chunks and embeddings. The table schema includes an ID, the page content, and the embeddings.

In [None]:
client.command("""
CREATE TABLE IF NOT EXISTS default.RAG (
    id Int64,
    page_content String,
    embeddings Array(Float32),
    CONSTRAINT check_data_length CHECK length(embeddings) = 384
) ENGINE = MergeTree()
    ORDER BY id
""")

# Insert data into the table
batch_size = 100
num_batches = (len(df) + batch_size - 1) // batch_size

for i in range(num_batches):
    batch_data = df[i * batch_size: (i + 1) * batch_size]
    client.insert('default.RAG', batch_data.values.tolist(), column_names=batch_data.columns.tolist())
    print(f"Batch {i+1}/{num_batches} inserted.")


## Creating a Vector Index
The next step is to add a vector index to the embeddings column in the RAG table. The vector index allows for efficient similarity searches, which are essential for retrieval-augmented generation tasks.

In [None]:
client.command("""
ALTER TABLE default.RAG
    ADD VECTOR INDEX vector_index embeddings
    TYPE MSTG
""")

## Retrieving Relevant Vectors
Let’s define a function to retrieve relevant documents based on a user query. The query embeddings are generated using the `get_embeddings` function, and an advanced SQL vector query is executed to find the closest matches in the database.

In [None]:
def get_relevant_docs(user_query, top_k):
    query_embeddings = get_embeddings(user_query)[0]
    results = client.query(f"""
        SELECT page_content,
        distance(embeddings, {query_embeddings}) as dist FROM default.RAG ORDER BY dist LIMIT {top_k}
    """)
    relevant_docs = " "
    
    for row in results.named_results():
        relevant_docs=relevant_docs + row["page_content"]
    
    return relevant_docs

# Example query
message="Who is albert einstein?"
relevant_docs = get_relevant_docs(message, 8)
print(relevant_docs)

## Connecting to BentoML LLM
Let’s establish a connection to our hosted LLM on BentoML. The `llm_client` object will be used to interact with the LLM for generating responses based on the retrieved documents.

In [None]:
import bentoml
BENTO_LLM_END_POINT = "<https://bentovllm-llama-3-8-b-insruct-service-cffa-88f11f2e.mt-guc1.bentoml.ai>"

llm_client = bentoml.SyncHTTPClient(BENTO_LLM_END_POINT, token="your_bento_token")

## Performing RAG
Define a function to perform RAG. The function takes a user question and the retrieved context as input. It constructs a prompt for the LLM, instructing it to answer the question based on the provided context. The response from the LLM is then returned as the answer.

In [None]:
def dorag(question: str, context: str):
    
    prompt = (f"You are a helpful assistant. The user has a question. Answer the user question based only on the context: {context}. \\n"
              f"The user question is {question}")
    
    results = llm_client.generate(
        max_tokens=1024,
        prompt=prompt,
    )
    
    res = ""
    for result in results:
        res += result
    return res

## Making a Query
Finally, we make a query to the RAG application. We ask the question "Who is Albert Einstein?" and use the dorag function to get the answer based on the relevant documents retrieved earlier. The output provides a detailed response to the question, demonstrating the effectiveness of the RAG setup.

In [None]:
query = "Who is albert einstein?"
dorag(question=query, context=relevant_docs)

The output of the query looks like this:

<img src="../assets/First-response.PNG"/>

When the RAG model was asked about the death of the Alber Einstein, the response looked like this:

<img src="../assets/Second-response.PNG"/>