# Evaluating 3 RAG Chatbots using 🔭 Galileo Evaluate


In this tutorial, we'll build 3 RAG based chatbots and evaluate the results in Galileo Evaluate.

This notebook pulls data from the web for its datasource and uses Open AI for LLM. Feel Free to change these sources as you'd like

## 1. Set-up of the environment

Let's start by installing the required libraries.

In [1]:
! pip install promptquality langchain langchain-community langchain_openai faiss-cpu openai ipywidgets



## 2. Set-up Galileo Clients

Next we will setup Galileo Evaluate client. Here we also define the metrics we wish to evaluate the models on. For this lab we will be using 9 metrics. Feel free to change them as needed and play aroundYou will need to enter 3 things - 
 - GALILEO API KEY: This is the API key used to connect to the client. You can fetch this from the console
 - OPENAI API KEY: For this notebook we are using Open AI so enter your Open AI Key here. If you are using some other model, you can skip this
 - Project Name - Define a name for the project

In [2]:
import os
import promptquality as pq

os.environ["GALILEO_CONSOLE_URL"] = "https://console.dev.rungalileo.io"
os.environ["GALILEO_API_KEY"] = ""  # Enter Galileo key here
os.environ["OPENAI_API_KEY"] = ""  # Enter Open AI Key here
pq.login()

Go to https://console.dev.rungalileo.io/get-token to generate a new Galileo token.


/usr/bin/xdg-open: 882: x-www-browser: not found
/usr/bin/xdg-open: 882: firefox: not found
/usr/bin/xdg-open: 882: iceweasel: not found
/usr/bin/xdg-open: 882: seamonkey: not found
/usr/bin/xdg-open: 882: mozilla: not found
/usr/bin/xdg-open: 882: epiphany: not found
/usr/bin/xdg-open: 882: konqueror: not found
/usr/bin/xdg-open: 882: chromium: not found
/usr/bin/xdg-open: 882: chromium-browser: not found
/usr/bin/xdg-open: 882: google-chrome: not found
/usr/bin/xdg-open: 882: www-browser: not found
/usr/bin/xdg-open: 882: links2: not found
/usr/bin/xdg-open: 882: elinks: not found
/usr/bin/xdg-open: 882: links: not found
/usr/bin/xdg-open: 882: lynx: not found
/usr/bin/xdg-open: 882: w3m: not found
xdg-open: no method available for opening 'https://console.dev.rungalileo.io/get-token'


ValueError: No token set. Please log in.

In [2]:
from promptquality import EvaluateRun

PROJECT_NAME = "evaluate-rag-chatbot"
metrics = [
    pq.Scorers.context_adherence_luna,
    pq.Scorers.completeness_luna,
    pq.Scorers.correctness,
    pq.Scorers.chunk_attribution_utilization_luna,
    pq.Scorers.instruction_adherence_plus,
    pq.Scorers.sexist,
    pq.Scorers.tone,
    pq.Scorers.toxicity,
    pq.Scorers.prompt_injection,
]

## 3. Loading and Preparing Data

For this lab we will use a fictuous use case where we want to build a chatbot to answer questions about Galileo. A typical technique to build such a chatbot is Retrieval-Augmented Generation (RAG).

Now in order to build the chatbot, we will 
 - Fetch some documents from Galileo's website, 
 - Create some questions, 
 - Ask the chatbot those questions
 - Evaluate the chatbot's responses with the help of Galileo Evaluate

In our case let's start by downloading some documents for Galileo from the website.

In [None]:
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Load data from a website URL
from langchain_community.document_loaders import WebBaseLoader

urls = [
    "https://docs.rungalileo.io/galileo",
    "https://docs.rungalileo.io/galileo/gen-ai-studio-products/galileo-guardrail-metrics/context-adherence",
    "https://docs.rungalileo.io/galileo/gen-ai-studio-products/galileo-guardrail-metrics/context-relevance",
]
loader = WebBaseLoader(urls)
documents = loader.load()

Now that the context data in the form of the documents has been downloaded we will now split them into smaller text chunks using the Langchain library. The CharacterTextSplitter divides the text into chunks of a specified size while allowing for overlap to prevent cutting sentences in half. When setting the chunk size, make sure it fits into the context window of your LLM and feel free to experiment with different chunk sizes.

In [4]:
# Split the text into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

Let's have a look at the size of our data

In [None]:
# Print metadata of the loaded documents
avg_doc_length = lambda documents: sum(
    [len(doc.page_content) for doc in documents]
) // len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(texts)
print(
    f"Average length among {len(documents)} pages loaded is {avg_char_count_pre} characters."
)
print(f"After the split you have {len(texts)}")
print(f"Average length among {len(texts)} chunks is {avg_char_count_post} characters.")

Next we convert our chunks into embeddings and store them in a vector database. This is a common technique used in RAG where instead of always passing all the documents to the LLM as context, we will pull the chunks we feel are most relevant to a given question and only pass those to the LLM. This is achieved by doing a semantic similarity search within the vector DB between the question embeddings and the chunk embeddings. Passing concise information to the LLM helps improve its accuracy

In [6]:
# Initialize OpenAI embeddings
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Create a vector store
vectorstore = FAISS.from_documents(texts, embeddings)

## 4. Run Inference for 3 models with Galileo Evaluate

In [7]:
# Create a function to generate a response using Open AI

from openai import OpenAI

client = OpenAI()


def generate_response(prompt: str, history: list = [], model_name: str = "gpt-4o-mini"):
    response = client.chat.completions.create(
        model=model_name,
        messages=history + [{"role": "user", "content": prompt}],
        max_tokens=512,
        temperature=1,
        top_p=1,
    )

    response_text = response.choices[0].message.content
    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens
    total_tokens = response.usage.total_tokens

    return response_text, input_tokens, output_tokens, total_tokens

If you want to type in your own questions on the fly for the LLM for a richer chatbot experience, set `USE_PREDEFINED_QUESTIONS` to False. Otherwise the model will run on these pre-defined questions below

In [8]:
USE_PREDEFINED_QUESTIONS = True

questions = [
    "What does Galileo do?",
    "What are some of the RAG Metrics Galileo provides?",
    "What is LUNA and where is it used?",
    "How does LUNA calculate context adherence?",
    "What is chainpoll?",
]


Here we define the models we want to evaluate, and the system prompt for the LLM

In [9]:
# Evaluate models
models = [
    "gpt-4o-mini",
    "gpt-4o",
    "o1-mini",
]

Now let's run the actual inference and log the information to Galileo! If you want to run the LLM chat longer, set the `max_rounds` variable accordingly

In [None]:
max_rounds = 5

for model_name in models[1:]:
    rounds = 0
    evaluate_run = EvaluateRun(
        run_name=model_name, project_name=PROJECT_NAME, scorers=metrics
    )
    history = [{"role": "system", "content": "You are a helpful assistant."}]

    while rounds < max_rounds:
        question = questions[rounds] if USE_PREDEFINED_QUESTIONS else input("You: ")
        if question.lower() == "exit":
            break
        # Retrieve relevant documents from the vector store
        relevant_docs = vectorstore.similarity_search(question, k=3)
        context_list = [doc.page_content for doc in relevant_docs]
        context = " ".join(context_list)
        prompt = f"""Context: {context}

        Question: {question}

        Answer: """

        # Create your workflow to log to Galileo.
        wf = evaluate_run.add_workflow(
            input={"question": question, "model_name": model_name},
            name=model_name,
            metadata={"env": "demo"},
        )
        wf.add_retriever(
            input=question,
            documents=context_list,
            metadata={"env": "demo"},
            name=f"{model_name}_RAG",
        )

        # Generate the response with the updated history
        model_response, input_tokens, output_tokens, total_tokens = generate_response(
            prompt, history, model_name
        )

        # Add the current question to the history
        history.append({"role": "user", "content": question})
        # Update history with the new interaction
        history.append({"role": "assistant", "content": model_response})

        print("You: ", question)
        print(f"Assistant: {model_response}")
        print("*" * 100)

        # Log your llm call step to Galileo.
        wf.add_llm(
            input=prompt,
            output=model_response,
            model=model_name,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            total_tokens=total_tokens,
            metadata={"env": "demo"},
            name=f"{model_name}_QA",
        )

        # Conclude the workflow.
        wf.conclude(output={"output": model_response})
        rounds += 1
    evaluate_run.finish()

You can have a look at the final results in the console via the link generated from the project

## Conclusion

Throughout this notebook, we have explored the process of creating and evaluation a chatbot for a QA-RAG application using GPT 4o mini via Open AI, Python, and Langchain. We covered essential steps, including setting up the environment, loading and preparing context data, extracting relevant context, answer generation, and logging to Galileo.