# RAGAS- Performance Evaluation of RAG Systems

## InsureLLM Company Question Answering CHATBOT
This project builds a low cost, high accuracy question answering system designed for employees of InsureLLM, an Insurance Tech company. The chatbot acts as an expert knowledge worker, helping staff quickly find accurate answers to domain specific queries. To achieve reliability, the system leverages Retrieval-Augmented Generation (RAG), combining document retrieval with LLM reasoning. This ensures responses are context-grounded, relevant, and scalable for enterprise use.

This Project integrates RAGAS metrics (faithfulness, relevancy, precision, recall, correctness) to automatically assess answer quality, and saves detailed results for analysis.

### Installing the Packages

In [2]:
pip install -U langchain-huggingface

Note: you may need to restart the kernel to use updated packages.


### Importing The Packages

In [3]:
import os
import glob
from dotenv import load_dotenv
import gradio as gr

In [4]:
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.llms import HuggingFaceHub
from langchain_huggingface import HuggingFaceEndpoint

In [5]:
MODEL = "mistralai/Mistral-7B-Instruct-v0.3"
db_name = "vector_db"

In [6]:
# Load environment variables in a file called .env

load_dotenv(override=True)
os.environ["HUGGINGFACEHUB_API_TOKEN"] = os.getenv("HUGGINGFACEHUB_API_TOKEN")


### Reading The Document from Our Knowledge Base

In [7]:
folders = glob.glob("knowledge-base/*")
text_loader_kwargs = {'encoding': 'utf-8'}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

### Creating Chunks

In [8]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

Created a chunk of size 1088, which is longer than the specified 1000


In [9]:
len(chunks)

123

### Documents in Knowledge Base

In [10]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: company, employees, contracts, products


### Embeddings, and "Auto-Encoding LLMs"

We will be mapping each chunk of text into a Vector that represents the meaning of the text, known as an embedding.
This model is an example of an "Auto-Encoding LLM" which generates an output given a complete input.
Another example of an Auto-Encoding LLMs is BERT from Google. In addition to embedding, Auto-encoding LLMs are often used for classification.


In [12]:
# Put the chunks of data into a Vector Store that associates a Vector Embedding with each chunk
# Chroma is a popular open source Vector Database based on SQLLite

from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Delete if already exists

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

# Create vectorstore

vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


Vectorstore created with 123 documents


In [13]:
# Get one vector and find how many dimensions it has

collection = vectorstore._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

The vectors have 384 dimensions


## RAG Implementation

In [15]:
import os
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

# Hugging Face Endpoint (Mistral Instruct = conversational model)
endpoint = HuggingFaceEndpoint(
    repo_id="mistralai/Mistral-7B-Instruct-v0.3",
    task="conversational",
    temperature=0.7,
    max_new_tokens=512,
    huggingfacehub_api_token=os.getenv("HUGGINGFACEHUB_API_TOKEN")
)

# Wrap in Chat interface
llm = ChatHuggingFace(llm=endpoint)

# Memory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    output_key="answer" 
)

# Retriever (from vectorstore)
retriever = vectorstore.as_retriever()

# Conversation Chain
conversation_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    return_source_documents=True,
    output_key="answer" 
)

# Test Query
query = "Can you describe Insurellm in a few sentences?"
result = conversation_chain.invoke({"question": query})
print(result["answer"])


  memory = ConversationBufferMemory(


 Insurellm is an innovative insurance tech startup founded by Avery Lancaster in 2015. It offers a range of services including AI-powered risk assessment, dynamic pricing, instant claim processing, predictive maintenance alerts, multi-channel integration, a customer portal, and comprehensive support for effective onboarding and 24/7 technical assistance. Insurellm aims to transform the home insurance landscape by combining innovation and reliability.


### Sample Query

In [16]:
query = "Can you describe Insurellm in a few sentences?"
result = conversation_chain.invoke({"question": query})
print(result["answer"])

 Insurellm is an insurance tech startup founded by Avery Lancaster in 2015. It was designed to disrupt the insurance industry with innovative products. Their first product was Markellm, a marketplace connecting consumers with insurance providers. By 2024, Insurellm had expanded to 200 employees and 12 offices across the US. They offer ongoing updates and enhancements to their Homellm platform, including new features and security improvements, and they actively solicit feedback from their clients to ensure their products continue to meet their evolving needs. They provide 24/7 technical support via email and phone assistance for the duration of their contracts.


### Setting Up Conversation Chain

In [17]:
# set up a new conversation memory for the chat
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

# putting it together: set up the conversation chain with the LLM, the vector store and memory
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

### Gradio Chatbot View 

In [19]:
def chat(message, history):
    result = conversation_chain.invoke({"question": message})
    return result["answer"]

In [20]:
view = gr.ChatInterface(chat, type="messages").launch(inbrowser=True)

* Running on local URL:  http://127.0.0.1:7862
* To create a public link, set `share=True` in `launch()`.


# Implementing RAGAS

In [21]:
pip install ragas

Note: you may need to restart the kernel to use updated packages.


In [22]:
!pip install -q ragas datasets pandas tqdm

In [23]:
import os
import json
import pandas as pd
from tqdm import tqdm
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
)

from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace, HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
import matplotlib.pyplot as plt
from typing import List, Dict, Optional, Tuple

In [None]:
### 

In [24]:
assert retriever is not None, "retriever must be defined from previous RAG pipeline"

# Use Hugging Face Inference API
RAGAS_MODEL = "mistralai/Mistral-7B-Instruct-v0.3"
print(f"Using RAGAS_MODEL: {RAGAS_MODEL}")

Using RAGAS_MODEL: mistralai/Mistral-7B-Instruct-v0.3


### Building LLM Wrapper

In [25]:
def get_llm():
    endpoint = HuggingFaceEndpoint(
        repo_id=RAGAS_MODEL,
        task="conversational",   # important for Mistral-Instruct
        temperature=0.0,
        max_new_tokens=512,
        huggingfacehub_api_token=os.getenv("HUGGINGFACEHUB_API_TOKEN")
    )
    return ChatHuggingFace(llm=endpoint)

### Creating a new evaluation chain with source documents returned

In [26]:
def make_eval_chain() -> ConversationalRetrievalChain:
    llm = get_llm()
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        return_messages=True,
        output_key="answer"
    )
    return ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=retriever,
        memory=memory,
        return_source_documents=True,
        output_key="answer"
    )

### Running RAG generation and collecting answers and contexts.

In [27]:
def run_generation(questions: List[str], ground_truth: Optional[List[str]] = None) -> List[Dict]:
    results = []
    failures = 0
    for i, q in enumerate(tqdm(questions, desc="Generating answers")):
        try:
            chain = make_eval_chain()
            output = chain.invoke({"question": q})
            answer = output.get("answer", "").strip()
            docs = output.get("source_documents", [])
            contexts = list({doc.page_content.strip()[:2000] for doc in docs if doc.page_content.strip()})
        except Exception as e:
            print(f"Generation failed for Q{i}: {e}")
            answer, contexts = "", []
            failures += 1
        row = {
            "question": q,
            "answer": answer,
            "contexts": contexts,
        }
        if ground_truth:
            row["ground_truth"] = ground_truth[i]
        results.append(row)
    print(f"Evaluated {len(questions)} questions with {failures} failures. Avg contexts: {sum(len(r['contexts']) for r in results)/len(results):.2f}")
    return results

### Converting rows to RAGAS compatible HuggingFace dataset.

In [28]:
def build_ragas_dataset(rows: List[Dict]) -> Dataset:
    keys = ["question", "answer", "contexts"]
    if all("ground_truth" in r for r in rows):
        keys.append("ground_truth")
    return Dataset.from_dict({k: [r.get(k, "") for r in rows] for k in keys})

### Running RAGAS evaluation and returning metrics DataFrame and summary.

In [29]:
def run_ragas(ds: Dataset) -> Tuple[pd.DataFrame, Dict]:
    metrics = [faithfulness, answer_relevancy, context_precision, context_recall]
    if "ground_truth" in ds.column_names:
        metrics.append(answer_correctness)
    
    ragas_llm = get_llm()   # use Mistral via HF API
    ragas_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    result = evaluate(ds, metrics=metrics, llm=ragas_llm, embeddings=ragas_embeddings)
    df = result.to_pandas()
    summary = {metric: round(df[metric].mean(skipna=True), 3) for metric in df.columns if df[metric].dtype != "O"}
    return df, summary

### Save evaluation results to disk

In [30]:
def save_outputs(df: pd.DataFrame, summary: Dict, path: str = "./eval") -> None:
    os.makedirs(path, exist_ok=True)
    df.to_csv(f"{path}/results_rows.csv", index=False)
    with open(f"{path}/summary.json", "w") as f:
        json.dump(summary, f, indent=2)
    print(f"Saved results to {path}/results_rows.csv and summary to {path}/summary.json")

### Example questions and ground-truth

In [31]:
questions = [
    "What is Insurellm?",
    "How does the claims process work?",
    "What are the benefits of using this platform?",
    "Can you explain the underwriting model?",
    "Is there support for multi-language documents?"
]

ground_truth = [
    "Insurellm is a platform for insurance-related language model tasks.",
    "Claims are processed by extracting structured data from documents.",
    "Benefits include automation, accuracy, and scalability.",
    "The underwriting model uses AI to assess risk based on documents.",
    "Yes, the platform supports multilingual document ingestion."
]

### Running The Evaluation

In [32]:
rows = run_generation(questions, ground_truth)
ds = build_ragas_dataset(rows)
df, summary = run_ragas(ds)

print("Sample results:")
display(df.head())

Generating answers: 100%|██████████| 5/5 [00:07<00:00,  1.40s/it]


Evaluated 5 questions with 0 failures. Avg contexts: 4.00


Evaluating:   0%|          | 0/25 [00:00<?, ?it/s]

Exception raised in Job[7]: ClientResponseError(402, message='Payment Required', url='https://router.huggingface.co/together/v1/chat/completions')
Exception raised in Job[12]: ClientResponseError(402, message='Payment Required', url='https://router.huggingface.co/together/v1/chat/completions')
Exception raised in Job[16]: ClientResponseError(402, message='Payment Required', url='https://router.huggingface.co/together/v1/chat/completions')
Exception raised in Job[13]: ClientResponseError(402, message='Payment Required', url='https://router.huggingface.co/together/v1/chat/completions')
Exception raised in Job[10]: ClientResponseError(402, message='Payment Required', url='https://router.huggingface.co/together/v1/chat/completions')
Exception raised in Job[9]: ClientResponseError(402, message='Payment Required', url='https://router.huggingface.co/together/v1/chat/completions')
Exception raised in Job[11]: ClientResponseError(402, message='Payment Required', url='https://router.huggingface.

Sample results:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall,answer_correctness
0,What is Insurellm?,[# About Insurellm\n\nInsurellm was founded by...,Insurellm is an innovative insurance tech firm...,Insurellm is a platform for insurance-related ...,,,0.25,,
1,How does the claims process work?,[# Contract with GreenField Holdings for Marke...,The claims process with Homellm works through ...,Claims are processed by extracting structured ...,,,,0.0,
2,What are the benefits of using this platform?,[---\n\n## Features\n\n- **AI-Powered Risk Ass...,The benefits of using this platform include:\n...,"Benefits include automation, accuracy, and sca...",,,,,
3,Can you explain the underwriting model?,[# Contract with GreenField Holdings for Marke...,The underwriting model in the context provided...,The underwriting model uses AI to assess risk ...,1.0,,,,
4,Is there support for multi-language documents?,[- **User-Friendly Interface**: Designed with ...,"Yes, there is a plan to enhance the AI custome...","Yes, the platform supports multilingual docume...",0.75,,,,0.149856


In [33]:
print("\nMacro metric means:")
for k, v in summary.items():
    print(f"{k}: {v:.3f}")


Macro metric means:
faithfulness: 0.875
answer_relevancy: nan
context_precision: 0.250
context_recall: 0.000
answer_correctness: 0.150
