# RAGAS- Performance Evaluation of RAG Systems- Mistral

## InsureLLM Company Question Answering CHATBOT
This project builds a low cost, high accuracy question answering system designed for employees of InsureLLM, an Insurance Tech company. The chatbot acts as an expert knowledge worker, helping staff quickly find accurate answers to domain specific queries. To achieve reliability, the system leverages Retrieval-Augmented Generation (RAG), combining document retrieval with LLM reasoning. This ensures responses are context-grounded, relevant, and scalable for enterprise use.

This Project integrates RAGAS metrics (faithfulness, relevancy, precision, recall, correctness) to automatically assess answer quality, and saves detailed results for analysis.

### Importing the Packages

In [1]:
import os
import json
import pandas as pd
from tqdm import tqdm
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
)
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace, HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
import matplotlib.pyplot as plt
from typing import List, Dict, Optional, Tuple
import os
import glob
from openai import OpenAI
from dotenv import load_dotenv
import gradio as gr
from langchain_openai import ChatOpenAI
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_chroma import Chroma
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_community.chat_models import ChatOllama
from langchain.embeddings import HuggingFaceEmbeddings
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go
import backoff
import time
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

### Loading the LLM

In [2]:
MODEL = "mistral:latest"
DB_NAME = "vector_db"

In [3]:
load_dotenv(override=True)

True

In [4]:
llm = ChatOllama(model=MODEL, temperature=0.7)

  llm = ChatOllama(model=MODEL, temperature=0.7)


### Loading the Documents

In [5]:
folders = glob.glob("knowledge-base/*")
text_loader_kwargs = {'encoding': 'utf-8'}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

### Creating Chunks

In [6]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

Created a chunk of size 1088, which is longer than the specified 1000


In [7]:
len(chunks)

123

### Documents in Knowledge Base

In [8]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: contracts, employees, products, company


### Initializing Chroma Vectorstore with HuggingFace Embeddings

In [9]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

if os.path.exists(DB_NAME):
    Chroma(persist_directory=DB_NAME, embedding_function=embeddings).delete_collection()

vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=DB_NAME)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


Vectorstore created with 123 documents


In [10]:
# Get one vector and find how many dimensions it has

collection = vectorstore._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

The vectors have 384 dimensions


### RAG Implementation

In [11]:
llm = ChatOllama(model=MODEL, temperature=0.7)
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
retriever = vectorstore.as_retriever()
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

  memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)


In [12]:
# Test query
query = "Can you describe Insurellm in a few sentences"
result = conversation_chain.invoke({"question": query})
print(result["answer"])

 Insurellm is an innovative insurance technology firm with over 200 employees across the US. Founded by Avery Lancaster in 2015, it offers four insurance software products - Carllm, Homellm, Rellm, and Marketllm - catering to various sectors such as auto, home, reinsurance, and connecting consumers with providers. Insurellm has more than 300 clients worldwide and provides technical support from 9 AM to 7 PM EST, Monday through Friday, with a commitment to respond to all queries within 24 business hours.


## RAGAS Evaluation Setup

In [13]:
RAGAS_MODEL = "sonar-reasoning-pro"
CACHE_FILE = "generation_cache.json"
REQUEST_DELAY = 1.0  # seconds between calls
TIMEOUT = 60  # seconds per request

### RAGAS Based Evaluation of RAG Systems Using Perplexity Sonar

In [14]:
perplexity_llm = ChatOpenAI(
    api_key=os.getenv("PERPLEXITY_API_KEY"),
    base_url="https://api.perplexity.ai",
    model="sonar-reasoning-pro",
    timeout=300
)

### Creating a new evaluation chain with source documents returned

In [15]:
def make_eval_chain() -> ConversationalRetrievalChain:
    return ConversationalRetrievalChain.from_llm(
        llm=ChatOllama(model=MODEL, temperature=0),
        retriever=retriever,
        memory=ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True,
            output_key="answer" 
        ),
        return_source_documents=True,
        output_key="answer"       
    )

### Safe Chain Invocation with Exponential Backoff and Retry Handling

In [16]:
@backoff.on_exception(
    backoff.expo,
    (TimeoutError, Exception),  # you can add RateLimitError if using OpenAI SDK
    max_tries=5
)
def safe_invoke(chain, q):
    return chain.invoke({"question": q})

### Cache Management: Load and Save Functions

In [17]:
def load_cache():
    if os.path.exists(CACHE_FILE):
        with open(CACHE_FILE, "r") as f:
            return json.load(f)
    return {}

def save_cache(cache):
    with open(CACHE_FILE, "w") as f:
        json.dump(cache, f, indent=2)

### Running RAG generation and collecting answers and contexts

In [18]:
def run_generation(questions, ground_truth=None, delay=REQUEST_DELAY):
    results = []
    failures = 0
    cache = load_cache()
    chain = make_eval_chain()

    for i, q in enumerate(tqdm(questions, desc="Generating answers")):
        if q in cache:
            answer, contexts = cache[q]["answer"], cache[q]["contexts"]
        else:
            try:
                output = safe_invoke(chain, q)
                answer = output.get("answer", "").strip()
                docs = output.get("source_documents", [])
                contexts = list({doc.page_content.strip()[:2000] for doc in docs if doc.page_content.strip()})
                cache[q] = {"answer": answer, "contexts": contexts}
                save_cache(cache)
            except Exception as e:
                print(f"Generation failed for Q{i}: {e}")
                answer, contexts = "", []
                failures += 1

            time.sleep(delay)  # throttle

        row = {"question": q, "answer": answer, "contexts": contexts}
        if ground_truth:
            row["ground_truth"] = ground_truth[i]
        results.append(row)

    print(f"Evaluated {len(questions)} questions with {failures} failures.")
    return results

### Building RAGAS Compatible Dataset from QA Rows

In [19]:
def build_ragas_dataset(rows):
    keys = ["question", "answer", "contexts"]
    if all("ground_truth" in r for r in rows):
        keys.append("ground_truth")
    return Dataset.from_dict({k: [r.get(k, "") for r in rows] for k in keys})

### Running RAGAS evaluation and returning metrics DataFrame and summary.

In [20]:
def run_ragas(ds):
    metrics = [faithfulness, answer_relevancy, context_precision, context_recall]
    if "ground_truth" in ds.column_names:
        metrics.append(answer_correctness)
    ragas_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    result = evaluate(ds, metrics=metrics, llm=perplexity_llm, embeddings=ragas_embeddings)
    df = result.to_pandas()
    summary = {metric: round(df[metric].mean(skipna=True), 3) for metric in df.columns if df[metric].dtype != "O"}
    return df, summary

### Save evaluation results to disk

In [21]:
def save_outputs(df, summary, path="./eval"):
    os.makedirs(path, exist_ok=True)
    df.to_csv(f"{path}/results_rows.csv", index=False)
    with open(f"{path}/summary.json", "w") as f:
        json.dump(summary, f, indent=2)
    print(f"Saved results to {path}")

In [22]:
questions = [
    "What is Insurellm?",
    "Who is Avery Lancaster"
]

ground_truth = [
    "Insurellm is a platform for insurance-related language model tasks.",
    "Avery Lancaster is Co-Founder & Chief Executive Officer (CEO) of Insurellm"
]

In [23]:
rows = run_generation(questions, ground_truth)
ds = build_ragas_dataset(rows)
df, summary = run_ragas(ds)

print("Sample results:")
df.head()

Generating answers: 100%|██████████| 2/2 [00:00<00:00, 12157.40it/s]


Evaluated 2 questions with 0 failures.


Evaluating:   0%|          | 0/10 [00:00<?, ?it/s]

Sample results:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall,answer_correctness
0,What is Insurellm?,[## Support\n1. **Technical Support**: Technic...,Insurellm is an insurance tech startup that wa...,Insurellm is a platform for insurance-related ...,0.875,0.963939,0.5,0.0,0.196197
1,Who is Avery Lancaster,[- **2022**: **Satisfactory** \n Avery focus...,Avery Lancaster is the Co-Founder and Chief Ex...,Avery Lancaster is Co-Founder & Chief Executiv...,0.857143,0.885467,0.333333,1.0,0.406071


In [24]:
print("\nMacro metric means:")
for k, v in summary.items():
    print(f"{k}: {v:.3f}")


Macro metric means:
faithfulness: 0.866
answer_relevancy: 0.925
context_precision: 0.417
context_recall: 0.500
answer_correctness: 0.301


# OBSERVATIONS
## RAGAS Evaluation on Mistral Local Model

## Query 1: "What is Insurellm?"
Faithfulness: 0.875 → The answer is highly consistent with the retrieved information.

Answer Relevancy: 0.964 → The response strongly addresses the user’s query.

Context Precision: 0.50 → Half of the retrieved chunks were useful for answering.

Context Recall: 0.0 → Not all relevant information was retrieved.

Answer Correctness: 0.196 → The answer only partially matches the ground-truth reference.


### Interpretation: 
Retrieval has improved (precision > 0), so the model had some useful grounding. However, recall is still zero, meaning important supporting info was missed. That’s why correctness is still low.

## Query 2: "Who is Avery Lancaster?"
Faithfulness: 0.857 → The response is strongly consistent with retrieved context.

Answer Relevancy: 0.885 → The answer is clearly relevant.

Context Precision: 0.333 → Some retrieved chunks were useful.

Context Recall: 1.0 → The retriever successfully pulled all needed information.

Answer Correctness: 0.406 → The answer is much closer to the ground truth.


### Interpretation: 
This is a strong result. With perfect recall, the model had all the right info available, which boosted both faithfulness and correctness.

## Overall Insight
Retrieval quality is improving 

Consistency is still uneven → For definitional queries (“What is Insurellm?”), recall is missing. For entity-based queries (“Who is Avery Lancaster?”), recall is strong.

Correctness is improving but still low → Even when retrieval works, answer correctness is only ~0.40. This suggests the LLM still isn’t fully grounding itself in the retrieved chunks.

Key next step → Focus on retrieval recall for conceptual questions and on prompting/model instructions so the LLM relies strictly on retrieved context.