# RAG Evaluation with RAGAS

Ragas is an open-source framework that helps developers evaluate and test Large Language Model (LLM) applications, particularly Retrieval Augmented Generation (RAG) pipelines.

Ragas uses LLMs to measure key performance metrics, such as retrieved contexts, answer correctness, faithfullness, context relevancy, etc

This notebook shows how you can integrate their (RAGAS) excellent RAG metrics in LangSmith to evaluate your RAG app.

## Prerequisites
Ragas and Langsmith

LangSmith is a DevOps platform that helps developers and data scientists build, test, deploy, and monitor large language model (LLM) applications.

In [4]:
%%capture --no-stderr
%pip install -U langsmith ragas numpy openai

In [5]:
import getpass
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LANGCHAIN_API_KEY")
os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY")

LANGCHAIN_API_KEY ········
OPENAI_API_KEY ········


## Clone a Dataset to use

In [7]:
import langsmith

client = langsmith.Client()
dataset_url = (
    "https://smith.langchain.com/public/56fe54cd-b7d7-4d3b-aaa0-88d7a2d30931/d"
)
dataset_name = "BaseCamp Q&A"
client.clone_public_dataset(dataset_url)

## Define your pipeline

In [8]:
import io
import os
import zipfile

import requests

# Fetch the source documents
url = "https://storage.googleapis.com/benchmarks-artifacts/basecamp-data/basecamp-data.zip"

response = requests.get(url)


with io.BytesIO(response.content) as zipped_file:
    with zipfile.ZipFile(zipped_file, "r") as zip_ref:
        zip_ref.extractall()

data_dir = os.path.join(os.getcwd(), "data")
docs = []
for filename in os.listdir(data_dir):
    if filename.endswith(".md"):
        with open(os.path.join(data_dir, filename), "r") as file:
            docs.append({"file": filename, "content": file.read()})

## Create the retriever

In [9]:
from typing import List

import numpy as np
import openai
from langsmith import traceable


class VectorStoreRetriever:
    def __init__(self, docs: list, vectors: list, oai_client):
        self._arr = np.array(vectors)
        self._docs = docs
        self._client = oai_client

    @classmethod
    async def from_docs(cls, docs, oai_client):
        embeddings = await oai_client.embeddings.create(
            model="text-embedding-3-small", input=[doc["content"] for doc in docs]
        )
        vectors = [emb.embedding for emb in embeddings.data]
        return cls(docs, vectors, oai_client)

    @traceable
    async def query(self, query: str, k: int = 5) -> List[dict]:
        embed = await self._client.embeddings.create(
            model="text-embedding-3-small", input=[query]
        )
        # "@" is just a matrix multiplication in python
        scores = np.array(embed.data[0].embedding) @ self._arr.T
        top_k_idx = np.argpartition(scores, -k)[-k:]
        top_k_idx_sorted = top_k_idx[np.argsort(-scores[top_k_idx])]
        return [
            {**self._docs[idx], "similarity": scores[idx]} for idx in top_k_idx_sorted
        ]

In [11]:
from langsmith import traceable
from langsmith.wrappers import wrap_openai


class NaiveRagBot:
    def __init__(self, retriever, model: str = "gpt-4-turbo-preview"):
        self._retriever = retriever
        # Wrapping the client instruments the LLM
        # and is completely optional
        self._client = wrap_openai(openai.AsyncClient())
        self._model = model

    @traceable
    async def get_answer(self, question: str):
        similar = await self._retriever.query(question)
        response = await self._client.chat.completions.create(
            model=self._model,
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful AI assistant."
                    " Use the following docs to help answer the user's question.\n\n"
                    f"## Docs\n\n{similar}",
                },
                {"role": "user", "content": question},
            ],
        )

        # The RAGAS evaluators expect the "answer" and "contexts"
        # keys to work properly. If your pipeline does not return these values,
        # you should wrap in a function that provides them.
        return {
            "answer": response.choices[0].message.content,
            "contexts": [str(doc) for doc in similar],
        }

In [12]:
retriever = await VectorStoreRetriever.from_docs(docs, openai.AsyncClient())
rag_bot = NaiveRagBot(retriever)

In [13]:
response = await rag_bot.get_answer("How much time off do we get?")
response["answer"][:150]

'Based on the information provided in the "Benefits & Perks" document, here is a summary of the time-off benefits at 37signals:\n\n### Paid Time Off (PTO'

## Evaluate

In [23]:
from langchain.smith import RunEvalConfig
from ragas.integrations.langchain import EvaluatorChain
from ragas.metrics import (
    answer_correctness,
    answer_relevancy,
    context_precision,
    context_recall,
    faithfulness,
)

# Wrap the RAGAS metrics to use in LangChain
evaluators = [
    EvaluatorChain(metric)
    for metric in [
        answer_correctness,
        answer_relevancy,
        context_precision,
        context_recall,
        faithfulness,
    ]
]
eval_config = RunEvalConfig(custom_evaluators=evaluators)

In [25]:
results = await client.arun_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=rag_bot.get_answer,
    evaluation=eval_config,
)

View the evaluation results for project 'diligent-tooth-31' at:
https://smith.langchain.com/o/305dbe54-58db-588f-9847-6b70dc6ac00a/datasets/88c50c7b-003c-4b5e-bcb5-1544b2025b49/compare?selectedSessions=9493172a-2b33-49e2-bc42-5f260b626dce

View all tests for Dataset BaseCamp Q&A at:
https://smith.langchain.com/o/305dbe54-58db-588f-9847-6b70dc6ac00a/datasets/88c50c7b-003c-4b5e-bcb5-1544b2025b49
[-------------------->                             ] 9/21

Error evaluating run 487201ca-ebe6-4273-86ad-cc77b5cf9672 with EvaluatorChain: APIConnectionError('Connection error.')
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/openai/_base_client.py", line 1558, in _request
    response = await self._client.send(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/httpx/_client.py", line 1675, in send
    raise exc
  File "/opt/conda/lib/python3.11/site-packages/httpx/_client.py", line 1669, in send
    await response.aread()
  File "/opt/conda/lib/python3.11/site-packages/httpx/_models.py", line 911, in aread
    self._content = b"".join([part async for part in self.aiter_bytes()])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/httpx/_models.py", line 911, in <listcomp>
    self._content = b"".join([part async for part in self.aiter_bytes()])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[------------------------------------------------->] 21/21