## LLM-as-a-Judge Evaluation
Evaluate the end-to-end performance using LLM-as-a-Judge provided by Ragas.

#### 1. Context Precision (without reference)

**Context Precision** is a metric that evaluates the retriever's ability to rank relevant chunks higher than irrelevant ones for a given query in the retrieved context. Specifically, it assesses the degree to which relevant chunks in the retrieved context are placed at the top of the ranking.

It is calculated as the mean of the `precision@k` for each chunk in the context. `Precision@k` is the ratio of the number of relevant chunks at rank k to the total number of chunks at rank k.

In [1]:
pip install ragas

Collecting ragas
  Using cached ragas-0.4.1-py3-none-any.whl.metadata (22 kB)
Collecting datasets>=4.0.0 (from ragas)
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting diskcache>=5.6.3 (from ragas)
  Using cached diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting instructor (from ragas)
  Using cached instructor-1.13.0-py3-none-any.whl.metadata (11 kB)
Collecting scikit-network (from ragas)
  Using cached scikit_network-0.33.5-cp312-cp312-macosx_11_0_arm64.whl.metadata (4.5 kB)
Collecting langchain-community (from ragas)
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets>=4.0.0->ragas)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess<0.70.19 (from datasets>=4.0.0->ragas)
  Downloading multiprocess-0.70.18-py312-none-any.whl.metadata (7.5 kB)
Collecting f

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

from datasets import Dataset
from ragas import evaluate, SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
evaluator_llm = ChatOpenAI(
    model=os.getenv("RAGAS_EVAL_MODEL", "gpt-4o"),
    temperature=0,
)

wrapped_evaluator_llm = LangchainLLMWrapper(evaluator_llm)

  wrapped_evaluator_llm = LangchainLLMWrapper(evaluator_llm)


In [None]:
from typing import List, Dict, Any
from graph.graph import app

def run_agent_and_collect(
    queries: List[str],
) -> List[Dict[str, Any]]:
    runs: List[Dict[str, Any]] = []

    for q in queries:
        result = app.invoke({"input": q})

        user_input = q

        final_response = result.get("response", "")

        # 3) retrieved_contexts
        #    ⚠️ 여기 부분은 너의 RAG 구현에 맞게 수정해야 함
        #    예: retrieval_tool 에서 products / web_results 원본 컨텍스트를 state에 같이 넣어두고
        #        거기서 텍스트만 뽑아서 리스트로 만든다고 가정
        #
        #    아래는 placeholder 예시:
        retrieved_contexts = result.get("documents", [])

        if not isinstance(retrieved_contexts, list):
            retrieved_contexts = [str(retrieved_contexts)]

        runs.append(
            {
                "user_input": user_input,
                "response": final_response,
                "retrieved_contexts": retrieved_contexts,
            }
        )

    return runs


test_queries = [
    "I want to find a stuffed animal for kids less than $30"
]

runs = run_agent_and_collect(test_queries)

len(runs), runs[0]


ROUTER NODE Started
router response:  * Task:  
  Find a stuffed animal for kids.

* Constraints:  
  - Budget: Less than $30  
  - Material: Not specified  
  - Brand: Not specified  

* Safety Flags: No
Planner NODE Started
planner response:  * Data Source: private
* Fields to Retrieve: product descriptions, prices, ratings
* Comparison Criteria: price, rating
Retriever NODE Started
RETRIEVAL TOOL Invoked with query: Retrieve products based on this plan:
* Data Source: private
* Fields to Retrieve: product descriptions, prices, ratings
* Comparison Criteria: price, rating
[MCP] Calling http://0.0.0.0:8001/tools/rag.search with args={'query': 'Retrieve products based on this plan:\n* Data Source: private\n* Fields to Retrieve: product descriptions, prices, ratings\n* Comparison Criteria: price, rating', 'top_k': 5, 'max_price': None, 'min_rating': None, 'brand': None}
[MCP] tool=rag.search response type: <class 'dict'>
[MCP] Calling http://0.0.0.0:8001/tools/web.search with args={'que

(1,
 {'user_input': 'I want to find a stuffed animal for kids less than $30',
  'response': '* Final Answer:  \nBased on the retrieved knowledge, a suitable option for a stuffed animal for kids under $30 is the "Melissa & Doug Annie Doll & Feeding Set Bundle," priced at $24.99 with a rating of 4.4 [3]. This product fits within your budget and is from a reputable brand known for children\'s toys. There are no safety concerns noted in the retrieved data.',
  'retrieved_contexts': []})

In [11]:
from ragas import SingleTurnSample

def to_single_turn_samples(runs: List[Dict[str, Any]]) -> List[SingleTurnSample]:
    samples: List[SingleTurnSample] = []

    for r in runs:
        sample = SingleTurnSample(
            user_input=r["user_input"],
            response=r["response"],
            retrieved_contexts=r["retrieved_contexts"],
        )
        samples.append(sample)

    return samples

samples = to_single_turn_samples(runs)
len(samples), samples[0]


(1,
 SingleTurnSample(user_input='I want to find a stuffed animal for kids less than $30', retrieved_contexts=[], reference_contexts=None, retrieved_context_ids=None, reference_context_ids=None, response='* Final Answer:  \nBased on the retrieved knowledge, a suitable option for a stuffed animal for kids under $30 is the "Melissa & Doug Annie Doll & Feeding Set Bundle," priced at $24.99 with a rating of 4.4 [3]. This product fits within your budget and is from a reputable brand known for children\'s toys. There are no safety concerns noted in the retrieved data.', multi_responses=None, reference=None, rubrics=None, persona_name=None, query_style=None, query_length=None))

In [12]:
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

context_precision_metric = LLMContextPrecisionWithoutReference(
    llm=wrapped_evaluator_llm
)

async def score_one_sample(sample: SingleTurnSample) -> float:
    score = await context_precision_metric.single_turn_ascore(sample)
    return score

one_score = await context_precision_metric.single_turn_ascore(samples[0])
print("First sample LLM Context Precision (no reference):", one_score)

First sample LLM Context Precision (no reference): 0.0
