# RAGAS Topic Adherence

 - Requires predefined set of topics. Provided in ``` reference_topics ```
 

### **class MultiTurnSample**

 What is this class?

    - A class designed to represent conversational data samples that involve multiple back-and-forth exchanges (turns) between users and AI. 

 What is it doing?

    - It organizes and validates conversational data between human and AI

    - Provides a structured format for evaluation:

            - evaluations: topic adherence, tool usage evaluation, answer accuracy.

            - ```

            class MultiTurnSample(BaseSample):
                user_input: t.List[t.Union[HumanMessage, AIMessage, ToolMessage]]
                # Each message type has specific attributes:
                # HumanMessage: content
                # AIMessage: content, tool_calls
                # ToolMessage: content
                # These fields provide "ground truth" for evaluation
                reference: t.Optional[str] = None  # Expected answer
                reference_tool_calls: t.Optional[t.List[ToolCall]] = None  # Expected tool usage
                reference_topics: t.Optional[t.List[str]] = None  # Expected conversation topics
                rubrics: t.Optional[t.Dict[str, str]] = None  # Evaluation criteria

            ````

### **class TopicAdherenceScore**

   What is this class:

      - It evaluates how well a conversational AI system stays on topic during a multi-turn conversation by measuring topic adherence

   What is it doing?

      The class measures topic adherence through three main steps:
         
         a) Topic Extraction:
         
         Takes a conversation (user inputs and AI responses)
         
         Uses TopicExtractionPrompt to identify all topics discussed in the conversation
         
         Example: From a conversation about physics, it might extract topics like "Einstein's theory of relativity"
         
         b) Topic Refusal Check:

         For each extracted topic, checks if the AI actually answered questions about it or refused to answer
         Uses TopicRefusedPrompt to determine if the AI properly addressed each topic
         Creates a boolean array where True means the topic was addressed

         c) Topic Classification:

         Compares extracted topics against reference topics (expected topics for the conversation)

         Uses TopicClassificationPrompt to determine if each extracted topic matches the reference topics

         Creates another boolean array of classifications

         Finally, it calculates a score using one of three modes:

               precision: How many of the addressed topics were relevant
               
               recall: How many of the relevant topics were addressed
               
               f1: A balanced score between precision and recall (default)
         
         The scoring formula uses:
         
               True positives: Topics that were both relevant and addressed
               
               False positives: Topics that were addressed but not relevant
               
               False negatives: Topics that were relevant but not addressed




**WHEN IS THIS TOOL APPLICABLE?**
When chatbot is only expected to response to queries related to predefined domains.

In [1]:
import os
from openai import OpenAI
# Lagchain
from langchain_openai import ChatOpenAI
# RAGAS
from ragas.dataset_schema import  SingleTurnSample, MultiTurnSample, EvaluationDataset
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import TopicAdherenceScore
from ragas.llms import LangchainLLMWrapper


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [17]:
sample_input_4 = [
HumanMessage(content="Can you provide me with details about Einstein's theory of relativity?"),
AIMessage(content="Sure, let me retrieve the relevant information for you.", tool_calls=[
    ToolCall(name="document_search", args={"query": "Einstein's theory of relativity"})
]),
ToolMessage(content="Found relevant documents: 1. Relativity: The Special and the General Theory, 2. General Theory of Relativity by A. Einstein."),
AIMessage(content="I found some documents on Einstein's theory of relativity. Which one would you like to know more about: 'Relativity: The Special and the General Theory' or 'General Theory of Relativity by A. Einstein'?"),
HumanMessage(content="Tell me about the 'General Theory of Relativity'."),
AIMessage(content="Got it! Let me fetch more details from 'General Theory of Relativity by A. Einstein'.", tool_calls=[
    ToolCall(name="document_retrieve", args={"document": "General Theory of Relativity by A. Einstein"})
]),
ToolMessage(content="The document discusses how gravity affects the fabric of spacetime, describing the relationship between mass and spacetime curvature."),
AIMessage(content="The 'General Theory of Relativity' explains how gravity affects the fabric of spacetime and the relationship between mass and spacetime curvature. Would you like more details or a specific explanation?"),
HumanMessage(content="No, that's perfect. By the way, do you know any good recipes for a chocolate cake?"),
AIMessage(content="Sure! Let me find a simple and delicious recipe for a chocolate cake.", tool_calls=[
    ToolCall(name="recipe_search", args={"query": "chocolate cake recipe"})
]),
ToolMessage(content="Here’s a popular recipe for a chocolate cake: Ingredients include flour, sugar, cocoa powder, eggs, milk, and butter. Instructions: Mix dry ingredients, add wet ingredients, and bake at 350°F for 30-35 minutes."),
AIMessage(content="I found a great recipe for chocolate cake! Would you like the full details, or is that summary enough?"),

HumanMessage(content = "Who is the current president of the US?"),
AIMessage(content = "Sorry, I can only respond answers related to science"),
# AIMessage(content = " Let me look into this", tool_calls = [
# ToolCall(name = "politics_research", args = {"query" : "Who is the current president of the US?"})]),
#ToolMessage(content = "The Current President of the US is Donal Trump"),
#AIMessage(content = " Donald Trump won the 2024 presidential elections, he will be in office until 2028") 
# AIMessage(content = " Sorry, I can only respond answers related to science")

]

In [18]:
sample = MultiTurnSample(user_input=sample_input_4, reference_topics=["science"])
scorer = TopicAdherenceScore(mode="precision")

In [19]:

langchain_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) 
# Wrapp llm
ragas_llm = LangchainLLMWrapper(langchain_llm)

# Create the scorer with the wrapped LLM
scorer = TopicAdherenceScore(mode="precision")
scorer.llm = ragas_llm

# Run the scoring
await scorer.multi_turn_ascore(sample)

0.6666666666444444

In [6]:
scorer = TopicAdherenceScore(mode="recall")
scorer.llm = ragas_llm

# Run the scoring
await scorer.multi_turn_ascore(sample)

0.99999999995

# **Tool Call Accuracy**


- tool names and arguments are compared using exact string matching. You can also use any ragas metrics (values between 0 and 1) as distance measure to identify if a retrieved context is relevant or not. For example,



### **Class ToolCallAccuracy**

1. What is this class

    Metric evaluator to measure how accuratelly an AI system uses tools in a conversation by comparing the actual tool calls made against the reference tool calls

2. What is it doing?

    The class evaluates tool usage accuracy in three main aspects:

    a. Sequence alignenment 

        checks if tools are called in the correct order. Uses is_sequence_aligned()

    b. Tool name matching

        check if name of tool call matches that of the reference (using exact string matching) 

    c. Argument accuracy

        Uses _get_arg_score() to compare predicted arguments against reference arguments

        By default, uses exact string matching for argument comparison

        How is it calculating this argument score?

            Takes the average of matching arguments





In [26]:
from ragas.metrics import ToolCallAccuracy
from ragas.dataset_schema import  MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall

sample = [
    HumanMessage(content="What's the weather like in New York right now?"),
    AIMessage(content="The current temperature in New York is 75°F and it's partly cloudy.", tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
    HumanMessage(content="Can you translate that to Celsius?"),
    AIMessage(content="Let me convert that to Celsius for you.", tool_calls=[
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
    ]),
    ToolMessage(content="75°F is approximately 23.9°C."),
    AIMessage(content="75°F is approximately 23.9°C.")
]


In [27]:
sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
    ]
)

**IF THE TOOLS ARE NOT CALLS IN NOT IN THE ORDER SPECIFIED IN THE reference_tool_calls THE SCORE IS ZERO!**

In [28]:
scorer = ToolCallAccuracy()
scorer.llm = ragas_llm
await scorer.multi_turn_ascore(sample)

1.0

In [11]:
from ragas.metrics._string import NonLLMStringSimilarity
# You can also use any ragas metrics (values between 0 and 1) as distance measure to identify if a retrieved context 
# is relevant or not. For example,
metric = ToolCallAccuracy()
metric.arg_comparison_metric = NonLLMStringSimilarity()

# **Agent Goal Accuracy**

binary metric, LLM judge to decide if 0 or 1.


### **AgentGoalAccuracyWithReference**

1. What is this class?

It is a metric evaluator that measures how well an AI agent achieves the user's intended goal by comparing the actual outcome against a reference (expected) outcome.

2. What is it doing?

It performs this evaluation by:

    a. Workflow analysis:

        Takes conversation (human, ai messages, and tool calls) and:

       - Uses InferGoalOutcomePrompt to extract:

            - user_goal: The original objective

            - end_state: What was actually achieved
    
    b. Outcome Comparison:

        - Uses CompareOutcomePrompt to compare:
                
                - desired_outcome: The reference (expected) outcome
                
                - arrived_outcome: The actual end state

        
        - Returns a binary score:

                - 1.0: The outcomes match
                
                - 0.0: The outcomes differ






In [36]:
from ragas.dataset_schema import  MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import AgentGoalAccuracyWithReference


sample = MultiTurnSample(user_input=[
    HumanMessage(content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"),
    AIMessage(content="Sure, let me find the best options for you.", tool_calls=[
        ToolCall(name="restaurant_search", args={"cuisine": "Chinese", "time": "8:00pm"})
    ]),
    ToolMessage(content="Found a few options: 1. Golden Dragon, 2. Jade Palace"),
    AIMessage(content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"),
    HumanMessage(content="Let's go with Golden Dragon."),
    AIMessage(content="Great choice! I'll book a table for 8:00pm at Golden Dragon.", tool_calls=[
        ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8:00pm"})
    ]),
    ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
    AIMessage(content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"),
    HumanMessage(content="thanks"),
],
    reference="Table booked at chinese restaurant at 8 pm")

scorer = AgentGoalAccuracyWithReference()
scorer.llm = ragas_llm
await scorer.multi_turn_ascore(sample)

1.0

## AgentGoalAccuracyWithoutReference


1. What is this class?

    AgentGoalAccuracyWithoutReference is a variant of the goal accuracy evaluator that doesn't require a predefined reference outcome. Instead, it compares the achieved outcome against the inferred user goal from the conversation itself.

2. What is it doing?

    The key difference is that this class:
        
        Extracts both the user's goal and the end state from the conversation
        
        Uses these extracted elements for comparison, rather than relying on a reference
        
        Key differences in the implementation:

            - Extracts both the user's goal and the end state from the conversation
            
            - Uses these extracted elements for comparison, rather than relying on a reference

In [31]:
from ragas.dataset_schema import  MultiTurnSample
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
from ragas.metrics import AgentGoalAccuracyWithoutReference


sample = MultiTurnSample(user_input=[
    HumanMessage(content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"),
    AIMessage(content="Sure, let me find the best options for you.", tool_calls=[
        ToolCall(name="restaurant_search", args={"cuisine": "Chinese", "time": "8:00pm"})
    ]),
    ToolMessage(content="Found a few options: 1. Golden Dragon, 2. Jade Palace"),
    AIMessage(content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"),
    HumanMessage(content="Let's go with Golden Dragon."),
    AIMessage(content="Great choice! I'll book a table for 8:00pm at Golden Dragon.", tool_calls=[
        ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8:00pm"})
    ]),
    ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
    AIMessage(content="Your table at nearest best Chinese restaurant is booked for 8:00pm. Enjoy your meal!"),
    HumanMessage(content="thanks"),
])

scorer = AgentGoalAccuracyWithoutReference()
scorer.llm = ragas_llm
await scorer.multi_turn_ascore(sample)

0.0