<a href="https://colab.research.google.com/github/petrovortex/dls-homework-sem-2/blob/main/agentic_system_project_SUBMIT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **README**

This notebook implements an **Agentic AI system** designed to estimate the **relevance of Map POIs** (Points of Interest) to user queries using Large Language Models.

---

### **Input Data:**
Training (35k rows) and validation (570 rows) datasets containing:
1. User Query `Text`: The raw search request.
2. POI Attributes (6 cols): metadata including `name`, `address`, and other descriptions.
3. Ground Truth: The `relevance` score of the POI to the query.

---

### **Base Solution:**

The baseline approach utilizes a direct, **single-pass LLM call** with a generic **Zero-shot prompt**. This serves as a benchmark for performance without agentic capabilities.

---

### **Best Solution:**

To overcome the limitations of the baseline, the following architectural improvements were implemented:

1.  **RAG System (Knowledge Base):** Construction of a vector-based Knowledge Base from the training dataset to enable **Dynamic Few-Shot Prompting**. This allows the model to take into account labeling patterns from similar historical cases.
2.  **Conditional Two-Stage Inference:** If the first model detects ambiguity (needs_search), it triggers a second inference step. Crucially, this step replaces the Knowledge Base examples (which lack useful signal for these specific edge cases) with external search results to make a final decision.
3.  **External Search Tool:** Integration of Tavily API to fetch real-time verification data. This tool is invoked only when the primary model lacks confidence, providing the necessary context to resolve the uncertainty.

---

## **Results:**

The notebook concludes with a quantitative assessment of the impact of the listed architectural improvements.

---

## **Tech Stack:**

*   **LLM Orchestration & Routing:**
    *   **LiteLLM (Router):** Implements usage-based routing and a fallback mechanism across multiple API endpoints (via OpenRouter) to ensure high availability and the price-less nature of LLMs.
    *   **Pydantic:** Used for structured Outputs and rigorous data validation of LLM responses (JSON schema enforcement).
*   **Knowledge base components:**
    *   **FAISS:** Vector database for efficient similarity search.
    *   **Sentence-Transformers (`multilingual-e5-large`):** A high-performance industry-standard model for cross-lingual embeddings
*   **Observability & Tracing:**
    *   **Opik (Comet):** Comprehensive agent tracing and evaluation platform. Used to monitor spans, track multi-turn reasoning, and perform quantitative evaluation of the agent's performance.
*   **Search engine:**
    *   **Tavily AI:** Providing clean real-time external context for ambiguous POI queries.


## **0. Data loading**

In [None]:
!pip install -q pandas openai pydantic requests tqdm litellm opik scikit-learn sentence-transformers faiss-cpu tavily-python razdel

In [None]:
import requests
from urllib.parse import urlencode
import pandas as pd
import io

from sklearn.model_selection import train_test_split

def download_file_from_yadisk(public_key: str):
    base_url = 'https://cloud-api.yandex.net/v1/disk/public/resources/download?'
    final_url = base_url + urlencode(dict(public_key=public_key))
    response = requests.get(final_url)
    download_url = response.json()['href']

    download_response = requests.get(download_url)
    return io.BytesIO(download_response.content)

TEST_PUBLIC_LINK = "https://disk.360.yandex.ru/d/aCpPMD--Yi_y5g"
TRAIN_PUBLIC_LINK = "https://disk.360.yandex.ru/d/Y4HNAcJh6_cNog"

try:
    test_file_content = download_file_from_yadisk(TEST_PUBLIC_LINK)
    train_file_content = download_file_from_yadisk(TRAIN_PUBLIC_LINK)

    df_test = pd.read_json(test_file_content, lines=True)
    df_train = pd.read_json(train_file_content, lines=True)

    print(f"Test dataset loaded. Shape: {df_test.shape}")
    print(f"Train dataset loaded. Shape: {df_train.shape}")
    print("Columns test:", df_test.columns.tolist())
    print("Columns train:", df_train.columns.tolist())
except Exception as e:
    print(f"–û—à–∏–±–∫–∞ –∑–∞–≥—Ä—É–∑–∫–∏: {e}")

Test dataset loaded. Shape: (570, 9)
Train dataset loaded. Shape: (35094, 9)
Columns test: ['Text', 'address', 'name', 'normalized_main_rubric_name_ru', 'permalink', 'prices_summarized', 'relevance', 'reviews_summarized', 'relevance_new']
Columns train: ['Text', 'address', 'name', 'normalized_main_rubric_name_ru', 'permalink', 'prices_summarized', 'relevance', 'reviews_summarized', 'relevance_new']


In [None]:
df_kb, df_val = train_test_split(df_train, test_size=0.05, random_state=42, stratify=df_train['relevance_new'])

## **1. Imports and configs**

In [None]:
import os
import json
from typing import List, Optional, Dict, Any, Union

import pickle
import numpy as np
import faiss
import torch
from sentence_transformers import SentenceTransformer, util

import litellm
from litellm import Router
from pydantic import BaseModel, Field, ValidationError, field_validator

from opik import track
from opik.opik_context import get_current_span_data, update_current_span
from litellm.integrations.opik.opik import OpikLogger

from tavily import TavilyClient

from razdel import sentenize
import re

from sklearn.metrics import accuracy_score, classification_report

os.environ["OPIK_API_KEY"] = "..."
os.environ["OPIK_WORKSPACE"] = "default"
os.environ["TAVILY_API_KEY"] = "tvly-dev-..."

opik_logger = OpikLogger()

tavily_client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

API_KEYS = [
    "sk-or-v1-...",
    "sk-or-v1-...",
    "sk-or-v1-...",
    "sk-or-v1-...",
    "sk-or-v1-...",
    "sk-or-v1-...",
    "sk-or-v1-...",
    "sk-or-v1-..."
]

MODELS_CONFIG = {
    #"mimo-v2-flash": "openrouter/xiaomi/mimo-v2-flash:free",
    "glm-4.5-air": "openrouter/z-ai/glm-4.5-air:free",
    "deepseek-r1": "openrouter/deepseek/deepseek-r1-0528:free",
    "trinity": "openrouter/arcee-ai/trinity-large-preview:free",
    "nemotron": "openrouter/nvidia/nemotron-3-nano-30b-a3b:free",
}

model_list = []

for alias, model_id in MODELS_CONFIG.items():
    for key in API_KEYS:
        model_list.append({
            "model_name": alias,
            "litellm_params": {
                "model": model_id,
                "api_key": key
            }
        })

llm_router = Router(
    model_list=model_list,
    routing_strategy="usage-based-routing",
    num_retries=len(API_KEYS)-1,
    allowed_fails=1,
    cooldown_time=600
)

class BaseOutput(BaseModel):
    reasoning: str
    relevance: float

    @field_validator('relevance')
    @classmethod
    def snap_relevance(cls, v: float) -> float:
        allowed = [0.0, 0.1, 1.0]
        return min(allowed, key=lambda x: abs(x - v))

class Tier1Output(BaseOutput):
    needs_search: bool
    search_query: Optional[str] = None

class Tier2Output(BaseOutput):
    reasoning: str = Field(description="Final detailed reasoning")

def clean_json_content(content: str) -> str:
    content = re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL)

    start_index = content.find('{')
    end_index = content.rfind('}')

    if start_index != -1 and end_index != -1:
        return content[start_index : end_index + 1]

    return content.strip()

In [None]:
def check_model(model_alias):
    """
    check if model is alive
    """
    messages = [
        {"role": "system", "content": "Answer 'pong' in json field 'answer'"},
        {"role": "user", "content": "ping"}
    ]
    try:
        response = llm_router.completion(
            model=model_alias,
            messages=messages,
            response_format={ "type": "json_object" }
        )
        content = response.choices[0].message.content
        if model_alias == 'deepseek-r1':
            content = clean_json_content(content)
        print(content)
    except Exception as e:
        print(f"LLM Call: {e}")

In [None]:
check_model('deepseek-r1')

{
  "answer": "pong"
}


In [None]:
search_results = tavily_client.search(
                "–≤—Ä–µ–º—è —Ä–∞–±–æ—Ç—ã –°–æ–≤–µ—Ç—Å–∫–∞—è –∞–ø—Ç–µ–∫–∞ –ú–∞—Ö–∞—á–∫–∞–ª–∞",
                search_depth="basic",
                include_answer=True,
                max_results=5
            )

In [None]:
search_results

## **2. Agent components**

### 2.1 Knowledge Base

In [None]:
def get_top_k_chunks(model, query, chunks, k=3):
    if not chunks or len(chunks) <= k:
        return chunks

    query_embedding = model.encode([f"query: {query}"], normalize_embeddings=True)
    chunk_embeddings = model.encode([f"passage: {c}" for c in chunks], normalize_embeddings=True)

    cos_scores = util.cos_sim(query_embedding, chunk_embeddings)[0]

    k_actual = min(k, len(chunks))
    top_results = torch.topk(cos_scores, k=k_actual)
    top_indices = top_results.indices.tolist()

    return [chunks[i] for i in top_indices]

In [None]:
class KnowledgeBase:
    def __init__(self, dataframe, model_name='intfloat/multilingual-e5-large', index_path="kb_index"):
        self.df = dataframe.reset_index(drop=True)
        self.model_name = model_name
        self.index_path_faiss = f"{index_path}.faiss"
        self.index_path_meta = f"{index_path}.pkl"

        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print(f"Loading embedding model on {self.device}...")
        self.model = SentenceTransformer(model_name, device=self.device)

        self.index = None

        if os.path.exists(self.index_path_faiss) and os.path.exists(self.index_path_meta):
            self.load()
        else:
            self.build_index()

    def _row_to_text(self, row, is_query):
        prefix = "query: " if is_query else "passage: "
        content = f"User request: {row['Text']} | Name: {row['name']} | Address: {row['address']} | Rubric: {row['normalized_main_rubric_name_ru']}"
        return prefix + content

    def build_index(self, batch_size=64):
        print("Building Vector Index...")
        sentences = self.df.apply(lambda x: self._row_to_text(x, is_query=False), axis=1).tolist()

        embeddings = self.model.encode(
            sentences,
            batch_size=batch_size,
            show_progress_bar=True,
            convert_to_numpy=True,
            normalize_embeddings=True
        )

        d = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(d)
        self.index.add(embeddings)

        print(f"Index built. Vectors: {self.index.ntotal}")
        self.save()

    @track(name="vector_search")
    def search(self, row, k=3):
        query_text = self._row_to_text(row, is_query=True)
        query_vec = self.model.encode([query_text], convert_to_numpy=True, normalize_embeddings=True)

        distances, indices = self.index.search(query_vec, k + 1)

        results = []
        for i, idx in enumerate(indices[0]):
            if i >= k: break

            match_row = self.df.iloc[idx]
            name_chunks = [n.strip() for n in str(match_row['name']).split(';') if n.strip()]

            top_names = get_top_k_chunks(
                self.model,
                query=match_row['Text'],
                chunks=name_chunks,
                k=2
            )

            results.append({
                "text": match_row['Text'],
                "name": " | ".join(top_names),
                "relevance": match_row['relevance'],
                "distance": float(distances[0][i])
            })
        return results

    def save(self):
        faiss.write_index(self.index, self.index_path_faiss)
        with open(self.index_path_meta, "wb") as f:
            pickle.dump(self.df, f)

    def load(self):
        self.index = faiss.read_index(self.index_path_faiss)
        with open(self.index_path_meta, "rb") as f:
            self.df = pickle.load(f)

In [None]:
from google.colab import drive
drive.mount('/content/drive')
drive_folder = "/content/drive/MyDrive/YandexMaps_Agent_Project"
os.makedirs(drive_folder, exist_ok=True)

In [None]:
index_path_drive = os.path.join(drive_folder, "kb_index")
kb = KnowledgeBase(df_kb, index_path=index_path_drive)

### 2.2 Prompts

In [None]:
vectorizer = kb.model

In [None]:
def format_input_context(row: pd.Series, has_reviews=True) -> str:
    """
    Constructs a concise context prompt by semantically filtering POI attributes against the user query.

    The function splits composite fields (names, prices, reviews) into chunks and
    retains only the top-k segments that are semantically closest to the query (using vector search).
    This ensures the context remains focused on relevant details.

    Args:
        row (pd.Series): A row containing POI metadata (Text, name, address, etc.).
        has_reviews (bool): Whether to include summarized reviews in the output.

    Returns:
        str: A formatted XML-like string containing the filtered object details.
    """
    user_text = row['Text']

    name_chunks = [n.strip() for n in str(row['name']).split(';') if n.strip()]
    top_names = get_top_k_chunks(vectorizer, user_text, name_chunks, k=3)
    name_str = " | ".join(top_names)

    prices_str = "No data"
    if pd.notna(row['prices_summarized']):
        parts = re.split(r'[|\n]', str(row['prices_summarized']))
        parts = [p.strip() for p in parts if p.strip()]
        summary = parts[0].strip()
        chunks = [p.strip() for p in parts[1:] if p.strip()]
        top_chunks = get_top_k_chunks(vectorizer, user_text, chunks, k=3)
        prices_str = f"{summary} [Details: {' | '.join(top_chunks)}]" if top_chunks else summary

    reviews_str = "No data"
    if pd.notna(row['reviews_summarized']) and has_reviews:
        parts = re.split(r'[|\n]', str(row['reviews_summarized']))
        parts = [r.strip() for r in parts if r.strip()]
        summary = parts[0].strip()
        chunks = [r.strip() for r in parts[1:] if r.strip()]
        top_chunks = get_top_k_chunks(vectorizer, user_text, chunks, k=3)
        reviews_str = f"{summary} [Key snippets: {' | '.join(top_chunks)}]" if top_chunks else summary

    return f"""
<context>
  <user_query>"{user_text}"</user_query>

  <object_details>
    - RUBRIC: {row['normalized_main_rubric_name_ru']}
    - NAME: {name_str}
    - ADDRESS: {row['address']}
    - PRICE_INFO: {prices_str}
    - REVIEW_HIGHLIGHTS: {reviews_str}
  </object_details>
</context>
"""

In [None]:
def get_base_instructions():
    return """
You are a search relevance expert. Your goal is to assess if a map object satisfies a user query.

### DATA STRUCTURE INTERPRETATION:
- **SUMMARY (General Info):** This is a broad overview. If it says "The shop sells electronics", it's a general truth.
- **HIGHLIGHTS/DETAILS (Evidence):** These are specific snippets selected as most relevant to the query. They are NOT an exhaustive list.
- **PRUNED NAMES:** You only see names most similar to the query.

### VALID RELEVANCE VALUES:
- 1.0: Perfect match. The object definitely provides what the user wants.
- 0.0: Irrelevant. Wrong category, closed, or completely unrelated.
- 0.1: Partial/Unsure. The object might be relevant, but specific IMPORTANT constraints (price, specific service, rare item) are not explicitly confirmed in the provided snippets.

### INSTRUCTIONS:
1. Think step-by-step. Detect the **USER INTENT**: Is the user looking for a specific **ITEM/SERVICE** (e.g., "buy pills", "sauna") or a specific **VENUE TYPE** (e.g., "Pharmacy", "Recreation Base")?
2. **IF USER WANTS A VENUE TYPE (Category Search):**
   - **RUBRIC PRIORITY:** The object's `Rubric` must semantically match the requested venue type.
   - **MISMATCH PENALTY:** If the user asks for "Category A" (e.g., "Holiday Center") and the object is "Category B" (e.g., "Sports Camp"), the relevance is likely **0.0**, even if they share some features (like saunas or beds).
   - *Exception:* Only give 0.1 or 1.0 if Category B is relative (like "–∫–∞—Ñ–µ" and "—Ä–µ—Å—Ç–æ—Ä–∞–Ω") or a direct sub-type or synonym of Category A.
3. **IF USER WANTS AN ITEM/SERVICE:**
   - **CATEGORY LOGIC:** If the user asks for a COMMON item (e.g., "aspirin") and the object is a standard provider (Pharmacy), **RELEVANCE IS 1.0**.
   - **PARTIAL INVENTORY (The "Open List" Rule):**
     - Treat 'Prices' and 'Description' as **incomplete examples**, not a full catalog.
     - If the object sells "Bags" but lists only "Wallets", assume it MIGHT sell "Suitcases".
     - **DO NOT DOWNGRADE TO 0.0** just because a specific item or brand is missing from the text description, unless the category makes it impossible (e.g., asking for "Suitcase" in a "Bakery").
     - **Verdict:** If Category matches but Item/Brand is unconfirmed -> Relevance is **0.1**.
   - Use 0.1 for RARE/SPECIFIC items where availability is truly unknown.
4. Pay attention to "Hard Constraints" (open now, free wifi) in user request.
5. QUERY RECOVERY (Typos & Layout Errors):
   - DETECT NOISE: Check if the query contains obvious typos (e.g., "–∫—É–ø–∏—Ç—å –º–∞—à–∏–Ω–∏—Å—Ç" instead of "–∫—É–ø–∏—Ç—å –º–∞—à–∏–Ω—É") or wrong keyboard layout patterns (e.g., gibberish text that maps to meaningful words in another language/layout).
   - RECONSTRUCT INTENT: If the literal query is nonsensical but a highly probable correction exists, evaluate the object based on the CORRECTED query.
"""

In [None]:
def get_output_format(schema):
    return f"""
### **OUTPUT FORMAT:**
Return strictly ONLY a valid JSON object (without thinking tags) matching this schema.
{schema}
"""

In [None]:
def get_rag_instructions():
    return """
### HANDLING HISTORICAL EXAMPLES:
- Use provided "SIMILAR HISTORICAL CASES" to calibrate your judgment (see how strict the scoring should be for this category).
- **WARNING:** Historical labels can be noisy or based on outdated info.
- If a historical case contradicts the current data or common sense, prioritize your own analysis of the current object.
"""

In [None]:
def get_search_instructions():
    return """
### UNCERTAINTY & SEARCH:
- If your analysis leads to **0.1**, you MUST set `needs_search` to `true`.
- **SEARCH QUERY RULES:**
  1. Language: Russian.
  2. Format: [Missing feature] + [Object Name] + [Address or City].
  3. Example: "–Ω–∞–ª–∏—á–∏–µ –±–∞—Å—Å–µ–π–Ω–∞ –≤ –ª–∞–≥–µ—Ä–µ –û—Å—Ç—Ä–æ–≤ –¥–µ—Ç—Å—Ç–≤–∞ –¢—é–º–µ–Ω—å" or "–º–µ–Ω—é –∫–∞—Ñ–µ –†–æ–º–∞—à–∫–∞ –Ω–∞ –õ–µ–Ω–∏–Ω–∞ 10".
- If relevance is definitively 0.0 or 1.0, set `needs_search` to `false`.
"""

In [None]:
def get_system_prompt(is_tier2=False, use_rag=False, use_search=False) -> str:
    """
    Collects system prompt:
    1. Base instructions
    2. RAG instructions (optional)
    3. Search instructions (optional)
    4. Output format instructions
    """
    system_prompt = get_base_instructions()
    if use_rag:
        system_prompt += get_rag_instructions()
    if use_search:
        system_prompt += get_search_instructions()
    if is_tier2:
        system_prompt += get_output_format(Tier2Output.model_json_schema())
    else:
        system_prompt += get_output_format(Tier1Output.model_json_schema())
    return system_prompt

In [None]:
def format_rag_context(examples: list) -> str:
    if not examples:
        return "No similar examples found."

    text = ""
    for i, ex in enumerate(examples):
        text += f"""---
[Historical Example {i+1}]
Query: "{ex['text']}"
Object: "{ex['name']}"
Relevance: {ex['relevance']}
---"""
    return text

In [None]:
def get_user_prompt(
        context: str,
        examples: Optional[str] = None,
        search_results: Optional[str] = None
    ):
    sections = []

    sections.append(context)

    if examples:
        sections.append(f"<historical_examples>\n{examples}\n</historical_examples>")

    if search_results:
        sections.append(f"<web_search_results>\n{search_results}\n</web_search_results>")

    sections.append("Analyze the data above and provide your assessment in JSON format.")

    return "\n\n".join(sections)

### 2.3 Agent class

In [None]:
class AgentConfig(BaseModel):
    model_alias: str
    use_rag: bool = False
    use_search: bool = False
    name: str  # –ò–º—è —ç–∫—Å–ø–µ—Ä–∏–º–µ–Ω—Ç–∞, –Ω–∞–ø—Ä–∏–º–µ—Ä "deepseek_rag"

In [None]:
class Agent:
    def __init__(
            self, router: Router,
            kb: KnowledgeBase,
            search_client: TavilyClient,
            config: AgentConfig
    ):
        self.router = router
        self.kb = kb
        self.search_client = search_client
        self.config = config

    @track
    def _call_llm(
        self,
        messages: List[Dict[str, str]],
        response_model: Union[Tier1Output, Tier2Output],
        step_name: str,
        timeout: int = 180,
        max_retries: int = 3
    ):
        update_current_span(name=step_name)

        current_messages = list(messages)

        completion_params = {
            "model": self.config.model_alias,
            "timeout": timeout,
            "response_format": response_model
        }

        for attempt in range(max_retries):
            try:
                completion_params["messages"] = current_messages

                response = self.router.completion(**completion_params)
                content = clean_json_content(response.choices[0].message.content.strip())

                return response_model.model_validate_json(content)

            except (ValidationError, json.JSONDecodeError) as e:
                print(f"‚ö†Ô∏è JSON Validation Error in {step_name} (Attempt {attempt + 1}/{max_retries}): {e}")

                if attempt == max_retries - 1:
                    print(f"‚ùå Failed to validate JSON after {max_retries} attempts.")
                    return None

                current_messages.append({"role": "assistant", "content": content})
                error_feedback = (
                    f"Your previous response caused a JSON validation error: {str(e)}. "
                    f"Please fix the JSON structure to match the schema exactly. "
                    f"Return ONLY the JSON."
                )
                current_messages.append({"role": "user", "content": error_feedback})

            except Exception as e:
                error_msg = str(e).lower()

                if "timeout" in error_msg:
                    print(f"‚è≥ Timeout Error in {step_name} after {timeout}s")
                    return None

                print(f"üõë Critical LLM Error in {step_name}: {e}")
                print("Stopping execution safely...")

                raise KeyboardInterrupt

    @track(name="tavily_search")
    def _execute_search(self, query: str) -> str:
        try:
            search_answer = self.search_client.search(
                query,
                include_answer=True,
                max_results=5
            )['answer']

            return search_answer

        except Exception as e:
            return f"Search failed: {e}"

    @track(name="relevance_prediction")
    def predict(self, row: pd.Series) -> dict:
        tier1_system_prompt = get_system_prompt(
            is_tier2=False,
            use_rag=self.config.use_rag,
            use_search=self.config.use_search
        )

        context = format_input_context(row)
        examples = None
        if self.config.use_rag:
            examples_list = self.kb.search(row, k=2)
            examples = format_rag_context(examples_list)

        tier1_user_prompt = get_user_prompt(context, examples)

        messages_tier1 = [
            {"role": "system", "content": tier1_system_prompt},
            {"role": "user", "content": tier1_user_prompt}
        ]

        tier1_output = self._call_llm(messages_tier1, Tier1Output, step_name="tier1_preliminary_analysis")

        trace_log = {
            "tier1_reasoning": None,
            "tier1_relevance": None,
            "needs_search": None,
            "search_query": None,
            "search_results": None,
            "tier2_reasoning": None,
            "tier2_relevance": None,
            "final_relevance": None
        }

        if not tier1_output:
            return trace_log

        trace_log["tier1_reasoning"] = tier1_output.reasoning
        trace_log["tier1_relevance"] = tier1_output.relevance
        trace_log["needs_search"] = tier1_output.needs_search
        trace_log["search_query"] = tier1_output.search_query
        trace_log["final_relevance"] = tier1_output.relevance

        should_search = tier1_output.needs_search and self.config.use_search

        if should_search and tier1_output.search_query:
            search_results = self._execute_search(tier1_output.search_query)
            trace_log["search_results"] = search_results

            tier2_system_prompt = get_system_prompt(
                is_tier2=True,
                use_rag=self.config.use_rag,
                use_search=False
            )

            poor_context = format_input_context(row, has_reviews=False)
            tier2_user_prompt = get_user_prompt(poor_context, examples, search_results)

            messages_tier2 = [
                {"role": "system", "content": tier2_system_prompt},
                {"role": "user", "content": tier2_user_prompt}
            ]

            tier2_output = self._call_llm(messages_tier2, Tier2Output, step_name="tier2_final_analysis")

            if tier2_output:
                trace_log["tier2_reasoning"] = tier2_output.reasoning
                trace_log["tier2_relevance"] = tier2_output.relevance
                trace_log["final_relevance"] = tier2_output.relevance

        return trace_log

## **3. Evaluation**


In [None]:
from opik.evaluation.metrics.score_result import ScoreResult
from opik.evaluation.metrics import Equals, BaseMetric
from opik.evaluation import evaluate
from opik import Opik

from functools import partial

In [None]:
def prepare_dataset(name, df, sample_size=40):
    client = Opik()
    dataset = client.get_or_create_dataset(name=name)

    if dataset.dataset_items_count == None or dataset.dataset_items_count == 0:
        sample = df.sample(n=min(sample_size, len(df)), random_state=42)
        records = [
            {
                "input_data": row.to_dict(),
                "reference": str(float(row['relevance_new']))
            }
            for _, row in sample.iterrows()
        ]
        dataset.insert(records)

    return dataset

In [None]:
def evaluation_task(item, agent):
    row = pd.Series(item["input_data"])
    trace_log = agent.predict(row)

    final_rel = trace_log.get("final_relevance")
    output_val = str(float(final_rel)) if final_rel is not None else "-1.0"

    return {
        "output": output_val,
        "tier1_relevance": trace_log.get("tier1_relevance"),
        "tier1_reasoning": trace_log.get("tier1_reasoning"),
        "search_results": trace_log.get("search_results"),
        "search_query": trace_log.get("search_query"),
        "tier2_reasoning": trace_log.get("tier2_reasoning"),
        "tier2_relevance": trace_log.get("tier2_relevance")
    }

In [None]:
class MyAccuracyMetric(BaseMetric):
    def __init__(self):
        super().__init__(name="Accuracy")

    def score(self, output, reference, **kwargs):
        if output is None:
             return ScoreResult(value=-1.0, name=self.name)

        try:
            matches = float(output) == float(reference)
            val = 1.0 if matches else 0.0
        except:
            val = 0.0

        return ScoreResult(value=val, name=self.name)

class RelevanceQualityMetric(BaseMetric):
    def __init__(self):
        super().__init__(name="Soft Match Score")

    def score(self, output, reference, **kwargs):
        if output is None:
            return ScoreResult(value=-1.0, name=self.name)

        if isinstance(output, dict):
            pred = output.get("output", -1)
        else:
            pred = output

        pred = float(pred)
        target = float(reference)

        if pred == target:
            val = 1.0
        else:
            pair = {pred, target}
            if pair == {0.0, 1.0}:
                val = 0.0
            else:
                val = 0.5

        return ScoreResult(value=val, name=self.name)

In [None]:
dataset = prepare_dataset("Search_Relevance_Benchmark_40", df_val)

model_aliases = ["glm-4.5-air", "trinity", "nemotron"]

metrics = [
    MyAccuracyMetric(),
    RelevanceQualityMetric()
]

configs = [
    {"use_rag": True, "use_search": True, "label": "full"},
    {"use_rag": True, "use_search": False, "label": "rag"},
    {"use_rag": False, "use_search": True, "label": "search"},
    {"use_rag": False, "use_search": False, "label": "baseline"}
]

In [None]:
for model in model_aliases:
    for cfg in configs:
        agent_config = AgentConfig(
            model_alias=model,
            use_rag=cfg["use_rag"],
            use_search=cfg["use_search"],
            name=f"{model}_{cfg['label']}"
        )

        agent = Agent(router=llm_router, kb=kb, search_client=tavily_client, config=agent_config)
        task = partial(evaluation_task, agent=agent)

        evaluate(
            dataset=dataset,
            task=task,
            task_threads=1,
            scoring_metrics=metrics,
            experiment_name=agent_config.name,
            project_name='POI_relevance',
        )

## **4. Results**

### **Metrics Definition**
To capture a nuanced view of model performance, two metrics were used:
1.  **Accuracy:** Exact match between the predicted and ground truth relevance scores.
2.  **Soft Match Score:** A weighted metric providing a partial credit (0.5) for "near-miss" errors (specifically cases involving the `0.1` relevance class), while penalizing binary flips between `0` and `1`.

### **1. Component Impact (Ablation Study)**
A series of experiments was conducted using **Trinity-Large** and **Nemotron-3-Nano** to evaluate the incremental value of each architectural component.
*   **Baseline vs. RAG/Search:** For both models, we observed a consistent upward trend in performance. Adding RAG and Web Search capabilities improved accuracy by **~15-20%** compared to the Zero-shot baseline.
*   **Agentic Workflow (Full):** The "Full" configuration (System Prompts + RAG + Search + Reflection) yielded the highest scores for these models, peaking at **0.70 Accuracy** for Nemotron-3-Nano.

![Evaluation Results](https://raw.githubusercontent.com/petrovortex/dls-homework-sem-2/main/img/trinity.png)

![Evaluation Results](https://raw.githubusercontent.com/petrovortex/dls-homework-sem-2/main/img/nemotron.png)

### **2. Scaling & Model Architecture**
The impact of model scale and reasoning capabilities was analyzed across different architectures:
*   **GLM-4.5-Air:** Demonstrated the highest overall performance with **0.85 Accuracy**. While significantly more accurate, it introduced a latency trade-off, with an average response time of **~30s**.
*   **Reasoning Model (DeepSeek-R1):** Despite its reasoning capabilities, DeepSeek-R1 showed lower performance in this specific task. It exhibited high latency (**88s avg.**) and frequently **struggled with JSON schema** enforcement, leading to validation errors.
*   **Small Model (GPT-5-Nano):** Showed limited utility for this complex relevance estimation task, significantly underperforming compared to mid-sized alternatives.

![Evaluation Results](https://raw.githubusercontent.com/petrovortex/dls-homework-sem-2/main/img/all_full.png)

![Evaluation Results](https://raw.githubusercontent.com/petrovortex/dls-homework-sem-2/main/img/info.png)

### **3. Experimental Constraints**
*   **Sample Size:** It is important to note that due to latency and rate limits, experiments for **DeepSeek-R1** and **GPT-5-Nano** were conducted on a smaller subset of the validation data.
*   **Operational Factors:** Average durations may have been influenced by fluctuating network stability and API response variability during the evaluation process.

### **4. Potential Enhancements: Programmatic Prompt Optimization**
A planned next step for this project is to migrate the agent logic to the DSPy framework.

**Objective**: To move away from manual prompts.

According to DSPy's documentation, the system can automatically generate more robust instructions and few-shot examples by learning directly from failure patterns identified in the training dataset.

This would allow for a rigorous comparison between human-engineered system prompts and those optimized by DSPy's compiler, potentially leading to higher performance and better generalization across different model scales.