# ðŸ§© ACME Q&A Agent Assignment

---

## Part 1: Agent and Tooling

1. **Google AI Studio Application**
    Google AI Studio app: https://ai.studio/apps/drive/1FYrTQywe4P9926W_LH4epOd-e6sjdmk9

2. **Exported Conversation JSON Files**
    `data/part_1_scenarios`

The agent was generated using Google AI Studio's Gemini, however the resulting code base did not connect smoothly and required some debugging. The GeminiAgent was receiving incorrect outputs from ACME API, and I had to fix this manually by properly formatting tool output.

Most of the scenarios requested in task part 1 are already fulfilled by @google/genai implementation. For instance, the reasoning chain and multi-hop reasoning it produces likely come from how `gemini-2.5-flash` was trained.

I had the most trouble with scenario `Scenario 4: Ambiguous Query Requiring Disambiguation`. This is likely because in the @google/genai implementation, the agent is encouraged to call the tools. When I added the instruction to ask clarifying questions before calling tools as simply one of the rules in the system prompt, it didn't work and was overriden by agent implementation every time. I had to re-design the entire system prompt to force the agent to first "classify" user query and then decide when and how to call the tools like the following:


>STEP 1: ASSESS QUERY CLARITY
>Ask yourself: "Is this query specific enough to answer?"
>
>VAGUE QUERIES - Ask clarifying questions WITHOUT calling tools:
>Examples:
>- "Tell me about ACME's people" â†’ Ask: "Would you like to know about our leadership team, HR policies, employee benefits, or something >else?"
>- "What products do we have?" â†’ Ask: "Are you interested in a complete list, specific product categories, or details about a particular >product?"


## Part 2: Evaluation and Learning Pipeline

### Insights From Logs

In a way, this agent is too good, because I've used up all free requests on three separate Google accounts to find any issues.

The QA works well on concrete, factoid questions.

The issues are revealed when the conversations get longer. This requires having multiple conversations with over 5 turns, which quickly depletes request quotas.

Due to time constraints and request quotas, I have decided to stop at two concrete issues I found:

- *redundant calls*: the agent very often does not consult conversation history and keeps sending requests to the same file;
- *insufficient reasoning search / gives up fast*: the agent does not consult conversation history nor queries for another document, but states the answer to the question is not in the file or tries to ask clarifying questions instead of looking for relevant information in the document.

I have implement two functions to deterministically check if the above issues appear in conversations. I also have a quick citation check function to demonstrate a check that works relatively well and can be saved to insights. I made sure to make these metrics applicable to all conversations with concrete outputs. This allows the insights to be used in regression tests: the metric values should be checked every time the agent is updated to check if fixing one issue did not break any other components.

#### No Citations

Additional insight: is there a citation and is it valid; this one rarely fails.

This function will return False for cases where the question does not appear in the documents, in which case it is correct to not have a citation. It is normal to always have some percentage of responces uncited, but it's still a good statistic to trace with agent updates.

This functon can be improved by further disambiguating whether the citation was not in the answer at all vs wrong file was cited.

In [1]:
cite_prefixes = ['source:', 'from:', 'based on:', 'sources:']
prefix_regex = r'%s (.*\.mdc)'

def correct_sources_cited(turn):
    doc_ids_from_tools = set([call['arguments']['id'] if 'id' in call['arguments'] else None for call in turn['tool_invocations']])
    if not doc_ids_from_tools:
        return False
    
    model_response = turn['model_response']['content'].replace('\n', ' ').lower()
    
    for cite_prefix in cite_prefixes:
        cited_docs = re.search(prefix_regex % cite_prefix, model_response)
        if cited_docs:
            cited_docs = set(cited_docs.group(1).split(', '))
            if len(cited_docs.intersection(doc_ids_from_tools)):
                return True
    return False

def count_no_citations(conversation_json):
    not_cited = 0
    for turn in conversation_json['turns']:
        if turn['type'] != 'user_query':
            continue
        
        if not correct_sources_cited(turn):
            not_cited += 1
    return not_cited

#### Redundant calls

We can easily find if there are redundant calls by checking if any document id was called more than once by the same tool within one conversation.

The goal is to have 0 redundant calls on all conversations.

In [2]:
import json
import os
import re

In [3]:
def count_redundant_calls(conversation_json):
    num_redundant_calls = 0
    tool_invocations = [call for turn in conversation_json['turns'][1:] for call in turn['tool_invocations']]
    
    doc_id_outline = [call['arguments']['id'] for call in tool_invocations if call['tool_name'] == 'getOutline']
    doc_id_full = [call['arguments']['id'] for call in tool_invocations if call['tool_name'] == 'getFull']

    # difference between document queried and number of unique documents queried
    num_redundant_calls += len(doc_id_outline) - len(set(doc_id_outline))
    num_redundant_calls += len(doc_id_full) - len(set(doc_id_full))
    
    return num_redundant_calls

#### Insufficient Search

There are many questions to which the agent is not able to answer, and a reply like "I am unable to answer" is appropriate. However, for queries to which we know the answer is in the document, we can detect early giving up by scanning for keywords. This insight will flag both unanswearable and answearable questions, but we can additionally label answearable questions and track this metric for them. We can also track this metric in general to see if the agent has more successful conversations with added changes to the code base / expanded document collection.

**Note:** This is essentially a bad retrieval issue. This deterministic check can be made more accurate by using some NLP techniques. For instance, we can check if the main constituent phrases from the user query appear in any documents which the agent did not call. A more advanced version of this is BM25 search. A BM25 retriever is relatively light and we can compare LLM reasoning performance against retriever matches while regression tests.

In [4]:
def detect_insufficient_search(conversation_json):
    """
    Detects when agent gives up without adequate searching
    """

    for turn in conversation_json['turns']:
        if turn['type'] != 'user_query':
            continue
        
        response = turn['model_response']['content'].lower()
        tool_calls = turn.get('tool_invocations', [])
        
        # Keywords indicating agent gave up
        gave_up_phrases = [
            "don't have information",
            "not available in the documentation",
            "isn't in the documentation",
            "couldn't find",
            "not in the available documentation",
            "can you clarify",
            "which would you like to know",
            "can you be more specific",
            "I'm sorry",
            "I cannot determine",
        ]
        
        agent_gave_up = any(phrase in response for phrase in gave_up_phrases)
        
        if agent_gave_up:
            return True
    
    return False

#### Extract Insights

In [5]:
def get_insights(conversations_path):
    insights = []
    
    for root, folder, files in os.walk(conversations_path):
        for file in files:
            if file.endswith('.json'):
                conv_jsn = json.load(open(os.path.join(root, file)))
    
                redundant_calls = count_redundant_calls(conv_jsn)
                insufficient_search = detect_insufficient_search(conv_jsn)
                no_citations = count_no_citations(conv_jsn)
    
                insights.append({
                    "conversation_id": conv_jsn["conversation_id"],
                    "redundant_calls": redundant_calls,
                    "insufficient_search": insufficient_search,
                    "no_citations": no_citations
                })
    return insights

In [12]:
conversations_path = 'data/part_2_conversations/run_1'

In [13]:
insights = get_insights(conversations_path)

In [15]:
num_redundant_calls = sum([convo["redundant_calls"] for convo in insights])
num_insufficient_search = sum([convo["insufficient_search"] for convo in insights])
num_no_citations = sum([convo["no_citations"] for convo in insights])

In [16]:
print(f"Redundant calls: {num_redundant_calls}")
print(f"Apparently insufficient search: {num_insufficient_search}")
print(f"Answers with no citations: {num_no_citations}")

Redundant calls: 11
Apparently insufficient search: 4
Answers with no citations: 31


In [19]:
# todo: change path
json.dump(insights, open('data/insights_run1.json', 'w'))

### Insight Integration

It's not really possible to automatically (deterministically) improve the system prompt from the insights, unless we want to prompt an LLM to generate a new system prompt for us.

But we can run all the same questions with an updated system prompt and see if the number of redundant calls and apparently insufficient search have gone down.


I believe we can fix both issues with a single system prompt patch. After many tries, I finally got the agent to reduce the number of calls by forcing it to first generate a list of documents retrieved within the conversation. Added to the system prompt:



>REQUIRED TWO-PHASE RESPONSE PATTERN:
>
>PHASE 1 - INVENTORY CHECK (You must always do this first):
>State explicitly:
>"Let me check what I already know:
>- Documents retrieved so far: [list]
>- Information I already have: [brief summary]
>- Do I need new documents? [YES/NO]"
>
>PHASE 2 - ACTION:
>- If you said "NO" in Phase 1 â†’ Answer using existing information
>- If you said "YES" in Phase 1 â†’ Call getFull ONLY on NEW documents
>
>You are FORBIDDEN from calling tools until you complete Phase 1.


Ideally, I should re-run all conversations from run 1, but I have run out of requests, so I only ran the problematic ones:

In [25]:
conversations_path_run2 = 'data/part_2_conversations/run_2'

In [26]:
insights_run2 = get_insights(conversations_path_run2)

In [27]:
num_redundant_calls_run2 = sum([convo["redundant_calls"] for convo in insights_run2])
num_insufficient_search_run2 = sum([convo["insufficient_search"] for convo in insights_run2])
num_no_citations_run2 = sum([convo["no_citations"] for convo in insights_run2])

In [28]:
print(f"Redundant calls: {num_redundant_calls_run2}")
print(f"Apparently insufficient search: {num_insufficient_search_run2}")
print(f"Answers with no citations: {num_no_citations_run2}")

Redundant calls: 1
Apparently insufficient search: 0
Answers with no citations: 12


These issues appear to have been reduced.