In [1]:
%load_ext autoreload
%autoreload 2

# Week 6 - Systematically Improving Your Rag Application

> If you haven't already, please make sure that you've completed the previous notebooks `1. Evaluate Tools` and `2. Generate Dataset` before continuing with this notebook. We'll be using the results from the previous notebook to evaluate the effectiveness of our new techniques.

In this notebook, we'll be looking at how we can improve the performance of our model by using techniques like few shot prompting and system prompts provided by users. 

Specifically we'll be using these techniques to compensate for some of the failure points that we covered in the previous notebook. 

We'll do so in 2 steps

1. **System Prompts** - We'll be looking at how we can improve on the prompt that we provide to our model to help it understand how the user uses each tool. 

2. **Few Shot Prompting** - We'll then look at how we can use fixed few shot examples to understand how to call specific tools in combinations with one another. 

Similar to the second notebook, we'll use the functions we defined in our previous notebook to evaluate the performance of our model and establish our initial precision and recall baselines. 

We'll also use `braintrust` here to store and log the performance of our model since it makes it easy to compare the performance of our model across different runs.

## System Prompts

By adding a system prompt for users to outline their specific workflow and tool usage, our model can handle a greater variety of users and their specific tool usage patterns.

Let's see this in action below where we add the user provided system prompt to our prompt template.

In [8]:
import instructor
from helpers import load_commands, load_queries, Command, SelectedCommands

user_system_prompt = """
I work as a software engineer at a company. When it comes to work, we normally track all outstanding tasks in Jira and handle the code review/discussions in github itself. 

refer to jira to get ticket information and updates but github is where code reviews, discussions and any other specific code related updates are tracked. Use the recently updated issues command to get the latest updates over search. 

for todos, i use a single note in apple notes for all my todos unless i say otherwise. Obsidian is where I store diagrams, charts and notes that I've taken down for things that I'm studying up on. Our company uses confluence for documentation, wikis, release reports, meeting notes etc that need to be shared with the rest of the team. Notion I use it for financial planning, tracking expenses and planning for trips. I always use databases in notion.

For messaging apps, I tend to just use discord for chatting with my friends when we game, i use microsoft teams for communicating with colleague about spcifically work related matters and iMessage for personal day to day stuff (Eg. coordinate a party, ask about general things in a personal context)
"""


async def generate_commands_with_system_prompt(
    query: str,
    client: instructor.AsyncInstructor,
    commands: list[Command],
    user_system_prompt: str,
):
    response = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can execute commands in response to a user query. You have access to the following commands:
                
                <commands>
                {% for command in commands %}
                - {{ command.key }} : {{ command.command_description }}
                {% endfor %}
                </commands>

                You must select at least one command to be called.

                Here is some information about how the user uses each extension. Remember to find a chat before sending a message.

                <user_behaviour>
                {{ user_behaviour }}
                </user_behaviour>
                """,
            },
            {
                "role": "user",
                "content": query,
            },
        ],
        response_model=SelectedCommands,
        context={"commands": commands, "user_behaviour": user_system_prompt},
    )
    return response.selected_commands

Let's now run our evaluations again to see how it performs.

In [160]:
import instructor
import google.generativeai as genai

commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")
client = instructor.from_gemini(
    genai.GenerativeModel("models/gemini-1.5-flash-latest"), use_async=True
)
await generate_commands_with_system_prompt(
    "Did anyone respond to my message in #gaming about playing it takes two later?",
    client,
    commands,
    user_system_prompt,
)

[UserCommand(key='discord.searchMessages', arguments=[UserCommandArgument(title='Channel', value='#gaming')])]

In [35]:
from braintrust import Score, Eval
from helpers import calculate_precision, calculate_recall
import google.generativeai as genai


def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name="precision",
            score=calculate_precision(output, kwargs["expected"]),
        ),
        Score(
            name="recall",
            score=calculate_recall(output, kwargs["expected"]),
        ),
    ]


commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")


client = instructor.from_gemini(
    genai.GenerativeModel("models/gemini-1.5-flash-latest"), use_async=True
)
commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")


async def task(query, hooks):
    resp = await generate_commands_with_system_prompt(
        query, client, commands, user_system_prompt
    )
    hooks.meta(
        input=query,
        output=resp,
    )
    return [item.key for item in resp]


results = await Eval(
    "function-calling",
    data=[
        {
            "input": row["query"],
            "expected": row["labels"],
        }
        for row in queries
    ],
    task=task,
    scores=[evaluate_braintrust],
)

I0000 00:00:1736773841.941677  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736773841.957182  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736773841.970902  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736773841.981821  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736773841.994535  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736773842.005035  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736773842.014434  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
Experiment week-6-1736773843 is running at https://www.braintrust.dev/app/567/p/function-calling/experim

function-calling (tasks):   0%|          | 0/52 [00:00<?, ?it/s]

  hooks.meta(



week-6-1736773843 compared to week-6-1736768887:
59.15% (-15.88%) 'recall'    score	(4 improvements, 21 regressions)
67.81% (-03.21%) 'precision' score	(10 improvements, 11 regressions)

1736773843.88s start
1736773846.17s end
2.28s (+55.91%) 'duration'	(2 improvements, 50 regressions)

See results for week-6-1736773843 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1736773843


By providing a system prompt, we saw a ~55% increase in precision from 0.44 to 0.68 whereas for recall, we saw a ~64% increase from 0.34 to 0.59. 

This is a significant improvement and shows that providing a system prompt can help the model understand how the user uses each tool. 

Better yet, using system prompts allow our model to be more flexible and handle a greater variety of users that may have different ways of interacting with the tools. 

### Comparing System Prompts with Baseline

Now that we've seen a overall improvement across the board, let's look at what specific queries our model is having issues with. Let's do so by computing the same metrics as we did in our previous notebook

In [36]:
import pandas as pd


df = pd.DataFrame(
    [
        {
            "query": row.input,
            "expected": row.expected,
            "output": row.output,
        }
        for row in results.results
    ]
)

# Get all unique tools
all_tools = set()
for tools in df["expected"] + df["output"]:
    all_tools.update(tools)

# Group by tools and get counts
tool_df = pd.DataFrame(
    {
        "expected_count": df["expected"].explode().value_counts(),
        "actual_count": df["output"]
        .explode()[df["output"].explode().isin(df["expected"].explode().unique())]
        .value_counts(),
    }
).fillna(0)

# Calculate precision and recall
tool_df["precision"] = round(tool_df["actual_count"] / tool_df["expected_count"], 2)
tool_df["recall"] = round(tool_df["actual_count"] / tool_df["expected_count"], 2)

# Sort by recall ascending
tool_df = tool_df.sort_values(["recall"], ascending=[True])
tool_df.head(10)

Unnamed: 0,expected_count,actual_count,precision,recall
github.unread-notifications,3,0.0,0.0,0.0
confluence-search.go,1,0.0,0.0,0.0
confluence-search.new-blog,2,0.0,0.0,0.0
github.notifications,2,0.0,0.0,0.0
imessage.findChat,5,0.0,0.0,0.0
microsoft-teams.findChat,12,2.0,0.17,0.17
obsidian.searchMedia,3,1.0,0.33,0.33
discord.searchMessages,2,1.0,0.5,0.5
github.workflow-runs,2,1.0,0.5,0.5
jira.active-sprints,2,1.0,0.5,0.5


From the table above, we can see a few failure points

1. `github.unread-notifications` is never called. 
2. `findChat` is never called. This is experienced in `microsoft-teams`, `imessage` and `discord`
3. Our model doesn't use the `obsidian.searchmedia` command, resulting in a low recall and precision of these commands.

Let's see what queries expect to use `findChat` and what was called instead.

In [38]:
# Filter for rows where extension.findChat is in expected tools
chat_df = df[df["expected"].apply(lambda x: any("findChat" in item for item in x))]

for _, row in chat_df.head(3).iterrows():
    print(
        f"Query: {row['query']}\n\nExpected: {row['expected']}\n\nOutput: {row['output']}\n\n{'-' * 80}\n"
    )


Query: Let's create a new release post about our latest deployment, also make sure to link the specific issues that were fixed in the latest sprint and send a message the #engineering channel to let them know about it

Expected: ['confluence-search.new-blog', 'jira.active-sprints', 'microsoft-teams.findChat', 'microsoft-teams.sendMessage']

Output: ['confluence-search.new-page', 'confluence-search.add-text', 'confluence-search.add-text', 'microsoft-teams.sendMessage']

--------------------------------------------------------------------------------

Query: Tell mum i'll be back for dinner around 7pm

Expected: ['imessage.findChat', 'imessage.sendMessage']

Output: ['imessage.sendMessage']

--------------------------------------------------------------------------------

Query: pull up munich plans, send mike the airbnb link to the accoms on the 22nd

Expected: ['notion.search-page', 'imessage.findChat', 'imessage.sendMessage']

Output: ['notion.search-page', 'imessage.sendMessage']

--

In [40]:
# Filter for rows where github.unread-notifications is in expected tools
gh_df = df[df["expected"].apply(lambda x: "github.unread-notifications" in x)]

for _, row in gh_df.head(3).iterrows():
    print(
        f"Query: {row['query']}\n\nExpected: {row['expected']}\n\nOutput: {row['output']}\n\n{'-' * 80}\n"
    )


Query: any security alerts raised since we upgraded our nextjs dependencies over to 14.2.0? 

Expected: ['github.unread-notifications']

Output: ['jira.recently-updated-issues']

--------------------------------------------------------------------------------

Query: check if there are any dependency vulnerabilities raised recently

Expected: ['github.unread-notifications']

Output: ['jira.recently-updated-issues']

--------------------------------------------------------------------------------

Query: pull up those security alerts and ping the security team

Expected: ['github.unread-notifications', 'microsoft-teams.findChat', 'microsoft-teams.sendMessage']

Output: ['jira.search-issues', 'microsoft-teams.sendMessage']

--------------------------------------------------------------------------------



In [41]:
# Filter for rows where obsidian.searchMedia is in expected tools
obsidian_df = df[df["expected"].apply(lambda x: "obsidian.searchMedia" in x)]

for _, row in obsidian_df.head(3).iterrows():
    print(
        f"Query: {row['query']}\n\nExpected: {row['expected']}\n\nOutput: {row['output']}\n\n{'-' * 80}\n"
    )


Query: can you help me find the diagrams I did to show how docker containerisation works in my personal notes?

Expected: ['obsidian.searchMedia']

Output: ['obsidian.searchMedia']

--------------------------------------------------------------------------------

Query: can you help me find that sketch I made that explained the docker containerisation process, i need it to explain to the team

Expected: ['obsidian.searchMedia']

Output: ['obsidian.searchNoteCommand']

--------------------------------------------------------------------------------

Query: search for my aws architecture drawings

Expected: ['obsidian.searchMedia']

Output: ['obsidian.searchNoteCommand']

--------------------------------------------------------------------------------



### Few Shot Prompting

Once we've identified potential problem areas - like the model failing to find findChat - few shot examples can explicitly demonstrate these commands used in context. 

For instance, we can show a few examples of how to use the `findChat` command with a `sendMessage` command. A natural fit here could be to grab some content from an internal documentation site like `confluence` and then sending it over to a chat.

```
<query>generate release notes for the tickets closed in our current sprint and send the link over to the #product channel ahead of time so they know what's coming</query>
<commands>
    confluence-search.new-blog,
    confluence-search.add-text,
    microsoft-teams.findChat,
    microsoft-teams.sendMessage
</commands>
```

We could also be inventive and use the `searchMedia` command alongside a normal `searchNoteCommand` to show the model how each command differs.

```
<query>Can you grab my notes and sketches which I put together about cross-attention?</query>
<commands>
    obsidian.searchMedia,
    obsidian.searchNote
</commands>
```

Including these concrete examples in the prompt teaches the model the correct sequence of steps and drastically reduces the chances it calls the wrong command.”

In [44]:
async def generate_commands_with_prompt_and_examples(
    query: str,
    client: instructor.AsyncInstructor,
    commands: list[Command],
    user_behaviour: str,
):
    response = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
You are a helpful assistant that can execute commands in response to a user query. Only choose from the commands listed below. 

You have access to the following commands:

<commands>
{% for command in commands %}
<command>
    <command key>{{ command.key }}</command key>
    <command description>{{ command.command_description }}</command description>
</command>
{% endfor %}
</commands>

Select between 1-4 commands to be called in response to the user query.

Here is some information about how the user uses each extension. Remember to find a chat before sending a message.

<user_behaviour>
{{ user_behaviour }}
</user_behaviour>

Here are some past examples of queries that the user has asked in the past and the keys of the commands that were expected to be called. These provide valuable context and so look at it carefully and understand why each command was called, taking into account the user query below and the user behaviour provided above.

<examples>
    <example>
        <query>generate release notes for the tickets closed in our current sprint and send the link over to the #product channel ahead of time so they know what's coming</query>
        <commands>
            confluence-search.new-blog,
            confluence-search.add-text,
            microsoft-teams.findChat,
            microsoft-teams.sendMessage
        </commands>
    </example>
    <example>
        <query>Tell Alaistar that I'll be late for dinner at Somma bar tonight and that I'm on my way down in a cab already</query>
        <commands>
            imessage.findChat,
            imessage.sendMessage
        </commands>
    </example>
    <example>
        <query>Can you grab my notes and sketches which I put together about cross-attention?</query>
        <commands>
            obsidian.searchMedia,
            obsidian.searchNote
        </commands>
    </example>
    <example>
        <query>Check if I have any new feedback on my PRs and additionally what's the status of the build on main</query>
        <commands>
            github.unread-notifications,
            github.workflow-runs
        </commands>
    </example>
</examples>
                """,
            },
            {
                "role": "user",
                "content": query,
            },
        ],
        response_model=SelectedCommands,
        context={"commands": commands, "user_behaviour": user_behaviour},
    )
    return response.selected_commands


In [45]:
from braintrust import Score, Eval
from helpers import calculate_precision, calculate_recall
import google.generativeai as genai


def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name="precision",
            score=calculate_precision(output, kwargs["expected"]),
        ),
        Score(
            name="recall",
            score=calculate_recall(output, kwargs["expected"]),
        ),
    ]


commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")


client = instructor.from_gemini(
    genai.GenerativeModel("models/gemini-1.5-flash-latest"), use_async=True
)
commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")


async def task(query, hooks):
    resp = await generate_commands_with_prompt_and_examples(
        query, client, commands, user_system_prompt
    )
    hooks.meta(
        input=query,
        output=resp,
    )
    return [item.key for item in resp]


results = await Eval(
    "function-calling",
    data=[
        {
            "input": row["query"],
            "expected": row["labels"],
        }
        for row in queries
    ],
    task=task,
    scores=[evaluate_braintrust],
)

I0000 00:00:1736774741.726326  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736774741.741421  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736774741.756831  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736774741.768472  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736774741.783235  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736774741.793348  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736774741.803715  281213 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
Experiment week-6-1736774743 is running at https://www.braintrust.dev/app/567/p/function-calling/experim

function-calling (tasks):   0%|          | 0/52 [00:00<?, ?it/s]

  hooks.meta(



week-6-1736774743 compared to week-6-1736774328:
74.54% (+00.31%) 'recall'    score	(2 improvements, 3 regressions)
71.65% (-01.12%) 'precision' score	(4 improvements, 4 regressions)

1736774743.37s start
1736774745.96s end
2.59s (+00.53%) 'duration'	(28 improvements, 24 regressions)

See results for week-6-1736774743 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1736774743


| Method | Precision | Recall |
|--------|-----------|--------|
| Baseline | 0.44 | 0.36 |
| System Prompt | 0.68 (+55%) | 0.59 (+64%) |
| Few Shot Prompting | 0.72 (+64%) | 0.75 (+108%) |

These results show that 

- A system prompt alone can substantially boost precision and recall by supplying the relevant context for the model to understand the specific usage patterns of the user.
- Few shot examples can further improve precision and recall by providing explicit examples of correct command usage.

# Conclusion

In this notebook, we've shown how simple techniques like few shot prompting and system prompts can significantly boost model precision and recall for tool calling. We demonstrated how we can identify specific failure patterns to guide the selection of few-shot examples.

This can be further extended by adding more detailed commands, collecting additional failure cases, and providing more domain-specific few-shot examples. 