<a href="https://colab.research.google.com/github/olanigan/DSPy_Cookbook/blob/main/DSPy_Agent_Example_AvaTaR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSPy Agent Example - AvaTaR

In the following notebook, we'll be working through an example of the AvaTaR process (as outlined in their [paper](https://arxiv.org/pdf/2406.11200)) using an example that was provided to the [DSPy Repository](https://github.com/stanfordnlp/dspy/blob/main/examples/agents/avatar_langchain_tools.ipynb).

We'll work through what's happening an each step - and learn about the AvaTaR process throughout the notebook!

## Dependencies

We'll be install DSPy from source today - to make sure we have the latest and greatest features available!

In [None]:
!pip install -qU git+https://github.com/stanfordnlp/dspy.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


Because we'll be creating an Arxiv Agent - we'll need to both grab the LangChain tools (for their convenient `.run()` method), and the Arxiv Python library.

In [None]:
!pip install -qU langchain_community arxiv

As is usually the case, DSPy does work with a number of local and API based models - but we'll be sticking with the absolute classic: OpenAI!

We'll need to provide our API key, as such.

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


## DSPy Setup

We're going to configure our DSPy Language Model (LM) to be `gpt-4o-mini` due to the fact that it is:

- inexpensive
- often underperforms `gpt-4o`
- is lightning quick

We'll set some simple hyperparameters, as well.

In [None]:
import os
import dspy

dspy.settings.configure(
    lm=dspy.OpenAI(
        model="gpt-4o-mini",
        api_key=os.getenv("OPENAI_API_KEY"),
        max_tokens=4000,
        temperature=0,
    )
)

### Signatures

If you recall to our [first DSPy event](https://www.youtube.com/watch?v=6YtdtjQD1r0) a `Signature` is essentially a prompt, and response package. It defines a prompt (the doctstring) as well as the expected inputs and outputs (each with their own prompts).

We'll create a `Signature` for our ArxivQA Agent below.

In [None]:
class ArxivQASignature(dspy.Signature):
    """You will be given a question and an Arxiv Paper ID. Your task is to answer the question."""

    question: str = dspy.InputField(
        prefix="Question:",
        desc="question to ask",
        format=lambda x: x.strip(),
    )
    paper_id: str = dspy.InputField(
        prefix="Paper ID:",
        desc="Arxiv Paper ID",
    )
    answer: str = dspy.OutputField(
        prefix="Answer:",
        desc="answer to the question",
    )

### Data

As is usually the case - DSPy enables us to take a data-forward approach to prompting. As such, we'll be providing some simple data!

Our data today will be a collection of prompt/response pairs from the  [ArXiv QA](https://huggingface.co/datasets/taesiri/arxiv_qa) dataset on Hugging Face. Which, criticaly, contains paper IDs. We'll use those to help evaluate our Agentic System below!

First, let's instantiate our `DataLoader`.

In [None]:
from random import sample
from dspy.datasets import DataLoader

dl = DataLoader()

Now, we can load our dataset directly from Hugging Face.

In [None]:
arxiv_qa = dl.from_huggingface(
    "taesiri/arxiv_qa",
    split="train",
    input_keys=("question", "paper_id"),
)

Since we need to both train and evaluate - we'll create a train/test split.

> NOTE: We care about the `paper_id` and `questions` as inputs here.

In [None]:
import random

# Set a random seed for reproducibility
random.seed(42)

aqa_train = [
    dspy.Example(question=example.question, paper_id=example.paper_id, answer=example.answer).with_inputs("question", "paper_id")
    for example in sample(arxiv_qa, 200)
]
aqa_test = [
    dspy.Example(question=example.question, paper_id=example.paper_id, answer=example.answer).with_inputs("question", "paper_id")
    for example in sample(arxiv_qa, 100)
]

### Creating an AvaTaR Agent

Now we can create an AvaTaR Agent!

So, what exactly *is* AvaTaR?

It's an optimization strategy that is useful for creating Agentic Systems that are more robust, and better at strategies like: Tool calling!

Let's see, from their paper, how AvaTaR stacks up against some of the other Agentic strategies:

![image](https://i.imgur.com/2QRJHMq.png)

So how does it work "under the hood", well - it's quite simple, actually:

![image](https://i.imgur.com/RwX6QQB.png)

As you can see in the diagram, there are a few essential components to the AvaTaR optimization process:

1. Generate a task for our Comparator using contrastive examples
2. Provide updated instructions for our Actor
3. Generate Actions from our Actor
4. Rinse, Repeat.

This introduces two critical ideas:

1. There is a "Comparator" which will help generate examples, and reason through contrastive examples to help guide our Actor's prompts.
2. There is an "Actor" which is ultimately the "product" of the AvaTaR process.

There's a lot more we could dive into - but the basic idea is above.

> NOTE: Check out the [paper](https://arxiv.org/pdf/2406.11200) for more awesome details!


Now we can initialize our toolbelt!

In this case - we're going to keep things simple, and just use a single `Tool`, which will be our `ArxivAPIWrapper`.

In [None]:
from dspy.predict.avatar import Tool, Avatar
from langchain_community.utilities import ArxivAPIWrapper

tools = [
    Tool(
        tool=ArxivAPIWrapper(),
        name="ARXIV_SEARCH",
        desc="Pass the arxiv paper id to get the paper information.",
        input_type="Arxiv Paper ID",
    ),
]

Now we can create our `Avatar` Agent!

> NOTE: Recall that during our optimization process, we will have a Comparator and an Actor - but our final product will be an optimized Actor.

In [None]:
arxiv_agent = Avatar(
    tools=tools,
    signature=ArxivQASignature,
    verbose=True,
)

We're going to leverage a new `Signature` for evaluation - which will help us use LLM-As-A-Judge to determine how well our Agent is performing on the test set.

Ultimately, this `Signature` is just determining if a response (as generated by our Agent) is "correct" or "incorrect".

In [None]:
class Evaluator(dspy.Signature):
    """Please act as an impartial judge and evaluate the quality of the responses provided by multiple AI assistants to the user question displayed below. You should choose the assistant that offers a better user experience by interacting with the user more effectively and efficiently, and providing a correct final response to the user's question.

Rules:
1. Avoid Position Biases: Ensure that the order in which the responses were presented does not influence your decision. Evaluate each response on its own merits.
2. Length of Responses: Do not let the length of the responses affect your evaluation. Focus on the quality and relevance of the response. A good response is targeted and addresses the user's needs effectively, rather than simply being detailed.
3. Objectivity: Be as objective as possible. Consider the user's perspective and overall experience with each assistant."""

    question: str = dspy.InputField(
        prefix="Question:",
        desc="question to ask",
    )
    reference_answer: str = dspy.InputField(
        prefix="Reference Answer:",
        desc="Answer to the question given by the model.",
    )
    answer: str = dspy.InputField(
        prefix="Answer:",
        desc="Answer to the question given by the model.",
    )
    rationale: str = dspy.OutputField(
        prefix="Rationale:",
        desc="Explanation of why the answer is correct or incorrect.",
    )
    is_correct: bool = dspy.OutputField(
        prefix="Correct:",
        desc="Whether the answer is correct.",
    )

Now we can wrap that all in a `TyperPredictor`!

In [None]:
evaluator = dspy.TypedPredictor(Evaluator)

Now we can wrap this in a simple metric that can be used to help optimize our AvaTaR Agent.

In [None]:
def metric(example, prediction, trace=None):
    return int(
        evaluator(
            question=example.question,
            answer=prediction.answer,
            reference_answer=example.answer
        ).is_correct
    )

### Custom Evaluation

Due to the differences in traditional DSPy optimization (formerly teleprompting) and AvaTaR optimization - we'll need to create a custom evaluation flow.

In [None]:
import tqdm

from concurrent.futures import ThreadPoolExecutor

def process_example(example, signature):
    try:
        avatar = Avatar(
            signature,
            tools=tools,
            verbose=False,
        )
        prediction = avatar(**example.inputs().toDict())

        return metric(example, prediction)
    except Exception as e:
        print(e)
        return 0

def multi_thread_executor(test_set, signature, num_threads=60):
    total_score = 0
    total_examples = len(test_set)

    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = [executor.submit(process_example, example, signature) for example in test_set]

        for future in tqdm.tqdm(futures, total=total_examples, desc="Processing examples"):
            total_score += future.result()

    avg_metric = total_score / total_examples
    return avg_metric

### Evaluation!

We can leverage our custom evaluation to...evaluate our system!

In [None]:
aqa_score = multi_thread_executor(aqa_test, ArxivQASignature)
print(f"Average Score on ArxivQA: {aqa_score:.2f}")

Processing examples:   0%|          | 0/100 [00:00<?, ?it/s]

Action 101: ARXIV_SEARCH (Generate 10 specific in-depth questions about the methodologies and contributions discussed in the paper 1912.01991, focusing on key themes and techniques.)


Processing examples:   2%|▏         | 2/100 [00:27<25:24, 15.55s/it]

Action 99: ARXIV_SEARCH (What are the main contributions and methodologies proposed in the paper with ID 2305.11130, specifically focusing on Defence-in-Depth strategies and Attribute-Based Encryption?)
Action 98: ARXIV_SEARCH (Based on the findings and methodologies discussed in the paper with ID 2304.04745, what specific questions should we ask to create a comprehensive summary?)
Action 104: ARXIV_SEARCH (What future research directions do the authors suggest in paper 2307.14352?)


Processing examples: 100%|██████████| 100/100 [01:25<00:00,  1.17it/s]


Average Score on ArxivQA: 0.41


## Avatar Optimization!

Taken directly from their notebook:

> Note, that Actor is the Module that directs tool execution and flow, it's not the signature that we are passing. It doesn't optimize the instruction of the signature we pass.

We have access to the following parameters, or control knobs:

- `metric`: Metric that we'll be optimizing for
- `max_iters`: Maximum number of iterations for the optimizer
- `lower_bound`: Lower bound for the metric to classify example as negative
- `upper_bound`: Upper bound for the metric to classify example as positive
- `max_positive_inputs`: Maximum number of positive inputs sampled for comparator
- `max_negative_inputs`: Maximum number of negative inputs sampled for comparator
- `optimize_for`: Whether we want to maximize the metric or minimize it during optimization

We'll dive into the [code](https://github.com/stanfordnlp/dspy/blob/main/dspy/teleprompt/avatar_optimizer.py) to learn a bit more, as well!

First, let's check out our Comparator `Signature`:

```python
class Comparator(dspy.Signature):
    """After executing the given actions on user inputs using the given instruction, some inputs have yielded good, results, while others have not. I'll provide you the inputs along with their, corresponding evaluation metrics:

Task:
(1) Firstly, identify and contrast the patterns of inputs that have achieved good results with those that have not.
(2) Then, review the computational logic for any inconsistencies in the previous actions.
(3) Lastly, specify the modification in tools used that can lead to improved performance on the negative inputs."""

    instruction: str = dspy.InputField(
        prefix="Instruction: ",
        desc="Instruction for the actor to execute the task",
    )
    actions: List[str] = dspy.InputField(
        prefix="Actions: ",
        desc="Actions actor can take to complete the task",
    )
    pos_input_with_metrics: List[EvalResult] = dspy.InputField(
        prefix="Positive Inputs: ",
        desc="Positive inputs along with their score on a evaluation metric and actions taken",
    )
    neg_input_with_metrics: List[EvalResult] = dspy.InputField(
        prefix="Negative Inputs: ",
        desc="Negative inputs along with their score on a evaluation metric and actions taken",
    )
    feedback: str = dspy.OutputField(
        prefix="Feedback: ",
        desc="Feedback for the actor to improve the performance of negative inputs",
    )
```

This gives us a deeper insight into what the Comparator's job is:

1. It uses what has worked so far, and what has not worked.
2. It determines if there are obvious mistakes or logical errors.
3. It provides feedback on tool usage that could've improved negative results.

So, how does it deliver that feedback to the Actor system?

If you guess: "Another `Signature`?" - you'd be correct!

```python
class FeedbackBasedInstruction(dspy.Signature):
    """There is a task that needs to be completed for which one can use multiple tools to achieve the desired outcome. A group's performance was evaluated on a dataset of inputs, the inputs that did well are positive inputs, and the inputs that did not do well are negative inputs.

You received feedback on how they can better use the tools to improve your performance on the negative inputs. You have been provided with the previous instruction, that they followed to use tools to complete the task, and the feedback on your performance.

Your task is to incorporate the feedback and generate a detailed instruction for the group to follow to improve their performance on the task.

Make sure that the new instruction talks about how to use the tools effectively and should be no more than 3 paragraphs long. The previous instruction contains general guidelines that you must retain in the new instruction."""

    previous_instruction: str = dspy.InputField(
        prefix="Previous Instruction: ",
        desc="Previous instruction for the actor to execute the task",
    )
    feedback: str = dspy.InputField(
        prefix="Feedback: ",
        desc="Feedback for the actor to improve the performance of negative inputs",
    )
    new_instruction: str = dspy.OutputField(
        prefix="New Instruction: ",
        desc="New instruction for the actor to execute the task",
    )
```

In [None]:
from dspy.teleprompt import AvatarOptimizer

teleprompter = AvatarOptimizer(
    metric=metric,
    max_iters=1,
    max_negative_inputs=2,
    max_positive_inputs=2,
)

All that's left to do is `.compile` our `arxiv_agent` on our training data!

In [None]:
optimized_arxiv_agent = teleprompter.compile(
    student=arxiv_agent,
    trainset=aqa_train
)

Iteration 1/1
Starting the task...
Starting the task...
Starting the task...Starting the task...Starting the task...
Starting the task...Action 1: ARXIV_SEARCH (2404.03876 keywords)


Action 1: ARXIV_SEARCH (Generate 10 in-depth questions about the method proposed in the paper with ID 2007.07843.)

Action 1: ARXIV_SEARCH (What problem or question is the paper 2307.15020 addressing?)Starting the task...

Starting the task...
Action 1: ARXIV_SEARCH (What is the central research question or hypothesis that the paper with ID 2304.01308 addresses?)Action 1: ARXIV_SEARCH (2303.01903 keywords)
Starting the task...
Action 1: ARXIV_SEARCH (What questions should we ask to create a comprehensive summary of the paper with ID 2206.03009?)
Action 1: ARXIV_SEARCH (Summarize the paper with ID 2305.10512 in one paragraph.)
Action 1: ARXIV_SEARCH (2402.01845 keywords)

Starting the task...
Starting the task...
Action 1: ARXIV_SEARCH (Summarize the paper with ID 2402.12550 in one sentence.)
Starting the 

Processing examples:   0%|          | 0/200 [00:00<?, ?it/s]

Action 1: ARXIV_SEARCH (2402.17505)
Action 1: ARXIV_SEARCH (2308.09597 keywords)
Action 1: ARXIV_SEARCH (What is the main contribution of the paper with ID 2304.01194?)
Action 1: ARXIV_SEARCH (Summarize the paper with ID 1901.09005 in one sentence.)
Action 1: ARXIV_SEARCH (What is the central research question or hypothesis that the paper with ID 2104.12671 addresses?)
Action 1: ARXIV_SEARCH (What is the main contribution of the paper with ID 2202.04053?)Action 1: ARXIV_SEARCH (Summarize the paper with ID 2402.18944 in one sentence.)

Action 1: ARXIV_SEARCH (What is the main contribution of the paper with ID 2401.04962?)Action 1: ARXIV_SEARCH (What future research directions do the authors suggest in paper 2305.16349?)

Action 1: ARXIV_SEARCH (What is the main contribution of the paper with ID 2403.09097?)
Action 1: ARXIV_SEARCH (Generate 10 in-depth questions about the method proposed in the paper with ID 2401.10189, avoiding simple questions.)
Action 1: ARXIV_SEARCH (Please provide a

Processing examples:   0%|          | 1/200 [00:06<22:04,  6.66s/it]

Starting the task...
Action 2: Finish (Write a very high-quality and detailed summary of the paper 2402.17505 that describes the paper in a way that a human can fully understand. The summary should cover the problem, the proposed solution and highlight the main contributions of the paper.)
Action 2: Finish (The keywords or key terms associated with the paper 2402.09666 are: commonsense knowledge graph, textual entailment, knowledge graph completion, semantic understanding, embedding, data sparsity.)Starting the task...

Action 2: Finish ()
Action 1: ARXIV_SEARCH (Generate 10 in-depth questions about the methods proposed in the paper with ID 2201.11990, focusing on complex aspects and avoiding simple questions.)


Processing examples:   1%|          | 2/200 [00:11<17:54,  5.43s/it]

Action 2: Finish (The keywords or key terms associated with the paper 2312.06561 are: hybrid neural fluid fields, fluid density, fluid velocity, neural dynamic reconstruction, visual ambiguities, turbulent fluid flows, physics-based losses, divergence-free velocity field, vortex particle-based velocity, 3D incompressible flow.)
Starting the task...
Starting the task...
Starting the task...
Starting the task...
Starting the task...
Starting the task...
Action 1: ARXIV_SEARCH (2403.15341 keywords)
Action 1: ARXIV_SEARCH (What is the main contribution of the paper with ID 2401.15029?)
Starting the task...
Action 2: Finish (The keywords or key terms associated with the paper 2403.14773 are: text-to-video, long video generation, autoregressive approach, conditional attention module, appearance preservation module, video enhancer, motion consistency.)
Action 1: ARXIV_SEARCH (Provide a detailed summary of the paper with ID 2403.02080, covering the problem, proposed solution, and main contribu

Processing examples:   2%|▏         | 3/200 [00:19<21:24,  6.52s/it]

Action 2: Finish (The paper with ID 2402.18944 discusses a framework that bridges the gap between sentence selection and fusion for summarization by ranking sentence singletons and pairs together, modeling human methodology to produce a summary sentence.)
Starting the task...
Starting the task...
Action 2: Finish ()
Action 2: Finish (The paper with ID 2403.17726 discusses a framework that bridges the gap between sentence selection and fusion for summarization by ranking sentence singletons and pairs together, modeling human methodology to produce a summary sentence.)
Action 1: ARXIV_SEARCH (Summarize the paper with ID 2205.14756 in a high-quality paragraph.)
Action 2: Finish ()
Action 2: Finish (Summarize the paper with ID 2402.10224 in one sentence.)
Action 1: ARXIV_SEARCH (Summarize the paper with ID 2307.08579 in two paragraphs.)Starting the task...

Action 2: Finish (Summarize the main method used in the paper with ID 2111.10659.)
Action 2: Finish (The paper with ID 1901.09005 disc

Processing examples:   2%|▎         | 5/200 [00:41<30:08,  9.27s/it]


Starting the task...
Action 1: ARXIV_SEARCH (Get keywords or key terms for paper ID 2312.11843)
Action 1: ARXIV_SEARCH (Summarize the paper with ID 2305.02750 in one sentence.)
Action 2: Finish (Summarize the paper with ID 2401.08046 in one sentence.)Action 1: ARXIV_SEARCH (What is the main contribution of the paper with ID 2404.00694?)

Action 2: Finish (The main contributions of the paper with ID 2403.09097 are: (1) the mapping of breakpoint types (IDEs and literature), (2) compiled definitions of breakpoint types, and (3) a breakpoint type taxonomy. These contributions provide the first step to organize breakpoint IDE taxonomy and lexicon, and support further debugging research.)
Starting the task...
Starting the task...
Action 2: Finish (The paper with ID 2403.00991 discusses a framework for scoring sentence singletons and pairs to improve abstractive summarization by bridging the gap between sentence selection and fusion.)
Action 2: Finish (Design 10 in-depth questions about the 

Processing examples:   4%|▍         | 8/200 [01:00<24:28,  7.65s/it]

Action 2: Finish (The TL;DR of the paper with ID 2304.08971 is: This paper proposes a unified framework for scoring and selecting sentence singletons and pairs for abstractive summarization, bridging the gap between sentence selection and fusion.)Starting the task...

Starting the task...
Starting the task...
Action 1: ARXIV_SEARCH (Get keywords or key terms for paper ID 2301.08556)
Action 1: ARXIV_SEARCH (Summarize the paper with ID 2401.04795 in one sentence.)
Action 2: Finish ()
Action 1: ARXIV_SEARCH (Provide a detailed summary of the paper with ID 2403.16182, covering the problem, proposed solution, and main contributions.)
Action 2: Finish (The main contributions of the paper with ID 2403.14293 are: (1) the mapping of breakpoint types (IDEs and literature), (2) compiled definitions of breakpoint types, and (3) a breakpoint type taxonomy. These contributions provide the first step to organize breakpoint IDE taxonomy and lexicon, and support further debugging research.)
Action 1: A

Processing examples:  32%|███▏      | 63/200 [01:34<02:27,  1.08s/it]

Action 1: ARXIV_SEARCH (What are the key methods proposed in the paper with ID 2311.17295?)
Action 2: Finish ()
Action 2: Finish (Write a very high-quality paragraph summarizing the paper with ID 2112.04478.)
Action 2: Finish (The main contributions of the paper with ID 2402.06452 are: (1) the mapping of breakpoint types (IDEs and literature), (2) compiled definitions of breakpoint types, and (3) a breakpoint type taxonomy. These contributions provide the first step to organize breakpoint IDE taxonomy and lexicon, and support further debugging research.)


Processing examples:  50%|█████     | 100/200 [01:36<00:59,  1.69it/s]

Action 2: Finish ()
Action 2: Finish (The paper with ID 2312.08234 discusses a novel approach to unsupervised sentence summarization using the Information Bottleneck principle, which optimizes the summary of information to predict relevant information effectively.)
Action 2: Finish (The main contributions of the paper with ID 2403.01731 are: (1) the mapping of breakpoint types (IDEs and literature), (2) compiled definitions of breakpoint types, and (3) a breakpoint type taxonomy. These contributions provide the first step to organize breakpoint IDE taxonomy and lexicon, and support further debugging research.)
Action 2: Finish (The paper with ID 2312.05923 discusses a novel approach to unsupervised sentence summarization using the Information Bottleneck principle, which optimizes the summary of information to predict relevant information, demonstrating improved performance over existing methods.)
Action 2: Finish (The keywords or key terms associated with the paper 2312.05526 are: Rein

Processing examples:  58%|█████▊    | 116/200 [01:50<00:54,  1.55it/s]

Action 2: Finish (Design 10 in-depth questions about the method proposed in the paper with ID 2311.15570, focusing on the Defence-in-Depth strategies, the use of Attribute-Based Encryption, and the implications for End-to-End security in Industry 4.0.)
Action 2: Finish ()
Action 2: Finish (The paper with ID 2402.07066 discusses a framework that bridges the gap between sentence selection and fusion for summarization by ranking single sentences and pairs together, aiming to model human summarization methods.)


Processing examples:  68%|██████▊   | 135/200 [01:51<00:31,  2.08it/s]

Action 2: Finish (Design 10 in-depth questions about the method proposed in the paper 'A formal approach to good practices in Pseudo-Labeling for Unsupervised Domain Adaptive Re-Identification' (Paper ID: 2403.10889). Focus on the theoretical framework, good practices, and experimental results.)
Action 2: Finish (Design 10 in-depth questions about the method proposed in the paper with ID 2104.06967, avoiding simple questions.)
Action 2: Finish ()
Action 2: Finish ()
Action 2: Finish (The main contributions of the paper with ID 2401.00964 are: (1) the mapping of breakpoint types (IDEs and literature), (2) compiled definitions of breakpoint types, and (3) a breakpoint type taxonomy. These contributions provide the first step to organize breakpoint IDE taxonomy and lexicon, and support further debugging research.)
Action 2: Finish (The authors suggest an exciting diversity of possible metaphoric options for future research into the mind and brain.)
Action 2: Finish ()
Action 2: Finish ()


Processing examples: 100%|██████████| 200/200 [01:54<00:00,  1.74it/s]


Average Score: 0.28
Positive examples: 56
Negative examples: 144
Sampling 2 positive examples and 2 negative examples
Action 100: ARXIV_SEARCH (What are the main contributions and methodologies proposed in the paper with ID 2305.11130, specifically focusing on Defence-in-Depth strategies and Attribute-Based Encryption?)
Action 102: ARXIV_SEARCH (Generate 10 specific in-depth questions about the methodologies and contributions discussed in the paper 1912.01991, focusing on key themes and techniques.)
Action 99: ARXIV_SEARCH (Based on the findings and methodologies discussed in the paper with ID 2304.04745, what specific questions should we ask to create a comprehensive summary?)
Generated new instruction: **Previous Instruction:** You will be given `Tools` which will be a list of tools to use to accomplish the `Goal`. Given the user query, your task is to decide which tool to use and what input values to provide. You will output action needed to accomplish the `Goal`. `Action` should ha

And now we can evaluate our newly optimized AvaTaR Agent!

In [None]:
teleprompter.thread_safe_evaluator(aqa_test, optimized_arxiv_agent)

Starting the task...
Starting the task...
Starting the task...Starting the task...

Starting the task...
Starting the task...
Starting the task...
Starting the task...
Starting the task...
Starting the task...
Starting the task...
Starting the task...
Starting the task...
Starting the task...
Starting the task...Starting the task...

Starting the task...
Starting the task...Starting the task...Starting the task...

Starting the task...
Starting the task...

Starting the task...
Starting the task...
Starting the task...
Starting the task...Starting the task...
Starting the task...

Starting the task...
Starting the task...
Starting the task...
Starting the task...
Action 1: ARXIV_SEARCH (Please provide a summary of the paper with ID 2402.08790 that highlights the key findings and contributions.)Starting the task...
Action 1: ARXIV_SEARCH (Please provide the keywords or key terms associated with the paper ID 2403.06404.)
Action 1: ARXIV_SEARCH (Please provide the keywords or key terms as

Processing examples:   0%|          | 0/100 [00:00<?, ?it/s]

Action 1: ARXIV_SEARCH (Please summarize the future research directions suggested by the authors in the paper with ID 2307.14352.)
Action 1: ARXIV_SEARCH (Summarize the main method used in the paper 2303.00747 in one paragraph.)
Action 1: ARXIV_SEARCH (What problem or question is the paper with ID 2307.08500 addressing?)
Action 1: ARXIV_SEARCH (Please provide a detailed summary of the methods proposed in the paper with ID 2312.01629.)
Action 1: ARXIV_SEARCH (Please provide the keywords or key terms associated with the paper ID 2203.11216.)
Action 1: ARXIV_SEARCH (Please provide the keywords or key terms associated with the paper ID 2402.11639.)Action 1: ARXIV_SEARCH (Provide a detailed summary of the paper with ID 2312.00849, covering the problem addressed, the proposed solution, and highlighting the main contributions.)

Action 1: ARXIV_SEARCH (Summarize the future research directions suggested by the authors of the paper with ID 2201.11990.)
Action 1: ARXIV_SEARCH (Generate 10 in-dep

Processing examples:   1%|          | 1/100 [00:37<1:01:41, 37.39s/it]

Action 2: Finish (Summarize the paper with ID 2401.14361 in one sentence.)
Starting the task...
Action 2: Finish ()
Action 2: Finish (Based on the detailed summary of the methods proposed in the paper with ID 2312.01629, design 10 in-depth questions about the method, avoiding simple questions.)
Starting the task...
Action 2: Finish (The paper with ID 2307.08500 addresses the problem of convergence analysis in block preconditioned steepest descent eigensolvers, specifically focusing on the convergence behavior of the preconditioned steepest descent iteration with implicit deflation (PSD-id) and extending this analysis to a block version of the method.)
Starting the task...
Action 2: Finish (Summarize the paper with ID 2403.02589 in one sentence.)
Starting the task...
Starting the task...
Action 2: Finish (Summarize the main method used in the paper 2303.00747 in one paragraph.)
Action 2: Finish (Based on the summary provided, the main contributions of the paper with ID 2402.04477 are: 1

Processing examples:   3%|▎         | 3/100 [00:54<25:38, 15.86s/it]  

Starting the task...
Starting the task...
Action 2: Finish (Based on the summary provided, the main contributions of the paper with ID 2401.13703 are: 1) A two-stage generation strategy for creating a preliminary summary sketch that serves as a basis for the final summary, utilizing weakly supervised signals from interrogative pronoun categories and key phrases. 2) A method for controlling the granularity of the final summary by predicting and highlighting different text spans, allowing for automatic determination of the number of generated summary sentences. 3) Achieving state-of-the-art performance on the SAMSum dialogue summarization corpus, with competitive human evaluation results.)
Action 2: Finish (The paper with ID 2303.15533 addresses the problem of designing reliable prediction models that abstain from predictions when faced with uncertain or out-of-distribution samples, known as Selective Classification in the presence of Out-of-Distribution data (SCOD). It makes three key c

Processing examples:   7%|▋         | 7/100 [01:44<21:17, 13.74s/it]

Action 2: Finish (The TL;DR of the paper with ID 2307.13251 is: This paper presents TL;DR Progress, a new tool for exploring the literature on neural text summarization, organizing 514 papers based on a comprehensive annotation scheme and enabling fine-grained, faceted search.)
Action 2: Finish (The main method used in the paper 2303.10826 involves constructing a hierarchical paragraph-level topic structure representation with three layers to guide the creation of a large-scale Chinese paragraph-level topic structure corpus (CPTS). This is achieved through a two-stage man-machine collaborative annotation method, resulting in high-quality data that supports tasks like topic segmentation and outline generation.)
Action 2: Finish (Summarize the paper with ID 2402.18491 in one sentence.)
Action 2: Finish (Summarize the paper with ID 2303.05762 in two paragraphs.)
Action 2: Finish (Write a very high-quality paragraph summarizing the paper with ID 2312.03497 that effectively covers the main 

Processing examples: 100%|██████████| 100/100 [02:40<00:00,  1.61s/it]


0.61

## Conclusion:

As you can see - our Agent went from a paltry `0.41` across our test set, to a less paltry `0.61`.

