<a href="https://colab.research.google.com/github/parthasarathydNU/gen-ai-coursework/blob/main/advanced-llms/CourseWork/INFO_7374_Lecture_10_Agents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agents



## The Shift to Agentic Workflows
- Traditional non-agentic workflows: One-shot answers from LLMs.
- Agentic workflows: Iterative process with planning, execution, revision, and tool use.

## Advantages of Agentic Workflows
- Remarkably better results compared to non-agentic workflows.
- Case study: Coding Benchmark (Human Eval) showing improved performance with agentic workflows.
- GPT-3.5 with agentic workflow outperforming GPT-4 in zero-shot prompting.
- Significant productivity boost in AI development with agentic workflows.
- Expansion of AI capabilities through agentic workflows.


You'll likely see better results with agentic workflow + GPT-4, than GPT-5 by itself. So learning how to use agents effectively is useful skill.



## Design Patterns in Agentic Workflows

**Reflection**: LLMs can review and improve their own output.

Example: Code generation, review, and revision by the same LLM.

---


**Tool Use**: LLMs using external tools to expand capabilities.

Example: Integration with image manipulation or code execution tools.

---

**Planning**: LLMs outlining steps and executing tasks sequentially.

Example: Research agents for literature review, breaking down topic into smaller sub topics.

---

**Multi Agent Collaboration**: Multiple LLMs collaborating to achieve complex tasks.

Example: ChatDev, where LLMs take on different roles in software development. Multi-agent debate leading to better performance.

## Reflection

Large language models like ChatGPT, Claude, and Gemini are powerful tools for generating human-like text based on prompts. However, their initial outputs are not always ideal or fully satisfactory.

The typical process involves:
1. Prompting the model
2. Receiving an imperfect output
3. Providing critical feedback to help the model improve
4. Getting an enhanced response based on that feedback

But what if **step 3 could be automated?** What if the language model could analyze its own initial response, identify weaknesses and areas for improvement, and then refine the output - all without human intervention?

This is the core idea behind **Reflection** - enabling AI models to engage in self-critique to iteratively enhance their responses. By building in the ability to evaluate their own outputs against certain criteria and heuristics, language models could spot gaps, errors, or places where more detail and nuance is needed. They could then revise the response, resulting in higher quality, more comprehensive outputs.

Reflection turns the process of critique and revision into an automatic feedback loop within the model itself. The potential benefits include more efficient generation of strong responses, less need for human fine-tuning, and AI systems that continuously sharpen their own performance. It's an exciting avenue for making large language models more robust and self-optimizing.

### Iterative Code Generation via Self-Reflection

When tasked with writing code to accomplish a specific objective, large language models can be remarkably effective. By providing a clear prompt detailing the desired functionality, we can often obtain working code that gets the job done.

However, the initial code output may not always be optimal in terms of correctness, style, efficiency, or other key factors. This is where the power of self-reflection comes into play.

After generating code intended to carry out task X, we can prompt the model to carefully review its own work and provide constructive criticism. For example:

```
Here's code intended for task X: [previously generated code]

Please check the code carefully for correctness, style, and efficiency. Provide constructive criticism on how the code could be improved.
```

By framing the request in this way, we're asking the model to put on its critic hat and analyze the strengths and weaknesses of the generated code. The LLM will often spot issues such as logical errors, edge cases not handled, inefficient algorithms, inconsistent styling, lack of comments/documentation, etc. It can then suggest targeted improvements.

Armed with this internal feedback, we move to the next step - rewriting the code. The new prompt includes both the original code and the model's own constructive criticism:

```
Here is the code generated previously: [original code]

Here is the constructive feedback on how to improve it: [self-critique]

Please rewrite the code from scratch, thoroughly addressing the issues raised in the feedback while still accomplishing task X.
```

By explicitly asking the model to take its own suggestions into account during the rewrite, we can get a new version of the code that is enhanced along multiple dimensions. The LLM leverages both its original understanding of the task and its self-reflective analysis to produce a higher quality solution.

We can even repeat this critique-rewrite loop multiple times, progressively polishing the code until the model no longer has any substantive feedback for itself.

This iterative process of self-reflection, where the model serves as both generator and critic, can be a powerful tool for optimizing code quality. It allows LLMs to identify and overcome their own blindspots and limitations in a targeted way. Beyond just code, this technique can be applied to improve LLM performance on a wide variety of language tasks, using self-reflection to iteratively refine the outputs.


### Enhancing Reflection with External Evaluation Tools

The power of self-reflection for improving LLM outputs can be further augmented by providing the model with external tools to evaluate its own work. For example, when generating code, the model can be prompted to run its solution through a suite of unit tests to check correctness on various test cases. If the code fails any tests, the model can analyze the specific errors and propose targeted fixes.

Similarly, for text generation tasks, the LLM can be asked to search the web for relevant information and cross-reference its output to check for factual accuracy, consistency, and completeness. Any discrepancies or gaps identified through this external validation process can then be fed back into the model's self-reflection to drive iterative improvements.


### A Multi-Agent Approach to Reflection

Another promising approach is to implement Reflection using a multi-agent framework. This involves creating two distinct AI agents with specialized roles: a generator agent focused on producing high-quality outputs, and a critic agent tasked with providing constructive feedback.

The generator agent is prompted to create an initial response to the given task. The critic agent then analyzes this output and offers suggestions for improvement. This kicks off a back-and-forth dialogue between the two agents, with the generator proposing revisions based on the critic's feedback, and the critic evaluating each new iteration.

Through this collaborative process of generation and critique, the output can be progressively refined. The multi-agent setup allows for a natural division of labor and a structured way to implement the Reflection pattern.

### Self-Refine: Iterative Refinement with Self-Feedback (May 2023)

https://arxiv.org/abs/2303.17651



![](https://raw.githubusercontent.com/madaan/self-refine/main/docs/static/images/animation_oldstyle_oneloop.gif)

Reflection is a powerful tool for iterative improvement across various domains. The process involves the following high-level steps:

1. Initial attempt
2. Reflection
3. Analysis
4. Revision
5. Evaluation
6. Iteration (if needed)
7. Final output

Here's how these steps map to the two specific examples:

Example 1: Drafting an email request
1. Initial attempt: Write a direct request, e.g., "Send me the data ASAP".
2. Reflection: Recognize that the phrasing may come across as impolite.
3. Analysis: Consider how the recipient might perceive the message and think about ways to make the request more polite and professional.
4. Revision: Revise the request, e.g., "Hi Ashley, could you please send me the data at your earliest convenience?".
5. Evaluation: Review the revised message to ensure it strikes the right tone and clearly communicates the request.
6. Iteration (if needed): If the revised message can still be improved, go through another round of reflection and revision.
7. Final output: Send the polished, professional request to the colleague.

Example 2: Writing code
1. Initial attempt: Implement a "quick and dirty" solution.
2. Reflection: Identify areas where the code can be improved in terms of efficiency and readability.
3. Analysis: Pinpoint specific inefficiencies or unclear sections that need refactoring.
4. Revision: Refactor the code, optimizing for efficiency and enhancing readability.
5. Evaluation: Test the refactored code to verify that it still functions correctly and has improved in terms of performance and maintainability.
6. Iteration (if needed): If further opportunities for optimization or clarity are identified, repeat the refactoring process.
7. Final output: Commit the optimized, readable code to the codebase.

By applying the reflection process to these specific examples, we can see how it leads to more effective communication in the email example and higher-quality code in the programming example. The same high-level steps can be adapted to suit various contexts and goals, demonstrating the versatility and value of reflection as a tool for continuous improvement.

![](https://selfrefine.info/static/images/fig2.png)

![](https://i.imgur.com/3TMpxHE.png)

In [None]:
!pip install -q openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from openai import OpenAI
import os
from google.colab import userdata

client = OpenAI(
  api_key=userdata.get('TOGETHER_API_KEY'),
  base_url='https://api.together.xyz/v1',
)


In [None]:
import openai

model = "Qwen/Qwen1.5-4B-Chat"

def generate_tweet(messages):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=50,
        n=1,
        stop=None,
        temperature=0.7,
    )
    return response.choices[0].message.content.strip()

def evaluate_tweet(tweet):
    messages = [
        {"role": "system", "content": "You are a Twitter influencer."},
        {"role": "user", "content": f"Please critique the following tweet:\n\n{tweet}"}
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=100,
        n=1,
        stop=None,
        temperature=0.7,
    )
    return response.choices[0].message.content.strip()


messages = [
    {"role": "system", "content": "You are a social media manager for Northeastern University."},
    {"role": "user", "content": "Generate an engaging tweet about Northeastern University:"}
]

for i in range(2):
    tweet = generate_tweet(messages)
    print(f"\n#### Generated Tweet {i+1}:")
    print(tweet)

    critique = evaluate_tweet(tweet)
    print(f"\n#### Critique {i+1}:")
    print(critique)

    messages.append({"role": "assistant", "content": tweet})
    messages.append({"role": "user", "content": f"Improve the tweet based on this critique. Only write the improve tweet:\n\n{critique}"})

print("\n#### Final Tweet:")
messages.append({"role": "assistant", "content": generate_tweet(messages)})
print(messages[-1]["content"])



#### Generated Tweet 1:
"Join us at Northeastern University and be part of a community that fosters creativity, collaboration, and innovation. #Northeastern #University #Community #Innovation"

#### Critique 1:
The tweet you've provided is a promotion for Northeastern University, and it does a few things well. Firstly, it highlights the key values that the university is known for: creativity, collaboration, and innovation. By using hashtags like #Northeastern and #University, the tweet is easily discoverable by people interested in these topics.
Secondly, the tweet is concise and straightforward, which is important for grabbing people's attention and making them want to know more. The call to action is clear:

#### Generated Tweet 2:
"Be part of Northeastern University's community of creativity, collaboration, and innovation. Join now! #Northeastern #University #Community #Innovation"

#### Critique 2:
The tweet aims to promote Northeastern University's community of creativity, collab

In [None]:
for message in messages:
    print(message)
    print('---')

{'role': 'system', 'content': 'You are a social media manager for Northeastern University.'}
---
{'role': 'user', 'content': 'Generate an engaging tweet about Northeastern University:'}
---
{'role': 'assistant', 'content': '"Join Northeastern University as we continue to push the boundaries of knowledge and innovation! Check out our latest achievements and upcoming events at @NEU.edu." #NEU #Innovation #Education'}
---
{'role': 'user', 'content': "Improve the tweet based on this critique. Only write the improve tweet:\n\nAs an AI language model, I don't have personal opinions or biases. However, I can provide a neutral critique of the tweet based on the key elements of a successful tweet:\n\n1. Clear and concise headline: The headline is clear and concise, and it clearly communicates the purpose of the tweet. It's a good way to capture people's attention and encourage them to click on the tweet to learn more.\n\n2. Use of hashtags: Hashtags are a great way to increase the visibility of

### Reflexion: Language Agents with Verbal Reinforcement Learning (October 2023)

https://arxiv.org/abs/2303.11366


Reflexion agents reflect on task feedback signals to improve their decision making over time. The key aspects are:

- They **verbally reflect** on the feedback signals they receive after completing a task
- They **store** these reflections in an **episodic memory buffer**
- In subsequent trials of the task, they use the stored reflections to **make better decisions**


![](https://i.imgur.com/VXYMKoy.png)

#### Reflexion's Three Models

 ![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Freflexion.1053729e.png&w=1080&q=75)

Reflexion has three main parts:

1. **Actor**: This part makes text and actions based on what it sees. It does something, looks at the result, and remembers what happened. It uses methods like Chain-of-Thought (CoT) and ReAct. It also has a memory to help it remember important things.

2. **Evaluator**: This part gives a score to what the Actor did. It looks at what happened and decides how good or bad it was. The score depends on the task. It uses language models and rules to make decisions.

3. **Self-Reflection**: This part helps the Actor get better. It's like a teacher giving tips. It looks at the score, what just happened, and things from the past. Then it makes suggestions to help the Actor do better next time. These suggestions go into the Actor's memory so it can keep learning and improving.

### CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (Feb 2024)
https://arxiv.org/abs/2305.11738


CRITIC proposes that for some tasks, using the LLM to directly provide feedback for an answer is not sufficient. It is better to use external tools to help verify the correctness of an answer and use the tools' output as part of the critique.

 ![](https://i.imgur.com/K8ca6i2.png)

 ![](https://i.imgur.com/2s3MYxE.png)

## Tool Use

- Large Language Models (LLMs) can be made even more powerful by giving them access to additional tools.
  - For example, LLMs can be connected to web search engines, allowing them to find up-to-date information to help answer questions better.
  - They can also be given the ability to run computer code, which lets them give working code examples and solve programming problems.

- By combining LLMs with these extra capabilities, they can provide responses that are more accurate and relevant to the specific situation.

- This opens up exciting possibilities for what AI language models can do to assist humans with a wide variety of tasks.

### Tools for Enhancing Large Language Models

Web search tool:
- When prompted with a question like "What is the best coffee maker according to reviewers?", an LLM can use a web search tool to gain context
- The LLM generates a special string (e.g., `{tool: web-search, query: "coffee maker reviews"}`) to request a web search
- The result is passed back to the LLM as additional input context for further processing

Code execution tool:
- When asked a question requiring computation (e.g., compound interest calculation), an LLM can use a code execution tool
- The LLM generates a string (e.g., `{tool: python-interpreter, code: "100 * (1+0.07)**12"}`) to request code execution
- This approach is more likely to result in the correct answer compared to generating the answer directly using a transformer network

Database query tool:
- LLMs can be integrated with databases to retrieve specific information
- For example, if asked "How many employees work at Company X?", the LLM can generate a database query (e.g., `{tool: database, query: "SELECT COUNT(*) FROM employees WHERE company = 'X'"}`) to fetch the answer
- This allows LLMs to provide accurate, up-to-date information from structured data sources

Image analysis tool:
- LLMs can be combined with computer vision models to analyze and describe images
- When presented with an image, the LLM can generate a request (e.g., `{tool: image-analyzer, image: <image_data>}`) to extract information from the image
- The image analysis results (e.g., object detection, scene description) can then be used by the LLM to answer questions or generate text related to the image

Translation tool:
- LLMs can leverage machine translation models to communicate in multiple languages
- If prompted with a request in a foreign language, the LLM can generate a translation request (e.g., `{tool: translator, text: "Bonjour, comment allez-vous?", target_language: "English"}`)
- The translated text is then passed back to the LLM, allowing it to understand and respond to the request in the original language

#### Expanding Tool Use in Agentic Workflows

To use these tools, LLMs can be given detailed descriptions of the available functions, including:

- A text description explaining the purpose of each function
- Information about the arguments each function expects

With this knowledge, the LLM is expected to automatically select the most appropriate function to call in order to complete a given task.

![](https://dl-staging-website.ghost.io/content/images/size/w1000/2024/04/TOOL-USE-3.png)

### Handling Large Numbers of Tools
- Systems are being built where LLMs have access to hundreds of tools
- Including all tool descriptions in the LLM context may not be feasible
- Heuristics can be used to select the most relevant subset of tools to include in the context at each processing step
  - Similar to retrieval augmented generation (RAG) systems, which use heuristics to select a subset of text when the full context is too large

### Expansion of Tool Use Practices
- Practices for tool use have exploded since the introduction of LMMs like LLaVa, GPT-4V, and Gemini
- **GPT-4's function calling capability, released in the middle of last year, was a significant step toward general-purpose tool use**

In [None]:
import os
import json
import openai


# Define function(s)
tools = [
  {
    "type": "function",
    "function": {
      "name": "get_current_weather",
      "description": "Get the current weather in a given location",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {
            "type": "string",
            "description": "The city e.g. San Francisco"
          },
          "country": {
            "type": "string",
            "description": "The country e.g. USA"
          },
          "unit": {
            "type": "string",
            "enum": [
              "celsius",
              "fahrenheit"
            ]
          }
        }
      }
    }
  },
    {
    "type": "function",
    "function": {
      "name": "get_popular_sport_team",
      "description": "Get the most popular professional sports team in a given location",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {
            "type": "string",
            "description": "The city e.g. San Francisco"
          },
          "sport": {
            "type": "string",
            "description": "The sport e.g. basketball"
          },
        }
      }
    }
  }
]

In [None]:
# Generate
response = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
		    {"role": "user", "content": "What is the current temperature of Paris?"}
		],
    tools=tools,
    tool_choice="auto",
)

print(json.dumps(response.choices[0].message.dict()['tool_calls'], indent=2))



[
  {
    "id": "call_i36rt3cvvuip6wxwzmvx9k2u",
    "function": {
      "arguments": "{\"city\":\"Paris\",\"country\":\"France\",\"unit\":\"celsius\"}",
      "name": "get_current_weather"
    },
    "type": "function"
  }
]


In [None]:
# Generate
response = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
		    {"role": "user", "content": "What is the most popular basketball team in Oakland?"}
		],
    tools=tools,
    tool_choice="auto",
)

print(json.dumps(response.choices[0].message.dict()['tool_calls'], indent=2))



[
  {
    "id": "call_mvguz6xxho4whuzh7h0zk3ry",
    "function": {
      "arguments": "{\"city\":\"Oakland\",\"sport\":\"basketball\"}",
      "name": "get_popular_sport_team"
    },
    "type": "function"
  }
]


Note that the LLM will only tell you what tool to use and what parameters to use, it will not directly tell you the answer. From this LLM response, we will extract the tool and the parameters and use it to call our tool.

In [None]:
tool_name = response.choices[0].message.dict()['tool_calls'][0]['function']['name']
params = json.loads(response.choices[0].message.dict()['tool_calls'][0]['function']["arguments"])

print(f"{tool_name=}")
print(f"{params=}")

tool_name='get_popular_sport_team'
params={'city': 'Oakland', 'sport': 'basketball'}


In [None]:
type(params)

dict

In [None]:
def get_popular_sport_team(city, sport):
    if city == 'Oakland' and sport == 'basketball':
        return "Warriors"
    else:
        return "not sure"

def get_current_weather(country, unit):
    # add custom logic here
    return "call 3rd party api"


def call_tool(tool_name, params):
    if tool_name == 'get_popular_sport_team':
        return get_popular_sport_team(**params)
    elif tool_name == 'get_current_weather':
        return get_current_weather(**params)
    else:
        raise NotImplementedError

call_tool(tool_name, params)

'Warriors'

### Gorilla: Large Language Model Connected with Massive APIs (May 2023)

https://arxiv.org/abs/2305.15334

LLMs had difficulty calling API/tools with the accurate input arguments and had a tendency to hallucinate the wrong usage of an API call.

Gorilla is a finetuned LLama based model that, at the time, surpassed the performance of GPT 4 on writing API calls.



### HuggingGPT

https://arxiv.org/abs/2303.17580

![](https://lilianweng.github.io/posts/2023-06-23-agent/hugging-gpt.png)

Here is a rephrased version of the key points about using large language models (LLMs) for task planning and execution:

#### Task Planning
- The LLM acts as the "brain" to parse user requests into multiple subtasks
- Each subtask has four attributes:
  1. Task type
  2. ID
  3. Dependencies
  4. Arguments
- Few-shot examples are used to guide the LLM in task parsing and planning

*prompt*

The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning.

#### Model Selection
- The LLM distributes the subtasks to specialized expert models
- The request to the expert model is framed as a multiple-choice question
- The LLM is given a list of models to choose from
- Task type based filtering is needed due to limited context length


*prompt*

Given the user request and the call command, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The AI assistant merely outputs the model id of the most appropriate model. The output must be in a strict JSON format: "id": "id", "reason": "your detail reason for the choice". We have a list of models for you to choose from {{ Candidate Models }}. Please select one model from the list.


#### Task Execution
- The selected expert models execute their assigned subtasks
- The results of each subtask are logged

*prompt*

With the input and the inference results, the AI assistant needs to
describe the process and results. The previous stages can be formed as:

    User Input: {{ User Input }}
    Task Planning: {{ Tasks }}
    Model Selection: {{ Model Assignment }}
    Task Execution: {{ Predictions }}


You must first answer the user's request in a straightforward manner.
Then describe the task process and show your analysis and model
inference results to the user in the first person. If inference
results contain a file path, you must tell the user the complete
file path.


#### Response Generation
- The LLM receives the execution results from the expert models
- It then generates a summarized response for the end user based on the combined subtask results


### AutoGPT

```
You are {{ai-name}}, {{user-provided AI bot description}}.
Your decisions must always be made independently without seeking user assistance. Play to your strengths as an LLM and pursue simple strategies with no legal complications.

GOALS:

1. {{user-provided goal 1}}
2. {{user-provided goal 2}}
3. ...
4. ...
5. ...

Constraints:
1. ~4000 word limit for short term memory. Your short term memory is short, so immediately save important information to files.
2. If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.
3. No user assistance
4. Exclusively use the commands listed in double quotes e.g. "command name"
5. Use subprocesses for commands that will not terminate within a few minutes

Commands:
1. Google Search: "google", args: "input": "<search>"
2. Browse Website: "browse_website", args: "url": "<url>", "question": "<what_you_want_to_find_on_website>"
3. Start GPT Agent: "start_agent", args: "name": "<name>", "task": "<short_task_desc>", "prompt": "<prompt>"
4. Message GPT Agent: "message_agent", args: "key": "<key>", "message": "<message>"
5. List GPT Agents: "list_agents", args:
6. Delete GPT Agent: "delete_agent", args: "key": "<key>"
7. Clone Repository: "clone_repository", args: "repository_url": "<url>", "clone_path": "<directory>"
8. Write to file: "write_to_file", args: "file": "<file>", "text": "<text>"
9. Read file: "read_file", args: "file": "<file>"
10. Append to file: "append_to_file", args: "file": "<file>", "text": "<text>"
11. Delete file: "delete_file", args: "file": "<file>"
12. Search Files: "search_files", args: "directory": "<directory>"
13. Analyze Code: "analyze_code", args: "code": "<full_code_string>"
14. Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>", "code": "<full_code_string>"
15. Write Tests: "write_tests", args: "code": "<full_code_string>", "focus": "<list_of_focus_areas>"
16. Execute Python File: "execute_python_file", args: "file": "<file>"
17. Generate Image: "generate_image", args: "prompt": "<prompt>"
18. Send Tweet: "send_tweet", args: "text": "<text>"
19. Do Nothing: "do_nothing", args:
20. Task Complete (Shutdown): "task_complete", args: "reason": "<reason>"

Resources:
1. Internet access for searches and information gathering.
2. Long Term memory management.
3. GPT-3.5 powered Agents for delegation of simple tasks.
4. File output.

Performance Evaluation:
1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.
2. Constructively self-criticize your big-picture behavior constantly.
3. Reflect on past decisions and strategies to refine your approach.
4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.

You should only respond in JSON format as described below
Response Format:
{
    "thoughts": {
        "text": "thought",
        "reasoning": "reasoning",
        "plan": "- short bulleted\n- list that conveys\n- long-term plan",
        "criticism": "constructive self-criticism",
        "speak": "thoughts summary to say to user"
    },
    "command": {
        "name": "command name",
        "args": {
            "arg name": "value"
        }
    }
}
Ensure the response can be parsed by Python json.loads

```

### Efficient Tool Use with Chain-of-Abstraction Reasoning (Feb 2024)

https://arxiv.org/abs/2401.17464

The key idea behind CoA is:

1. Train LLMs to first generate reasoning chains with **abstract placeholders** (e.g., y1, y2, y3).

2. Then, call domain-specific tools to fill in those placeholders with **specific knowledge**, grounding the final answer.

This decoupling of general reasoning and domain knowledge allows for parallel processing, where LLMs can generate the next abstract chain while tools fill in the current one, accelerating the overall inference.

#### Benefits of CoA

CoA offers several advantages over prior methods where LLM decoding and API calls are interleaved:

- Promotes effective planning by encouraging LLMs to interconnect multiple tool calls and adopt more feasible reasoning strategies.

- Enables robust and efficient multistep reasoning by separating high-level reasoning from granular knowledge retrieval.

![](https://i.imgur.com/TEeSRDH.png)

![](https://i.imgur.com/IQor7uK.png)

## Planning

https://arxiv.org/abs/2402.02716 (Understanding the planning of LLM agents: A survey, Feb 2024)



Many complex tasks require multiple steps and tools to complete. An AI agent can break down the task into smaller, manageable steps and decide which tools to use at each step. For instance, if the goal is to create an image of a girl in the same pose as a boy in a given picture, the agent might plan the following steps:

1. Use a pose detection tool to analyze the boy's pose in the input image.
2. Use a pose-to-image generation tool to render an image of a girl in the detected pose.

The agent can be fine-tuned or prompted to generate a plan specifying the tools, inputs, and outputs for each step, like this:

```
{tool: pose-detection, input: image.jpg, output: temp1}
{tool: pose-to-image, input: temp1, output: final.jpg}
```

![](https://dl-staging-website.ghost.io/content/images/2024/04/unnamed---2024-04-10T140722.194.png)

#### Example: Planning for a Research Agent

Let's consider a text-based example involving a research agent. Suppose the agent is tasked with writing a summary of the impact of climate change on coral reefs. The agent might create a plan like this:

```
{tool: web-search, input: "climate change impact on coral reefs", output: search_results}
{tool: text-summarization, input: search_results, output: key_points}
{tool: article-writing, input: key_points, output: draft_summary}
{tool: grammar-checking, input: draft_summary, output: final_summary}
```

In this plan, the agent:

1. Searches the web for relevant information on the impact of climate change on coral reefs.
2. Summarizes the key points from the search results.
3. Writes a draft summary using the key points.
4. Checks and corrects the grammar in the draft summary to produce the final output.

By breaking down the task into smaller steps and specifying the appropriate tools for each step, the AI agent can effectively plan and execute complex tasks.

#### Utility and Challenges of Planning in AI Agents

For complex tasks where a predefined set of steps cannot be specified in advance, AI agents with planning capabilities can dynamically determine the necessary steps to complete the task. This allows for more flexibility and adaptability in problem-solving.

**Planning is a powerful capability that enables AI agents to tackle complex, multi-step problems.** However, the dynamic nature of planning can lead to less predictable results compared to predefined step-by-step approaches.

It's important to note that planning is a less mature technology compared to other AI capabilities. As a result, it can be challenging to predict the exact actions an AI agent will take when using planning to solve a problem. However, the field of AI planning is rapidly evolving, and the abilities of AI agents in this area will continue to improve and become more reliable in the near future.

#### Addressing Complexity and Uncertainty in AI Planning: Multi-Plan Selection

AI agents face challenges when generating plans for complex tasks due to the inherent uncertainty in language models (LLMs). Although LLMs have strong reasoning abilities, a single plan generated by an LLM-based agent may be suboptimal or even infeasible. To overcome this issue, a more effective approach is multi-plan selection, which consists of two main steps:

1. **Multi-Plan Generation**
   - This step involves generating multiple candidate plans to form a diverse set of potential solutions.
   - Common methods for multi-plan generation include:
     - Incorporating uncertainty in the decoding process of generative models
     - Chain of Thought (CoT) reasoning
     - Tree of Thoughts (ToT) approach
     - Self-consistency techniques

2. **Optimal Plan Selection**
   - After generating a set of candidate plans, the next step is to evaluate and select the most promising plan.
   - This process involves assessing the feasibility, efficiency, and potential outcomes of each plan.
   - Techniques such as simulations, heuristic evaluations, or expert feedback can be used to rank and select the optimal plan from the candidate set.

### External Memory

* [Generative Agents (August 2023)](https://arxiv.org/abs/2304.03442): store the daily experiences of human like agents in text form, and retrieve memories based on composite score of recency and relevance to current situation
* [MemoryBank (May 2023)](https://arxiv.org/abs/2305.10250), [TiM (Nov 2023)](https://arxiv.org/pdf/2311.08719.pdf), and [RecMind (March 2024)](https://arxiv.org/abs/2308.14296): encode each memory using an embedding model and put it into a vector index like FAISS. During retrieval, the description of the current status is used as a query to retrieve memories from the memory pool. Difference is how memory is updated.
* [REMEMBER (October 2023)](https://arxiv.org/abs/2306.07929): stores historical memories in the form of a Q value table, where each record is (environment, task, action, Q-value), then during retrieval, positive and negative memories are both retrieved

<!-- ### Embodied Memory: fine tuning LM

* usually the experiential samples are collected from the agents' interactions with the environment
* [CALM (October 2020)](https://arxiv.org/abs/2010.02903): uses ground truth trajectories to fine tune GPT2 on next token prediction task -->

## Multi Agent Collaboration

   - Multiple LLMs collaborating to achieve complex tasks.
   - Example: ChatDev, where LLMs take on different roles in software development.
   - Multi-agent debate leading to better performance.

### ChatDev (December 2023)

https://arxiv.org/pdf/2307.07924.pdf

![](https://i.imgur.com/EeL8s8p.png)

![](https://i.imgur.com/ZxQpBfL.png)

![](https://i.imgur.com/C42iBOU.png)