# Project Agent: working with a repository

>>> TODO writeup: write a paragraph to explain what "Explorative Repository Agent" is.

Author: Runlong Zhou, Beibin Li


- Motivation: TODO writeup
- Warm-Start: we warm start the repo by reading some important files and architectures, and then summarize the findings inside the "SUMMARY.txt".
- **NOTE**: we have not utilize any truncking or caching mechanism yet to save computational cost. So, the OpenAI API cost would be high if you use this code base. Please proceed with caution.

Timeline
- July, 2023: The original project idea and [code](https://github.com/BeibinLi/SPM/commit/d18392d5ffabdf1af621f392619bf9d0ce031437) are created during Runlong's 2023 summer internship at Microsofot Research. During the project, we used "long-term" and "short-term" external memory inside LLM agents to store and retrieve useful information. 
- August, 2023: We create the [Coffee Dataset](https://github.com/BeibinLi/Coffee_Roasting_Dataset) as a synthetic benchmark for supply chain products. 
- January, 2024: we re-implemented the pipeline with [AutoGen](https://github.com/microsoft/autogen) and included the research code into [OptiGuide](https://github.com/microsoft/OptiGuide). We also provide a simpler design without utilizing external memory mechanism. 




### Introduction: Why do we need exploration
TODO writeup


### Methodology: Warm-Start Summarization
TODO writeup


### Related Work
The idea in this notebook is similar to [GPT Code Interpreter](https://openai.com/blog/chatgpt-plugins), [InterCode](http://arxiv.org/abs/2306.14898), [Open Interpreter](https://github.com/KillianLucas/open-interpreter), and many other open source projects. Here, we provide a subset of important inspirations, because it is hard (and almost impossible) to provide an exhaustive list of related projects. 

Prior Work:
- LLM: [Recursive summarization](http://arxiv.org/abs/2109.10862), 
- Prompting guidelines: [ReAct](https://arxiv.org/abs/2210.03629), [CoT](https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html)
- LLM Tools: [AutoGen](https://github.com/microsoft/autogen), LangChain, AutoGPT, LlamaIndex

Some concurrent studies:
- Code editing: [Magicoder](http://arxiv.org/abs/2312.02120). [RepoAgent](https://github.com/LOGIC-10/RepoAgent)
- Exploration with LLMs: [Explore-Instruct](http://arxiv.org/abs/2310.09168), [search engine with copilot](http://arxiv.org/abs/2311.01235)


### Installing this experimental notebook

Install the explorer branch of OptiGuide:
```bash
pip install https://github.com/microsoft/OptiGuide/archive/explore.zip
```

In [1]:
import autogen
from optiguide.experimental.explorer import ProjectAgent
from autogen import UserProxyAgent
import os

config_list_gpt4 = autogen.config_list_from_json(
    "OAI_CONFIG_LIST",
    filter_dict={
        "model": ["gpt-4", "gpt-4-0314", "gpt4", "gpt-4-32k", "gpt-4-32k-0314", "gpt-4-32k-v0314"],
    },
)


config_list_gpt3 = autogen.config_list_from_json(
    "OAI_CONFIG_LIST",
    filter_dict={
        "model": ["gpt-35-turbo", "gpt-3.5-turbo"],
    },
)

## Example: Coffee Dataset

We use the [Coffee Dataset](https://github.com/BeibinLi/Coffee_Roasting_Dataset) as the example application, where users


```bash
git clone https://github.com/BeibinLi/Coffee_Roasting_Dataset ~/
```

In [2]:
repo_loc = os.path.expanduser("~/Coffee_Roasting_Dataset/opti_coffee/")
assert os.path.exists(repo_loc)

In [3]:
agent = ProjectAgent(
    name="Project-Agent",
     root=repo_loc,
    max_consecutive_auto_reply=30,
    llm_config={"config_list": config_list_gpt4, "seed": 42},
     description=""
)

user = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    human_input_mode="NEVER",  # Try between ALWAYS or NEVER
    max_consecutive_auto_reply=0,
)


[32mData copied to temporary directory: /home/beibinli/OptiGuide/notebook/tmpak2w_cwy[0m
[33mRepository Environment Agent[0m (to Project-Agent):

Read files, understand the repo, and summarize the knowledge in 500 words.

--------------------------------------------------------------------------------
[33mProject-Agent[0m (to Repository Environment Agent):

REASON: I need to explore the content of the files within the repository to summarize the knowledge contained in it.
ACTION:
```sh
ls
```

--------------------------------------------------------------------------------
[31m
>>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is sh)...[0m
[33mRepository Environment Agent[0m (to Project-Agent):

exitcode: 0 (execution succeeded)
Code output: 
database
internal
public
solver
visualization


--------------------------------------------------------------------------------
[33mProject-Agent[0m (to Repository Environment Agent):

REASON: I need to explore the contents of each t


--------------------------------------------------------------------------------
[33mProject-Agent[0m (to Repository Environment Agent):

REASON: Now I have an understanding of the structure and contents of the `database` directory, which seems to include data about the company's cafes, employees, and sales, among other aspects. I need to explore the other top-level directories to form a comprehensive summary. Next, I will explore the `internal` directory to understand the kind of information it contains about the company's internal affairs.

ACTION:
```sh
ls internal
```

--------------------------------------------------------------------------------
[31m
>>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is sh)...[0m
[33mRepository Environment Agent[0m (to Project-Agent):

exitcode: 0 (execution succeeded)
Code output: 
Employee Benefits.md
Financial Plan.md
Manual%3A How to Negotiate with Coffee Bean Suppliers and Cafe Shops.md
Marketing Strategies.md
Quality Control Manuals


--------------------------------------------------------------------------------
[33mProject-Agent[0m (to Repository Environment Agent):

REASON: I've explored the `database` and `internal` directories, gaining insights into employee data and internal documentation on employee benefits. I will continue to explore the remaining directories such as `public`, `solver`, and `visualization` by listing their contents to understand the breadth of data and documents available.

ACTION:
```sh
ls public
```

--------------------------------------------------------------------------------
[31m
>>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is sh)...[0m
[33mRepository Environment Agent[0m (to Project-Agent):

exitcode: 0 (execution succeeded)
Code output: 
contact_info.md
newsletter
our_company.md
service_and_price.md


--------------------------------------------------------------------------------
[33mProject-Agent[0m (to Repository Environment Agent):

REASON: I will now explore th


--------------------------------------------------------------------------------
[33mProject-Agent[0m (to Repository Environment Agent):

REASON: I have explored the `database`, `internal`, `public`, and `solver` directories and read some of their key files. The last top-level directory to explore is `visualization`. I'll need to gather information on the types of visualization scripts included.

ACTION:
```sh
ls visualization
```

--------------------------------------------------------------------------------
[31m
>>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is sh)...[0m
[33mRepository Environment Agent[0m (to Project-Agent):

exitcode: 0 (execution succeeded)
Code output: 
bean_price.py
cpp
cs
java
membership_duration.py
salary.py
supplier_on_map.py
supplier_price.py
test


--------------------------------------------------------------------------------
[33mProject-Agent[0m (to Repository Environment Agent):

REASON: To finish the exploration of the repo and gain some

[31m
>>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is sh)...[0m
[33mRepository Environment Agent[0m (to Project-Agent):

exitcode: 0 (execution succeeded)
Code output: 
Cafe.h
CoffeeDistributionOptimizer.h
DistributionNetwork.h
main.cpp
Roastery.h
Supplier.h


--------------------------------------------------------------------------------
[33mProject-Agent[0m (to Repository Environment Agent):

REASON: Following the directory exploration, the main programming languages used in this project are Python, C#, and C++. Now, I will check the `visualization` directory to confirm the languages used for visualization scripts.

ACTION:
```sh
ls visualization
```

--------------------------------------------------------------------------------
[31m
>>>>>>>> EXECUTING CODE BLOCK 0 (inferred language is sh)...[0m
[33mRepository Environment Agent[0m (to Project-Agent):

exitcode: 0 (execution succeeded)
Code output: 
bean_price.py
cpp
cs
java
membership_duration.py
salary.py
supplier

In [5]:
# user.initiate_chat(agent, message="Plot the average of employee salary in a global map. Use the data and documents inside the repository.")
user.initiate_chat(agent, message="Read the datbase, and show me the average of employee salary per country.")

[33mUser_proxy[0m (to Project-Agent):

Read the datbase, and show me the average of employee salary per country.

--------------------------------------------------------------------------------
[33mRepository Environment Agent[0m (to Project-Agent):

Read the datbase, and show me the average of employee salary per country.

--------------------------------------------------------------------------------
[33mProject-Agent[0m (to Repository Environment Agent):

REASON: To present the average employee salary per country from the database, I'll execute a Python script to process the 'employee.csv' file.

ACTION:
```python
import pandas as pd

# Read CSV data into a DataFrame
data = pd.read_csv('database/employee.csv')

# Group by 'country' and calculate mean of 'salary'
avg_salary_by_country = data.groupby('country')['salary'].mean().reset_index()

# Rename the columns for clarity
avg_salary_by_country.columns = ['Country', 'Average Salary']

# Print the average salary by country
pr

## Future Work
- Recover the privacy and security features using Docker.
- Memory management and efficient prompting (store and retrieve information).
- Better serialization to store agents' states.
- The agent can create "pull request" automatically.
