# Criando um Agente de Pandas

O Langchain não tem suporte, atualmente, pra criar novas variáveis com o agente de Pandas.

Acho que dá pra criar um agente mais simples que escreva código e dê pra gente as variáveis que ele criou.

Modelos OpenAI:

1. "gpt-4": 8192,
1. "gpt-4-0314": 8192,
1. "gpt-4-32k": 32768,
1. "gpt-4-32k-0314": 32768,
1. "gpt-3.5-turbo": 4096,
1. "gpt-3.5-turbo-0301": 4096,
1. "text-ada-001": 2049,
1. "ada": 2049,
1. "text-babbage-001": 2040,
1. "babbage": 2049,
1. "text-curie-001": 2049,
1. "curie": 2049,
1. "davinci": 2049,
1. "text-davinci-003": 4097,
1. "text-davinci-002": 4097,
1. "code-davinci-002": 8001,
1. "code-davinci-001": 8001,
1. "code-cushman-002": 2048,
1. "code-cushman-001": 2048,

Idéia:
1. Agente Root: Recebe tarefa do usuário. Cria plano. Envia plano pra segundo agente.
    1. Ao receber a resposta do leaf, decide se está bom.
        1. Está bom: responde pro usuário e termina.
        1. Não está bom: devolve problemas pro leaf
1. Agente Leaf: Recebe plano do root.
    1. Posso escrever código que resolve esse plano?
        1. Sim -> Escreve o código e retorna pro root
        1. Não -> Cria outros agentes leaf, e se torna um agente stem

Perguntas:
1. Quais são os tipos de respostas dos leaves?
1. Qual é o tipo de resposta que o usuário recebe?
1. Quais são os passos de executar a tarefa? Tipo Plano-Execução-Debug-Resposta. Algo assim?


# Conversa com GPT-4

Ivan:

Hello chat. I want to create python code that creates "agents" which will receive a pandas related task and will solve it. This would use LLMs and code execution from inside python.

My current idea is this: Have LLMs in a tree graph structure, and each node can either write code to solve a task it was asked to, or generate new nodes to solve steps which can solve the task it was asked to solve.

The first node is the root. It receives a user prompt such as "Load df.parquet and give me only the columns for which id==4". Somehow I have to prompt an LLM to achive this:

- Either it creates code to solve the task, and prompts another LLM instance (another node) to check whether the task is done or if it contains a mistake
- Or it realizes the task is too complex to solve in just one prompt, and creates a list of steps. It then creates as many LLM instances as it needs to do each step and return to it the result. It then checks if the result return answers the user prompt. If so, it returns the answer to the user. Otherwise, it keeps correcting mistakes or updating its plans.

Right now I'm just thinking of how to go about creating this agent graph structure. It's still not crystal clear to me how exactly I should prompt agents and what to ask of them in each step of this process. What do you think?

---
GPT-4:

Your concept sounds fascinating! You're proposing a system that uses large language models (LLMs) to automate code generation and error checking for pandas-related tasks. Here's a sketch of a structure that could be used to implement this:

**Data structures:**

1. **Agent**: Represents a task-solving entity. It can be represented by an instance of an LLM or a function that generates LLM prompts. Each agent should have a method to:
    - Receive tasks
    - Generate a solution or breakdown the problem
    - Verify if a task is done or contains mistakes
    - Return the solution or the results of its tasks

2. **Task**: Represents a pandas-related task. This could be a class with properties such as:
    - The task prompt
    - The current state of the solution
    - The complexity level of the task

3. **Tree**: A structure to represent your hierarchy of agents. Each node in the tree will correspond to an agent, and the edges will represent the relationship between agents, i.e., which agent generated which other agents.

**Logic flow:**

1. You give the root node agent the task.

2. The root node agent, based on the task's complexity, either:
    - Generates the solution directly, or
    - Breaks down the task into subtasks. It then creates child nodes (new agents) and assigns each one of them a subtask. 

3. Each child node agent then follows the same process: it either solves the subtask or breaks it down further. 

4. Once a node comes up with a solution, it passes it to another LLM (could be a child node or a peer node) for error checking. 

5. Once all subtasks are done, the root node collects all the sub-solutions, compiles them into a single solution, and passes it back to the user.

6. If the solution does not fulfill the user's requirements, the root node updates its plans and repeats the process.

**How to prompt LLMs:**

You can use a question-and-answer format to prompt the LLMs. Here's a rough idea:

- **Task-solving prompt**: "Given the task '{task.prompt}', write Python code using pandas that can solve this task."
- **Error-checking prompt**: "Given this Python code '{code}', does it correctly solve the task '{task.prompt}'? If not, what are the errors and how can they be corrected?"

You would need to wrap these prompts with code that feeds them into the LLM and handles the output.

Remember that this is a conceptual sketch and a complex project. Depending on the specifics of your task, your environment, and your resources, you might need to adjust it. Also note that this approach can be CPU and memory intensive, especially if the tasks are complex or the tree becomes large. The LLMs used (like GPT-3 or GPT-4) can also generate incorrect or unoptimal solutions, and might require monitoring and fine-tuning. 

Furthermore, it's important to remember that the pandas library must be properly installed and working in the environment in which this code will be running. Code generation should also follow good security practices, as running dynamically generated code can present risks.

Let me know if you'd like more help with any specific part of this project!

---
Ivan:

Very nice thoughts! Thanks!

So the task object can contain tasks inside itself as well. This can be a two-way link between the parent and child tasks. It can also contain the python/pandas objects related to it, as well as code that was used to generate each object. Actually, one "code" object can generate many python/pandas objects, so I can also create a "code" object, which a task can contain.

Each agent can contain a history of prompts that were given to it, as well as its answers. It can also contain tasks and code that it generated.

In [1]:
import os
import openai
os.environ["OPENAI_API_KEY"] = "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"
# openai.organization = "org-IpivVcfezFhHNrRgT5zhR0Ng"
openai.api_key = os.getenv("OPENAI_API_KEY")

# Testando

In [23]:
response = openai.Completion.create(
    model='text-ada-001',
    prompt="Hello!"
)

print(response.choices[0].text)



Hello!

Hello!

Hello!

Hello!


In [25]:
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {"system": "You are a Senior Data Analyst and expert in Pandas. You are always concise and precise."},
        {"role": "user", "content": "Hello!"},
    ]
)

print(response.choices[0].message.content)

Hi! How can I assist you today?


# Pedir Plano de Ação

## def create_plan_prompt(task):

In [36]:
def create_plan_prompt(task):
    return f"Create a numbered list with the precise steps to fulfil the task below. Do not write anything else.\nTask: {task}"

In [37]:
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {"role": "system", "content": "You are a Senior Data Analyst and expert in Pandas."},
        {"role": "user", "content": create_plan_prompt("Create a dataframe with two random columns and two random rows")},
    ]
)

print(response.choices[0].message.content)

1. Import pandas library
2. Create a dictionary with two lists as values, each list containing two random values
3. Convert the dictionary to a dataframe using the pandas DataFrame() function
4. Print the resulting dataframe to verify the two random columns and two random rows


In [40]:
tasks = response.choices[0].message.content
tasks.split('\n')

['1. Import pandas library',
 '2. Create a dictionary with two lists as values, each list containing two random values',
 '3. Convert the dictionary to a dataframe using the pandas DataFrame() function',
 '4. Print the resulting dataframe to verify the two random columns and two random rows']

## def prompt_openai(task,prompt_design_fn = None):
## def create_action_prompt(task):

In [41]:
def prompt_openai(task,prompt_design_fn = None):
    task_msg = prompt_design_fn(task) if prompt_design_fn else task

    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[
            {"role": "system", "content": "You are a Senior Data Analyst and expert in Pandas."},
            {"role": "user", "content": task_msg},
        ]
    )
    
    return response

def create_action_prompt(task):
    return f"Write Python code to fulfil the task below. Do not write anything else.\nTask: {task}"

In [44]:
tasks = tasks.split('\n')
tasks[0]

'1. Import pandas library'

In [47]:
tasks

['1. Import pandas library',
 '2. Create a dictionary with two lists as values, each list containing two random values',
 '3. Convert the dictionary to a dataframe using the pandas DataFrame() function',
 '4. Print the resulting dataframe to verify the two random columns and two random rows']

In [51]:
code = []

code.append(prompt_openai(tasks[0], create_action_prompt))

In [52]:
code[0]

<OpenAIObject chat.completion id=chatcmpl-7Lkj7Af4c8KUSeKLPONPRdPSiU19j at 0x7f3ef97d5da0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "```python\nimport pandas as pd\n```",
        "role": "assistant"
      }
    }
  ],
  "created": 1685418861,
  "id": "chatcmpl-7Lkj7Af4c8KUSeKLPONPRdPSiU19j",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 9,
    "prompt_tokens": 48,
    "total_tokens": 57
  }
}

In [53]:
code.append(prompt_openai(tasks[1], create_action_prompt))

In [54]:
code[1]

<OpenAIObject chat.completion id=chatcmpl-7LkjFSwjQgVik6njynmRCkDlwsICj at 0x7f3ef97d5df0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "```python\nimport random\ndict_ = {}\n\nlist1 = [random.randint(1, 10) for i in range(2)]\nlist2 = [random.randint(1, 10) for i in range(2)]\n\ndict_['list1'] = list1\ndict_['list2'] = list2\n\nprint(dict_)\n```",
        "role": "assistant"
      }
    }
  ],
  "created": 1685418869,
  "id": "chatcmpl-7LkjFSwjQgVik6njynmRCkDlwsICj",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 72,
    "prompt_tokens": 60,
    "total_tokens": 132
  }
}

In [55]:
code.append(prompt_openai(tasks[2], create_action_prompt))

In [56]:
code[2]

<OpenAIObject chat.completion id=chatcmpl-7Lkjp8AjfYRO4bhmNFzmYeve0gmSc at 0x7f3ef974b380> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "```python\nimport pandas as pd\n\n# assuming the dictionary is stored in the variable 'my_dict'\ndf = pd.DataFrame(my_dict)\n```",
        "role": "assistant"
      }
    }
  ],
  "created": 1685418905,
  "id": "chatcmpl-7Lkjp8AjfYRO4bhmNFzmYeve0gmSc",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 29,
    "prompt_tokens": 57,
    "total_tokens": 86
  }
}

In [57]:
code.append(prompt_openai(tasks[3], create_action_prompt))

In [58]:
code[3]

<OpenAIObject chat.completion id=chatcmpl-7LkkDbqy94WVsW0A1kOALoWmmk3Pu at 0x7f3ef97676f0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "I am assuming that you have already created a Pandas DataFrame and want to print two random columns and two random rows of that DataFrame. Here is the code to achieve that:\n\n```python\nimport pandas as pd\nimport numpy as np\n\n# create a sample dataframe\ndf = pd.DataFrame(np.random.randint(0,100,size=(10, 5)), columns=list('ABCDE'))\n\n# select two random columns and two random rows\ncols = np.random.choice(df.columns, 2, replace=False)\nrows = np.random.choice(df.index, 2, replace=False)\n\n# print the resulting dataframe\nprint(df.loc[rows,cols])\n```\n\nThis will print a DataFrame containing only the two random columns and two random rows you selected.",
        "role": "assistant"
      }
    }
  ],
  "created": 1685418929,
  "id": "chatcmpl-7LkkDbqy94WVsW0A1kOALoWmmk3Pu",


In [None]:
# code = []
# for task in tasks:
#     code.append(prompt_openai(task, create_action_prompt))

In [65]:
for i, act in enumerate(code):
    print('-'*50)
    print(f'i: {i}')
    print(act.choices[0].message.content.split('```')[-2].replace('python\n', ''))

--------------------------------------------------
i: 0
import pandas as pd

--------------------------------------------------
i: 1
import random
dict_ = {}

list1 = [random.randint(1, 10) for i in range(2)]
list2 = [random.randint(1, 10) for i in range(2)]

dict_['list1'] = list1
dict_['list2'] = list2

print(dict_)

--------------------------------------------------
i: 2
import pandas as pd

# assuming the dictionary is stored in the variable 'my_dict'
df = pd.DataFrame(my_dict)

--------------------------------------------------
i: 3
import pandas as pd
import numpy as np

# create a sample dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(10, 5)), columns=list('ABCDE'))

# select two random columns and two random rows
cols = np.random.choice(df.columns, 2, replace=False)
rows = np.random.choice(df.index, 2, replace=False)

# print the resulting dataframe
print(df.loc[rows,cols])



In [66]:
act = code[1]
act = act.choices[0].message.content.split('```')[-2].replace('python\n', '')
act

"import random\ndict_ = {}\n\nlist1 = [random.randint(1, 10) for i in range(2)]\nlist2 = [random.randint(1, 10) for i in range(2)]\n\ndict_['list1'] = list1\ndict_['list2'] = list2\n\nprint(dict_)\n"

## def exec_code(code):

In [19]:
code = '''
x = 10
y = 20
a = 2.7
z = x + y + a
'''
exec(code)
print(x)

10


In [22]:
del x

In [7]:
# Exec no globals
def exec_code(code):
    variables = {}
    exec(code, {}, variables)
    return variables

code = '''
x = 10
y = 20
a = 2.7
z = x + y + a
'''
a = 5.5
result = exec_code(code)
print(result)
print(f'a = {a}')

{'x': 10, 'y': 20, 'a': 2.7, 'z': 32.7}
a = 5.5


In [3]:
act = "import random\ndict_ = {}\n\nlist1 = [random.randint(1, 10) for i in range(2)]\nlist2 = [random.randint(1, 10) for i in range(2)]\n\ndict_['list1'] = list1\ndict_['list2'] = list2\n\nprint(dict_)\n"

In [5]:
print(act)

import random
dict_ = {}

list1 = [random.randint(1, 10) for i in range(2)]
list2 = [random.randint(1, 10) for i in range(2)]

dict_['list1'] = list1
dict_['list2'] = list2

print(dict_)



In [27]:
def exec_code(code):
    local_namespace = {}
    exec(code, globals(), local_namespace)
    return local_namespace

exec_code(act)

{'list1': [7, 4], 'list2': [8, 1]}


{'random': <module 'random' from '/usr/lib/python3.10/random.py'>,
 'dict_': {'list1': [7, 4], 'list2': [8, 1]},
 'list1': [7, 4],
 'list2': [8, 1]}

# Testing Llama

In [1]:
from llama_cpp import Llama

In [4]:
import psutil

def get_available_ram():
    memory = psutil.virtual_memory()
    available_ram = memory.available / (1024 ** 3)  # Convert bytes to gigabytes
    return available_ram

# Example usage
available_ram = get_available_ram()
print(f"Available RAM: {available_ram:.2f} GB")

Available RAM: 8.16 GB


In [3]:
llm = Llama(model_path="/media/ivanmasf/ext-main/llama-python/models/wizard-mega-13B.ggmlv3.q8_0.bin", verbose=True,n_gpu_layers=30,n_threads=8)

llama.cpp: loading model from /media/ivanmasf/ext-main/llama-python/models/wizard-mega-13B.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 5594.59 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 30 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9643 MB
...........................................................................
llama

In [7]:
output = llm("Group a pandas dataframe on column \'price\' ", max_tokens=200, stop=["Q:"])
output

Llama.generate: prefix-match hit

llama_print_timings:        load time =  9515.84 ms
llama_print_timings:      sample time =    28.76 ms /    63 runs   (    0.46 ms per token)
llama_print_timings: prompt eval time =   602.43 ms /    10 tokens (   60.24 ms per token)
llama_print_timings:        eval time = 10620.24 ms /    62 runs   (  171.29 ms per token)
llama_print_timings:       total time = 11380.50 ms


{'id': 'cmpl-3f95c8db-2301-4ed1-9655-58b69585ab18',
 'object': 'text_completion',
 'created': 1685836422,
 'model': '/media/ivanmasf/ext-main/llama-python/models/wizard-mega-13B.ggmlv3.q8_0.bin',
 'choices': [{'text': "\ndf = df.groupby('price')['date'].mean()\n\nGroup a numpy array on column 'price'\narr = arr.reshape(-1,)\narr = arr.groupby(np.arange(len(arr))//5).mean()\n```</s>",
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 11, 'completion_tokens': 62, 'total_tokens': 73}}

In [8]:
print(output['choices'][0]['text'])


df = df.groupby('price')['date'].mean()

Group a numpy array on column 'price'
arr = arr.reshape(-1,)
arr = arr.groupby(np.arange(len(arr))//5).mean()
```</s>
