<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 15px;
   font-size : 100%;
    background-color: #FEF2EF;
    border-radius: 10px;
    font-color :  #581845  ;        
    border: 1.5px solid #FF5733 ;"> 

## Building a data analysis agent using pandas is simple as : 
1. Most of the LLMs are pretrained on pandas library's methods and functions, so we don't need to specifically make those functions understand to the llm.
2. These sophisticated models are even capable of writing complex commands as well
3. But they come with a risk : The llm can be tricked to expose sensitive data if it has access to it.
4. So we should be very careful of what level of access are we giving to th LLM
5. If we want to restrict the access, one way is to give access to only some of the fraud-tested tools (functions) for execution, limiting its search space

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 15px;
   font-size : 100%;
    background-color: #FEF2EF;
    border-radius: 10px;
    font-color :  #581845  ;        
    border: 1.5px solid #FF5733 ;"> 

Here, we'll need : 
1. An LLM (Mistral)
2. A Dataset (which needs to be analysed)
3. A simple function : for executing the python code

[Extras] : 
a. To store the conversations with the agent, you can try implementing this :  [Conversational Agent using Mistral from Scratch](https://www.kaggle.com/code/ashishkumarak/conversational-agent-using-mistral-from-scratch)

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 15px;
   font-size : 100%;
    background-color: #FEF2EF;
    border-radius: 10px;
    font-color :  #581845  ;        
    border: 1.5px solid #FF5733 ;"> 

## FULL PIPELINE
1. User passes the query to the LLM
2. LLM generates the code
3. Code is executed in the python terminal (that hopefully contains the answer to the user query)
4. All information is passed to the model again for answer generation in simple language, which is shown to the user

To know more about Function Calling, try this notebook : [AI agents : Function Calling in LLMs [Mistral 7B]](https://www.kaggle.com/code/ashishkumarak/conversational-agent-using-mistral-from-scratch)

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 15px;
   font-size : 100%;
    background-color: #FEF2EF;
    border-radius: 10px;
    font-color :  #581845  ;        
    border: 1.5px solid #FF5733 ;"> 
    
## Getting the HuggingFace token for fetching the gated model

In [1]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("HUGGINGFACE_TOKEN")

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 15px;
   font-size : 100%;
    background-color: #FEF2EF;
    border-radius: 10px;
    font-color :  #581845  ;        
    border: 1.5px solid #FF5733 ;"> 

## Downloading Mistral and tokenizer
Pass the token in the token argument of both the AutoModelForCausalLM and AutoTokenizer

In [2]:
## Importing necessary libraries
from transformers import pipeline ## For sequential text generation
from transformers import AutoModelForCausalLM, AutoTokenizer # For leading the model and tokenizer from huggingface repository
import warnings
warnings.filterwarnings("ignore") ## To remove warning messages from output


## Providing the huggingface model repository name for mistral 7B
model_name = "mistralai/Mistral-7B-Instruct-v0.3"

## Downloading the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, token = secret_value_0,  device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_name, token = secret_value_0)

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 15px;
   font-size : 100%;
    background-color: #FEF2EF;
    border-radius: 10px;
    font-color :  #581845  ;        
    border: 1.5px solid #FF5733 ;"> 

## Dataset
- Here, I'm using a dataset : Netflix reviews
- We'll pass the first five rows as a markdown to the llm to help model understand what the data is all about

In [3]:
import pandas as pd
reviews_df = pd.read_csv('/kaggle/input/netflix-reviews-playstore-daily-updated/netflix_reviews.csv')
reviews_df.head()

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,30a04503-2371-4e8f-900c-fb7c5926f31c,Sanjay k Roy,Acha hai,1,0,,2024-10-06 06:22:42,
1,341beef2-16d9-4a92-b1c5-448512ab3bbe,Gary Ray,Really I'm surprised we aren't boycotting Netf...,1,0,8.127.1 build 10 50788,2024-10-04 12:49:31,8.127.1 build 10 50788
2,4c9890cb-0c48-42ab-8495-0155eb5adb52,Kadir MeTe,I have an annoying bug in this app. When I tap...,1,0,8.0.0 build 5 40003,2024-09-28 01:16:38,8.0.0 build 5 40003
3,68e2791d-21a8-41f1-914c-706b2750b1b0,patrick,doesn't work on my phone anymore. 1 star until...,1,0,8.132.2 build 18 50846,2024-09-22 10:49:18,8.132.2 build 18 50846
4,945c4e73-9f70-43b3-8fc0-bd7542a3c36b,Sibusiso Mvutho,Sbuioss,5,0,7.120.6 build 63 35594,2024-09-19 13:01:07,7.120.6 build 63 35594


<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 15px;
   font-size : 100%;
    background-color: #FEF2EF;
    border-radius: 10px;
    font-color :  #581845  ;        
    border: 1.5px solid #FF5733 ;"> 
    
## Execution Function
- A function for executing any command in the python interpreter
- The LLMs can be tricked in many ways for malicious tasks, so make sure that it's run only in a sanboxed environment or contanerised to check access to sensitive information

In [4]:


def exec_any_command(command: str) :
    """
    Execute the provided Python code for this loaded dataframe: 

    Args:
        command: A string containing valid Python code to be executed.

    Returns:
        The output of the executed command.
    """
    import pandas as pd 
    df = pd.read_csv('/kaggle/input/netflix-reviews-playstore-daily-updated/netflix_reviews.csv')
    
    output = eval(command)
    
    return output



command = """df.head()"""

outp = exec_any_command(command)
print(outp)

                               reviewId         userName  \
0  30a04503-2371-4e8f-900c-fb7c5926f31c     Sanjay k Roy   
1  341beef2-16d9-4a92-b1c5-448512ab3bbe         Gary Ray   
2  4c9890cb-0c48-42ab-8495-0155eb5adb52       Kadir MeTe   
3  68e2791d-21a8-41f1-914c-706b2750b1b0          patrick   
4  945c4e73-9f70-43b3-8fc0-bd7542a3c36b  Sibusiso Mvutho   

                                             content  score  thumbsUpCount  \
0                                           Acha hai      1              0   
1  Really I'm surprised we aren't boycotting Netf...      1              0   
2  I have an annoying bug in this app. When I tap...      1              0   
3  doesn't work on my phone anymore. 1 star until...      1              0   
4                                            Sbuioss      5              0   

     reviewCreatedVersion                   at              appVersion  
0                     NaN  2024-10-06 06:22:42                     NaN  
1  8.127.1 build 10 5078

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 15px;
   font-size : 100%;
    background-color: #FEF2EF;
    border-radius: 10px;
    font-color :  #581845  ;        
    border: 1.5px solid #FF5733 ;"> 

## Getting the LLM's response for the query and executing it 
- We're setting up a function that receives 
1. the user's query, 
2. the dataframe and 
3. the execution function information, and sends it to the Mistral model,then gets it response in string format and sends it to the `exec_any_command` function for execution.
- Then the natural language answer is generated again by passing all the information to the llm 

<div class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 15px;
   font-size : 100%;
    background-color: #FEF2EF;
    border-radius: 10px;
    font-color :  #581845  ;        
    border: 1.5px solid #FF5733 ;"> 

## Generating a tool_call_id [MISTRAL SPECIFIC] 
   Ref : [HuggingFace](https://huggingface.co/docs/transformers/main/chat_templating#a-complete-tool-use-example)
-  It is used to uniquely identify and match tool calls with their corresponding responses, ensuring consistency and error handling in complex interactions with external tools.

In [5]:
import json
import random
import string
import re

def extract_tool_call(llm_response) : 
    """From the output text generated by the llm, it'll extract the function(with the arguments) """
    try :
        tool_call = json.loads(llm_response)[0]

    except :
        # Step 1: Extract the JSON-like part using regex
        json_part = re.search(r'\[.*\]', llm_response, re.DOTALL).group(0)

        # Step 2: Convert it to a list of dictionaries
        tool_call = json.loads(json_part)[0]

    return tool_call


def redef_messages(messages, response) :
    """Generates a tool_call_id and append it to the messages prompt-template (as required by Mistral)"""
    tool_call_id = ''.join(random.choices(string.ascii_letters + string.digits, k=9))
    # Append the tool call to the conversation
    messages.append({"role": "assistant", "tool_calls": [{"type": "function", "id": tool_call_id, "function": response}]})
    
    return messages, tool_call_id




def extract_and_exec(messages,tool_call,tool_call_id) : 
    """Executing the defined tool and appending the results of the tool to the message prompt-template """
    function_name = tool_call["name"]   # Extracting the name of the tool (function) from the tool_call dictionary.

    arguments = tool_call["arguments"]  # Extracting the arguments for the function from the tool_call dictionary.

    
    console_output = exec_any_command(**arguments)

    print(console_output)
    messages.append({"role": "tool", "tool_call_id": tool_call_id, "name": f"{function_name}", "content": f"{str(console_output)}"})
    
    return messages, console_output




In [6]:
## For printing in Bold green
TGREEN = '\033[1;32m' 
## For printing in Bold Blue
TBLUE = '\033[1;34m'
## For printing in Black
TBLACK = '\033[30m'

In [7]:
def llm_response_exec(user_input,df = reviews_df,func = exec_any_command) : 
     
    print(TBLUE + f">> User's Input : {user_input}\n")
    
    messages = [{"role": "system", "content": f"You are a Data Analysis Assistant, with access to the first few rows of this dataset information :\n {df.head(5).to_markdown()}.After receiving the tool's/function's response, provide the final answer based on the output of the function/tool.Respond in the correct, concisely and (natural english language) simple way to the user."},
               {"role": "user", "content": f"Generate Python code as an argument using the specified function/tool from the input. Do not include any print statements. Maintain the exact format and avoid importing libraries.\n >> Do not include any print statements in the tool command output code\n Question:{user_input}."}]
        
    
    inputs = tokenizer.apply_chat_template(
        messages,  # Passing the initial prompt or conversation context as a list of messages. 
        tools = [func], # Specifying the tools (functions) available for use during the conversation. 
        add_generation_prompt=True,  # Whether to add a system generation prompt to guide the model in generating appropriate responses based on the tools or input.
        return_dict=True,  # Return the results in dictionary format, which allows easier access to tokenized data, inputs, and other outputs.
        return_tensors="pt"  # Specifies that the output should be returned as PyTorch tensors. This is useful if you're working with models in a PyTorch-based environment.
    )
                

    inputs = {k: v.to(model.device) for k, v in inputs.items()} #  Moves all the input tensors to the same device (CPU/GPU) as the model.
    outputs = model.generate(**inputs, max_new_tokens=128, pad_token_id=model.config.eos_token_id)
    
    ## Here we get the initial response, now we need to extract the tool (function) information alongwith the arguments
    response = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)# Decodes the model's output tokens back into human-readable text.
    
    
    
    print(TBLUE + ">> Response generated by the LLM : \n") 
    print(TBLACK + f"{response}\n")

    ## Getting the redefined message alongwith the tool_call_id
    messages,tool_call_id = redef_messages(messages,response)
    
    ## Extracting the tool call generated by the LLM
    tool_call = extract_tool_call(response)
    
    print(TBLUE + ">> Running the function with the generated code\n")
        
    ## Running the tool to get the results and appending it to the messages prompt-template 
    messages, result = extract_and_exec(messages,tool_call,tool_call_id)
    
    print(TBLUE + ">> Results of function execution : \n")
    print(TBLACK + f"{result}" + '\n')
    
    ## Now this whole information needs to be passed to the LLM again to generate the final response based on the whole conversation and execution
    inputs_2 = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt")
    
    inputs_2 = {k: v.to(model.device) for k, v in inputs_2.items()}
    
    outputs_2 = model.generate(**inputs_2, max_new_tokens=150, pad_token_id=model.config.eos_token_id)
    final_response = tokenizer.decode(outputs_2[0][len(inputs_2["input_ids"][0]):],skip_special_tokens=True)
    
    print(TGREEN + '>> Final Answer :\n')
    print(f'{final_response}')


In [8]:
llm_response_exec('How many rows are there in the DataFrame?')

[1;34m>> User's Input : How many rows are there in the DataFrame?



Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


[1;34m>> Response generated by the LLM : 

[30m[{"name": "exec_any_command", "arguments": {"command": "len(df)"}}]

[1;34m>> Running the function with the generated code

115672
[1;34m>> Results of function execution : 

[30m115672

[1;32m>> Final Answer :

The DataFrame contains 115672 rows.


In [9]:
llm_response_exec('How many columns are there in the DataFrame?')

[1;34m>> User's Input : How many columns are there in the DataFrame?

[1;34m>> Response generated by the LLM : 

[30m[{"name": "exec_any_command", "arguments": {"command": "len(df.columns)"}}]

[1;34m>> Running the function with the generated code

8
[1;34m>> Results of function execution : 

[30m8

[1;32m>> Final Answer :

The DataFrame has 8 columns.


In [10]:
llm_response_exec('What are the column names in the DataFrame?')

[1;34m>> User's Input : What are the column names in the DataFrame?

[1;34m>> Response generated by the LLM : 

[30m[{"name": "exec_any_command", "arguments": {"command": "df.columns"}}]

[1;34m>> Running the function with the generated code

Index(['reviewId', 'userName', 'content', 'score', 'thumbsUpCount',
       'reviewCreatedVersion', 'at', 'appVersion'],
      dtype='object')
[1;34m>> Results of function execution : 

[30mIndex(['reviewId', 'userName', 'content', 'score', 'thumbsUpCount',
       'reviewCreatedVersion', 'at', 'appVersion'],
      dtype='object')

[1;32m>> Final Answer :

The column names in the DataFrame are: ['reviewId', 'userName', 'content', 'score', 'thumbsUpCount', 'reviewCreatedVersion', 'at', 'appVersion']


In [None]:
llm_response_exec('What '