<a href="https://colab.research.google.com/github/navalepratham18/LLM-Powered-Pandas-Query-Engine/blob/main/PandasQueryEngine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM-Powered Pandas Query Engine

This notebook uses the `PandasQueryEngine` to create a data analysis chatbot. This approach is ideal for structured tabular data (like our Excel files) as it leverages a Large Language Model (LLM) to dynamically generate and execute Python (Pandas) code, enabling precise calculations and filtering.

### 1. Configuration & Data Preparation
- **Setup:** We first install the required libraries and configure `Gemini 1.5 Pro` as our reasoning engine. The LLM's role is to act as an on-demand programmer.
- **Data Unification:** All individual Excel files from Google Drive are loaded and concatenated into a single, master Pandas DataFrame. This provides a unified, in-memory table for the engine to query.

### 2. Contextual Querying via Code Generation
This engine's core logic does not use vectorization. Instead, it operates through a three-step process for each query:
1.  **Prompt Construction (Context):** When a user asks a question, the engine constructs a detailed prompt for the Gemini API. This prompt includes the user's query and, most importantly, the **schema of the DataFrame** (all column names and their data types). This schema provides the necessary context for the LLM.
2.  **LLM-Powered Code Generation:** Using this context, the Gemini LLM analyzes the user's intent and generates a string of Pandas code designed to answer the question by manipulating the DataFrame.
3.  **Local Execution:** This generated code is executed locally within the Colab environment on the master DataFrame. The numerical or factual result is then used by the LLM to formulate the final, human-readable answer.

### 3. Stateless Interaction
The `PandasQueryEngine` is re-initialized inside the `while` loop for every new question. This makes the system **stateless**, ensuring that each query is treated as a fresh, independent analysis without any memory of previous interactions, which is critical for accuracy.

In [23]:
# --- 1. INSTALL LIBRARIES ---
print("Installing libraries... This may take a minute.")
%pip install llama-index llama-index-llms-google-genai pandas openpyxl &> /dev/null
%pip install llama-index-experimental &> /dev/null
print("Installation complete.")


# --- 2. MOUNT GOOGLE DRIVE & SETUP ---
import os
import pandas as pd
from google.colab import userdata, drive
from llama_index.core import Settings
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.experimental.query_engine import PandasQueryEngine

# Mount your Google Drive
drive.mount('/content/drive')
print("Google Drive mounted.")

# Securely import your Google API key
try:
    os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')
except userdata.SecretNotFoundError as e:
    print("Secret not found. Please add your GOOGLE_API_KEY to Colab's secrets.")
    raise e
print("API Key loaded successfully.")


# --- 3. CONFIGURE THE LLM ---
Settings.llm = GoogleGenAI(model_name="models/gemini-1.5-pro-latest")
print("LLM configured to use Gemini 1.5 Pro.")


# --- 4. LOAD ALL EXCEL FILES INTO A SINGLE PANDAS DATAFRAME ---
INPUT_DIR = "/content/drive/MyDrive/data"
print(f"\nLoading and combining all Excel files from {INPUT_DIR}...")

excel_files = [os.path.join(INPUT_DIR, f) for f in os.listdir(INPUT_DIR) if f.endswith('.xlsx')]
list_of_dfs = [pd.read_excel(file) for file in excel_files]
df = pd.concat(list_of_dfs, ignore_index=True)

print(f"Successfully loaded and combined {len(excel_files)} files into a single DataFrame with {len(df)} rows.")


# --- 5. ASK QUESTIONS! ---
print("\n--- Start Chatting With Your Data ---")
while True:
    try:
        query = input("Ask a question about your data (or type 'exit' to quit): ")
        if query.lower() == 'exit':
            print("Exiting. Goodbye!")
            break

        # We create a new, stateless query engine for every single query.
        # This prevents the engine from remembering previous answers.
        # print("\nCreating a fresh query engine...")
        query_engine = PandasQueryEngine(df=df, llm=Settings.llm, verbose=True)

        # print("Thinking...")
        # response = query_engine.query(query)

        print("\nResponse:")
        print(str(response))
        print("\n" + "="*50 + "\n")
    except EOFError:
        print("\nExiting due to input interruption.")
        break

Installing libraries... This may take a minute.
Installation complete.
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Drive mounted.
API Key loaded successfully.
LLM configured to use Gemini 1.5 Pro.

Loading and combining all Excel files from /content/drive/MyDrive/data...
Successfully loaded and combined 1 files into a single DataFrame with 27863 rows.

--- Start Chatting With Your Data ---
Ask a question about your data (or type 'exit' to quit): What was the density for the observation ID 1900083_26/08/2001 at a depth of 759.3?

Response:
12.464110217851632


Ask a question about your data (or type 'exit' to quit): what was the average temperature throughout the year 2001?

Response:
12.464110217851632


Ask a question about your data (or type 'exit' to quit): exit
Exiting. Goodbye!
