This repository contains the official implementation of our EMNLP'25 paper: "Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG".
RAGvis is a novel, two-stage Retrieval-Augmented Generation (RAG) framework designed to automate Exploratory Data Analysis (EDA). It is built to address the limitations of Large Language Model (LLM)-only approaches, which can struggle with accuracy and reliability, particularly on private or less-common datasets.
RAGvis operates in two primary stages:
-
Offline Knowledge Graph Semantic Enrichment: In this stage, a knowledge graph is first built from a large collection of EDA notebooks. This graph is then enriched with structured EDA semantics. This process is guided by an LLM using an empirically-developed taxonomy of EDA operations.
-
Online EDA Notebook Generation: When presented with a new, unseen dataset, RAGvis performs the following steps:
- Retrieves relevant EDA operations from the knowledge graph.
- Aligns these retrieved operations with the structure of the new dataset.
- Refines the aligned operations through LLM reasoning.
- Generates and verifies executable Python code using a self-correcting agent.
This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.
