Autonomous Data Cleaning & Feature Engineering Agent
ProcessumAir is a next-generation Data Engineering Agent designed to democratize the data preparation process. Powered by Google Gemini 2.5 Flash, it acts as an intelligent autonomous loop that ingests raw datasets (CSV/Excel), identifies quality issues, creates a rigorous cleaning strategy, and executes it to produce Machine Learning-ready data.
ProcessumAir is an autonomous agent that inspects, cleans, and engineers features for your datasets. From raw CSV to ML-ready code in seconds.

- Autonomous Reasoning: Analyzes dataset schema and user goals (e.g., "Predict Churn") to formulate a tailored cleaning plan.
- Universal Import: Drag-and-drop support for CSV and Excel (.xlsx) files.
- Smart Profiling: Automatic detection of data types, missing values, and outliers.
- Trace Visibility: A visual "Decision Trace Graph" showing the agent's internal thought process (Input → Reasoning → Code).
- Client-Side Processing: Data parsing and heuristic cleaning happen locally in the browser for speed and privacy.
- Deliverables:
- Cleaned Dataset: Download the processed file as
.xlsx. - Python Script: Get a reproducible Pandas/Scikit-Learn script.
- Executive Audit Report: A professional PDF certificate of data health.
- Cleaned Dataset: Download the processed file as
- Frontend: React, TypeScript, Tailwind CSS
- AI Engine: Google Gemini API (
@google/genai) - Data Engine: SheetJS (
xlsx) - Visualization: Recharts
- Reporting: jsPDF
-
Clone the repository
git clone https://github.com/yourusername/processum-air.git cd processum-air -
Install dependencies
npm install
-
Configure API Key ProcessumAir requires a Google Gemini API Key.
Create a
.envfile in the root directory:# Note: In a Vite setup, you might need to configure 'define' in vite.config.ts # to support 'process.env.API_KEY' or switch to 'import.meta.env'. API_KEY=your_google_gemini_api_key_here
-
Run the development server
npm run dev
processum-air/
├── src/
│ ├── components/ # UI Components (Dashboard, Charts, Modals)
│ ├── services/ # Gemini AI Service & Prompt Engineering
│ ├── utils/ # CSV/Excel Parsers & Cleaning Logic
│ ├── types.ts # TypeScript Interfaces
│ └── App.tsx # Main Application State & Routing
├── public/
└── package.json
1. Profiling Phase:
The app uses SheetJS to parse the file locally. It generates a statistical metadata summary (Row count, Null %, Types) without sending raw rows to the cloud.
2. Planning Phase: This metadata + User Goal is sent to Gemini 2.5 Flash. The LLM returns a JSON Cleaning Plan (List of steps with reasoning and Python code).
3. Execution Phase:
The dashboard simulates the execution. The cleanAndExportData utility replicates the logic (Imputation, Dropping columns) on the full dataset in the browser memory.
This project is licensed under the MIT License - see the LICENSE file for details.
Verified by ProcessumAir