🚀 PyFrameX

Next-Generation Native DataFrame for Python

Simple like Excel, Powerful like SQL, Smart like AI

PyFrameX is a revolutionary DataFrame engine built from scratch in pure Python. It combines the simplicity of Excel, the power of SQL, and the intelligence of machine learning into one intuitive package.

🌟 What Makes PyFrameX Different?

❌ The Problem

Pandas: Powerful but complicated (.loc, .iloc, .apply confusion)
Polars: Fast but too technical for beginners
Excel: Simple but limited in scale and automation

✅ The Solution: PyFrameX

from pyframex import Frame

# Load data - just like Excel
df = Frame("sales.csv")

# Excel-style operations
df["profit"] = df["revenue"] - df["cost"]

# SQL-style queries
df.sql("SELECT region, SUM(revenue) FROM df GROUP BY region")

# AI-powered automation
df.auto_predict(target="sales")

🎯 Key Features

1️⃣ Pure Python Native Engine

Zero dependencies for core functionality
Custom column store implementation
Type-aware operations (Int, Float, String, Date, Bool)
Automatic type inference

2️⃣ Excel-Like Simplicity

# Simple, intuitive operations
df["ratio"] = df["sales"] / df["visits"]
df["status"] = "active"

# No confusing .loc or .iloc needed!

3️⃣ Built-in SQL Engine

# Execute SQL queries directly on DataFrames
result = df.sql("""
    SELECT 
        region, 
        SUM(revenue) as total_revenue,
        AVG(profit) as avg_profit
    FROM df 
    WHERE year = 2024 
    GROUP BY region
    ORDER BY total_revenue DESC
    LIMIT 10
""")

4️⃣ AI-Powered Automation

# Automatic data cleaning
clean_df = df.auto_clean()

# Automatic predictive modeling
results = df.auto_predict(target="price")
print(f"Accuracy: {results['metrics']['accuracy']}")

# Automatic clustering
clustered = df.auto_cluster(n_clusters=3)

# Automatic feature engineering
enriched = df.auto_feature_engineering()

5️⃣ Optimized Performance

Lazy evaluation
Column-oriented storage
Cached statistics
Query optimization
Filter pushdown

📦 Installation

# Basic installation
pip install pyframex

# With ML capabilities
pip install pyframex[ml]

# Install all features
pip install pyframex[all]

🚀 Quick Start

Loading Data

from pyframex import Frame

# From CSV
df = Frame("data.csv")

# From JSON
df = Frame("data.json")

# From dictionary
df = Frame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 70000]
})

# From list of dictionaries
df = Frame([
    {"name": "Alice", "age": 25, "salary": 50000},
    {"name": "Bob", "age": 30, "salary": 60000},
    {"name": "Charlie", "age": 35, "salary": 70000}
])

Basic Operations

# View data
print(df)
print(df.head(10))
print(df.tail(5))

# Get info
print(df.summary())
print(df.shape())  # (rows, columns)
print(df.dtypes())  # Column types

# Select columns
names = df["name"]
subset = df[["name", "salary"]]

# Add/modify columns
df["bonus"] = df["salary"] * 0.1
df["total"] = df["salary"] + df["bonus"]

Filtering

# Excel-style filtering
high_earners = df.filter("salary > 60000")
young_staff = df.filter("age < 30")

# Combined conditions
filtered = df.filter("age > 25 and salary < 70000")

# Using column comparisons
mask = df["age"] > 30
filtered = df.filter(mask)

Sorting & Grouping

# Sort
sorted_df = df.sort("salary", ascending=False)

# Group by
by_region = df.groupby("region").agg({
    "revenue": "sum",
    "orders": "count"
})

# Multiple aggregations
summary = df.groupby(["region", "category"]).agg({
    "revenue": "sum",
    "profit": "mean",
    "orders": "count"
})

SQL Queries

# Simple query
result = df.sql("SELECT name, salary FROM df WHERE age > 30")

# With aggregation
result = df.sql("""
    SELECT 
        region, 
        SUM(revenue) as total,
        AVG(profit) as avg_profit
    FROM df 
    GROUP BY region
""")

# With ordering and limit
result = df.sql("""
    SELECT * FROM df 
    WHERE status = 'active' 
    ORDER BY created_date DESC 
    LIMIT 100
""")

# Explain query plan
from pyframex.query import QueryPlanner
planner = QueryPlanner()
print(planner.explain("SELECT * FROM df WHERE revenue > 1000"))

🤖 Machine Learning Integration

Auto Clean

# Automatically:
# - Remove duplicates
# - Handle missing values (median/mode imputation)
# - Remove outliers
# - Fix data types
clean_df = df.auto_clean()

Auto Predict

# Automatic model training
results = df.auto_predict(
    target="price",
    test_size=0.2
)

# Results include:
print(results['metrics'])  # Performance metrics
print(results['model'])  # Trained model
print(results['predictions'])  # Test predictions

# Feature importance
for feature, importance in results['metrics']['feature_importance'].items():
    print(f"{feature}: {importance:.4f}")

Auto Cluster

# Automatic clustering
clustered = df.auto_cluster(n_clusters=3)
print(clustered["cluster"].value_counts())

Feature Engineering

# Automatically create:
# - Polynomial features
# - Interaction terms
# - Date extractions
enriched = df.auto_feature_engineering()

Smart Suggestions

# Get transformation suggestions
suggestions = df._ml_engine.suggest_transformations(df)
for suggestion in suggestions:
    print(f"💡 {suggestion}")

🔧 Advanced Features

Column Operations

# Numeric columns
df["price"].sum()
df["price"].mean()
df["price"].median()
df["price"].min()
df["price"].max()
df["price"].std()  # Standard deviation

# String columns
df["name"].lower()
df["name"].upper()
df["name"].strip()
df["name"].contains("alice")
df["name"].replace("old", "new")
df["name"].len()  # String lengths

# Date columns
df["date"].year()
df["date"].month()
df["date"].day()
df["date"].weekday()

Mathematical Operations

# Column arithmetic
df["total"] = df["price"] * df["quantity"]
df["discount_price"] = df["price"] * 0.9
df["profit"] = df["revenue"] - df["cost"]

# Column-to-column operations
df["ratio"] = df["sales"] / df["visits"]
df["growth"] = df["current"] - df["previous"]

Data Export

# Save to CSV
df.to_csv("output.csv")

# Save to JSON
df.to_json("output.json")

# Convert to dictionary
data_dict = df.to_dict()

📊 Real-World Examples

Example 1: Sales Analysis

from pyframex import Frame

# Load sales data
df = Frame("sales.csv")

# Calculate profit
df["profit"] = df["revenue"] - df["cost"]
df["margin"] = df["profit"] / df["revenue"]

# Find top performing regions
top_regions = df.sql("""
    SELECT 
        region,
        SUM(revenue) as total_revenue,
        AVG(margin) as avg_margin
    FROM df
    GROUP BY region
    ORDER BY total_revenue DESC
    LIMIT 5
""")

print(top_regions)

Example 2: Customer Segmentation

# Load customer data
customers = Frame("customers.csv")

# Auto-clean data
customers = customers.auto_clean()

# Perform clustering
segmented = customers.auto_cluster(n_clusters=4)

# Analyze clusters
cluster_summary = segmented.groupby("cluster").agg({
    "age": "mean",
    "purchases": "sum",
    "lifetime_value": "mean"
})

print(cluster_summary)

Example 3: Predictive Modeling

# Load historical data
data = Frame("historical_sales.csv")

# Engineer features
data = data.auto_feature_engineering()

# Train model
results = data.auto_predict(target="next_month_sales")

print(f"Model R²: {results['metrics']['r2']:.4f}")
print(f"RMSE: {results['metrics']['rmse']:.2f}")

# Feature importance
for feature, importance in results['metrics']['feature_importance'].items():
    if importance > 0.05:
        print(f"  {feature}: {importance:.2%}")

🎯 Use Cases

Perfect For:

✅ Data Analysts - Excel-like simplicity with SQL power
✅ Data Scientists - Built-in ML with no setup
✅ Python Beginners - Intuitive, no steep learning curve
✅ Rapid Prototyping - Fast iteration with auto features
✅ Educational Projects - Learn data science easily
✅ Small to Medium Data - Pure Python, no heavy dependencies

Not Ideal For:

❌ Massive datasets (100M+ rows) - Use Polars/DuckDB
❌ Distributed computing - Use Spark/Dask
❌ Production big data pipelines - Use enterprise solutions

🏗️ Architecture

PyFrameX consists of 6 core components:

┌─────────────────────────────────────────┐
│           Frame (Main API)              │
│   Simple like Excel, Powerful like SQL  │
└─────────────────────────────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
┌───────▼────────┐  ┌───────▼────────┐
│ Column Engine  │  │  Query Planner │
│ - IntColumn    │  │  - SQL Parser  │
│ - FloatColumn  │  │  - Optimizer   │
│ - StringColumn │  │  - Executor    │
│ - DateColumn   │  │  - Cache       │
│ - BoolColumn   │  └────────────────┘
└────────────────┘
        │
┌───────▼────────┐  ┌────────────────┐
│   AutoML       │  │  Visualizer    │
│ - auto_clean   │  │  - Charts      │
│ - auto_predict │  │  - Summaries   │
│ - auto_cluster │  │  - Reports     │
└────────────────┘  └────────────────┘

📈 Performance

PyFrameX is optimized for clarity and moderate-sized datasets:

Column-oriented storage for efficient operations
Lazy evaluation where possible
Cached statistics to avoid recomputation
Type-specific optimizations for each column type
Query optimization with filter pushdown

Benchmark (1M rows):

Loading CSV: ~2-3 seconds
Filtering: ~0.1-0.5 seconds
Grouping: ~0.5-1 second
SQL query: ~0.5-2 seconds

🛠️ CLI Usage

# Show DataFrame info
pyframex info data.csv

# Show first 10 rows
pyframex head data.csv -n 10

# Execute SQL query
pyframex query data.csv "SELECT * FROM df WHERE age > 30"

# Auto-clean data
pyframex clean data.csv cleaned_data.csv

# Show version
pyframex version

🤝 Contributing

Contributions are welcome! Here's how you can help:

Report bugs - Open an issue on GitHub
Suggest features - Describe your use case
Submit PRs - Fix bugs or add features
Write docs - Improve documentation
Share examples - Show how you use PyFrameX

📝 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

PyFrameX is inspired by:

Pandas - The gold standard for DataFrame operations
Polars - Modern columnar data processing
DuckDB - Fast in-process SQL
Excel - Universal data manipulation tool

📧 Contact & Support

Author: Idriss Bado
Email: idrissbadoolivier@gmail.com
GitHub: https://github.com/idrissbado/PyFrameX
Issues: GitHub Issues

🎓 Citation

If you use PyFrameX in your research, please cite:

@software{pyframex2024,
  author = {Bado, Idriss},
  title = {PyFrameX: Next-Generation Native DataFrame for Python},
  year = {2024},
  url = {https://github.com/idrissbado/PyFrameX}
}

⭐ Star History

If you find PyFrameX useful, please give it a star on GitHub! ⭐

Made with ❤️ by Idriss Bado

Simple like Excel, Powerful like SQL, Smart like AI

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/pyframex		src/pyframex
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🚀 PyFrameX

🌟 What Makes PyFrameX Different?

❌ The Problem

✅ The Solution: PyFrameX

🎯 Key Features

1️⃣ Pure Python Native Engine

2️⃣ Excel-Like Simplicity

3️⃣ Built-in SQL Engine

4️⃣ AI-Powered Automation

5️⃣ Optimized Performance

📦 Installation

🚀 Quick Start

Loading Data

Basic Operations

Filtering

Sorting & Grouping

SQL Queries

🤖 Machine Learning Integration

Auto Clean

Auto Predict

Auto Cluster

Feature Engineering

Smart Suggestions

🔧 Advanced Features

Column Operations

Mathematical Operations

Data Export

📊 Real-World Examples

Example 1: Sales Analysis

Example 2: Customer Segmentation

Example 3: Predictive Modeling

🎯 Use Cases

Perfect For:

Not Ideal For:

🏗️ Architecture

📈 Performance

🛠️ CLI Usage

🤝 Contributing

📝 License

🙏 Acknowledgments

📧 Contact & Support

🎓 Citation

⭐ Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages